# Concreto

El concreto es el material más importante en la ingeniería civil. La fuerza de compresión del concreto es una función altamente no lineal de la edad y los ingredientes. Estos ingredientes incluyen cemento, escoria de alto horno, cenizas volantes, agua, superplastificante, agregado grueso y agregado fino.

![](https://folio.news/wp-content/uploads/2017/02/Resistencia-del-concreto.jpg)

El objetivo de este notebook es predecir la fuerza de compresión del concreto mediante algunas variables y usando una regresión lineal.

El conjunto de datos contiene una variedad de formulaciones de hormigón y la  fuerza de compresión del producto resultante, que es una medida de cuánta carga puede soportar ese tipo de hormigón. 

In [1]:
import pandas as pd
import numpy as np

In [29]:
#!pip install update scikit-learn

Collecting update
  Downloading update-0.0.1-py2.py3-none-any.whl (2.9 kB)
Collecting style==1.1.0
  Downloading style-1.1.0-py2.py3-none-any.whl (6.4 kB)
Installing collected packages: style, update
Successfully installed style-1.1.0 update-0.0.1


In [23]:
#!pip install xlrd==2.0.0

Collecting xlrd==2.0.0
  Downloading xlrd-2.0.0-py2.py3-none-any.whl (95 kB)
[?25l[K     |███▍                            | 10 kB 20.2 MB/s eta 0:00:01[K     |██████▉                         | 20 kB 7.9 MB/s eta 0:00:01[K     |██████████▎                     | 30 kB 4.8 MB/s eta 0:00:01[K     |█████████████▊                  | 40 kB 4.4 MB/s eta 0:00:01[K     |█████████████████▏              | 51 kB 2.6 MB/s eta 0:00:01[K     |████████████████████▌           | 61 kB 3.1 MB/s eta 0:00:01[K     |████████████████████████        | 71 kB 3.3 MB/s eta 0:00:01[K     |███████████████████████████▍    | 81 kB 3.3 MB/s eta 0:00:01[K     |██████████████████████████████▉ | 92 kB 3.7 MB/s eta 0:00:01[K     |████████████████████████████████| 95 kB 2.2 MB/s 
[?25hInstalling collected packages: xlrd
  Attempting uninstall: xlrd
    Found existing installation: xlrd 2.0.1
    Uninstalling xlrd-2.0.1:
      Successfully uninstalled xlrd-2.0.1
Successfully installed xlrd-2.0.0


In [None]:
#!pip install xlrd==2.0.0

In [2]:
# import pkg_resources
# pkg_resources.get_distribution("xlrd").version

'2.0.0'

## Cargar datos

In [2]:
### Conexion a google drive ###
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [3]:
#path = 'https://docs.google.com/spreadsheets/d/1BjCqE5NVvprFNADOkI_z5aEs0cVTwuk6/edit?usp=sharing&ouid=117022111338564756150&rtpof=true&sd=true'


path = 'gdrive/MyDrive/Factored_preparation/data_for_exercises/'

df_concrete = pd.read_excel(path + 'Concrete_Data.xls', sheet_name='Sheet1')
df_concrete.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age,Concrete compressive strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


In [7]:
df_concrete.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Cement                         1030 non-null   float64
 1   Blast Furnace Slag             1030 non-null   float64
 2   Fly Ash                        1030 non-null   float64
 3   Water                          1030 non-null   float64
 4   Superplasticizer               1030 non-null   float64
 5   CoarseAggregate                1030 non-null   float64
 6   FineAggregate                  1030 non-null   float64
 7   Age                            1030 non-null   int64  
 8   Concrete compressive strength  1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB


*   cement - cantidad de  cemento
*   BlastFurnaceSlag -  cantidad de escoria de alto horno
*   FlyAsh - cantidad de ceniza voladora
*   Water -  cantidad de agua
*   Superplasticizer - cantidad de superplastificante
*   CoarseAggregate -  cantidad de agregado grueso
*   FineAggregate -  cantidad de agregado finp
*   Age -  edad del concreto

In [4]:
#importing the libraries for teh model

from sklearn.ensemble import RandomForestRegressor  # liearn regression using Random Forest

#Evaluate a score by cross-validation
from sklearn.model_selection import cross_val_score  #Evaluate a score by cross-validation

**Linear regression**

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/03dc99b761b997908d3aa34aff2f72eb33f50f11)

![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/700px-Linear_regression.svg.png)

**A random forest regressor**

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. For more information see: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

![](https://miro.medium.com/max/1400/1*ZFuMI_HrI3jt2Wlay73IUQ.png)

**criterion{“mse”, “mae”}, default=”mse”**
The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.

In statistics, mean absolute error **(MAE)** is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed. MAE is the average absolute difference between X and Y. This is known as a scale-dependent accuracy measure and therefore cannot be used to make comparisons between series using different scales


**Cross-validation** 

is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (called the validation dataset or testing set). The goal of cross-validation is to test the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem).

In [13]:
X = df_concrete.copy()
y = X.pop("Concrete compressive strength")

In [32]:
X.columns

Index(['Cement', 'Blast Furnace Slag', 'Fly Ash', 'Water ',
       'Superplasticizer ', 'CoarseAggregate  ', 'FineAggregate ', 'Age'],
      dtype='object')

In [33]:
# Create synthetic features
X["FCRatio"] = X['FineAggregate '] / X['CoarseAggregate  ']
X["AggCmtRatio"] = (X['CoarseAggregate  '] + X['FineAggregate ']) / X["Cement"]
X["WtrCmtRatio"] = X['Water '] / X["Cement"]

In [34]:
# Train and score baseline model
baseline = RandomForestRegressor(criterion="mae", random_state=0)
baseline_score = cross_val_score(
    baseline, X, y, cv=5, scoring="neg_mean_absolute_error"
)
baseline_score = -1 * baseline_score.mean()



print(f"MAE Score: {baseline_score:.4}")



MAE Score: 8.01
