[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1_sPEdqNneqh4RxVmpYkasbLe_Ku9JsG4?usp=sharing)

We work with the Concrete Compressive Strength Data Set available from UCI Machine Learning Repository. More information about this data set can be found here: https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength 

A csv version of this data set is available through this GitHub repository: https://github.com/farhad-pourkamali/machine-learning/blob/main/Concrete_Data.csv 

In a nutshell, we have 1,030 samples with 8 input features and 1 output variable taking on continuous values so we have a regression problem. 

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split, KFold  
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/farhad-pourkamali/machine-learning/main/Concrete_Data.csv')
df

In [None]:
df.describe()

In [None]:
# Data matrix X
X = df.iloc[:,:-1].to_numpy()

# Targets y 
y = df.iloc[:,-1].to_numpy()

# Print types 
print(type(X), type(y))

# Print sizes 
print(X.shape, y.shape)

Divide the data into two parts: train and test sets 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

In [None]:
# Print train and test sizes 
print(X_train.shape, X_test.shape)

In [None]:
print(X_train[0],'\n', y_train[0])

Case 1: Linear regression without preprocessing

In [None]:
# Train (i.e., find coefficients)
reg1 = LinearRegression().fit(X_train, y_train)

# Test (i.e., make predictions) 
y_pred1 = reg1.predict(X_test)

# Evaluate (we use RMSE)
err1 = mean_squared_error(y_test, y_pred1, squared=False)
print(err1)

In [None]:
from sklearn.metrics import explained_variance_score
print(explained_variance_score(y_test, y_pred1))

Case 2: Linear regression with preprocessing

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)

In [None]:
X_train[0]

In [None]:
X_train_pre = scaler.transform(X_train)

In [None]:
X_train_pre[0]

In [None]:
# Train (i.e., find coefficients)
reg2 = LinearRegression().fit(X_train_pre, y_train)

# Test (i.e., make predictions) 
y_pred2 = reg2.predict(scaler.transform(X_test))

# Evaluate (we use RMSE)
err2 = mean_squared_error(y_test, y_pred2, squared=False)
print(err2)

Case 3: Polynomial regression with preprocessing

In [None]:
poly = PolynomialFeatures(2)
X_train_pre_pol = poly.fit_transform(X_train_pre)
print(X_train_pre.shape, X_train_pre_pol.shape)

In [None]:
print(X_train_pre[0], '\n', X_train_pre_pol[0])

In [None]:
# Train (i.e., find coefficients)
reg3 = LinearRegression().fit(X_train_pre_pol, y_train)

# Test (i.e., make predictions) 
y_pred3 = reg3.predict(poly.transform(scaler.transform(X_test)))

# Evaluate (we use RMSE)
err3 = mean_squared_error(y_test, y_pred3, squared=False)
print(err3)

In [None]:
print(explained_variance_score(y_test, y_pred3))

Case 4: Polynomial regression with preprocessing (nice way!)

We set up a very basic *pipeline* that consists of the following sequence:

* scaler: For pre-processing data, i.e., transform the data to zero mean and unit variance using the StandardScaler()

* poly: For creating polynomial features using the PolynomialFeatures()

* Regressor: LinearRegression(), which implements the linear regression algorithm.

In [None]:
# Define pipe
pipe = Pipeline([
('scaler', StandardScaler()),
('poly', PolynomialFeatures(2)),
('regressor', LinearRegression())
])

# Train 
pipe.fit(X_train, y_train)

# Predict 
y_pred4 = pipe.predict(X_test)
err4 = mean_squared_error(y_test, y_pred4, squared=False)
print(err4)

In [None]:
print(explained_variance_score(y_test, y_pred4))

To learn more about using *pipeline*, you can find a short tutorial at https://youtu.be/jzKSAeJpC6s

Instead of using train/test split, we can use cross-validation (or CV for short). A k-fold CV means that:

* a model is trained using (k-1) of the folds as training data;

* the resulting model is validated on the remaining part of the data.

In [None]:
kf = KFold(n_splits=3, shuffle=True, random_state=6) # shuffle the data before splitting into batches
scores = []
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    scores.append(explained_variance_score(y_test, pipe.fit(X_train, y_train).predict(X_test)))

scores = np.array(scores)
print(scores)
print("%0.2f mean score with a standard deviation of %0.2f" % (scores.mean(), scores.std()))