# Exercise - XGBoost

In this exercise, you will build different models to predict the compressive strength of
concrete from different features (including the composition). The conventional process of testing the compressive strength of concrete involves casting several cubes with different compositions and observing the strength of the concrete over a period of time.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("data/ccs.csv")

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.shape

In [None]:
data.describe()

In [7]:
X=data.iloc[:,:-1].to_numpy()
Y=data.iloc[:,-1].to_numpy()

1. Split the dataset into training (80%) and test (20%) and standardize the datasets, using `scikit-learn` functions
   (`train_test_split` and [`StandardScaler`](https://scikit-learn.org/stable/modules/preprocessing.html)). Remember: stardardize the test set using
   the mean and variance of the training set (see also Section 10.2.1 [here](https://scikit-learn.org/stable/common_pitfalls.html)).

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

In [9]:
# shuffling and splitting
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42, test_size=0.2)

scX = StandardScaler()
scY = StandardScaler()

# standardize the training set
X_train_scaled = scX.fit_transform(X_train) 
y_train_scaled = scY.fit_transform(y_train.reshape(-1,1))

# transform uses the mean and variance of the training set to standardize the test set
X_test_scaled = scX.transform(X_test) 
y_test_scaled = scY.transform(y_test.reshape(-1,1))

2. Fit a linear model using [`LinearRegression`](https://scikit-learn.org/stable/modules/linear_model.html) from `scikit-learn`. Evaluate the R2 scores on the training and test sets.

In [None]:
lr=LinearRegression()
lr.fit(X_train_scaled, y_train_scaled)
y_pred_lrtr=lr.predict(X_train_scaled)
y_pred_lrte=lr.predict(X_test_scaled)
print('Train R2 score: ',r2_score(y_train_scaled,y_pred_lrtr))
print('Test R2 score: ',r2_score(y_test_scaled,y_pred_lrte))

3. Fit a quadratic model by building polynomial features using [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/linear_model.html#polynomial-regression-extending-linear-models-with-basis-functions)
   from `scikit-learn`. Evaluate the R2 scores on the training and test sets. Increase
   the degree of the polynomial up to 4 and observe the changes in the R2 scores.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

pf = PolynomialFeatures(degree = 2)
X_poly_train = pf.fit_transform(X_train)
X_poly_train_scaled = scX.fit_transform(X_poly_train)
lr.fit(X_poly_train_scaled, y_train_scaled)
print("Training R2 score: ", lr.score(X_poly_train_scaled, y_train_scaled))
X_poly_test = pf.fit_transform(X_test)
X_poly_test_scaled = scX.transform(X_poly_test)
print("Test R2 score: ", lr.score(X_poly_test_scaled, y_test_scaled))

## XGBoost regressor

4. Build a
   [`XGBRegressor`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn)
   object from the `xgboost` library (you may need to install it via `pip`). Train the
   model using the `fit` method and evaluate the R2 scores on the training and test sets.

In [None]:
from xgboost import XGBRegressor

xgb = XGBRegressor()
xgb.fit(X_train_scaled, y_train_scaled)
print('Train R2 score: ', xgb.score(X_train_scaled, y_train_scaled))
print('Test R2 score: ', xgb.score(X_test_scaled, y_test_scaled))

5. Use the `GridSearchCV` function from `scikit-learn` to carry out a grid search on the
   following hyperparameters. Study more about the grid search implemented in
   `scikit-learn`
   [here](https://scikit-learn.org/stable/modules/grid_search.html#grid-search). Evaluate
   the R2 scores on the training and test sets.

In [13]:
from sklearn.model_selection import GridSearchCV

param_grid = {
        'n_estimators': [50, 100, 500],
        'max_depth': [2, 4, 6, 8, 10],
        'gamma': [0.001, 0.01],
        'learning_rate': [0.01, 0.1, 0.3],
    }

In [None]:
grid_search_model = GridSearchCV(xgb, param_grid=param_grid)

grid_search_model.fit(X_train_scaled, y_train_scaled)

print(f'Best hyperparameters: {grid_search_model.best_params_}')

In [None]:
print(f'Train R2 score: {grid_search_model.score(X_train_scaled, y_train_scaled)}')
print(f'Test R2 score: {grid_search_model.score(X_test_scaled, y_test_scaled)}')