# Training, Tuning, and Evaluating Models in Python

In this notebook, we will demonstrate:
1. How to train and evaluate models using scikit-learn `model.fit` and `model.predict`.
2. Great way to tune hyperparameters using `GridSearchCV`.

We will use a dataset with both categorical and numerical features to showcase preprocessing steps, model training, and evaluation. 

For the purposes of instruction, we're making some poor decisions when it comes to preprocessing and modeling. (We really don't need to be using PCA.) Be sure to understand the *why* of each step during your data dive project. 

If you don't have the following packages installed, delete the # and run the cell.

In [116]:
# !pip install pandas
# !pip install sci-kit learn 

In [117]:
# Import necessary libraries
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.metrics import root_mean_squared_error, r2_score
from sklearn.linear_model import Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

In [118]:
# Load dataset 
df = pd.read_csv("insurance_with_missing.csv")

In [119]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0,yes,southwest,16884.924
1,18.0,male,33.77,1,no,,1725.5523
2,28.0,male,33.0,3,no,southeast,4449.462
3,33.0,male,22.705,0,no,northwest,21984.47061
4,32.0,male,28.88,0,no,northwest,3866.8552


## Dataset Overview – Medical Insurance Charges

### Response

- `charges`: **[float64]** – Individual medical costs billed by health insurance (in USD)

### Features

- `age`: **[int64]** – Age of the primary insurance beneficiary (in years)

- `sex`: **[object]** – Gender of the insurance contractor (`male`, `female`)

- `bmi`: **[float64]** – Body Mass Index (kg/m²), calculated as weight divided by height squared  
  *(Healthy range: 18.5 – 24.9)*

- `children`: **[int64]** – Number of dependents covered by health insurance

- `smoker`: **[object]** – Smoking status (`yes`, `no`)

- `region`: **[object]** – Residential region in the U.S. (`northeast`, `southeast`, `southwest`, `northwest`)

### Problem Type

- Supervised Learning
- Regression (predicting a continuous target variable)


## Preprocessing

In [120]:
df.dtypes

age         float64
sex             str
bmi         float64
children      int64
smoker          str
region          str
charges     float64
dtype: object

All of our data types look as we expect them to, so we can proceed.

In [121]:
df.isna().sum()

age          66
sex           0
bmi         133
children      0
smoker        0
region       93
charges       0
dtype: int64

We have multiple columns with NA values. We will need to fix this before the modeling stage.

### Imputation

In [122]:
X = df[['age', 'sex', 'bmi', 'children', 'smoker', 'region']]
y = df['charges']

In [123]:
imputer_region = SimpleImputer(strategy='most_frequent')

X.loc[:,'region'] = imputer_region.fit_transform(X[['region']])

For the categorical variables, we can replace the NA values with the most frequent value in the column.

In [124]:
imputer_mean = SimpleImputer(strategy='mean')

X.loc[:,'age'] = imputer_mean.fit_transform(X[['age']])
X.loc[:,'bmi'] = imputer_mean.fit_transform(X[['bmi']])

For the numeric variables, we can replace the NA values with the mean of the data points. (This is not always a good idea! Be sure to understand the consequences of imputation before applying it.)

### One-Hot Encoding

In [125]:
encoder = OneHotEncoder()

encoded_categorical = encoder.fit_transform(X[['sex', 'smoker', 'region']])

In [126]:
encoded_categorical_df = pd.DataFrame(encoded_categorical.toarray(), columns=encoder.get_feature_names_out())
encoded_categorical_df

Unnamed: 0,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
4,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...
1333,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
1334,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1335,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1336,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


Since the sk-learn models cannot handle categorical data, we need to represent it as binary columns.

In [127]:
X = X.drop(['sex', 'smoker', 'region'], axis=1)
X = pd.concat([X, encoded_categorical_df], axis=1)

X.head()

Unnamed: 0,age,bmi,children,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19.0,27.9,0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,18.0,33.77,1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
2,28.0,33.0,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
3,33.0,22.705,0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
4,32.0,28.88,0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0


Then we can remove our categorical columns and add the binary columns to our original dataframe.

In [95]:
X.isna().sum()

age                 0
bmi                 0
children            0
sex_female          0
sex_male            0
smoker_no           0
smoker_yes          0
region_northeast    0
region_northwest    0
region_southeast    0
region_southwest    0
dtype: int64

### Splitting the Data

In [96]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [97]:
X_train.head()

Unnamed: 0,age,bmi,children,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
560,46.0,19.95,2,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1285,47.0,24.32,0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1142,52.0,24.86,0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
969,39.0,34.32,5,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
486,54.0,21.47,3,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


### Scaling

In [98]:
scaler = StandardScaler()

X_train[['age', 'bmi', 'children']] = scaler.fit_transform(X_train[['age', 'bmi', 'children']])
X_test[['age', 'bmi', 'children']] = scaler.transform(X_test[['age', 'bmi', 'children']])

We want to scale our variables after the data has been split into training and test datasets. 

In [99]:
X_train.head()

Unnamed: 0,age,bmi,children,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
560,0.477927,-1.756525,0.734336,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1285,0.550992,-1.033082,-0.911192,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1142,0.916318,-0.943687,-0.911192,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
969,-0.03353,0.622393,3.202629,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
486,1.062449,-1.504893,1.5571,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


### PCA

Should we be using PCA with one-hot encoded variables? Maybe not. But let's do it anyways.

In [100]:
pca = PCA(n_components=0.95)

We want to keep 95% of the variance in the original data.

In [101]:
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

In [102]:
print("Number of components selected:", pca.n_components_)

Number of components selected: 7


We only need 7 principal components (columns) instead of our original 10 columns to keep 95% of the variance.

In [103]:
print("Explained variance ratio:")
print(pca.explained_variance_ratio_) 

Explained variance ratio:
[0.25010269 0.2209823  0.19230341 0.1101324  0.0713488  0.05751027
 0.04914904]


This tells us how much variance each of the principal components explains. 

Wow, that was a lot of work! Is there an easier way to perform all of these steps? Hmm...

## Modeling

### L2 Penalty Linear Regression

In [104]:
# Initialize model
ridge = Ridge(alpha=1.0)

# Fit and predict 
ridge.fit(X_train_pca, y_train)
ridge_preds = ridge.predict(X_test_pca)

# Evaluate 
ridge_rmse = root_mean_squared_error(y_test, ridge_preds)
ridge_r2 = r2_score(y_test, ridge_preds)

print("Ridge RMSE:", ridge_rmse)
print("Ridge R2:", ridge_r2)

Ridge RMSE: 6055.583475861047
Ridge R2: 0.7637978044602285


The `alpha` parameter sets the strength of the penalty term.

### L1 Penalty Linear Regression

In [105]:
# Initialize model
lasso = Lasso(alpha=0.1)

# Fit and predict 
lasso.fit(X_train_pca, y_train)
lasso_preds = lasso.predict(X_test_pca)

# Evaluate 
lasso_rmse = root_mean_squared_error(y_test, lasso_preds)
lasso_r2 = r2_score(y_test, lasso_preds)

print("Lasso RMSE:", lasso_rmse)
print("Lasso R2:", lasso_r2)

Lasso RMSE: 6053.707219197562
Lasso R2: 0.7639441511462102


### K-Nearest Neighbors 

In [106]:
# Initialize model
knn = KNeighborsRegressor(n_neighbors=18)

# Fit and predict  
knn.fit(X_train_pca, y_train)
y_pred_knn = knn.predict(X_test_pca)

# Evaluate
knn_rmse = root_mean_squared_error(y_test, y_pred_knn)
knn_r2 = r2_score(y_test, y_pred_knn)

print("RMSE:", knn_rmse)
print("R2:", knn_r2)

RMSE: 6406.481607380838
R2: 0.7356306477268459


The `n_neighbors` parameter is the number of data points closest to the new point to consider when making a prediction.

### Decision Tree Regression 

In [107]:
# Initialize model
dt = DecisionTreeRegressor(random_state=42)

# Fit and predict 
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

# Evaluate
dt_rmse = root_mean_squared_error(y_test, y_pred_dt)
dt_r2 = r2_score(y_test, y_pred_dt)

print("Decision Tree RMSE:", dt_rmse)
print("Decision Tree R2:", dt_r2)


Decision Tree RMSE: 8185.221205589798
Decision Tree R2: 0.5684483503300826


We haven't set a `max_depth` for the tree, so it will continue splitting until each leaf is one sample (and probably overfit).

That was cool, but we're a bit limited by only being able to choose one set of hyperparameters each time...

## Pipeline 

### What is GridSearchCV?

GridSearchCV is a tool in scikit-learn for **hyperparameter tuning**, which finds the best combination of hyperparameters (e.g., `n_neighbors` in KNN, `max_depth` in decision trees) to improve model performance.

### How It Works:
1. **Define a Parameter Grid**: Specify ranges for hyperparameters.
2. **Cross-Validation**: Train and evaluate the model on multiple data splits for each combination.
3. **Select Best Parameters**: Choose the combination with the best performance.

Next, we’ll use GridSearchCV to tune a model and evaluate its performance!


In [108]:
# 1) Define X and y
X = df[['age', 'sex', 'bmi', 'children', 'smoker', 'region']]
y = df['charges']

# 2) Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [109]:
# 3) Define preprocessing for numerical and categorical features
numeric_features = ['age', 'bmi', 'children']
categorical_features = ['sex', 'smoker', 'region']

numeric_transformer = Pipeline(steps=[
    ('imputer_num', SimpleImputer(strategy='mean'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer_cat', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# 4) Pipeline: preprocessing + DecisionTreeRegressor
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('tree', DecisionTreeRegressor(random_state=42))
])


In [110]:
# 5) Define the parameter grid for GridSearchCV
param_grid = {
    'tree__max_depth': [3, 5, 10, None],
    'tree__min_samples_split': [2, 5, 10],
    'tree__min_samples_leaf': [1, 2, 4]
}

# 6) Set up GridSearchCV with the pipeline and parameter grid
grid_search = GridSearchCV(
    pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error'
)

# 7) Fit GridSearchCV on the training data
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best CV RMSE:", -grid_search.best_score_)

# 8) Evaluate best model on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("Test RMSE:", root_mean_squared_error(y_test, y_pred))
print("Test R2:", r2_score(y_test, y_pred))


Best Parameters: {'tree__max_depth': 3, 'tree__min_samples_leaf': 1, 'tree__min_samples_split': 2}
Best CV RMSE: 4885.074803214875
Test RMSE: 6588.430526151194
Test R2: 0.7204008278790928


### Group Activity: Titanic Survival Prediction with Preprocessing + GridSearchCV

#### Objective
You will use the Titanic dataset to build a complete machine learning workflow that:
1) selects features and the target variable  
2) splits the data into train/test sets  
3) builds a preprocessing pipeline for numeric and categorical features  
4) trains a Decision Tree model inside a Pipeline  
5) uses GridSearchCV to tune hyperparameters using cross-validation  
6) evaluates the tuned model on the test set

This assignment focuses on building a correct **scikit-learn Pipeline** and using **GridSearchCV**.


#### Dataset Columns
#### Response

- `Survived`: ***[int64]*** - Survival (0 = No, 1 = Yes)


#### Features

- `Pclass`: ***[int64]*** - Passenger Class (1 = 1st, 2 = 2nd, 3 = 3rd)
- `Sex`: ***[object]*** - Sex (male, female)
- `Age`: ***[float64]*** - Age (in years)
- `SibSp`: ***[int64]*** - Number of Siblings/Spouses Aboard
- `Parch`: ***[int64]*** - Number of Parents/Children Aboard
- `Fare`: ***[float64]*** - Fare (in British pounds)

### Important: Classification vs Regression

The target variable is:

- `Survived` → 0 = No, 1 = Yes

This is a **binary classification problem**, NOT a regression problem.

Therefore:

- You must use a **classifier** (e.g., `DecisionTreeClassifier`)
- You must NOT use a regressor (e.g., `DecisionTreeRegressor`)
- Your evaluation metric should be **accuracy**, not RMSE


In [111]:
# Load Titanic dataset 
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Select features and target variable
X = data[[]]
y = data[[]]

In [112]:
# Split the data into training and test sets 


In [113]:
# Define preprocessing for numerical and categorical features


# Combine preprocessors in a column transformer



In [114]:
# Define the parameter grid for GridSearchCV


# Create a pipeline with the preprocessor and Decision Tree model


# Set up GridSearchCV with the pipeline and parameter grid


# Fit GridSearchCV on the training data


# Evaluate the best model on the test set
