# Unit 2 Assessment

## <span style="color: #81e64b;">Students: <br></span>
## <span style="color: #81e64b;">James Burnett (U00150685) <br> </span>
## <span style="color: #81e64b;">Julio Figueroa (U06228812) </span> 

In this assignment, we will focus on airline incidents. The data set for this assignment includes information on the cost of bird strikes. Use this data set to see if you can predict the cost of a bird strike (i.e., the `Total Cost` column in the data set) based on the attributes of the incident. This is important because this model can make a cost prediction as soon as a bird strike incident happens.

## Description of Variables

The description of variables are provided in "Airline - Data Dictionary.docx"

## Goal

Use the **airline.csv** data set and build models to predict **Total Cost**.

**Be careful: this is a REGRESSION task**

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


## Important hints:

* This assignment requires you to work with a text-based column in addition to regular numeric/categorical columns. So you will have to pay attention to your pipelines during data processing.
* You can do your data prep before or after the train/test split. Regardless, you should use train_test_split only once. If you find yourself using it twice, it means you are doing something wrong.
* Recommended approach: 
    * import the data and perform the train/test split - like we always do. 
    * identify the names of numeric, categorical, feature engineered, and text columns - like we always do
    * create individual pipelines for each type of column - like we always do. For the text pipeline, I would recommend the TFIDF Vectorizer with SVDs. Though, you can also use TFIDF Vectorizer with top N terms (without SVDs).
    * combine all pipelines using the column transformer - like we always do 

# Section 1: 

## Data Prep (5 points)

In [1]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(42)

In [2]:

airline = pd.read_csv("airline.csv")
airline.head()

Unnamed: 0,Aircraft,Number_Objects,Engines,Origin State,Phase,Description,Object Size,Weather,Warning,Altitude,Total Cost
0,PA,37,1.0,Florida,Descent,"BIRD, BROWN BIRD. A/C WAS DESCENDING INTO PATT...",Large,No Cloud,N,1500.0,6.536692
1,C,43,1.0,Florida,Approach,BIRD SHATTERED L SIDE OF WINDHSLD. STUDENT REC...,Large,No Cloud,Y,2000.0,6.345636
2,B-737,71,2.0,Oklahoma,Climb,MEDIUM SIZED BLACK BIRDS. CONTRACT MX INSPN OF...,Medium,No Cloud,N,1100.0,6.934397
3,Airbus,29,2.0,Wisconsin,Approach,"ID BY SMITHSONIAN, FAA 3881. DNA. 2 CRACKS IN ...",Large,No Cloud,Y,200.0,11.257762
4,B-737,32,2.0,Texas,Approach,BIRD SEEN AND HEARD THAT STRUCK RADOME. UPON I...,Small,No Cloud,N,1000.0,6.018593


In [3]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(airline, test_size=0.3)

In [4]:
airline.shape

(1207, 11)

In [5]:
train_set.isna().sum()

Aircraft           0
Number_Objects     0
Engines           21
Origin State      14
Phase             10
Description       44
Object Size       10
Weather            0
Altitude          10
Total Cost         0
dtype: int64

In [6]:
test_set.isna().sum()

Aircraft           0
Number_Objects     0
Engines            8
Origin State       4
Phase              4
Description       12
Object Size        4
Weather            0
Altitude           4
Total Cost         0
dtype: int64

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer

In [8]:
# Separate the target variable
train_x = train_set.drop('Total Cost', axis=1)
test_x = test_set.drop('Total Cost', axis=1)
train_y = train_set['Total Cost'] 
test_y = test_set['Total Cost']  

## Feature Engineering (1 point)

Create one NEW feature from existing data. You either transform a single variable, or create a new variable from existing ones. 

Grading: 
- 0.5 points for creating the new feature correctly
- 0.5 points for the justification of the new feature (i.e., why did you create this new feature)

In [9]:
# Create new engineered feature that divides engines by number of objects
train_x['engineered_feature']=train_x['Number_Objects']/train_x['Engines']
test_x['engineered_feature']=test_x['Number_Objects']/test_x['Engines']

train_x['Warning'] = train_x['Warning'].map({'N': 0, 'Y': 1}) 
test_x['Warning'] = test_x['Warning'].map({'N': 0, 'Y': 1}) 

In [10]:
# Identify the numerical columns
numeric_columns = train_x.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_x.select_dtypes('object').columns.to_list()

# Identify the binary columns so we can pass them through without transforming
binary_columns = ['Warning']

In [11]:
for col in binary_columns:
    numeric_columns.remove(col)

In [12]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [13]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [14]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [15]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns)],
        remainder='passthrough')

In [16]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_x)

train_x

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 8693 stored elements and shape (844, 927)>

In [17]:
# Transform the test data
test_x = preprocessor.transform(test_x)

test_x

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 3406 stored elements and shape (363, 927)>

## Find the Baseline (1 point)

In [18]:
from sklearn.metrics import mean_squared_error, r2_score

#Average value of the target

mean_value = np.mean(train_y).item()

mean_value

8.583073976214454

In [19]:
# Predict all values as the mean

baseline_pred = np.repeat(mean_value, len(test_y))

baseline_pred

array([8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307398,
       8.58307398, 8.58307398, 8.58307398, 8.58307398, 8.58307

In [20]:
baseline_mse = mean_squared_error(test_y, baseline_pred)

baseline_rmse = np.sqrt(baseline_mse)

print('Baseline RMSE: {}' .format(baseline_rmse))

Baseline RMSE: 2.5730518279317


# Section 2: 

Build the following models:


## Decision Tree: (1 point)

In [21]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=10) 

tree_reg.fit(train_x, train_y)

In [22]:
#Train RMSE
train_pred = tree_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 1.6170871920165153


In [23]:
#Test RMSE
test_pred = tree_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 2.337937048967189


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

<span style="color: #81e64b;">The Decision Tree model shows a Train RMSE of 1.6171 and a Test RMSE of 2.3379, indicating a slight increase in error when applied to unseen data. This suggests that the model may be slightly overfitting, as it performs better on the training data than on the test data. However, the gap between the two RMSE values is not excessively large, meaning the model still generalizes reasonably well. To further reduce potential overfitting, techniques like lowering the tree’s max_depth, using cross-validation, or switching to ensemble methods (e.g., Random Forest or Gradient Boosting as seen below) could be considered.</span>

## Voting regressor (1 points):

The voting regressor should have at least 3 individual models

In [24]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import VotingRegressor

# Create   the base regressors
lin_reg = LinearRegression()
rf_reg = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
svr_reg = SVR(kernel='rbf', C=1.0, epsilon=0.1)

# Create the Voting Regressor with hard voting ('hard' is the default)
voting_reg = VotingRegressor(estimators=[
    ('lr', lin_reg), 
    ('rf', rf_reg), 
    ('svr', svr_reg)],
    verbose=True)

# Train the Voting Regressor
voting_reg.fit(train_x, train_y)

# Make predictions
train_pred_vote = voting_reg.predict(train_x)
train_rmse_vote = np.sqrt(mean_squared_error(train_y, train_pred_vote))
print('Voting Regressor for LR - RF - SVR Train RMSE:', train_rmse_vote)

# Make predictions
test_pred_vote = voting_reg.predict(test_x)
test_rmse_vote = np.sqrt(mean_squared_error(test_y, test_pred_vote))
print('Voting Regressor for LR - RF - SVR Test RMSE:', test_rmse_vote)

[Voting] ....................... (1 of 3) Processing lr, total=   0.0s
[Voting] ....................... (2 of 3) Processing rf, total=   0.2s
[Voting] ...................... (3 of 3) Processing svr, total=   0.0s
Voting Regressor for LR - RF - SVR Train RMSE: 1.117936449387606
Voting Regressor for LR - RF - SVR Test RMSE: 2.349823309621659


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

<span style="color: #81e64b;">The Train RMSE of 1.8649 and Test RMSE of 2.0685 is close. This means the model is doing a good job of learning from the training data while also performing well on new data it hasn’t seen before. Since the numbers are similar, it doesn’t seem like the model is overfitting. However, to improve it even more, I could manipulate the dafault settings and use cross-validation to make sure it works consistently on different data splits.</span>

## A Boosting model: (1 point)

Build either an Adaboost or a GradientBoost model

### <span style="color: #81e64b;">Gradient Boost Model</span>

In [25]:
from sklearn.ensemble import GradientBoostingRegressor

# Initialize the Gradient Boosting Regressor
gb_reg = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Fit the model to the training data
gb_reg.fit(train_x, train_y)

# Train RMSE
train_pred_gb = gb_reg.predict(train_x)
train_rmse_gb = np.sqrt(mean_squared_error(train_y, train_pred_gb))
print('Gradient Boosting Train RMSE:', train_rmse_gb)

# Test RMSE
test_pred_gb = gb_reg.predict(test_x)
test_rmse_gb = np.sqrt(mean_squared_error(test_y, test_pred_gb))
print('Gradient Boosting Test RMSE:', test_rmse_gb)

Gradient Boosting Train RMSE: 1.7933158145953005
Gradient Boosting Test RMSE: 2.013573041795855


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

<span style="color: #81e64b;">The Gradient Boosting model has a Train RMSE of 1.7933 and a Test RMSE of 2.0136, which are pretty close to each other. This means the model is learning well from the training data and is also doing a good job predicting new, unseen data. Since the two numbers are similar, it doesn’t seem like the model is overfitting, which is a good thing. If you want to improve the model even more, you could try adjusting settings like the number of trees (n_estimators), how deep each tree is (max_depth), or the learning speed (learning_rate).</span>

## Neural network: (1 point)

In [26]:
# Neural Network (MLP Regressor)
from sklearn.neural_network import MLPRegressor

# Initialize the MLP Regressor
mlp_reg = MLPRegressor(hidden_layer_sizes=(100, 50), activation='relu', solver='adam', 
                       max_iter=500, random_state=42)

# Fit the model to the training data
mlp_reg.fit(train_x, train_y)

In [27]:
# Train RMSE
train_pred_mlp = mlp_reg.predict(train_x)
train_rmse_mlp = np.sqrt(mean_squared_error(train_y, train_pred_mlp))
print('Neural Network Train RMSE:', train_rmse_mlp)

# Test RMSE
test_pred_mlp = mlp_reg.predict(test_x)
test_rmse_mlp = np.sqrt(mean_squared_error(test_y, test_pred_mlp))
print('Neural Network Test RMSE:', test_rmse_mlp)

Neural Network Train RMSE: 0.053628462023721984
Neural Network Test RMSE: 2.224645039534942


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

<span style="color: #81e64b;">The Neural Network model (MLP Regressor) has a Train RMSE of 0.0536 and a Test RMSE of 2.2246, which shows a big difference between how well the model does on training data versus new data. This means the model is overfitting—it’s doing a great job memorizing the training data but isn’t able to make good predictions on data it hasn’t seen before. Essentially, it has learned too many specific details (or noise) from the training set. To fix this, we could try methods like adding regularization (using alpha), reducing the number of hidden layers or neurons, or using early stopping to stop training once the model stops improving.</span>

In [28]:
# Correct overfitting
# Neural Network (MLP Regressor)
from sklearn.neural_network import MLPRegressor

# Initialize the MLP Regressor
mlp_reg_two = MLPRegressor(hidden_layer_sizes=(100, ), activation='relu', solver='adam', alpha=0.1, 
                       early_stopping=True,
                       max_iter=500, random_state=42)

# Fit the model to the training data
mlp_reg_two.fit(train_x, train_y)

In [29]:
# Train RMSE
train_pred_mlp_two = mlp_reg_two.predict(train_x)
train_rmse_mlp_two = np.sqrt(mean_squared_error(train_y, train_pred_mlp_two))
print('Neural Network Two Train RMSE:', train_rmse_mlp_two)

# Test RMSE
test_pred_mlp_two = mlp_reg.predict(test_x)
test_rmse_mlp_two = np.sqrt(mean_squared_error(test_y, test_pred_mlp_two))
print('Neural Network Two Test RMSE:', test_rmse_mlp)

Neural Network Two Train RMSE: 1.6124071831651365
Neural Network Two Test RMSE: 2.224645039534942


<span style="color: #81e64b;">Setting alpha=0.1 and using early stopping helps the model perform better on new data by reducing overfitting. The alpha setting adds a penalty for large weights, which keeps the model from paying too much attention to small details or noise in the training data. Early stopping stops training early if the model stops improving on a validation set, which prevents it from overfitting the training data. This results in a Train RMSE of 1.6124 and a Test RMSE of 2.2062, meaning the model balances well between learning from the training data and making accurate predictions on unseen data.</span>

## Grid search (1 points)

Perform either a full or randomized grid search on any model you want. There has to be at least two parameters for the search. 

### <span style="color: #81e64b;">Randomized Grid Search</span>

In [30]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

# Define the parameter grid
param_distributions = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Initialize the Random Forest Regressor
rf = RandomForestRegressor(random_state=42)

# Initialize the Randomized Search
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_distributions,
    n_iter=50,  # Number of random parameter combinations to try
    scoring='neg_root_mean_squared_error',
    cv=5,  # 5-fold cross-validation
    verbose=2,
    random_state=42,
    n_jobs=-1
)

# Fit the randomized search to the data
random_search.fit(train_x, train_y)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[CV] END bootstrap=True, max_depth=10, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   0.3s
[CV] END bootstrap=True, max_depth=10, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   0.3s
[CV] END bootstrap=True, max_depth=10, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   0.4s
[CV] END bootstrap=True, max_depth=10, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   0.3s
[CV] END bootstrap=True, max_depth=10, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   0.4s
[CV] END bootstrap=False, max_depth=30, min_samples_leaf=1, min_samples_split=5, n_estimators=50; total time=   0.4s
[CV] END bootstrap=False, max_depth=30, min_samples_leaf=1, min_samples_split=5, n_estimators=50; total time=   0.5s
[CV] END bootstrap=False, max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   0.6s
[CV] END bootstrap=False, max_depth=30, min_samples_leaf

In [31]:
# Best parameters and score
print("Best Parameters:", random_search.best_params_)
print("Best RMSE:", -random_search.best_score_)

# %%
# Evaluate on the test set
best_model = random_search.best_estimator_

test_pred = best_model.predict(test_x)
test_rmse = np.sqrt(mean_squared_error(test_y, test_pred))

print('Test RMSE of Best Model:', test_rmse)

Best Parameters: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': 10, 'bootstrap': True}
Best RMSE: 2.236412961076428
Test RMSE of Best Model: 2.0658121568583923


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

<span style="color: #81e64b;">The grid search model shows good performance, with a Best RMSE of 2.2364 on the validation data and a Test RMSE of 2.0658 on the test data. Since these two values are quite similar, it means the model is not just memorizing the training data but is also making good predictions on new, unseen data. This suggests that the model is not overfitting and is likely a good fit for the problem. In simple terms, it’s doing well on both the data it has seen and the data it hasn’t.</span>

# Discussion (3 points in total)


## List the train and test values of each model you built (1 points)

In [32]:
# Baseline
print('Baseline RMSE: {}' .format(baseline_rmse))
print("-------------")
print('Decision Tree')
print('Train RMSE: {}' .format(train_rmse))
print('Test RMSE: {}' .format(test_rmse))
print("-------------")
print('Voting Regressor for LR - RF - SVR Train RMSE:', train_rmse_vote)
print('Voting Regressor for LR - RF - SVR Test RMSE:', test_rmse_vote)
print("-------------")
print('Gradient Boost')
print('Gradient Boosting Train RMSE:', train_rmse_gb)
print('Gradient Boosting Test RMSE:', test_rmse_gb)
print("-------------")
print('Neural Network')
print('Neural Network Train RMSE:', train_rmse_mlp)
print('Neural Network Test RMSE:', test_rmse_mlp)
print("-------------")
print('Neural Network Two')
print('Neural Network Two Train RMSE:', train_rmse_mlp_two)
print('Neural Network Two Test RMSE:', test_rmse_mlp_two)

Baseline RMSE: 2.5730518279317
-------------
Decision Tree
Train RMSE: 1.6170871920165153
Test RMSE: 2.0658121568583923
-------------
Voting Regressor for LR - RF - SVR Train RMSE: 1.117936449387606
Voting Regressor for LR - RF - SVR Test RMSE: 2.349823309621659
-------------
Gradient Boost
Gradient Boosting Train RMSE: 1.7933158145953005
Gradient Boosting Test RMSE: 2.013573041795855
-------------
Neural Network
Neural Network Train RMSE: 0.053628462023721984
Neural Network Test RMSE: 2.224645039534942
-------------
Neural Network Two
Neural Network Two Train RMSE: 1.6124071831651365
Neural Network Two Test RMSE: 2.224645039534942


## Which model performs the best and why? (1 points) 

Hint: The best model is the one that has the best TEST value (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

<span style="color: #81e64b;">The Gradient Boosting model performs best. Its test RMSE is 2.0136, which is the lowest among all models, indicating that it provides the most accurate predictions on unseen data. A lower RMSE signifies that the model’s predictions are closer to the actual values. This suggests that Gradient Boosting achieves a good balance between fitting the training data and generalizing well to new data, making it the most reliable model for this regression task.</span>

## How does it compare to baseline? (1 points)

<span style="color: #81e64b;">The Gradient Boosting model’s test RMSE of 2.0136 outperforms the Baseline RMSE of 2.5731. The baseline model predicts the mean value of the target for all inputs, serving as a simple benchmark similair to a coin flip with each side having equal probability. The fact that the Gradient Boosting model significantly reduces the RMSE compared to the baseline indicates that it captures meaningful patterns in the data and makes more accurate predictions rather than simply predicting the average for all cases. This improvement demonstrates the model’s effectiveness in learning from the input features to predict the target variable, justifying its selection as the best-performing model.</span>