# Bird Strike Project

In this assignment, we will focus on airline incidents. The data set for this assignment includes information on the cost of bird strikes. Use this data set to see if you can predict the cost of a bird strike (i.e., the `Total Cost` column in the data set) based on the attributes of the incident. This is important because this model can make a cost prediction as soon as a bird strike incident happens.

## Description of Variables

The description of variables are provided in "Airline - Data Dictionary.docx"

## Goal

Use the **airline.csv** data set and build models to predict **Total Cost**.

**Be careful: this is a REGRESSION task**

# Section 1: 

In [2]:
import pandas as pd
import numpy as np

np.random.seed(42)

# Get the data

In [3]:
#We will predict the "price" value in the data set:

airline = pd.read_csv("airline.csv")
airline.head()

Unnamed: 0,Aircraft,Number_Objects,Engines,Airline,Origin State,Phase,Description,Object Size,Weather,Warning,Altitude,Total Cost
0,B-737-400,859,2.0,US AIRWAYS*,New York,Climb,FLT 753. PILOT REPTD A HUNDRED BIRDS ON UNKN T...,Medium,No Cloud,N,1500.0,30736
1,LEARJET-25,227,2.0,BUSINESS,Delaware,Climb,,Small,No Cloud,N,150.0,1481711
2,A-320,320,2.0,UNITED AIRLINES,DC,Approach,WS ASSISTED IN CLEAN-UP OF 273 STARLINGS AND 1...,Small,Some Cloud,Y,100.0,1483141
3,HAWKER 800,3,2.0,EXECUTIVE JET AVIATION,Colorado,Approach,"SAW SML FLOCK FLYING UPON LDG FLARE, ACROSS RW...",Small,No Cloud,N,20.0,8600
4,DC-9-10,5,2.0,NORTHWEST AIRLINES,Minnesota,Climb,FLT 1493 STATED HE FLEW THRU A FLOCK OF ABOUT ...,Large,Overcast,Y,800.0,35146


In [4]:
# Check for missing values

airline[['Description']].isna().sum()

Description    56
dtype: int64

In [5]:
# If there were missing values:

airline['Description'].fillna('missing', inplace=True)

# Split the data into train and test

In [6]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(airline, test_size=0.3)

# Check the missing values

In [7]:
train_set.isna().sum()

Aircraft           0
Number_Objects     0
Engines           23
Airline           13
Origin State      15
Phase             13
Description        0
Object Size       13
Weather            0
Altitude          13
Total Cost         0
dtype: int64

In [8]:
test_set.isna().sum()

Aircraft          0
Number_Objects    0
Engines           6
Airline           1
Origin State      3
Phase             1
Description       0
Object Size       1
Weather           0
Altitude          1
Total Cost        0
dtype: int64

## Data Prep 

In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Separate the target variable

In [10]:
train_target = train_set[['Total Cost']]
test_target = test_set[['Total Cost']]

train_inputs = train_set.drop(['Total Cost'], axis=1)
test_inputs = test_set.drop(['Total Cost'], axis=1)

## Feature Engineering

Create one NEW feature from existing data. You either transform a single variable, or create a new variable from existing ones. 


The new feature I have created is Engine/No.of objects

In [11]:
def new_col1(df):
    
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()

    # Use the formula, though fill in 0s when the value is 0/0 (because 0/0 generates "nan" values)
    df1['Engine_to_Object_Ratio'] = (df1['Engines']/df1['Number_Objects']).fillna(0)

    # Replace the infinity values with 1 (because a value divided by 0 generates infinity)
    df1['Engine_to_Object_Ratio'].replace(np.inf, 1, inplace=True)

    return df1[['Engine_to_Object_Ratio']]
    # You can use this to check whether the calculation is made correctly:
    #return df1

In [12]:
#Let's test the new function:

# Send the train set to the function we created
new_col1(train_set)

Unnamed: 0,Engine_to_Object_Ratio
864,2.0
1083,2.0
692,2.0
354,2.0
1081,0.0
...,...
1044,2.0
1095,2.0
1130,2.0
860,2.0


In [13]:
train_inputs.dtypes

Aircraft           object
Number_Objects      int64
Engines           float64
Airline            object
Origin State       object
Phase              object
Description        object
Object Size        object
Weather            object
Altitude          float64
dtype: object

Justification of the new feature 

I believe that, "Engine to Object Ratio" is a valuable tool for determining the number of impacts on an aircraft per engine and this metric can help a lot in analytics . A higher ratio could indicate significant damage to the aircraft. This feature plays a crucial role in enhancing risk evaluations and predictive models, and can greatly impact maintenance and safety choices by pinpointing the areas of greatest impact in aviation incidents.

# Sklearn: Text preparation

In [14]:
def new_col(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    # First, conver the dataframe column to a numpy array. Then, call the ravel function to make it one-dimensional
    return np.array(df['Description']).ravel()

In [15]:
new_col(train_set)

array(['HIT BIRD IN REDUCED VISABILITY (ABOUT 4 MILES) RAIN. DID NOT SEE BIRD UNTIL LAST MOMENT. ONLY REALIZED STRIKE HAPPENED AFTER LDG DURING POST FLT INSPN. BIRD WAS INGESTED INTO #2 ENG WITH NO DMG NOTED DURING BORESCOPE. UNABLE TO RECOVER ANY USABLE FEATHERS',
       'BIRD WAYON RWY, FLEW UP, STRUCK RT LEADING EDGE MIDWING. NO DMG',
       'NOTIFIED DEPTARTURE. DO NOT KNOW IF REMAINS WERE COLLECTED. EST $100 FOR INSPN.',
       'FLT1040',
       "MALLARD STRUCK RADOME, DAMAGING BOTH RADOME AND RADAR. CARCASS COLLECTED BY ARPT OPS AND ID/ADDED TO STRIKE DATABASE BY USDA/WILDLF SPEC. MALE MALLARD DUCK ID'S FROM CARCASS COLLECTION BY USDA/WILDLF SPEC.",
       'ENG WAS CLEANED OF BIRD DEBRIS. NO FURTHER WORK WAS REQUIRED.',
       'INITIAL CLIMB, GEESE SPOTTED COMING UP IN FRONT OF A/C. NO TIME TO AVOID. NUMEROUS STRIKES, #1 ENG FAILED DUE TO INGESTION. FLT DIVERTED TO AGS & LANDED UNEVENTFULLY. 14 GEESE FOUND ON RWY. 3 FOUND ON A/C INLET. MIN STRUCK 17. ID BY SMITHSONIAN FAA 910.',


In [16]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [17]:
# Identify the binary columns so we can pass them through without transforming
binary_columns = ['Warning']

In [18]:
# Be careful: numerical columns already includes the binary columns,
# So, we need to remove the binary columns from numerical columns.

for col in binary_columns:
    categorical_columns.remove(col)

In [19]:
binary_columns



In [20]:
numeric_columns

['Number_Objects', 'Engines', 'Altitude']

In [21]:
categorical_columns

['Aircraft',
 'Airline',
 'Origin State',
 'Phase',
 'Description',
 'Object Size',
 'Weather']

In [22]:
feat_eng_columns = ['Engines', 'Number_Objects']

In [23]:
text_column = input_data= ['Description']

In [24]:
# Be careful: categorical columns already includes the text columns,
# So, we need to remove the text columns from numerical columns.

for col in text_column:
    categorical_columns.remove(col)

In [25]:
categorical_columns

['Aircraft', 'Airline', 'Origin State', 'Phase', 'Object Size', 'Weather']

# Pipeline

In [26]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [27]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [28]:
from sklearn.preprocessing import OrdinalEncoder

binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder(categories=[['N', 'Y']]))])

In [29]:

my_new_column1 = Pipeline(steps=[('my_new_column1', FunctionTransformer(new_col1)),
                               ('scaler', StandardScaler())])

In [30]:
number_svd_components = 300

In [31]:
text_transformer = Pipeline(steps=[
                ('my_new_column', FunctionTransformer(new_col)),
                ('text', TfidfVectorizer(stop_words='english')),
                ('svd', TruncatedSVD(n_components=number_svd_components, n_iter=10))
            ])

In [32]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns),
        ('trans', my_new_column1, feat_eng_columns),
        ('text', text_transformer,text_column),
],
        remainder='drop')


# Transform: fit_transform() for TRAIN

In [33]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[-1.22997682e-01,  1.49320144e-01,  8.61530209e-01, ...,
         9.78352984e-02, -1.14917078e-02, -9.72581495e-02],
       [-1.22997682e-01,  1.49320144e-01, -5.29337739e-01, ...,
         1.53227731e-03, -4.72851209e-02,  3.02531360e-02],
       [-1.22997682e-01,  1.49320144e-01, -5.29337739e-01, ...,
         5.63993729e-03, -1.01024312e-02,  5.06883837e-02],
       ...,
       [-1.22997682e-01,  1.49320144e-01, -5.29337739e-01, ...,
        -8.64173963e-04, -3.48075423e-04, -3.25992080e-04],
       [-1.22997682e-01,  1.49320144e-01,  3.05183030e-01, ...,
         2.69272909e-02, -3.88563964e-02, -1.43000256e-02],
       [-1.22997682e-01,  1.49320144e-01,  6.42500200e+00, ...,
        -4.46945613e-02,  2.99697558e-03, -1.06586757e-02]])

In [34]:
train_x.shape

(847, 664)

# Tranform: transform() for TEST

In [35]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[-0.06074549,  0.14932014, -0.45979434, ..., -0.01200592,
        -0.01960757,  0.01006036],
       [-0.12299768,  0.14932014, -0.43661321, ..., -0.02141604,
        -0.01112644,  0.01344794],
       [-0.12299768,  0.14932014, -0.52933774, ...,  0.00382433,
         0.00336162, -0.05244516],
       ...,
       [-0.12299768,  0.14932014, -0.52933774, ..., -0.04852786,
         0.02084468,  0.00780351],
       [-0.12299768, -1.73835391, -0.52609238, ..., -0.00908029,
        -0.01857279,  0.0124783 ],
       [-0.12299768,  0.14932014, -0.52701963, ...,  0.05382771,
        -0.02859989, -0.00261714]])

In [36]:
test_x.shape

(364, 664)

## Find the Baseline

In [37]:
from sklearn.dummy import DummyRegressor

dummy_regr = DummyRegressor(strategy="mean")

dummy_regr.fit(train_x, train_target)

In [38]:
from sklearn.metrics import mean_squared_error

In [39]:
#Baseline Train RMSE
dummy_train_pred = dummy_regr.predict(train_x)

baseline_train_mse = mean_squared_error(train_target, dummy_train_pred)

baseline_train_rmse = np.sqrt(baseline_train_mse)

print('Baseline Train RMSE: {}' .format(baseline_train_rmse))

Baseline Train RMSE: 574386.4093841937


In [40]:
#Baseline Test RMSE
dummy_test_pred = dummy_regr.predict(test_x)

baseline_test_mse = mean_squared_error (test_target, dummy_test_pred)

baseline_test_rmse = np.sqrt(baseline_test_mse)

print('Baseline Test RMSE: {}' .format(baseline_test_rmse))

Baseline Test RMSE: 482867.49956297135


# Section 2: 

Build the following models:


## Decision Tree: 

In [80]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=5) 

tree_reg.fit(train_x, train_target)

In [81]:
#Train RMSE
train_pred = tree_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 268928.5684185048


In [82]:
#Test RMSE
test_pred = tree_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 629822.6269433041


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points.

Baseline Model:

Train RMSE: 574,386.41
Test RMSE: 482,867.50

Decision Tree (max_depth=5):

Train RMSE: 268,928.57
Test RMSE: 629,822.63

This model has lower training RMSE, indicating a good fit to the training data. But its test RMSE is a little higher than that of the baseline test RMSE which indicates that it's overfitting the training data and not generalizing well to the test data. 

To remediate that I have played around a lot with the min_samples_leaf and max_depth. 

The updated the Decision Tree (min_samples_leaf=26, max_depth=20):

Train RMSE: 547,699.37
Test RMSE: 473,840.15

The training RMSE is close to that of the baseline showing similar level of fit to the training data.
Most importantly, its test RMSE is lower than the baseline's test RMSE, indicating better performance on unseen data.

In [83]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(min_samples_leaf = 26,max_depth=20) 

tree_reg.fit(train_x, train_target)

In [84]:
#Train RMSE
train_pred = tree_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 547699.3719292778


In [85]:
#Test RMSE
test_pred = tree_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 473840.1460378704


## Voting regressor :

The voting regressor should have at least 3 individual models

In [86]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor


dtree_reg = DecisionTreeRegressor(max_depth=20)
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=10000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), ('svr', svm_reg), ('sgd', sgd_reg)],
                        weights=[0.6, 0.2, 0.2])

voting_reg.fit(train_x, train_target)

  y = column_or_1d(y, warn=True)


In [87]:
#Train RMSE
train_pred = voting_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 165003.67186654703


In [88]:
#Test RMSE
test_pred = voting_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 584257.801469813


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points.

Baseline Model:

Train RMSE: 574,386.41
Test RMSE: 482,867.50

Voting Regressor 1st:

Train RMSE: 166,090.54
Test RMSE: 512,367.84

The VR Model has a lower training RMSE compared to the baseline model, indicating it fits the training data much better. But its test RMSE is higher than the baseline's test RMSE. Which indicates that it doesn't generalize as effectively to the test data as the baseline model does. Which is likely overfitting.

To remediate that I have changed the hyper parameters like below,

dtree_reg = DecisionTreeRegressor(min_samples_leaf = 26,max_depth=20) 
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.5, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=50, penalty='l2', alpha = 0.1, eta0=0.01)

Voting Regressor 2nd:

Train RMSE: 545,549.31
Test RMSE: 469,450.16

The second model has training RMSE comparable to the baseline, indicating a better level of fit to the training data.
Its test RMSE is lower than both the baseline and the first Voting Regressor, indicating better generalization to unseen data i.e test data.

In [92]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor


dtree_reg = DecisionTreeRegressor(min_samples_leaf = 26,max_depth=20) 
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.5, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=50, penalty='l2', alpha = 0.1, eta0=0.01)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), ('svr', svm_reg), ('sgd', sgd_reg)],
                        weights=[0.6, 0.2, 0.2])

voting_reg.fit(train_x, train_target)

  y = column_or_1d(y, warn=True)


In [93]:
#Train RMSE
train_pred = voting_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 545549.3191641965


In [94]:
#Test RMSE
test_pred = voting_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 469450.15983156324


## A Boosting model: 

Build either an Adaboost or a GradientBoost model

In [95]:
#Use GradientBoosting

from sklearn.ensemble import GradientBoostingRegressor

gb_reg = GradientBoostingRegressor(max_depth=5, n_estimators=100, learning_rate=0.1) 

gb_reg.fit(train_x, train_target)

  y = column_or_1d(y, warn=True)


In [96]:
#Train RMSE
train_pred = gb_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 35433.15627982174


In [97]:
#Test RMSE
test_pred = gb_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 509069.51467035245


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. 

Baseline Model:

Train RMSE: 574,386.41
Test RMSE: 482,867.50

Gradient Boosting Model:

Train RMSE: 35,433.16
Test RMSE: 509,069.51

The Gradient Boosting Model shows an extremely low training RMSE, suggesting it fits the training data well, actually too well. This is an indicative of overfitting. On the other hand the test RMSE is higher than the baseline, which clearly shows overfitting. The model has performed well on the training data but fails to generalize effectively to the test data.

To remediate that I have decided to use Early stopping for performance tuning. By changing below hyperparmeters,
(max_depth=3, n_estimators=100, learning_rate=0.1, tol=0.1, n_iter_no_change=5, validation_fraction=0.2,verbose=1) 

Gradient Boosting with early stopping
Train RMSE: 539520.83 
Test RMSE: 479889.25

Gradient Boosting with early stopping has training RMSE closer to the baseline, indicating a reduction in overfitting compared to the first Gradient Boosting Model.Also, The test RMSE is slightly lower than the baseline's test RMSE, suggesting better generalization to unseen data compared to both the baseline and the initial Gradient Boosting model.

In [118]:
#Notice that learning rate and tol are high to see early stopping
gb_reg = GradientBoostingRegressor(max_depth=3, n_estimators=100, 
                                   learning_rate=0.1, 
                                  tol=0.1, n_iter_no_change=5, validation_fraction=0.2,
                                  verbose=1) 

gb_reg.fit(train_x, train_target['Total Cost'])

      Iter       Train Loss   Remaining Time 
         1 304646964128.4088           40.69s
         2 263127495482.0468           41.70s
         3 229496725878.4936           41.49s
         4 202255802499.6155           40.85s
         5 180190654562.7242           40.26s
         6 162285874455.7681           39.88s


In [119]:
gb_reg.n_estimators_

6

In [120]:
#Train RMSE
train_pred = gb_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 417771.92650419974


In [117]:
#Test RMSE
test_pred = gb_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 492292.731364155


## Neural network: 

In [121]:
from sklearn.neural_network import MLPRegressor

#Default settings create 1 hidden layer with 100 neurons
mlp_reg = MLPRegressor(hidden_layer_sizes=(100,))

mlp_reg.fit(train_x, train_target)

  y = column_or_1d(y, warn=True)


In [122]:
#Train RMSE
train_pred = mlp_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 585166.131317479


In [123]:
#Test RMSE
test_pred = mlp_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 499243.94627652405


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. 

Baseline Model:

Train RMSE: 574,386.41
Test RMSE: 482,867.50
Neural Network 1st:

Train RMSE: 585,166.13
Test RMSE: 499,243.95

The training RMSE is slightly higher than the baseline, suggesting that this neural network model is performing worse than the baseline model.Also, The test RMSE is higher than the baseline, indicating that this model does not generalize well to unseen data. The gap between the training and test RMSEs suggests that the model may be overfitting.

To remediate that I have decided to use Early stopping for performance tuning. By changing below hyperparmeters,
hidden_layer_sizes=(50,50,50),max_iter=100,tol=0.1,alpha=0.1,learning_rate='constant',early_stopping=True

The training RMSE is very close to that of the first neural network model hence showing not much of a change. The test RMSE is significantly higher than the baseline. This suggests that the model, even with early stopping, is not generalizing effectively to new data. I decided to use early stopping mechanism to get rid of the overfitting but even after multiple attempts I was not able to improve its performance.

In [142]:
dnn_reg = MLPRegressor(hidden_layer_sizes=(50,50,50),
                       max_iter=100,tol=0.1,alpha=0.1,learning_rate='constant',
                       early_stopping=True)

dnn_reg.fit(train_x, train_target)

  y = column_or_1d(y, warn=True)


In [143]:
#Train RMSE
train_pred = dnn_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 585333.857788859


In [144]:
#Test RMSE
test_pred = dnn_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 499464.5624696545


## Grid search 

Perform either a full or randomized grid search on any model you want. There has to be at least two parameters for the search. 

In [145]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = [
    {'min_samples_leaf': np.arange(1, 30), 
     'max_depth': np.arange(1,30)}
  ]

tree_reg = DecisionTreeRegressor()

grid_search = RandomizedSearchCV(tree_reg, param_grid, cv=5, n_iter=10,
                           scoring='neg_mean_squared_error', verbose=1,
                           return_train_score=True)

grid_search.fit(train_x, train_target)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [146]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

659368.1339209817 {'min_samples_leaf': 3, 'max_depth': 2}
657945.7565166292 {'min_samples_leaf': 5, 'max_depth': 4}
617866.5930710235 {'min_samples_leaf': 16, 'max_depth': 5}
602248.1385677645 {'min_samples_leaf': 19, 'max_depth': 16}
599129.3034196673 {'min_samples_leaf': 19, 'max_depth': 3}
616211.1840377086 {'min_samples_leaf': 13, 'max_depth': 4}
594356.4433650728 {'min_samples_leaf': 26, 'max_depth': 18}
593462.9442373025 {'min_samples_leaf': 29, 'max_depth': 24}
690019.4133289763 {'min_samples_leaf': 3, 'max_depth': 4}
590026.0542875259 {'min_samples_leaf': 28, 'max_depth': 27}


In [68]:
grid_search.best_params_

{'min_samples_leaf': 28, 'max_depth': 23}

In [69]:
grid_search.best_estimator_

In [70]:
#Train RMSE
train_pred = grid_search.best_estimator_.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 540974.1603672268


In [71]:
#Test RMSE
test_pred = grid_search.best_estimator_.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 470670.41666696436


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

Baseline Model:

Train RMSE: 574,386.41
Test RMSE: 482,867.50

Grid Search Decision Tree:

Train RMSE: 540,974.16
Test RMSE: 470,670.42

I believe there is no Overfitting.The Grid Search Decision Tree model has outperformed the baseline model. It shows a lower training RMSE 540,974.16 than the baseline 574,386.41, indicating a better fit to the training data. More importantly, it has a significantly lower test RMSE 470,670.42 compared to the baseline 482,867.50, demonstrating better generalization to unseen data. 

# Discussion 


## List the train and test values of each model you built (1 points)

## Which model performs the best and why? 

Hint: The best model is the one that has the best TEST value (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

Voting Regressor 2nd: Test RMSE = 469,450.16

The Voting Regressor 2nd model has the lowest test RMSE 469,450.16 indicating that it has performed the best at generalizing to unseen data compared to the other models. I believe that The Voting Regressor 2nd model has performed the best in terms of both reducing overfitting and enhancing predictive accuracy on unseen data.

## How does it compare to baseline? 

The Voting Regressor 2nd model shows better performance than the baseline model. It has a slightly lower training RMSE 545,549.31 vs. 574,386.41 and more importantly, a lower test RMSE 469,450.16 vs. 482,867.50 
This lower test RMSE shows that the Voting Regressor 2nd model generalizes better to unseen data than the baseline, making it a more effective model for predictive accuracy on this dataset.