# Cost Of A Bird Strike

In this project, I will focus on airline incidents. The data set for this assignment includes information on the cost of bird strikes. Used this data set to see if I can predict the cost of a bird strike (i.e., the `Total Cost` column in the data set) based on the attributes of the incident. This is important because this model can make a cost prediction as soon as a bird strike incident happens.

## Description of Variables

The description of variables are provided in "Airline - Data Dictionary.docx"

## Goal

Use the **airline.csv** data set and build models to predict **Total Cost**.

# Section 1: 

## Data Prep

In [1]:
# Common imports
import numpy as np
import pandas as pd
np.random.seed(42)

In [2]:
!pip install pypandoc




[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
airline = pd.read_csv("airline.csv")
airline.head(5)

Unnamed: 0,Aircraft,Number_Objects,Engines,Airline,Origin State,Phase,Description,Object Size,Weather,Warning,Altitude,Total Cost
0,B-737-400,859,2.0,US AIRWAYS*,New York,Climb,FLT 753. PILOT REPTD A HUNDRED BIRDS ON UNKN T...,Medium,No Cloud,N,1500.0,30736
1,LEARJET-25,227,2.0,BUSINESS,Delaware,Climb,,Small,No Cloud,N,150.0,1481711
2,A-320,320,2.0,UNITED AIRLINES,DC,Approach,WS ASSISTED IN CLEAN-UP OF 273 STARLINGS AND 1...,Small,Some Cloud,Y,100.0,1483141
3,HAWKER 800,3,2.0,EXECUTIVE JET AVIATION,Colorado,Approach,"SAW SML FLOCK FLYING UPON LDG FLARE, ACROSS RW...",Small,No Cloud,N,20.0,8600
4,DC-9-10,5,2.0,NORTHWEST AIRLINES,Minnesota,Climb,FLT 1493 STATED HE FLEW THRU A FLOCK OF ABOUT ...,Large,Overcast,Y,800.0,35146


In [4]:
airline.shape

(1211, 12)

In [5]:
target = ['Total Cost']

In [6]:
airline[['Description']].isna().sum()

Description    56
dtype: int64

In [7]:
airline['Description'].fillna('missing', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  airline['Description'].fillna('missing', inplace=True)


In [8]:
text_input = airline[['Description']]

In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [10]:
def new_col1(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    # First, conver the dataframe column to a numpy array. Then, call the ravel function to make it one-dimensional
    return np.array(df).ravel()

In [11]:
new_col1(text_input)

array(['FLT 753. PILOT REPTD A HUNDRED BIRDS ON UNKN TYPE. #1 ENG WAS SHUT DOWN AND DIVERTED TO EWR. SLIGHT VIBRATION. A/C WAS OUT OF SVC FOR REPAIRS TO COWLING, FAN DUCT ACCOUSTIC PANEL. INGESTION. DENTED FAN BLADE #26 IN #1 ENG. HEAVY BLOOD STAINS ON L WINGTIP',
       'missing',
       'WS ASSISTED IN CLEAN-UP OF 273 STARLINGS AND 1 BROWN-HEADED COWBIRD FROM RWY THRESHOLD. PHOTOS OF A/C TAKEN. BORESCOPED BOTH ENGS. FOUND DENTS AND NICKS IN STAGES 3-6. ALL WITHIN LIMITS. CLEANED RADOME, L WING, FLAPS, PYLON, GEAR AND LEADING EDGE FLAPS. R',
       ...,
       'AT A/C ROTATED, 1 FOX WAS SEEN. IT WAS NOT BELIEVED TO HAVE BEEN STRUCK BUT WAS LATER REPTD TO HAVE BEEN HIT. NO EFFECT ON FLT OR A/C INDICATIONS. ID LATER UPDATED TO COYOTE.',
       'ID BY SMITHSONIAN. LEFT WING ROOT CRACKED FAIRING AND RIVETS POPPED. SUBSTANTIAL DMG TO K-FLAPS, WING ROOT, WING AND BODY FAIRING AND UNDERLYING STRUCTURE. A/C TAKEN TO SEATTLE FOR REPAIRS.',
       'ID BY SMITHSONIAN. PS FOUND HAWK ALONG WITH PA

In [12]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(airline, test_size=0.3)

In [13]:
#Separate the target variable

train_target = train[['Total Cost']]
test_target = test[['Total Cost']]

train_set = train.drop(['Total Cost'], axis=1)
test_set = test.drop(['Total Cost'], axis=1)

## Feature Engineering

Created one NEW feature from existing data.

In [14]:
#hit on runway or not
def new_col(df):
    df1 = df.copy()
    df1['hit_on_runway'] = np.where(df1['Altitude'] > 0, 0, 1)
    return df1[['hit_on_runway']]

In [15]:
new_col(train_set)

Unnamed: 0,hit_on_runway
864,0
1083,1
692,1
354,1
1081,0
...,...
1044,0
1095,0
1130,1
860,0


In [16]:
train_set.dtypes

Aircraft           object
Number_Objects      int64
Engines           float64
Airline            object
Origin State       object
Phase              object
Description        object
Object Size        object
Weather            object
Altitude          float64
dtype: object

In [17]:
# Identify the numerical columns
numeric_columns = train_set.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_set.select_dtypes('object').columns.to_list()

In [18]:
#numeric Col
numeric_columns

['Number_Objects', 'Engines', 'Altitude']

In [19]:
categorical_columns

['Aircraft',
 'Airline',
 'Origin State',
 'Phase',
 'Description',
 'Object Size',
 'Weather',

In [20]:
# Text Column
text_column = ['Description']

In [21]:
for col in text_column:
    categorical_columns.remove(col)

In [22]:
#Categorical Columns
categorical_columns

['Aircraft',
 'Airline',
 'Origin State',
 'Phase',
 'Object Size',
 'Weather',

In [23]:
input_set = airline[['Description']]

In [24]:
#col used in Feature engg.
feat_eng_columns = ['Altitude']

In [25]:
#Pipeline
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [26]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [27]:
my_new_column = Pipeline(steps=[('my_new_column', FunctionTransformer(new_col))])

In [28]:
number_svd_components = 100
text_transformer = Pipeline(steps=[
                ('my_new_column1', FunctionTransformer(new_col1)),
                ('text', TfidfVectorizer(stop_words='english')),
                ('svd', TruncatedSVD(n_components=number_svd_components, n_iter=10))
            ])

In [29]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('text', text_transformer, text_column),
        ('trans', my_new_column, feat_eng_columns)],
        remainder='drop')

In [30]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_set)

train_x

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 93406 stored elements and shape (847, 466)>

In [31]:
train_x.shape

(847, 466)

In [32]:
train_x.toarray()

array([[-1.22997682e-01,  1.49320144e-01,  8.61530209e-01, ...,
        -2.10613325e-02, -2.90259325e-03,  0.00000000e+00],
       [-1.22997682e-01,  1.49320144e-01, -5.29337739e-01, ...,
        -2.28011657e-02, -1.31029021e-02,  1.00000000e+00],
       [-1.22997682e-01,  1.49320144e-01, -5.29337739e-01, ...,
        -1.17929124e-02, -4.92265117e-02,  1.00000000e+00],
       ...,
       [-1.22997682e-01,  1.49320144e-01, -5.29337739e-01, ...,
         4.94296894e-05,  3.59440636e-04,  1.00000000e+00],
       [-1.22997682e-01,  1.49320144e-01,  3.05183030e-01, ...,
        -5.18224451e-02,  5.03840746e-02,  0.00000000e+00],
       [-1.22997682e-01,  1.49320144e-01,  6.42500200e+00, ...,
         4.49126656e-03, -9.71355006e-03,  0.00000000e+00]])

In [33]:
test_x = preprocessor.transform(test_set)

test_x

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 40108 stored elements and shape (364, 466)>

In [34]:
test_x.shape

(364, 466)

In [35]:
test_x.toarray()

array([[-0.06074549,  0.14932014, -0.45979434, ...,  0.01309841,
         0.02520036,  0.        ],
       [-0.12299768,  0.14932014, -0.43661321, ...,  0.00462003,
        -0.06664408,  0.        ],
       [-0.12299768,  0.14932014, -0.52933774, ...,  0.02461294,
        -0.03667306,  1.        ],
       ...,
       [-0.12299768,  0.14932014, -0.52933774, ...,  0.00978361,
         0.05575606,  1.        ],
       [-0.12299768, -1.73835391, -0.52609238, ..., -0.02452497,
         0.0174856 ,  0.        ],
       [-0.12299768,  0.14932014, -0.52701963, ..., -0.01888315,
         0.03423408,  0.        ]])

## Find the Baseline

In [36]:
from sklearn.dummy import DummyRegressor

dummy_regr = DummyRegressor(strategy="mean")

dummy_regr.fit(train_x, train_target)

In [37]:
from sklearn.metrics import mean_squared_error

In [38]:
#Baseline Train RMSE
dummy_train_pred = dummy_regr.predict(train_x)

baseline_train_mse = mean_squared_error(train_target, dummy_train_pred)

baseline_train_rmse = np.sqrt(baseline_train_mse)

print('Baseline Train RMSE: {}' .format(baseline_train_rmse))

Baseline Train RMSE: 574386.4093841937


In [39]:
#Baseline Test RMSE
dummy_test_pred = dummy_regr.predict(test_x)

baseline_test_mse = mean_squared_error (test_target, dummy_test_pred)

baseline_test_rmse = np.sqrt(baseline_test_mse)

print('Baseline Test RMSE: {}' .format(baseline_test_rmse))

Baseline Test RMSE: 482867.49956297135


# Section 2: 

Build the following models:


## Decision Tree

In [40]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=10) 

tree_reg.fit(train_x, train_target)

In [41]:
#Train RMSE
train_pred = tree_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 120656.88925462912


In [42]:
#Test RMSE
test_pred = tree_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 521759.92530176823


### Is the model overfitting?

In [43]:
tree_reg = DecisionTreeRegressor(min_samples_leaf = 30, max_depth= 5)

tree_reg.fit(train_x, train_target)

In [44]:
#Train RMSE
train_pred = tree_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 549974.1059952383


In [45]:
#Test RMSE
test_pred = tree_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 477310.62570611236


## Voting regressor

The voting regressorhave 3 individual models

In [46]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor


dtree_reg = DecisionTreeRegressor(max_depth=10)
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=200000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        ('svr', svm_reg), 
                        ('sgd', sgd_reg)])

voting_reg.fit(train_x, train_target)

  y = column_or_1d(y, warn=True)


In [47]:
#Train RMSE
train_pred = voting_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 326441.4933364595


In [48]:
#Test RMSE
test_pred = voting_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 495519.33402992453


Reason Model is Overfitting

In [49]:
dtree_reg = DecisionTreeRegressor(min_samples_leaf = 30,max_depth= 5)
svm_reg = SVR(kernel="rbf", C=7, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=500, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        ('svr', svm_reg), 
                        ('sgd', sgd_reg)])

voting_reg.fit(train_x, train_target)

  y = column_or_1d(y, warn=True)


In [50]:
#Train RMSE
train_pred = voting_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 511016.61185214855


In [51]:
#Test RMSE
test_pred = voting_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 465649.3057318923


## A Boosting model
Build either an Adaboost or a GradientBoost model

In [52]:
from sklearn.ensemble import AdaBoostRegressor 


ada_reg = AdaBoostRegressor( 
            DecisionTreeRegressor(max_depth=4), n_estimators=50, 
            learning_rate=0.1) 

ada_reg.fit(train_x, train_target)

  y = column_or_1d(y, warn=True)


In [53]:
#Train RMSE
train_pred = ada_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 227511.54106947163


In [54]:
#Test RMSE
test_pred = ada_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 479161.4502522493


### Overfitting

In [55]:
ada_reg = AdaBoostRegressor( 
            DecisionTreeRegressor(min_samples_leaf = 5, max_depth=3), n_estimators= 50, 
            learning_rate=0.005) 

ada_reg.fit(train_x, train_target)

  y = column_or_1d(y, warn=True)


In [56]:
#Train RMSE
train_pred = ada_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 517303.0263513528


In [57]:
#Test RMSE
test_pred = ada_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 484690.52324094536


## Neural network

In [58]:
from sklearn.neural_network import MLPRegressor
mlp_reg = MLPRegressor(hidden_layer_sizes=(50,50),
                       max_iter=1000)

mlp_reg.fit(train_x, train_target)

  y = column_or_1d(y, warn=True)


In [59]:
#Train RMSE
train_pred = mlp_reg.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 528208.6184570465


In [60]:
#Test RMSE
test_pred = mlp_reg.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 455780.4844918081


## Grid search

In [61]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = [
    {'min_samples_leaf': np.arange(1, 30), 
     'max_depth': np.arange(1,30)}
  ]

tree_reg = DecisionTreeRegressor()

grid_search = RandomizedSearchCV(tree_reg, param_grid, cv=5, n_iter=10,
                           scoring='neg_mean_squared_error', verbose=1,
                           return_train_score=True)

grid_search.fit(train_x, train_target)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [62]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

624202.9115127813 {'min_samples_leaf': np.int64(11), 'max_depth': np.int64(27)}
657504.9825904495 {'min_samples_leaf': np.int64(6), 'max_depth': np.int64(9)}
588481.1060531567 {'min_samples_leaf': np.int64(22), 'max_depth': np.int64(28)}
598168.8427091968 {'min_samples_leaf': np.int64(15), 'max_depth': np.int64(15)}
594676.3140460313 {'min_samples_leaf': np.int64(19), 'max_depth': np.int64(21)}
643735.4824580738 {'min_samples_leaf': np.int64(8), 'max_depth': np.int64(24)}
624202.9115127813 {'min_samples_leaf': np.int64(11), 'max_depth': np.int64(29)}
682385.0294633267 {'min_samples_leaf': np.int64(3), 'max_depth': np.int64(3)}
635370.2957492163 {'min_samples_leaf': np.int64(9), 'max_depth': np.int64(28)}
663131.8519999741 {'min_samples_leaf': np.int64(4), 'max_depth': np.int64(17)}


In [63]:
grid_search.best_params_

{'min_samples_leaf': np.int64(22), 'max_depth': np.int64(28)}

In [64]:
grid_search.best_estimator_

In [65]:
#Train RMSE
train_pred = grid_search.best_estimator_.predict(train_x)

train_mse = mean_squared_error(train_target, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 546443.9419648456


In [66]:
#Test RMSE
test_pred = grid_search.best_estimator_.predict(test_x)

test_mse = mean_squared_error(test_target, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 484526.7600914328


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting.

# Discussion

## All train and test values

## Which model performs the best and why?


## How does it compare to the baseline?