## Modeling

Using my scaled data, complete with the features I engineered in the previous step, I now want to try a few different models and evaluate their performance while also evaluating the performance of each type of model using different sets of parameters. Based on the nature of the question, I know I want to limit my evaluation to supervised learning models.

A grid search and some cross validation will be required.

In [1]:
#basic imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#ignore warning messages to ensure clean outputs
import warnings
warnings.filterwarnings('ignore')

#import utilities for train/test split, CV, and model evaluation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
from sklearn.preprocessing import StandardScaler

#import model packages to try
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [2]:
df = pd.read_csv('../csv/preprocessed_data.csv')

### Splitting into Test & Train Sets

In [3]:
y = df['positionOrder']
X = df.drop(columns = ['positionOrder'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

### Scaling

In [4]:
scaler = StandardScaler()

x_scaler = scaler.fit(X_train)
X_train = x_scaler.fit_transform(X_train)
X_test = x_scaler.transform(X_test)

### Model 1: Linear Regression

#### Parameter Tuning With Grid Search

In [5]:
C_param_range = [0.001,0.01,0.1,1,10,100]

table = pd.DataFrame(columns = ['C_parameter','Accuracy'])
table['C_parameter'] = C_param_range


j = 0
for i in C_param_range:
    
    # Apply logistic regression model to training data
    Logreg = LogisticRegression(C = i,random_state = 42)
    Logreg.fit(X_train,y_train)
    
    # Predict using model
    y_pred_lr = Logreg.predict(X_test)
    
    # Saving accuracy score in table
    table.iloc[j,1] = accuracy_score(y_test,y_pred_lr)
    j += 1
    
table

Unnamed: 0,C_parameter,Accuracy
0,0.001,0.158389
1,0.01,0.183893
2,0.1,0.214765
3,1.0,0.267785
4,10.0,0.319463
5,100.0,0.351007


In [15]:
Logreg = LogisticRegression(C = 100, random_state = 42)
Logreg.fit(X_train,y_train)

LogisticRegression(C=100, random_state=42)

#### Cross Validation & Scoring

In [7]:
cv_scores_test = cross_val_score(Logreg, X_test, y_test, cv=5, scoring='accuracy')
cv_scores_train = cross_val_score(Logreg, X_train, y_train, cv=5, scoring='accuracy')
print(cv_scores_test)

cv_scores_lr_test= cv_scores_test.mean()
cv_scores_lr_train= cv_scores_train.mean()
cv_scores_std_test_lr= cv_scores_test.std()

print ('Mean cross validation test score: ' +str(cv_scores_lr_test))
print ('Mean cross validation train score: ' +str(cv_scores_lr_train))
print ('Standard deviation in cv test scores: ' +str(cv_scores_std_test_lr))

[0.23825503 0.22147651 0.2114094  0.22147651 0.21812081]
Mean cross validation test score: 0.22214765100671144
Mean cross validation train score: 0.3303597122302159
Standard deviation in cv test scores: 0.008852957018975111


### Model 2: Random Forest

#### Parameter Tuning With Grid Search

In [8]:
n_estimators_param_range = [10, 100, 300, 500, 750, 800]

table = pd.DataFrame(columns = ['n_parameter','Accuracy'])
table['n_parameter'] = n_estimators_param_range

j = 0
for i in C_param_range:
    rf = RandomForestClassifier(bootstrap=True,n_estimators=100,criterion='entropy')
    rf.fit(X_train, y_train)
    
    y_pred_rf = rf.predict(X_test)
    
    table.iloc[j,1] = accuracy_score(y_test,y_pred_rf)
    j += 1

table

Unnamed: 0,n_parameter,Accuracy
0,10,0.360403
1,100,0.343624
2,300,0.34094
3,500,0.348322
4,750,0.368456
5,800,0.356376


In [16]:
rf = RandomForestClassifier(bootstrap=True, n_estimators=750, criterion='entropy', random_state=42)
rf.fit(X_train, y_train)

RandomForestClassifier(criterion='entropy', n_estimators=750, random_state=42)

#### Cross Validation & Scoring

In [10]:
cv_scores_test = cross_val_score(rf, X_test, y_test, cv=5, scoring='accuracy')
cv_scores_train = cross_val_score(rf, X_train, y_train, cv=5, scoring='accuracy')
print(cv_scores_test)

cv_scores_rf_test = cv_scores_test.mean()
cv_scores_rf_train = cv_scores_train.mean()
cv_scores_std_test_rf = cv_scores_test.std()

print ('Mean cross validation test score: ' +str(cv_scores_rf_test))
print ('Mean cross validation train score: ' +str(cv_scores_rf_train))
print ('Standard deviation in cv test scores: ' +str(cv_scores_std_test_rf))

[0.33221477 0.34899329 0.31208054 0.30201342 0.2885906 ]
Mean cross validation test score: 0.31677852348993285
Mean cross validation train score: 0.361726618705036
Standard deviation in cv test scores: 0.021497472990667062


### Model 3: Gradient Boosting


In [11]:
gbc = GradientBoostingClassifier(subsample=0.8, learning_rate=0.05 , n_estimators=160, random_state=5, max_depth=9, max_leaf_nodes=100)

gbc.fit(X_train, y_train)

y_pred_gbc = gbc.predict(X_test)

In [12]:
cv_scores_test = cross_val_score(gbc, X_test, y_test, cv=5, scoring='accuracy')
cv_scores_train = cross_val_score(gbc, X_train, y_train, cv=5, scoring='accuracy')
print(cv_scores_test)

cv_scores_gbc_test = cv_scores_test.mean()
cv_scores_gbc_train = cv_scores_train.mean()
cv_scores_std_test_gbc = cv_scores_test.std()

print ('Mean cross validation test score: ' +str(cv_scores_gbc_test))
print ('Mean cross validation train score: ' +str(cv_scores_gbc_train))
print ('Standard deviation in cv test scores: ' +str(cv_scores_std_test_gbc))

[0.49328859 0.52684564 0.48993289 0.49328859 0.47651007]
Mean cross validation test score: 0.4959731543624161
Mean cross validation train score: 0.5315107913669065
Standard deviation in cv test scores: 0.016630217038072295


### Comparing Model Accuracy

In [17]:
myLabels = [ 'Logistic Regression', 'Random Forest','Gradient Boost']

Accuracy_lr=Logreg.score(X_test,y_test)
Accuracy_rf=rf.score(X_test,y_test)
Accuracy_gbc=gbc.score(X_test,y_test)

Accuracy_score = [Accuracy_lr, Accuracy_rf, Accuracy_gbc]

accuracy_table = pd.DataFrame(list(zip(myLabels, Accuracy_score)), 
               columns =['Algorithm', 'Model accuracy score']) 

accuracy_table

Unnamed: 0,Algorithm,Model accuracy score
0,Logistic Regression,0.351007
1,Random Forest,0.36443
2,Gradient Boost,0.512081


Based on these accuracy scores, it's clear that Gradient Boost significantly outperforms the other models tested. All models outperform random choice, which would roughly equate to 1/20 - 1/22 (depedning on how many drivers are participating in the average Grand Prix race.