# Interactive Widget: Back End Code: Bagging Classifier

This is our official final version of the widget's back end code.

Throughout this workbook, we used steps from the following web pages to inform our widgets.
- https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Basics.html
- https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html
- https://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html

## Setting Up the Model for the Widget

### Set up the training and testing sets.

In [1]:
# Import necessary data libraries.
from collections import Counter
from imblearn.datasets import fetch_datasets
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
import numpy as np 
import pandas as pd

In [2]:
# Set up datasets.
features_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/processed/final_features.csv'
features = pd.read_csv(features_url, sep = ',', engine = 'python', encoding = 'latin-1')

In [3]:
# What columns are in this dataset?
features.columns

Index(['raceId', 'driverId', 'CompletionStatus', 'alt', 'grid', 'trackType',
       'average_lap_time', 'minimum_lap_time', 'year', 'PRCP', 'TAVG',
       'isHistoric', 'oneHot_circuits_1', 'oneHot_circuits_2',
       'oneHot_circuits_3', 'oneHot_circuits_4', 'oneHot_circuits_5',
       'oneHot_circuits_6', 'alt_trans', 'PRCP_trans', 'normalized_minLapTime',
       'normalized_avgLapTime'],
      dtype='object')

In [4]:
features.shape

(9258, 22)

In [5]:
# Establish our X (independent) variables.
X = features[['trackType', 'alt_trans', 'grid', 'normalized_minLapTime', 'normalized_avgLapTime', 'year', 'PRCP_trans', 
'TAVG', 'isHistoric', "oneHot_circuits_1", "oneHot_circuits_2", "oneHot_circuits_3",
"oneHot_circuits_4","oneHot_circuits_5","oneHot_circuits_6"]]

In [6]:
# Establish our y (dependent, target) variable.
y = features['CompletionStatus']

In [7]:
# Split our data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [8]:
# Import SMOTE so we can deal with our class imbalance.
from imblearn.over_sampling import SMOTE, ADASYN

In [9]:
# Use SMOTE on our X_ and y_train to create X_ and y_resampled.
X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)

In [10]:
# Check the balance of our resampled data.
print(sorted(Counter(y_resampled).items()))

[(0, 5694), (1, 5694)]


Above we can see that we've fixed the class imbalance of our training sets.

### Create CSV Files

In order to not have a randomized training set every time someone uses the widget, we'll create CSV files of our training data that we can call back to. Although the files say they were for the K-Neighbors widget, we did not change anything in the resampling or testing data, so we will still use these files.

In [11]:
# Use pandas.DataFrame.to_csv to create the CSV file.
X_resampled.to_csv("data/interim/X_resampled_forKNeighborWidget.csv", index = False)

In [12]:
# Use pandas.DataFrame.to_csv to create the CSV file.
y_resampled.to_csv("data/interim/y_resampled_forKNeighborWidget.csv", index = False)

In [13]:
# Use pandas.DataFrame.to_csv to create the CSV file.
X_test.to_csv("data/interim/X_test_forKNeighborWidget.csv", index = False)

In [14]:
# Use pandas.DataFrame.to_csv to create the CSV file.
y_test.to_csv("data/interim/y_test_forKNeighborWidget.csv", index = False)

Further down, upon running our model and after we brought in the above CSV files, we got an error stating `"A column-vector y was passed when a 1d array was expected."` We know that the model worked before hand, so we need to revert our new y_resampled to the same type it used to be.

In [15]:
# What type was y_resampled?
type(y_resampled)

pandas.core.series.Series

The result above says that `y_resampled` used to be pandas.core.series.Series.

### Set Up the Initial Model

Although our work involves several models, we're only using one for now: Bagging Classifier. This model will run with the regular `X_test` and `y_test` data.

In [16]:
# Import the necessary data libraries that we'll need for our model.
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from yellowbrick.classifier import ClassificationReport
from sklearn.ensemble import BaggingClassifier

Although the URLs say they were for the K-Neighbors widget, we did not change anything in the resampling or testing data, so we will still use these files.

In [17]:
# Set up datasets. 
X_resampled_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/interim/X_resampled_forKNeighborWidget.csv'
X_resampled = pd.read_csv(X_resampled_url, sep = ',', engine = 'python')
y_resampled_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/interim/y_resampled_forKNeighborWidget.csv'
y_resampled = pd.read_csv(y_resampled_url, sep = ',', engine = 'python')
X_test_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/interim/X_test_forKNeighborWidget.csv'
X_test = pd.read_csv(X_test_url, sep = ',', engine = 'python')
y_test_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/interim/y_test_forKNeighborWidget.csv'
y_test = pd.read_csv(y_test_url, sep = ',', engine = 'python')

In [18]:
# View X_resampled.
X_resampled.head()

Unnamed: 0,grid,trackType,year,TAVG,isHistoric,oneHot_circuits_1,oneHot_circuits_2,oneHot_circuits_3,oneHot_circuits_4,oneHot_circuits_5,oneHot_circuits_6,alt_trans,PRCP_trans,normalized_minLapTime,normalized_avgLapTime
0,1,0,2018,84.0,0,1,0,0,0,0,0,5.605802,0.0,0.987352,-0.020431
1,15,0,2001,73.0,0,0,0,0,1,0,0,5.463832,0.0,1.006597,0.003767
2,14,0,2005,84.0,0,0,0,1,0,0,0,2.70805,0.0,0.996195,-0.011305
3,16,1,2016,83.0,0,0,0,0,1,0,0,3.258097,0.24686,1.011321,0.002241
4,11,1,2015,62.0,1,1,0,0,0,0,0,2.70805,0.0,1.00129,-0.000971


We know from testing the type of `y_resampled` before we brought in the CSV files that `y_resampled` and `y_test` need to be a series in order for our model to run correctly. We also know from this site (https://datatofish.com/pandas-dataframe-to-series/) how to change a dataframe into a series.

In [19]:
# Change the y_resampled dataframe into a y_resampled series.
y_resampled = y_resampled.squeeze()

In [20]:
# View y_resampled.
y_resampled.head()

0    1
1    1
2    1
3    1
4    1
Name: CompletionStatus, dtype: int64

In [21]:
# Change the y_test dataframe into a y_test series.
y_test = y_test.squeeze()

In [22]:
# Create the function score_model.
def score_model(X_resampled, y_resampled, X_test, y_test, estimator, **kwargs):
    """
    Test various estimators.
    """
    # Instantiate the classification model and visualizer.
    estimator.fit(X_resampled, y_resampled, **kwargs)  
    
    expected  = y_test
    predicted = estimator.predict(X_test)
    
    # Compute and return F1 (harmonic mean of precision and recall).
    print("{}: {}".format(estimator.__class__.__name__, f1_score(expected, predicted)))

In [23]:
# Run the Bagging Classifier model.
score_model(X_resampled, y_resampled, X_test, y_test, BaggingClassifier())

BaggingClassifier: 0.854182087342709


## Widget Experimentation

### Set Up

In [24]:
# Import necessary data libraries.
import pandas as pd
import os 
import csv
import io
import requests
import numpy as np
import matplotlib.pyplot as plt
import category_encoders as ce
import scipy.stats as stats

# The following are for Classification Accuracy.
from sklearn import metrics

# The following are for Jupyter Widgets.
import ipywidgets as widgets
from IPython.display import display
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from ipywidgets import FloatSlider

### Working with the Data in the Input Columns

In [25]:
# What features are in X_resampled and will therefore be required for our widget?
X_resampled.columns

Index(['grid', 'trackType', 'year', 'TAVG', 'isHistoric', 'oneHot_circuits_1',
       'oneHot_circuits_2', 'oneHot_circuits_3', 'oneHot_circuits_4',
       'oneHot_circuits_5', 'oneHot_circuits_6', 'alt_trans', 'PRCP_trans',
       'normalized_minLapTime', 'normalized_avgLapTime'],
      dtype='object')

As shown above, with slight changes to account for the one-hot encoding, we'll have to ask interactors to choose grid, trackType, year, average temperature, whether the track is historic or not, a binned circuit, altitude, precipitation, minimum lap time, and average lap time.

In [26]:
# What minimum and maximum numbers will we have to allow for in our input columns?
X_resampled.describe()

Unnamed: 0,grid,trackType,year,TAVG,isHistoric,oneHot_circuits_1,oneHot_circuits_2,oneHot_circuits_3,oneHot_circuits_4,oneHot_circuits_5,oneHot_circuits_6,alt_trans,PRCP_trans,normalized_minLapTime,normalized_avgLapTime
count,11386.0,11386.0,11386.0,11386.0,11386.0,11386.0,11386.0,11386.0,11386.0,11386.0,11386.0,11386.0,11386.0,11386.0,11386.0
mean,11.407167,0.227297,2007.588793,67.905169,0.270244,0.246267,0.212103,0.1544,0.107149,0.083524,0.033989,4.370972,0.077372,0.998519,-0.00407
std,6.212801,0.419104,7.380854,8.95464,0.444105,0.430855,0.408815,0.361348,0.309316,0.276684,0.181209,1.449241,0.187889,0.016467,0.021178
min,0.0,0.0,1996.0,44.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.941946,-0.079882
25%,6.0,0.0,2001.0,61.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.993564,0.0,0.989047,-0.015462
50%,12.0,0.0,2007.0,67.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.70953,0.0,0.997514,-0.005171
75%,17.0,0.0,2014.0,74.489699,1.0,0.0,0.0,0.0,0.0,0.0,0.0,5.463832,0.057923,1.008695,0.007451
max,24.0,1.0,2021.0,94.2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,7.711997,1.987874,1.04938,0.067129


- grid has a min of 0 and a max of 24.
- year has a min of 1996 and a max of 2021.
- TAVG has a min of 49.0 and a max of 94.2.
- alt, non-transformed, has a min of -7.0 and a max of 2227.0. We know this from Feature_Transformation.csv.
- average_lap_time has a min of -0.079882 and a max of 0.067129. We know from Feature_Transformation.csv, however, that the min of the value we're actually asking for (aka the non-transformed and non-imputed but normalized average lap time), is 0.523032. The max of that same value is 4.702234.
- minimum_lap_time has a min of 0.942567 and a max of 1.049380. We know from Feature_Transformation.csv, however, that the min of the value we're actually asking for (aka the non-imputed but normalized minimum lap time), is 0.768296. The max of that same value is 4.837281.
- PRCP, non-transformed, has a min of 0.0 and a max of 6.3. We know this from Feature_Transformation.csv.

### Building the Widget

In [27]:
# Create the function score_model.
def widgetpred(X_resampled, y_resampled, X_test, y_test, input_test, estimator, **kwargs):
    """
    Test various estimators.
    """
    # Instantiate the classification model and visualizer.
    estimator.fit(X_resampled, y_resampled, **kwargs)  
    
    expected  = y_test
    predicted = estimator.predict(X_test)
    
    inputpred = estimator.predict(input_test)
    
    # Compute and return the prediction.
    return [predicted, inputpred]

In [28]:
# Create the function conmatrix.
def conmatrix(y_test, predicted, inputpred):
    """
    Compute the confusion matrix and return the results.
    """
    confusion = metrics.confusion_matrix(y_test, predicted)
    TP = confusion[1, 1]
    TN = confusion[0, 0]
    FP = confusion[0, 1]
    FN = confusion[1, 0]
    
    # When the prediction is positive, how often is it correct? Define truepositive_rate.
    truepositive_rate = round((TP / (TP + FP)) * 100, 2)
    
    # When the prediction is negative, how often is it correct? Define truenegative_rate.
    truenegative_rate = round((TN / (TN + FN)) * 100, 2)
    
    # Use an if-else statement to print a statement about the true positive or negative rate, depending on the prediction.
    if inputpred == 1:
        print("When our model predicts that a car will finish the race, it is correct", truepositive_rate, "% of the time.")
    else:
        print("When our model predicts that a car will not finish the race, it is correct", truenegative_rate, "% of the time.")

In [29]:
"""
Establish function "predict" which allows selection of two track types, whether
the track is historic or not, and how popular the circuit is,
as well as the input of one of each of the following values:
year, grid, alt, average_lap_time, minimum_lap_time, PRCP, TAVG.

Place these values in the dataframe input_df and display the dataframe.

Create prediction based on widgetpred function and display the prediction:
0 for did not finish, 1 for did finish.
"""
def predictfinish(trackType, historic, circuit, year, grid, alt, average_lap_time, normalized_minLapTime, PRCP, TAVG):    
    # Use an if-else statement to determine the output based on the input track.
    if trackType == "Race":
        trackType = 0
    else:
        trackType = 1
        
    # Use an if-else statement to determine the output based on the input historic.
    if historic == "Not Historic":
        isHistoric = 0
    else:
        isHistoric = 1
    
    # Use an if-else statement to determine the output based on the input circuit.
    if circuit == "Used 500+ times":
        oneHot_circuits_1 = 1
        oneHot_circuits_2 = 0
        oneHot_circuits_3 = 0
        oneHot_circuits_4 = 0
        oneHot_circuits_5 = 0
        oneHot_circuits_6 = 0
    elif circuit == "Used 400-499 times":
        oneHot_circuits_1 = 0
        oneHot_circuits_2 = 1
        oneHot_circuits_3 = 0
        oneHot_circuits_4 = 0
        oneHot_circuits_5 = 0
        oneHot_circuits_6 = 0
    elif circuit == "Used 300-399 times":
        oneHot_circuits_1 = 0
        oneHot_circuits_2 = 0
        oneHot_circuits_3 = 1
        oneHot_circuits_4 = 0
        oneHot_circuits_5 = 0
        oneHot_circuits_6 = 0
    elif circuit == "Used 200-299 times":
        oneHot_circuits_1 = 0
        oneHot_circuits_2 = 0
        oneHot_circuits_3 = 0
        oneHot_circuits_4 = 1
        oneHot_circuits_5 = 0
        oneHot_circuits_6 = 0
    elif circuit == "Used 100-199 times":
        oneHot_circuits_1 = 0
        oneHot_circuits_2 = 0
        oneHot_circuits_3 = 0
        oneHot_circuits_4 = 0
        oneHot_circuits_5 = 1
        oneHot_circuits_6 = 0
    elif circuit == "Used less than 100 times":
        oneHot_circuits_1 = 0
        oneHot_circuits_2 = 0
        oneHot_circuits_3 = 0
        oneHot_circuits_4 = 0
        oneHot_circuits_5 = 0
        oneHot_circuits_6 = 1
    
    # Transform average_lap_time.
    normalized_avgLapTime = np.log(average_lap_time)
    
    # Use an if-else statement to move any potential outliers from average_lap_time.
    avgQ1 = -0.019303
    avgQ3 = 0.006690
    avgIQR = avgQ3 - avgQ1
    avglowertail = avgQ1 - 2.5 * avgIQR
    avguppertail = avgQ3 + 2.5 * avgIQR
    avgmedian = -0.005962837883204569
    if normalized_avgLapTime > avguppertail or normalized_avgLapTime < avglowertail:
        normalized_avgLapTime = avgmedian
        
    # Use an if-else statement to move any potential outliers from normalized_minLapTime.
    minQ1 = 0.984717
    minQ3 = 1.006281
    minIQR = minQ3 - minQ1
    minlowertail = minQ1 - 2.0 * minIQR
    minuppertail = minQ3 + 2.0 * minIQR
    minmedian = 0.995628475361378
    if normalized_minLapTime > minuppertail or normalized_minLapTime < minlowertail:
        normalized_minLapTime = minmedian
    
    # Transform altitude.
    alt_trans = np.log(alt + 1 - (-7))
    
    # Transform precipitation.
    PRCP_trans = np.log(PRCP + 1)
    
    # Establish the data of our input_df dataframe.
    inputdata = [[grid, trackType, year, TAVG, isHistoric, oneHot_circuits_1, oneHot_circuits_2,
                 oneHot_circuits_3, oneHot_circuits_4, oneHot_circuits_5, oneHot_circuits_6, alt_trans,
                 PRCP_trans, normalized_minLapTime, normalized_avgLapTime]]
    
    # Establish the dataframe input_df itself with pd.DataFrame.
    input_df = pd.DataFrame(inputdata, columns =
                ['grid', 'trackType', 'year', 'TAVG',
             'isHistoric', 'oneHot_circuits_1', 'oneHot_circuits_2',
             'oneHot_circuits_3', 'oneHot_circuits_4', 'oneHot_circuits_5',
             'oneHot_circuits_6', 'alt_trans', 'PRCP_trans', 'normalized_minLapTime',
             'normalized_avgLapTime'])
    
    display(input_df)
    
    # Using the widgetpred function, predict whether the car will finish the race or not given input_df.
    pred = widgetpred(X_resampled, y_resampled, X_test, y_test, input_df, BaggingClassifier())
    
    # Using an if-else statement, determine what interactors will see given the data they input.
    if pred[1] == 1:
        writtenpred = "finish the race."
    else:
        writtenpred = "not finish the race."
    
    # Print the model's prediction.
    print("According to our Bagging Classifier model, your car is predicted to", writtenpred)
    
    """
    Using the conmatrix function, print out a statement about
    the true positive or negative rate, depending on the prediction.
    """
    conmatrix(y_test, pred[0], pred[1])

In [30]:
# Create a widget that will interact with the predictfinish function.
interact(predictfinish, trackType = widgets.Dropdown(options = ["Race", "Street"], value = "Race", description = 'Track Type'),
         historic = widgets.Dropdown(options = ["Not Historic", "Historic"], value = "Not Historic", description = 'Historic?'),
         circuit = widgets.Dropdown(options = ["Used 500+ times", "Used 400-499 times", "Used 300-399 times", "Used 200-299 times", "Used 100-199 times", "Used less than 100 times"], value = "Used less than 100 times", description = 'Circuit'),
         year = widgets.IntSlider(min = 1996, max = 2021, description = 'Year', disabled = False, continuous_update = False),
         grid = widgets.IntSlider(min = 0, max = 30, description = 'Grid', disabled = False, continuous_update = False),
         alt = widgets.BoundedFloatText(min = -100, max = 2500, description = 'Altitude', disabled = False, continuous_update = False),
         average_lap_time = widgets.FloatSlider(min = 0.1, max = 6.0, value = 0.1, description = 'Avg Lap Time', disabled = False, continuous_update = False),
         normalized_minLapTime = widgets.FloatSlider(min = 0.1, max = 6.0, value = 0.1, description = 'Min Lap Time', disabled = False, continuous_update = False),
         PRCP = widgets.FloatSlider(min = 0, max = 10, description = 'Precipitation', disabled = False, continuous_update = False),
         TAVG = widgets.FloatSlider(min = 0, max = 110, description = 'Avg Temp (F)', disabled = False, continuous_update = False));

interactive(children=(Dropdown(description='Track Type', options=('Race', 'Street'), value='Race'), Dropdown(d…