# Testing One-Hot Encoding with MasterData5

Our original "Testing One-Hot Encoding" workbook used MasterData3 and looked at encoding for our multi-class models. Here we'll use MasterData5 and look at encoding for our binary models, with Completion Status as our target variable instead of familyStatus.

## Set Up Libraries and Data

In [1]:
# Import necessary data libraries.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
import category_encoders as ce
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing

In [2]:
# Set up URLs.
master_data_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/processed/MasterData5.csv'

In [3]:
# Set up dataframes.
master_data = pd.read_csv(master_data_url, sep = ',', engine = 'python')

## Establishing Variables

In [4]:
# Looking at how many uniques we have in each major column
#the columns below are what we would like to oneHot
print(
    #master_data['Completion Status'].value_counts()
    #master_data['isHistoric'].value_counts()
    #master_data['trackType'].value_counts()
    master_data['binned_circuits'].value_counts()
)

1    2558
2    2262
3    1771
4    1218
5    1030
6     419
Name: binned_circuits, dtype: int64


In [5]:
master_data.head()

Unnamed: 0,raceId,driverId,constructorId,grid,familyStatus,Completion Status,year,circuitId,country,alt,...,trackType,nationality,total_lap_time,average_lap_time,minimum_lap_time,PRCP,TAVG,TMAX,TMIN,binned_circuits
0,1,2,2,9,4,1,2009,1,Australia,10,...,2,German,5662869,97635.672414,88283,0.0,72.0,78.0,66.0,2
1,1,3,3,5,4,1,2009,1,Australia,10,...,2,German,5661506,97612.172414,87706,0.0,72.0,78.0,66.0,2
2,1,4,4,10,4,1,2009,1,Australia,10,...,2,Spanish,5660663,97597.637931,88712,0.0,72.0,78.0,66.0,2
3,1,6,3,11,1,0,2009,1,Australia,10,...,2,Japanese,1560978,91822.235294,89923,0.0,72.0,78.0,66.0,2
4,1,7,5,17,4,1,2009,1,Australia,10,...,2,French,5662082,97622.103448,89823,0.0,72.0,78.0,66.0,2


In [6]:
master_data['binned_circuits'].dtype

dtype('int64')

In [17]:
master_data['binned_circuits'] = master_data.binned_circuits.astype("str")

In [8]:
master_data['binned_circuits'].dtype

dtype('O')

In [22]:
# Analytics Vidhya suggests we check features before proceeding. Do so using their code.
check_features = master_data.select_dtypes(include='O').keys()
# Display variables.
check_features

Index(['country', 'nationality', 'binned_circuits'], dtype='object')

# Quick Column Rename

In [23]:
# Rename Completion Status so it doesn't have any spaces. This will make it easier to use in the code below.
master_data = master_data.rename(columns={"Completion Status": "CompletionStatus"})

# X,y Setup

In [24]:
# Are there any NAs in our data?
master_data['binned_circuits'].isna().sum()

0

In [25]:
X = master_data.loc[:, ['average_lap_time', #Numeric
                        'trackType', #categorical
                        'alt', #numeric
                        'grid', #numeric
                        'average_lap_time', #numeric
                        'minimum_lap_time', #numeric
                        'year', #numeric
                        'PRCP', #numeric
                        'TAVG', #numeric
                        'isHistoric', #categorical
                        'binned_circuits' #categorical
                       ]]
y = master_data.loc[:, 'CompletionStatus'] #categorical

# Basic Linear Regression Test

In [26]:
logreg = LogisticRegression(solver = 'lbfgs')

# Transformations

In [27]:
column_trans =  make_column_transformer(
    (OneHotEncoder(), ['trackType', 'isHistoric', 'binned_circuits']),
    remainder = 'passthrough')

In [28]:
column_trans.fit_transform(X)

array([[0.000e+00, 1.000e+00, 1.000e+00, ..., 2.009e+03, 0.000e+00,
        7.200e+01],
       [0.000e+00, 1.000e+00, 1.000e+00, ..., 2.009e+03, 0.000e+00,
        7.200e+01],
       [0.000e+00, 1.000e+00, 1.000e+00, ..., 2.009e+03, 0.000e+00,
        7.200e+01],
       ...,
       [1.000e+00, 0.000e+00, 1.000e+00, ..., 2.021e+03, 3.000e-02,
        6.700e+01],
       [1.000e+00, 0.000e+00, 1.000e+00, ..., 2.021e+03, 3.000e-02,
        6.700e+01],
       [1.000e+00, 0.000e+00, 1.000e+00, ..., 2.021e+03, 3.000e-02,
        6.700e+01]])

In [29]:
pipe = make_pipeline(column_trans, logreg) 

In [30]:
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.7663651543208796