# Testing One-Hot Encoding

We're looking at using one-hot encoding for our multi-class models, so this workbook is to help us figure out how to use and set up one-hot encoding in the first place.

## Set Up Libraries and Data

In [12]:
# Import necessary data libraries.
import pandas as pd
import os 
import csv
import io
import requests
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder

In [2]:
# Set up URLs.
circuits_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/circuits.csv'
constructor_results_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/constructor_results.csv'
constructor_standings_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/constructor_standings.csv'
constructors_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/constructors.csv'
driver_standings_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/driver_standings.csv'
drivers_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/drivers.csv'
lap_times_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/lap_times.csv'
pit_stop_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/pit_stops.csv'
qualifying_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/qualifying.csv'
races_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/races.csv'
results_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/results.csv'
seasons_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/seasons.csv'
status_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/status.csv'
race_status_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/race_status.csv'
master_data_url = 'https://raw.githubusercontent.com/georgetown-analytics/Formula1/main/data/processed/MasterData3.csv'

In [3]:
# Set up dataframes.
circuits_df = pd.read_csv(circuits_url, sep = ',', encoding = 'latin-1')
constructor_results_df = pd.read_csv(constructor_results_url, sep = ',', engine = 'python')
constructor_standings_df = pd.read_csv(constructor_standings_url, sep = ',', engine = 'python')
constructors_df = pd.read_csv(constructor_standings_url, sep = ',', engine = 'python')
driver_standings_df = pd.read_csv(driver_standings_url, sep = ',', engine = 'python')
lap_times_df = pd.read_csv(lap_times_url, sep = ',', engine = 'python')
pit_stop_df = pd.read_csv(pit_stop_url, sep = ',', engine = 'python')
qualifying_df = pd.read_csv(constructor_standings_url, sep = ',', engine = 'python')
results_df = pd.read_csv(results_url, sep = ',', engine = 'python')
seasons_df = pd.read_csv(seasons_url, sep = ',', engine = 'python')
status_df = pd.read_csv(status_url, sep = ',', engine = 'python')
races_df = pd.read_csv(races_url, sep = ',', engine = 'c')
drivers_df = pd.read_csv(drivers_url, sep = ',', encoding = 'latin-1')
race_status_df = pd.read_csv(race_status_url, sep = ',', engine = 'python')
master_data = pd.read_csv(master_data_url, sep = ',', engine = 'python')

## Establishing Variables

In [10]:
# Are there any NAs in our data?
master_data.isna().sum()

raceId                  0
driverId                0
constructorId           0
grid                    0
position             2164
positionText            0
positionOrder           0
laps                    0
fastestLap           2745
rank                 2704
fastestLapSpeed      2745
familyStatus            0
Completion Status       0
year                    0
circuitId               0
country                 0
alt                     0
isHistoric              0
trackType               0
nationality             0
total_lap_time          0
average_lap_time        0
minimum_lap_time        0
PRCP                  925
TAVG                  925
TMAX                  925
TMIN                  925
dtype: int64

Yes, we have NAs in position, fastestLap, rank, fastestLapSpeed, PRCP, TAVG, TMAX, and TMIN. At this point the NAs in the last four variables have been fixed in our weather dataset but not the master data.

I'll be using familyStatus as our target variable, with grid, year, alt, isHistoric, and average_lap_time as independent variables.

In [11]:
# Check on the mins and maxes of familyStatus.
master_data["familyStatus"].describe()

count    9466.000000
mean        4.071097
std         1.069539
min         1.000000
25%         4.000000
50%         4.000000
75%         4.000000
max         6.000000
Name: familyStatus, dtype: float64

We can see above that our minimum familyStatus is 1 and our maximum familyStatus is 6. This is in line with how we set up familyStatus, in which numbers 1-6 all apply to various statuses of individiual cars.

## Start One-Hot Encoding

We used these sites (https://towardsdatascience.com/target-encoding-for-multi-class-classification-c9a7bcb1a53 and https://www.analyticsvidhya.com/blog/2021/05/how-to-perform-one-hot-encoding-for-multi-categorical-variables/) as foundations for our code.

In [65]:
# Analytics Vidhya suggests we check features before proceeding. Do so using their code.
check_features = master_data.select_dtypes(include='O').keys()
# Display variables.
check_features

Index(['positionText', 'country', 'nationality'], dtype='object')

In [66]:
"""
Encode and transform familyStatus using the one-hot encoding code from Towards Data Science.
Note that the target variable must be a string here.
"""
encodeFamilyStatus = ce.OneHotEncoder().fit(master_data.familyStatus.astype(str))
y_onehot = encodeFamilyStatus.transform(master_data.familyStatus.astype(str))
y_onehot

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,familyStatus_1,familyStatus_2,familyStatus_3,familyStatus_4,familyStatus_5,familyStatus_6
0,1,0,0,0,0,0
1,0,1,0,0,0,0
2,0,1,0,0,0,0
3,0,1,0,0,0,0
4,0,0,1,0,0,0
...,...,...,...,...,...,...
9461,0,1,0,0,0,0
9462,0,1,0,0,0,0
9463,0,1,0,0,0,0
9464,0,1,0,0,0,0


In [67]:
"""
One-hot encode the country column using the for loop shown in the Towards Data Science article.
"""
class_names = y_onehot.columns
for class_ in class_names:
  encodeCountry = ce.TargetEncoder(smoothing = 0)
  print(encodeCountry.fit_transform(master_data["country"], y_onehot[class_]))

       country
0     0.042644
1     0.042644
2     0.042644
3     0.042644
4     0.042644
...        ...
9461  0.020468
9462  0.020468
9463  0.020468
9464  0.020468
9465  0.020468

[9466 rows x 1 columns]
       country
0     0.648188
1     0.648188
2     0.648188
3     0.648188
4     0.648188
...        ...
9461  0.728070
9462  0.728070
9463  0.728070
9464  0.728070
9465  0.728070

[9466 rows x 1 columns]
       country
0     0.049041
1     0.049041
2     0.049041
3     0.049041
4     0.049041
...        ...
9461  0.011696
9462  0.011696
9463  0.011696
9464  0.011696
9465  0.011696

[9466 rows x 1 columns]
       country
0     0.213220
1     0.213220
2     0.213220
3     0.213220
4     0.213220
...        ...
9461  0.187135
9462  0.187135
9463  0.187135
9464  0.187135
9465  0.187135

[9466 rows x 1 columns]
       country
0     0.044776
1     0.044776
2     0.044776
3     0.044776
4     0.044776
...        ...
9461  0.049708
9462  0.049708
9463  0.049708
9464  0.049708
9465  0.049708


  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):


In [68]:
"""
The Towards Data Science article gives a function, target_encode_multiclass,
that one-hot encodes the entire dataset with the given target variable. That function is below.
"""
def target_encode_multiclass(X,y): #X,y are pandas df and series
    y = y.astype(str)   #convert to string to onehot encode
    enc = ce.OneHotEncoder().fit(y)
    y_onehot = enc.transform(y)
    class_names = y_onehot.columns  #names of onehot encoded columns
    X_obj = X.select_dtypes('object') #separate categorical columns
    X = X.select_dtypes(exclude='object') 
    for class_ in class_names:
      
        enc = ce.TargetEncoder()
        enc.fit(X_obj,y_onehot[class_]) #convert all categorical 
        temp = enc.transform(X_obj)       #columns for class_
        temp.columns = [str(x)+'_'+str(class_) for x in temp.columns]
        X = pd.concat([X,temp],axis=1)    #add to original dataset
      
    return X

In [69]:
# Use the above function to one-hot encode our master_data dataset with familyStatus as our target variable.
onehot_data = target_encode_multiclass(master_data, master_data["familyStatus"])

  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):


In [70]:
# Take a look at the new dataset using the describe() function.
onehot_data.describe()

Unnamed: 0,raceId,driverId,constructorId,grid,position,positionOrder,laps,fastestLap,rank,fastestLapSpeed,...,nationality_familyStatus_3,positionText_familyStatus_4,country_familyStatus_4,nationality_familyStatus_4,positionText_familyStatus_5,country_familyStatus_5,nationality_familyStatus_5,positionText_familyStatus_6,country_familyStatus_6,nationality_familyStatus_6
count,9466.0,9466.0,9466.0,9466.0,7302.0,9466.0,9466.0,6721.0,6762.0,6721.0,...,9466.0,9466.0,9466.0,9466.0,9466.0,9466.0,9466.0,9466.0,9466.0,9466.0
mean,500.169977,249.438411,36.706634,11.070357,8.744043,10.817135,52.982252,42.216039,10.692399,202.509826,...,0.045956,0.1455,0.145468,0.145461,0.031911,0.031904,0.03190486,0.003804,0.003803,0.003803
std,408.988287,355.593273,63.937258,6.24087,5.090236,6.043638,17.737604,17.000168,6.059511,21.342117,...,0.022961,0.242194,0.040549,0.036695,0.052178,0.016824,0.007799094,0.005982,0.003089,0.0045
min,1.0,1.0,1.0,0.0,1.0,1.0,0.0,2.0,0.0,89.54,...,0.0,0.0,0.016667,0.03125,0.0,0.007648,2.982379e-15,0.0,0.0,0.0
25%,121.0,15.0,4.0,6.0,4.0,6.0,49.0,32.0,5.0,192.346,...,0.027586,0.00216,0.122616,0.114583,0.0,0.020716,0.02879291,0.0,0.0,0.002244
50%,236.0,35.0,9.0,11.0,8.0,11.0,56.0,45.0,11.0,203.989,...,0.041536,0.018519,0.141447,0.146497,0.004338,0.029674,0.03115265,0.0,0.002967,0.003822
75%,934.0,810.0,20.0,16.0,13.0,16.0,66.0,54.0,16.0,215.688,...,0.062016,0.074419,0.178974,0.161683,0.017544,0.04,0.03590127,0.005291,0.006061,0.005664
max,1060.0,854.0,214.0,24.0,24.0,24.0,87.0,85.0,24.0,257.32,...,0.214285,0.596117,0.296875,0.357142,0.128788,0.107527,0.08695652,0.014678,0.011834,0.064516


### Create a CSV file with our new one-hot encoded dataset.

In [71]:
# Use pandas.DataFrame.to_csv to create the CSV file.
onehot_data.to_csv("data/processed/OneHot_MasterData3.csv", index = False)