[**Introduction**](#1)

The first question a person asks during a horse race is: *who will win the race?*

Already the question needs a thorough analysis. The winner of a race is always the one who sets the best time.
So the task of machine learning, in this case, will be to use the available datasets, choose the features that can be determining factors and finally try to predict the time of each horse based on the race.


[**Task**](#2)

We are going to create a new database using runs.csv and race.csv, keeping and building useful features that aren't derived from other information or are not useful for our machine learning.

Next we will clean up our database, eliminating the NA values, adjusting the data types and observing the correlation between the columns.

Through machine learning we will create regression models and we will evaluate these through Cross Validation.
We will predict our Y_test and compare the results.

Finally we will use our model on "race_id" numbers 0 and 3 and try to see how many winning and placed horses we have guessed.

[**Conclusion**](#3)

Our best model is Random Forest:

Its results are:

*R^2* 0.99560

*Mean Absolute Error* 0.708858

*Mean Squared Error* 1.50512

*Root Mean Squared Error* 2.26538


The algorithm during the "race_id=0" identified 2 horses out of 3 placed;

The algorithm during the "race_id=3" identified the winner and 2 horses out of 3 placed.

In [None]:
import pandas as pd
import sklearn.preprocessing as preprocessing
import sklearn.model_selection as model_selection
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import VotingRegressor


I upload the CSV files

In [None]:
races_df = pd.read_csv('../input/hkracing/races.csv',delimiter=",",header=0, index_col='race_id')
runs_df = pd.read_csv('../input/hkracing/runs.csv', delimiter=",", header=0)

Databases shape

In [None]:
races_df.shape

In [None]:
runs_df.shape

In [None]:
#useful code to see all the features of my database

pd.set_option('display.max_columns', None)

Our main database will undoubtedly runs_df, because it represents how the horses ran and the times (Y) of each during the race

In [None]:
df=runs_df

Now let's delete the columns that are derived from other information or are not useful for our machine learning

In [None]:
df=df.drop('won',axis=1)
df=df.drop('lengths_behind',axis=1)
df=df.drop('horse_rating',axis=1)
df=df.drop('horse_gear',axis=1)
df=df.drop('draw',axis=1)
df=df.drop('position_sec1',axis=1)
df=df.drop('position_sec2',axis=1)
df=df.drop('position_sec3',axis=1)
df=df.drop('position_sec4',axis=1)
df=df.drop('position_sec5',axis=1)
df=df.drop('position_sec6',axis=1)
df=df.drop('behind_sec1',axis=1)
df=df.drop('behind_sec2',axis=1)
df=df.drop('behind_sec3',axis=1)
df=df.drop('behind_sec4',axis=1)
df=df.drop('behind_sec5',axis=1)
df=df.drop('behind_sec6',axis=1)
df=df.drop('time1',axis=1)
df=df.drop('time2',axis=1)
df=df.drop('time3',axis=1)
df=df.drop('time4',axis=1)
df=df.drop('time5',axis=1)
df=df.drop('time6',axis=1)
df=df.drop('win_odds',axis=1)
df=df.drop('place_odds',axis=1)

Now let's add the interesting features from race_df to our dataframe

In [None]:
races_df.head(1)

In [None]:
runs_df.columns

I didn't add "race_no" because if the ground gets damaged during the races I know it from the 'going' variable. 

I didn't add "date" because I know the horse age.

In [None]:
df = pd.merge(df,races_df[['venue','config','surface','distance','going']],on='race_id', how='left')

Including 'id's as a feature will not make sense at all, because the model will treat the id as a numeric value which will decrease the model performance.

Analyzing the previous db I replace the 'horse_id', 'jockey_id' and 'trainer_id' with the percentage of podiums wons.

In [None]:
runs_df.head(3)

*New horse features*

In [None]:
horse_tot_race=runs_df.groupby(['horse_id'])['result'].apply(lambda x: (x).sum()).reset_index(name='horse_tot_race')

In [None]:
df=pd.merge(df,horse_tot_race,on='horse_id',how='left')

In [None]:
horse_tot_place=runs_df.groupby(['horse_id'])['result'].apply(lambda x: (x <=3).sum()).reset_index(name='horse_tot_place')

In [None]:
df=pd.merge(df,horse_tot_place,on='horse_id',how='left')

*New jockey features*

In [None]:
jockey_tot_race=runs_df.groupby(['jockey_id'])['result'].apply(lambda x: (x).sum()).reset_index(name='jockey_tot_race')

In [None]:
df=pd.merge(df,jockey_tot_race,on='jockey_id',how='left')

In [None]:
jockey_tot_place=runs_df.groupby(['jockey_id'])['result'].apply(lambda x: (x <=3).sum()).reset_index(name='jockey_tot_place')

In [None]:
df=pd.merge(df,jockey_tot_place,on='jockey_id',how='left')

*New trainer features*

In [None]:
trainer_tot_race=runs_df.groupby(['trainer_id'])['result'].apply(lambda x: (x).sum()).reset_index(name='trainer_tot_race')

In [None]:
df=pd.merge(df,trainer_tot_race,on='trainer_id',how='left')

In [None]:
trainer_tot_place=runs_df.groupby(['trainer_id'])['result'].apply(lambda x: (x <=3).sum()).reset_index(name='trainer_tot_place')

In [None]:
df=pd.merge(df,trainer_tot_place,on='trainer_id',how='left')

Now that we have the new data, we can create the new columns with the victory and placement percentages of the jockeys and coaches.

In [None]:
#new horse features
df['horse_place_perc']=df['horse_tot_place']/df['horse_tot_race']

#new jockey features
df['jockey_place_perc']=df['jockey_tot_place']/df['jockey_tot_race']

#new trainer features
df['trainer_place_perc']=df['trainer_tot_place']/df['trainer_tot_race']

We delete the columns that we no longer need.

In [None]:
df.columns

In [None]:
df=df.drop('horse_tot_place',axis=1)
df=df.drop('horse_tot_race',axis=1)
df=df.drop('horse_id',axis=1)


df=df.drop('trainer_tot_place',axis=1)
df=df.drop('trainer_tot_race',axis=1)
df=df.drop('trainer_id',axis=1)


df=df.drop('jockey_tot_place',axis=1)
df=df.drop('jockey_tot_race',axis=1)
df=df.drop('jockey_id',axis=1)

The database is used to create an algorithm that predicts the time of the race, so we also delete: 'horse_no'. no race_id because we need for the machine learning test.

In [None]:
df=df.drop('horse_no',axis=1)

[**Data Cleaning**](#4)

I set the variables in alphabetical order and "finish_time"(Y) to the first 

In [None]:
df = df.sort_index(axis=1, ascending=True)

In [None]:
temp_cols=df.columns.tolist()
index=df.columns.get_loc("finish_time")
new_cols=temp_cols[index:index+1] + temp_cols[0:index] + temp_cols[index+1:]
df=df[new_cols]

In [None]:
df.shape

In [None]:
df.head(2)

Na values were 2. I drop NA.

df.isna().sum()

In [None]:
df=df.dropna()

Let's correct the dtypes and encode our columns

In [None]:
df.info()

In [None]:
# encode ordinal columns: config,going 
config_encoder = preprocessing.OrdinalEncoder()
df['config'] = config_encoder.fit_transform(df['config'].values.reshape(-1, 1))

going_encoder = preprocessing.OrdinalEncoder()
df['going'] = going_encoder.fit_transform(df['going'].values.reshape(-1, 1))

In [None]:
# encode nominal column: venue, horse_country, horse_type
venue_encoder = preprocessing.LabelEncoder()
df['venue'] = venue_encoder.fit_transform(df['venue'])

horse_country_encoder = preprocessing.LabelEncoder()
df['horse_country'] = horse_country_encoder.fit_transform(df['horse_country'])

horse_type_encoder = preprocessing.LabelEncoder()
df['horse_type'] = horse_type_encoder.fit_transform(df['horse_type'])

In [None]:
df.head(3)

In [None]:
df.shape

Let's now see the correlation between the variables.

As we expected there is a very high between final time and distance.

In [None]:
corr=df.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

[**Machine Learning**](#5)

As you can see from the database we have kept the columns "result" and "race_id" which are not useful for machine learning.

These columns will serve us to create 2 new databases to test our algorithm on a particular ride.

In [None]:
#1st database with race_id=0 , 2nd race_id=3
df_0=df[df.race_id==0]
df_3=df[df.race_id==3]

In [None]:
df_0.shape

In [None]:
df_3.shape

In [None]:
df.shape

In [None]:
df=df[df.race_id!=0]

In [None]:
df=df[df.race_id!=3]

In [None]:
df.shape

In [None]:
#now I can drop race_id and result
df=df.drop('race_id',axis=1)
df=df.drop('result',axis=1)

df_0_ML=df_0
df_0_ML=df_0_ML.drop('race_id',axis=1)
df_0_ML=df_0_ML.drop('result',axis=1)

df_3_ML=df_3
df_3_ML=df_3_ML.drop('race_id',axis=1)
df_3_ML=df_3_ML.drop('result',axis=1)

[**Standardization**](#6)

The features are not in the same scale, we have to *standardize* the variables in the X dataset.

In [None]:
X = df[df.columns[1:]] 
ss = preprocessing.StandardScaler()
X = pd.DataFrame(ss.fit_transform(X),columns = X.columns)

In [None]:
y_time = df.finish_time

In [None]:
#We are not going to use those db now
X_0= df_0_ML[df_0_ML.columns[1:]]
ss = preprocessing.StandardScaler()
X_0= pd.DataFrame(ss.fit_transform(X_0),columns = X_0.columns)

In [None]:
X_3 = df_3_ML[df_3_ML.columns[1:]] 
ss = preprocessing.StandardScaler()
X_3 = pd.DataFrame(ss.fit_transform(X_3),columns = X_3.columns)

Split in train (75%) and test (25%) sets

In [None]:
# split data into train and test sets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y_time, train_size=0.75, test_size=0.25, random_state=1)

In [None]:
print('X_train', X_train.shape)
print('y_train', y_train.shape)
print('X_test', X_test.shape)
print('y_test', y_test.shape)

[**Cross Validation Models**](#7)

In [None]:
lr= linear_model.LinearRegression()
cv= cross_val_score(lr,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

In [None]:
knn= KNeighborsRegressor(n_neighbors=4)
cv= cross_val_score(knn,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

In [None]:
tree=DecisionTreeRegressor(random_state=1)
cv= cross_val_score(tree,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

In [None]:
rf=RandomForestRegressor(random_state=1)
cv= cross_val_score(rf,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

In [None]:
#VotingRegressor
voting_rg=VotingRegressor(estimators=[('lr',lr),('rf',rf)])
cv= cross_val_score(voting_rg,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[**Predictions**](#8)

Take just better 3 (Linear Regression, Random Forest and VotingRegressor)we do the predictions

In [None]:
#linear regression
lr.fit(X_train,y_train)
y_lr=lr.predict(X_test)

In [None]:
print("Linear Regression results:")
print("R^2",metrics.r2_score(y_test,y_lr))
print("Mean Absolute Error", metrics.mean_absolute_error(y_test,y_lr))
print("Mean Squared Error", metrics.mean_squared_error(y_test,y_lr))
print("Root Mean Squared Error",np.square(metrics.mean_squared_error(y_test,y_lr)))

In [None]:
#Random Forest
rf.fit(X_train,y_train)
y_rf=rf.predict(X_test)

In [None]:
print("Random Forest results:")
print("R^2",metrics.r2_score(y_test,y_rf))
print("Mean Absolute Error", metrics.mean_absolute_error(y_test,y_rf))
print("Mean Squared Error", metrics.mean_squared_error(y_test,y_rf))
print("Root Mean Squared Error",np.square(metrics.mean_squared_error(y_test,y_rf)))

In [None]:
#VotingRegressor
voting_rg.fit(X_train,y_train)
y_voting_rg=voting_rg.predict(X_test)

In [None]:
print("Voting Regressor results:")
print("R^2",metrics.r2_score(y_test,y_voting_rg))
print("Mean Absolute Error", metrics.mean_absolute_error(y_test,y_voting_rg))
print("Mean Squared Error", metrics.mean_squared_error(y_test,y_voting_rg))
print("Root Mean Squared Error",np.square(metrics.mean_squared_error(y_test,y_voting_rg)))

Analyzing the MSE and MAE the best is the Random Forest,than Voting Regressor and third Linear Regression

[**Conclusion Test**](#8)

We try to predict the time to each horse and understand who is the winner using our algoritm in the race_id= 0 and race_id= 3

In [None]:
df_0['pred']=voting_rg.predict(X_0)

In [None]:
df_0 = df_0[['finish_time','pred','result']]

In [None]:
df_0['result_pred'] = df_0['pred'].rank(ascending=True).astype(int)

In [None]:
df_0

We see that the algorithm has guessed 2 horses out of 3 placed.

Now we use the last db

In [None]:
df_3['pred']=voting_rg.predict(X_3)

In [None]:
df_3 = df_3[['finish_time','pred','result']]

In [None]:
df_3['result_pred'] = df_3['pred'].rank(ascending=True).astype(int)

In [None]:
df_3

Also in this case our algorithm has identified 2 placed horses and the winner

Thanks for your attention :)