# Predicting EPL Match Winners

**Background**

In this project, we'll predict match winners in the English Premier League (EPL) using machine learning. We'll be working with match data from the English Premier League. This data is from the 2020-2021 and 2021-2022 seasons. (The data was scraped partway through the 2021-2022 season, so you won't have the complete match history for the season.)

**Python packages**

- `pandas`
- `requests`
- `BeautifulSoup`
- `scikit-learn`

**Steps**

1. Investigating Missing Data
2. Cleaning Data for ML Algorithm
3. Creating Predictors for ML Algorithm
4. Training the initial ML Model 
5. Improving the Model precision with Rolling Averages 
6. Retraining the Model 
7. Combining Home and Away Predictions 

## 1. Investigating Missing Data 

In [1]:
import pandas as pd  

In [2]:
#read dataset
matches = pd.read_csv("data/matches.csv", index_col=0)
#first 5 entries of the dataset
matches.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,Match Report,,18.0,4.0,16.9,1.0,0.0,0.0,2022,Manchester City
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,Match Report,,16.0,4.0,17.3,1.0,0.0,0.0,2022,Manchester City
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,Match Report,,25.0,10.0,14.3,0.0,0.0,0.0,2022,Manchester City
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,Match Report,,25.0,8.0,14.0,0.0,0.0,0.0,2022,Manchester City
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,Match Report,,16.0,1.0,15.7,1.0,0.0,0.0,2022,Manchester City


In [3]:
#the total number of entries in the dataset
matches.shape

(1389, 27)

There are 20 teams in each EPL season which leads to 38 matches for each team in 2 ways round. So the total matches in 2 seasons will be:  

In [4]:
38 * 20 * 2

1520

The actual result is smaller than 1520; therefore, there must be missing data for certain clubs. 

In [5]:
matches['team'].value_counts()

Southampton                 72
Brighton and Hove Albion    72
Manchester United           72
West Ham United             72
Newcastle United            72
Burnley                     71
Leeds United                71
Crystal Palace              71
Manchester City             71
Wolverhampton Wanderers     71
Tottenham Hotspur           71
Arsenal                     71
Leicester City              70
Chelsea                     70
Aston Villa                 70
Everton                     70
Liverpool                   38
Fulham                      38
West Bromwich Albion        38
Sheffield United            38
Brentford                   34
Watford                     33
Norwich City                33
Name: team, dtype: int64

In fact, 3 teams are relegated each season so there is a possibility that one team can not play full 38 matches. However, for those teams playing under or equal to 38 matches in the dataset, there maybe only their one season data. To prove this assumption, we investigate Liverpool's data.

In [6]:
# matches[matches['team']=='Liverpool']

## 2. Cleaning Data for ML Algorithm

In [7]:
# check value types of all columns 
matches.dtypes

date             object
time             object
comp             object
round            object
day              object
venue            object
result           object
gf              float64
ga              float64
opponent         object
xg              float64
xga             float64
poss            float64
attendance      float64
captain          object
formation        object
referee          object
match report     object
notes           float64
sh              float64
sot             float64
dist            float64
fk              float64
pk              float64
pkatt           float64
season            int64
team             object
dtype: object

The date values must be changed to datetime type instead of the current object type.

In [8]:
matches['date'] = pd.to_datetime(matches['date'])

In [9]:
matches['date'].dtypes

dtype('<M8[ns]')

## 3. Creating Predictors for ML Algorithm

As the chosen ML model for this project is RandomForrest Regressor, all of the parameters must be in numeric form. While some of the predictors are still strings, they should be encoded properly to train the the initial ML algorithm.  

In [10]:
# to differentiate home or away matches
matches['venue_code'] = matches['venue'].astype('category').cat.codes 

In [11]:
# to encode the number of opponent in alpabetical order
matches['opp_code'] = matches['opponent'].astype('category').cat.codes 

In [12]:
# to retrieve matches' time of the day
matches['hour'] = matches['time'].str.replace(":.+", "", regex=True).astype('int')

In [13]:
# to retrieve the day number of the week 
matches['day_code'] = matches['date'].dt.dayofweek

In [14]:
# to differentiate if the final result is a win or not 
matches['target'] = (matches['result'] == 'W').astype('int')

In [15]:
matches[['venue_code','opp_code','hour','day_code','target']].head()

Unnamed: 0,venue_code,opp_code,hour,day_code,target
1,0,18,16,6,0
2,1,15,15,5,1
3,1,0,12,5,1
4,0,10,15,5,1
6,1,17,15,5,0


The output of the encoded values looks perfect before training the ML model.

## 4. Training the initial ML Algorithm

In [16]:
from sklearn.ensemble import RandomForestClassifier

In [17]:
rf = RandomForestClassifier(n_estimators=50, min_samples_split=10, random_state=1)

Now we split the original dataset to define the train and test sets for the algorithm.

In [18]:
#Define the train data set
train = matches[matches["date"] < '2022-01-01']

In [19]:
#Define the test data set
test = matches[matches["date"] > '2022-01-01']

In [20]:
predictors = ["venue_code", "opp_code", "hour", "day_code"]

In [21]:
rf.fit(train[predictors], train["target"])

RandomForestClassifier(min_samples_split=10, n_estimators=50, random_state=1)

In [22]:
preds = rf.predict(test[predictors])

In [23]:
from sklearn.metrics import accuracy_score 

In [24]:
acc = accuracy_score(test["target"], preds)
print("Accuracy score:", acc)

Accuracy score: 0.6123188405797102


In [25]:
#Compare the actual and predicted values 
combined = pd.DataFrame(dict(actual = test["target"], prediction = preds))

In [26]:
pd.crosstab(index=combined["actual"], columns=combined["prediction"])

prediction,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,141,31
1,76,28


It is evident that when we predict if a match is a draw or lose, it is highly likely accurate. However, the predictions for win matches are not good enough as the more false results happen. 

In [27]:
from sklearn.metrics import precision_score

In [28]:
precision_score(test["target"], preds)

0.4745762711864407

## 5. Improving the Model precision with Rolling Averages 

In [29]:
# Group the dataset by single team 
grouped_matches = matches.groupby("team")

In [32]:
# Testing MU's stats with the ascending order of date 
group = grouped_matches.get_group("Manchester United").sort_values("date")
group.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,fk,pk,pkatt,season,team,venue_code,opp_code,hour,day_code,target
0,2020-09-19,17:30,Premier League,Matchweek 2,Sat,Home,L,1.0,3.0,Crystal Palace,...,1.0,0.0,0.0,2021,Manchester United,1,6,17,5,0
2,2020-09-26,12:30,Premier League,Matchweek 3,Sat,Away,W,3.0,2.0,Brighton,...,0.0,1.0,1.0,2021,Manchester United,0,3,12,5,1
4,2020-10-04,16:30,Premier League,Matchweek 4,Sun,Home,L,1.0,6.0,Tottenham,...,0.0,1.0,1.0,2021,Manchester United,1,18,16,6,0
5,2020-10-17,20:00,Premier League,Matchweek 5,Sat,Away,W,4.0,1.0,Newcastle Utd,...,0.0,0.0,1.0,2021,Manchester United,0,14,20,5,1
7,2020-10-24,17:30,Premier League,Matchweek 6,Sat,Home,D,0.0,0.0,Chelsea,...,0.0,0.0,0.0,2021,Manchester United,1,5,17,5,0


In [33]:
# Create the rolling averages calculation function 
def rolling_averages(group, cols, new_cols): # take a combination of 'cols' as input to compute the rolling stats as 'new_cols'
    group = group.sort_values("date")
    rolling_stats = group[cols].rolling(3, closed='left').mean() # 'left' means ignoring the current week while using the past 3 weeks 
    group[new_cols] = rolling_stats # the results after applying the algorithm 
    group = group.dropna(subset=new_cols)
    return group

In [35]:
# Define the input columns and create new columns respectively 
cols = ["gf", "ga", "sh", "sot", "dist", "fk", "pk", "pkatt"]
new_cols = [f"{c}_rolling" for c in cols]

In [36]:
new_cols

['gf_rolling',
 'ga_rolling',
 'sh_rolling',
 'sot_rolling',
 'dist_rolling',
 'fk_rolling',
 'pk_rolling',
 'pkatt_rolling']

In [37]:
# Apply the rolling avg function 
rolling_averages(group, cols, new_cols) 

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
5,2020-10-17,20:00,Premier League,Matchweek 5,Sat,Away,W,4.0,1.0,Newcastle Utd,...,5,1,1.666667,3.666667,9.000000,2.333333,19.800000,0.333333,0.666667,0.666667
7,2020-10-24,17:30,Premier League,Matchweek 6,Sat,Home,D,0.0,0.0,Chelsea,...,5,0,2.666667,3.000000,12.333333,4.666667,20.000000,0.000000,0.666667,1.000000
9,2020-11-01,16:30,Premier League,Matchweek 7,Sun,Home,L,0.0,1.0,Arsenal,...,6,0,1.666667,2.333333,15.000000,5.333333,19.700000,0.000000,0.333333,0.666667
11,2020-11-07,12:30,Premier League,Matchweek 8,Sat,Away,W,3.0,1.0,Everton,...,5,1,1.333333,0.666667,16.666667,5.666667,18.300000,0.000000,0.000000,0.333333
12,2020-11-21,20:00,Premier League,Matchweek 9,Sat,Home,W,1.0,0.0,West Brom,...,5,1,1.000000,0.666667,12.000000,3.666667,18.733333,0.666667,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40,2022-04-02,17:30,Premier League,Matchweek 31,Sat,Home,D,1.0,1.0,Leicester City,...,5,0,1.333333,2.000000,12.666667,3.666667,15.333333,0.333333,0.000000,0.000000
41,2022-04-09,12:30,Premier League,Matchweek 32,Sat,Away,L,0.0,1.0,Everton,...,5,0,1.666667,2.333333,9.000000,4.333333,14.333333,0.000000,0.000000,0.000000
42,2022-04-16,15:00,Premier League,Matchweek 33,Sat,Home,W,3.0,2.0,Norwich City,...,5,1,1.333333,1.333333,11.333333,5.000000,15.333333,0.000000,0.000000,0.000000
43,2022-04-19,20:00,Premier League,Matchweek 30,Tue,Away,L,0.0,4.0,Liverpool,...,1,0,1.333333,1.333333,14.333333,6.000000,15.666667,0.333333,0.000000,0.000000


Now we apply the all clubs in the original dataset: 

In [38]:
matches_rolling = matches.groupby('team').apply(lambda x: rolling_averages(x, cols, new_cols))

In [39]:
matches_rolling

Unnamed: 0_level_0,Unnamed: 1_level_0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Arsenal,6,2020-10-04,14:00,Premier League,Matchweek 4,Sun,Home,W,2.0,1.0,Sheffield Utd,...,6,1,2.000000,1.333333,7.666667,3.666667,14.733333,0.666667,0.000000,0.000000
Arsenal,7,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Away,L,0.0,1.0,Manchester City,...,5,0,1.666667,1.666667,5.333333,3.666667,15.766667,0.000000,0.000000,0.000000
Arsenal,9,2020-10-25,19:15,Premier League,Matchweek 6,Sun,Home,L,0.0,1.0,Leicester City,...,6,0,1.000000,1.666667,7.000000,3.666667,16.733333,0.666667,0.000000,0.000000
Arsenal,11,2020-11-01,16:30,Premier League,Matchweek 7,Sun,Away,W,1.0,0.0,Manchester Utd,...,6,1,0.666667,1.000000,9.666667,4.000000,16.033333,1.000000,0.000000,0.000000
Arsenal,13,2020-11-08,19:15,Premier League,Matchweek 8,Sun,Home,L,0.0,3.0,Aston Villa,...,6,0,0.333333,0.666667,9.666667,2.666667,18.033333,1.000000,0.333333,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wolverhampton Wanderers,32,2022-03-13,14:00,Premier League,Matchweek 29,Sun,Away,W,1.0,0.0,Everton,...,6,1,1.333333,1.000000,12.333333,3.666667,19.300000,0.000000,0.000000,0.000000
Wolverhampton Wanderers,33,2022-03-18,20:00,Premier League,Matchweek 30,Fri,Home,L,2.0,3.0,Leeds United,...,4,0,1.666667,0.666667,12.333333,4.333333,19.600000,0.000000,0.000000,0.000000
Wolverhampton Wanderers,34,2022-04-02,15:00,Premier League,Matchweek 31,Sat,Home,W,2.0,1.0,Aston Villa,...,5,1,2.333333,1.000000,13.000000,5.333333,19.833333,0.000000,0.000000,0.000000
Wolverhampton Wanderers,35,2022-04-08,20:00,Premier League,Matchweek 32,Fri,Away,L,0.0,1.0,Newcastle Utd,...,4,0,1.666667,1.333333,13.000000,5.000000,18.533333,0.000000,0.000000,0.000000


As we do not need the index level which is now the name of the team, we can drop them.

In [40]:
matches_rolling = matches_rolling.droplevel('team')

The updated dataset will have normal index now: 

In [41]:
matches_rolling

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
6,2020-10-04,14:00,Premier League,Matchweek 4,Sun,Home,W,2.0,1.0,Sheffield Utd,...,6,1,2.000000,1.333333,7.666667,3.666667,14.733333,0.666667,0.000000,0.000000
7,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Away,L,0.0,1.0,Manchester City,...,5,0,1.666667,1.666667,5.333333,3.666667,15.766667,0.000000,0.000000,0.000000
9,2020-10-25,19:15,Premier League,Matchweek 6,Sun,Home,L,0.0,1.0,Leicester City,...,6,0,1.000000,1.666667,7.000000,3.666667,16.733333,0.666667,0.000000,0.000000
11,2020-11-01,16:30,Premier League,Matchweek 7,Sun,Away,W,1.0,0.0,Manchester Utd,...,6,1,0.666667,1.000000,9.666667,4.000000,16.033333,1.000000,0.000000,0.000000
13,2020-11-08,19:15,Premier League,Matchweek 8,Sun,Home,L,0.0,3.0,Aston Villa,...,6,0,0.333333,0.666667,9.666667,2.666667,18.033333,1.000000,0.333333,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32,2022-03-13,14:00,Premier League,Matchweek 29,Sun,Away,W,1.0,0.0,Everton,...,6,1,1.333333,1.000000,12.333333,3.666667,19.300000,0.000000,0.000000,0.000000
33,2022-03-18,20:00,Premier League,Matchweek 30,Fri,Home,L,2.0,3.0,Leeds United,...,4,0,1.666667,0.666667,12.333333,4.333333,19.600000,0.000000,0.000000,0.000000
34,2022-04-02,15:00,Premier League,Matchweek 31,Sat,Home,W,2.0,1.0,Aston Villa,...,5,1,2.333333,1.000000,13.000000,5.333333,19.833333,0.000000,0.000000,0.000000
35,2022-04-08,20:00,Premier League,Matchweek 32,Fri,Away,L,0.0,1.0,Newcastle Utd,...,4,0,1.666667,1.333333,13.000000,5.000000,18.533333,0.000000,0.000000,0.000000


We should have 1317 entries but it somehow did not reach that high record, there must be repeated values. Therefore, we can map the `matches_rolling`'s index with the true range to have unique values. 

In [43]:
matches_rolling.index = range(matches_rolling.shape[0])

In [44]:
matches_rolling

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
0,2020-10-04,14:00,Premier League,Matchweek 4,Sun,Home,W,2.0,1.0,Sheffield Utd,...,6,1,2.000000,1.333333,7.666667,3.666667,14.733333,0.666667,0.000000,0.000000
1,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Away,L,0.0,1.0,Manchester City,...,5,0,1.666667,1.666667,5.333333,3.666667,15.766667,0.000000,0.000000,0.000000
2,2020-10-25,19:15,Premier League,Matchweek 6,Sun,Home,L,0.0,1.0,Leicester City,...,6,0,1.000000,1.666667,7.000000,3.666667,16.733333,0.666667,0.000000,0.000000
3,2020-11-01,16:30,Premier League,Matchweek 7,Sun,Away,W,1.0,0.0,Manchester Utd,...,6,1,0.666667,1.000000,9.666667,4.000000,16.033333,1.000000,0.000000,0.000000
4,2020-11-08,19:15,Premier League,Matchweek 8,Sun,Home,L,0.0,3.0,Aston Villa,...,6,0,0.333333,0.666667,9.666667,2.666667,18.033333,1.000000,0.333333,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1312,2022-03-13,14:00,Premier League,Matchweek 29,Sun,Away,W,1.0,0.0,Everton,...,6,1,1.333333,1.000000,12.333333,3.666667,19.300000,0.000000,0.000000,0.000000
1313,2022-03-18,20:00,Premier League,Matchweek 30,Fri,Home,L,2.0,3.0,Leeds United,...,4,0,1.666667,0.666667,12.333333,4.333333,19.600000,0.000000,0.000000,0.000000
1314,2022-04-02,15:00,Premier League,Matchweek 31,Sat,Home,W,2.0,1.0,Aston Villa,...,5,1,2.333333,1.000000,13.000000,5.333333,19.833333,0.000000,0.000000,0.000000
1315,2022-04-08,20:00,Premier League,Matchweek 32,Fri,Away,L,0.0,1.0,Newcastle Utd,...,4,0,1.666667,1.333333,13.000000,5.000000,18.533333,0.000000,0.000000,0.000000


## 6. Retraining our machine learning model

## 7. Combining home and away predictions