# Predicting EPL Match Winners

**Background**

In this project, we'll predict match winners in the English Premier League (EPL) using machine learning. We'll be working with match data from the English Premier League. This data is from the 2020-2021 and 2021-2022 seasons. (The data was scraped partway through the 2021-2022 season, so you won't have the complete match history for the season.)

**Python packages**

- `pandas`
- `requests`
- `BeautifulSoup`
- `scikit-learn`

**Steps**

1. Investigating Missing Data
2. Cleaning Data for ML Algorithm
3. Creating Predictors for ML Algorithm
4. Training the initial ML Model 
5. Improving the Model with Rolling Averages 
6. Retraining the Model 
7. Combining Home and Away Predictions 

## 1. Investigating Missing Data 

In [2]:
import pandas as pd  

In [3]:
#read dataset
matches = pd.read_csv("data/matches.csv", index_col=0)
#first 5 entries of the dataset
matches.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,Match Report,,18.0,4.0,16.9,1.0,0.0,0.0,2022,Manchester City
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,Match Report,,16.0,4.0,17.3,1.0,0.0,0.0,2022,Manchester City
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,Match Report,,25.0,10.0,14.3,0.0,0.0,0.0,2022,Manchester City
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,Match Report,,25.0,8.0,14.0,0.0,0.0,0.0,2022,Manchester City
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,Match Report,,16.0,1.0,15.7,1.0,0.0,0.0,2022,Manchester City


In [4]:
#the total number of entries in the dataset
matches.shape

(1389, 27)

There are 20 teams in each EPL season which leads to 38 matches for each team in 2 ways round. So the total matches in 2 seasons will be:  

In [5]:
38 * 20 * 2

1520

The actual result is smaller than 1520; therefore, there must be missing data for certain clubs. 

In [6]:
matches['team'].value_counts()

Southampton                 72
Brighton and Hove Albion    72
Manchester United           72
West Ham United             72
Newcastle United            72
Burnley                     71
Leeds United                71
Crystal Palace              71
Manchester City             71
Wolverhampton Wanderers     71
Tottenham Hotspur           71
Arsenal                     71
Leicester City              70
Chelsea                     70
Aston Villa                 70
Everton                     70
Liverpool                   38
Fulham                      38
West Bromwich Albion        38
Sheffield United            38
Brentford                   34
Watford                     33
Norwich City                33
Name: team, dtype: int64

In fact, 3 teams are relegated each season so there is a possibility that one team can not play full 38 matches. However, for those teams playing under or equal to 38 matches in the dataset, there maybe only their one season data. To prove this assumption, we investigate Liverpool's data.

In [8]:
# matches[matches['team']=='Liverpool']

## 2. Cleaning Data for ML Algorithm

In [9]:
# check value types of all columns 
matches.dtypes

date             object
time             object
comp             object
round            object
day              object
venue            object
result           object
gf              float64
ga              float64
opponent         object
xg              float64
xga             float64
poss            float64
attendance      float64
captain          object
formation        object
referee          object
match report     object
notes           float64
sh              float64
sot             float64
dist            float64
fk              float64
pk              float64
pkatt           float64
season            int64
team             object
dtype: object

The date values must be changed to datetime type instead of the current object type.

In [10]:
matches['date'] = pd.to_datetime(matches['date'])

In [12]:
matches['date'].dtypes

dtype('<M8[ns]')

## 3. Creating Predictors for ML Algorithm

As the chosen ML model for this project is RandomForrest Regressor, all of the parameters must be in numeric form. While some of the predictors are still strings, they should be encoded properly to train the the initial ML algorithm.  

In [16]:
# to differentiate home or away matches
matches['venue_code'] = matches['venue'].astype('category').cat.codes 

In [17]:
# to encode the number of opponent in alpabetical order
matches['opp_code'] = matches['opponent'].astype('category').cat.codes 

In [18]:
# to retrieve matches' time of the day
matches['hour'] = matches['time'].str.replace(":.+", "", regex=True).astype('int')

In [19]:
# to retrieve the day number of the week 
matches['day_code'] = matches['date'].dt.dayofweek

In [20]:
# to differentiate if the final result is a win or not 
matches['target'] = (matches['result'] == 'W').astype('int')

In [24]:
matches[['venue_code','opp_code','hour','day_code','target']].head()

Unnamed: 0,venue_code,opp_code,hour,day_code,target
1,0,18,16,6,0
2,1,15,15,5,1
3,1,0,12,5,1
4,0,10,15,5,1
6,1,17,15,5,0


The output of the encoded values looks perfect before training the ML model.

## 4. Training the initial ML Algorithm