# Predicting English Premier League Matches

In my past projects I worked only with given datasets. For my next project, I wanted to scrape the data that I'd work with, since I think this is a important skill for a data scientist.

Since I love football (soccer), I thought that it would be fun to scrape football data - more specifically, English Premier League matches - and use it to try to predict future matches. 

In the following sections I'll explain the process and the reasoning behind it.

### Sections:
1. [Loading Packages and Files](#Loading-Packages-and-Files)
2. [Data Scraping](#Data-Scraping)<br>
    2.1. [Setting Scraping Variables](#Setting-Scraping-Variables) <br>
    2.2. [Scraping Data](#Scraping-Data)
3. [Building our DataFrame](#Building-our-DataFrame)
4. [Models](#Models) <br>
    4.1. [Random Forest](#Random-Forest) <br>
    4.2. [Neural Network](#Neural-Network) <br>
    4.3. [Betting Odds Comparison](#Betting-Odds-Comparison) <br>
5. [Conclusion](#Conclusion)

## Loading Packages and Files

We start loading the following standard libraries:

In [1]:
import pandas as pd
import numpy as np

import re
import datetime

We now set a few global variables that will be used later:

In [2]:
#to build our database, will be explained later.
LAST_MATCHES = datetime.timedelta(days = 15)
OUTCOMES = {'W' : 3, 'D' : 1, 'L': 0}

# for the train-test split
RANDOM_STATE = 42

## Data Scraping

First, we load libraries that will help our scraping process:

In [3]:
import requests
from bs4 import BeautifulSoup as BS

We will be scraping data from [Football Reference (FBRef)](http://fbref.com). Since FBRef only has xG and xGA records for the English Premier League seasons from 2017-2018 onwards, we chose to scrape these three seasons (2017-2018, 2018-2019, 2019-2020). Note that xG and xGA are both important metrics for team perfomance in matches, since they track the expected goals and expected goals against, respectively.

We define now a few variables that will help our scraping process:

### Setting Scraping Variables

In [4]:
SEASONS = ['2017-2018', '2018-2019', '2019-2020']
BASE_URL = r'http://fbref.com'

# codes, names and numbers/ID (from FBREF) for each Premier League in 2017 through 2020
TEAMS_CODES = ['ARS', 'AVL', 'BOU', 'BHA', 'BUR', 'CAR', 'CHE', 'CRY', 'EVE', 'FUL',
               'HUD', 'LEI', 'LIV', 'MCI', 'MUN', 'NEW', 'NOR', 
               'SHU', 'SOU', 'STK', 'SWA', 'TOT', 'WAT', 'WBA', 'WHU', 'WOL']
TEAMS_NAMES = ['Arsenal', 'Aston Villa', 'Bournemouth', 'Brighton', 'Burnley', 'Cardiff City', 'Chelsea', 'Crystal Palace', 'Everton', 'Fulham', 
               'Huddersfield', 'Leicester City', 'Liverpool', 'Manchester City', 'Manchester Utd', 'Newcastle Utd', 'Norwich City',
               'Sheffield Utd', 'Southampton', 'Stoke City', 'Swansea City', 'Tottenham', 'Watford', 'West Brom', 'West Ham', 'Wolves']
TEAMS_NUMBERS = ['18bb7c10', '8602292d', '4ba7cbea', 'd07537b9', '943e8050','75fae011', 'cff3d9bb', '47c64c55', 'd3fd31cc', 'fd962109',
                 'f5922ca5', 'a2d435b3', '822bd0ba', 'b8fd03ef', '19538871', 'b2b47a98', '1c781004', 
                 '1df6b87e', '33c895d4', '17892952',  'fb10988f', '361ca564', '2abfe087', '60c6b05f', '7c21e445', '8cec06e1']

# teams that played in each season
TEAMS_SEASONS = {'2017-2018' : ['ARS', 'BOU', 'BHA', 'BUR', 'CHE', 'CRY', 'EVE', 'HUD', 'LEI', 'LIV', 'MCI', 'MUN', 'NEW', 'SOU', 'STK', 'SWA', 'TOT', 'WAT', 'WBA', 'WHU'],
                 '2018-2019' : ['ARS', 'BOU', 'BHA', 'BUR', 'CAR', 'CHE', 'CRY', 'EVE', 'FUL', 'HUD', 'LEI', 'LIV', 'MCI', 'MUN', 'NEW', 'SOU', 'TOT', 'WAT', 'WHU', 'WOL'],
                 '2019-2020' : ['ARS', 'AVL', 'BOU', 'BHA', 'BUR', 'CHE', 'CRY', 'EVE', 'LEI', 'LIV', 'MCI', 'MUN', 'NEW', 'NOR', 'SHU', 'SOU', 'TOT', 'WAT', 'WHU', 'WOL']}

TEAMS_NUMBERS_DICT = dict(zip(TEAMS_CODES, TEAMS_NUMBERS))
TEAMS_CODES_DICT = dict(zip(TEAMS_NAMES, TEAMS_CODES))

We are now ready to start scraping data.

### Scraping Data

First, we'll scrape the full league schedule for every season (2017-18, 2018-19 and 2019-20).

In [5]:
EPL = {}
EPL_LINKS = [r'https://fbref.com/en/comps/9/1631/schedule/2017-2018-Premier-League-Scores-and-Fixtures',
             r'https://fbref.com/en/comps/9/1889/schedule/2018-2019-Premier-League-Scores-and-Fixtures',
             r'https://fbref.com/en/comps/9/3232/schedule/2019-2020-Premier-League-Scores-and-Fixtures']

for index, epl in enumerate(EPL_LINKS):
    fixtures = pd.read_html(epl)[0].iloc[:,0:12]
    fixtures.dropna(axis=0, inplace=True, how='all')
    fixtures['Date'] = pd.to_datetime(fixtures['Date'])
    fixtures.drop(['Day', 'Time', 'xG', 'xG.1', 'Attendance', 'Venue', 'Referee'], axis=1, inplace=True)
    fixtures.reset_index(drop=True, inplace=True)
    name = SEASONS[index]
    EPL[name] = fixtures    

In [6]:
EPL['2017-2018'].head(10)

Unnamed: 0,Wk,Date,Home,Score,Away
0,1.0,2017-08-11,Arsenal,4–3,Leicester City
1,1.0,2017-08-12,Watford,3–3,Liverpool
2,1.0,2017-08-12,West Brom,1–0,Bournemouth
3,1.0,2017-08-12,Everton,1–0,Stoke City
4,1.0,2017-08-12,Southampton,0–0,Swansea City
5,1.0,2017-08-12,Chelsea,2–3,Burnley
6,1.0,2017-08-12,Crystal Palace,0–3,Huddersfield
7,1.0,2017-08-12,Brighton,0–2,Manchester City
8,1.0,2017-08-13,Newcastle Utd,0–2,Tottenham
9,1.0,2017-08-13,Manchester Utd,4–0,West Ham


Since every team has played other matches beyond the Premier League ones (e.g., in the FA Cup, Carabao Cup and European Competitions) we'll also scrape each team individual schedules. For that, we'll define a helper function:

In [7]:
def getTeamSchedule(team, season):
    ''' 
    variable 'team' is the team code provided by 'TEAMS_CODES' and 'season' is one of the seasons described in the 'SEASONS' list.
    '''
    number = TEAMS_NUMBERS_DICT[team]
    url = BASE_URL + '/en/squads/' + number +  '/' + season + '/matchlogs/all_comps/schedule'
    d = pd.read_html(url)[0]
    d.drop(['Time', 'Day', 'Attendance', 'Captain', 'Formation', 'Referee', 'Match Report', 'Notes'], axis=1, inplace=True)
    d['Date'] = d['Date'].apply(pd.to_datetime)
    
    try:
        d['GF'].astype('int32')
        d['GA'].astype('int32')
    except ValueError:
        re_number = re.compile(r"\d")
        d['GF'] = d['GF'].apply(lambda x : int(re_number.findall(x)[0]))
        d['GA'] = d['GA'].apply(lambda x : int(re_number.findall(x)[0]))
    return d

Before getting every schedule, let's make one test:

In [8]:
test = getTeamSchedule('ARS', '2017-2018')
test.head(10)

Unnamed: 0,Date,Comp,Round,Venue,Result,GF,GA,Opponent,xG,xGA,Poss
0,2017-08-06,Community Shield,FA Community Shield,Neutral,W,1,1,Chelsea,,,
1,2017-08-11,Premier League,Matchweek 1,Home,W,4,3,Leicester City,2.1,1.6,68.0
2,2017-08-19,Premier League,Matchweek 2,Away,L,0,1,Stoke City,1.2,0.7,77.0
3,2017-08-27,Premier League,Matchweek 3,Away,L,0,4,Liverpool,0.7,2.6,51.0
4,2017-09-09,Premier League,Matchweek 4,Home,W,3,0,Bournemouth,1.6,0.8,58.0
5,2017-09-14,Europa Lg,Group stage,Home,W,3,1,de Köln,,,72.0
6,2017-09-17,Premier League,Matchweek 5,Away,D,0,0,Chelsea,1.1,0.7,50.0
7,2017-09-20,EFL Cup,Third round,Home,W,1,0,Doncaster,,,57.0
8,2017-09-25,Premier League,Matchweek 6,Home,W,2,0,West Brom,2.1,1.1,68.0
9,2017-09-28,Europa Lg,Group stage,Away,W,4,2,by BATE Borisov,,,62.0


We can see above that the FBRef database **only** tracks xG and xGA for **Premier League** matches. In order to make a good dataframe (as we'll explain later), we'll fill these missing rows with the same match score.

In [9]:
def dealNaN(value, goal_column):
    '''
    'value' is a variable that could be missing and 'goal_column' is a variable that, if 'value' is missing, will replace it
    '''
    if pd.isnull(value):
        return goal_column
    else:
        return value

def NEWgetTeamSchedule(team, season):
    ''' 
    variable 'team' is the team code provided by 'TEAMS_CODES' and 'season' is one of the seasons described in the 'SEASONS' list.
    '''
    number = TEAMS_NUMBERS_DICT[team]
    url = BASE_URL + '/en/squads/' + number +  '/' + season + '/matchlogs/all_comps/schedule'
    d = pd.read_html(url)[0]
    d.drop(['Time', 'Day', 'Attendance', 'Captain', 'Formation', 'Referee', 'Match Report', 'Notes'], axis=1, inplace=True)
    d['Date'] = d['Date'].apply(pd.to_datetime)
    
    # since each team schedule may have cup games (which goes to penalties and are written '2 (5)'), we'll be using RegEx:
    try:
        d['GF'].astype('int32')
        d['GA'].astype('int32')
    except ValueError:
        re_number = re.compile(r"\d")
        d['GF'] = d['GF'].apply(lambda x : int(re_number.findall(x)[0]))
        d['GA'] = d['GA'].apply(lambda x : int(re_number.findall(x)[0]))
    
    
    # replacing missing xG and xGA with the actual goals scored and conceded
    d['xG'] = d.apply(lambda x : dealNaN(x[8], x[5]), axis=1)   
    d['xGA'] = d.apply(lambda x : dealNaN(x[9], x[6]), axis=1)   

    return d

In [10]:
test = NEWgetTeamSchedule('ARS', '2017-2018')
test.head(10)

Unnamed: 0,Date,Comp,Round,Venue,Result,GF,GA,Opponent,xG,xGA,Poss
0,2017-08-06,Community Shield,FA Community Shield,Neutral,W,1,1,Chelsea,1.0,1.0,
1,2017-08-11,Premier League,Matchweek 1,Home,W,4,3,Leicester City,2.1,1.6,68.0
2,2017-08-19,Premier League,Matchweek 2,Away,L,0,1,Stoke City,1.2,0.7,77.0
3,2017-08-27,Premier League,Matchweek 3,Away,L,0,4,Liverpool,0.7,2.6,51.0
4,2017-09-09,Premier League,Matchweek 4,Home,W,3,0,Bournemouth,1.6,0.8,58.0
5,2017-09-14,Europa Lg,Group stage,Home,W,3,1,de Köln,3.0,1.0,72.0
6,2017-09-17,Premier League,Matchweek 5,Away,D,0,0,Chelsea,1.1,0.7,50.0
7,2017-09-20,EFL Cup,Third round,Home,W,1,0,Doncaster,1.0,0.0,57.0
8,2017-09-25,Premier League,Matchweek 6,Home,W,2,0,West Brom,2.1,1.1,68.0
9,2017-09-28,Europa Lg,Group stage,Away,W,4,2,by BATE Borisov,4.0,2.0,62.0


We see that we still have a NaN value. However, since it's only one game so far (the Community Shield Match) and all the other competitions are tracking this stat, we'll deal with it later.

With the xG and xGA now fixed, we now proceed to gather all schedules:

In [11]:
SCHEDULES = {}
for season in TEAMS_SEASONS:
    #print(season) 
    team_dict = {}
    for team in TEAMS_SEASONS[season]:
        #print(team)
        team_dict[team] = NEWgetTeamSchedule(team, season)
    
    SCHEDULES[season] = team_dict

Now that we have scraped our data, let's build our Dataframe:

## Building our DataFrame

So in order to build a model to predict the outcome of a match (a win for the home team, a draw or a win for the away team) we need to chose which explanatory variables we'll be using. Obviously, we can't use the data from the match itself since it'd be an obvious data leakage: we want to be able to predict the outcome **before** the match starts.

Everyone that watches football regularly knows that **form** is an important thing. A team with great players and a great manager could be so *out-of-form* that they lose many matches in which they are considered favorites (by simply looking at the quality of each starting lineup). For example, in recent times Barcelona (even with Messi), Man United, Arsenal and a number of other teams have struggled in their domestic leagues despite their superb squads.

With that in mind, we'll be using the matches from the last **LAST_MATCHES period** (as defined in the beginning) as a metric for "form". Since our data scraped from FBRef only has Results, GF (Goals For), GA (Goals Against), xG, xGA and Poss (as one can see by the dataframe 'test' above), we'll be mainly using these metrics in our dataframe. We'll also be tracking how many matches were played in that timeframe.

We'll be defining some helper functions that will do this job next.

In [12]:
def sumPastMatches(data):
    ''' 
    'data': a dataframe with the last matches 
    returns the Number of Matches and Points, Goals Scored, Goals Against, xG, xGA and Possession average for  over those matches
    '''
    if data.shape[0] == 0:
        return pd.DataFrame(data=[[0]*7], columns = ['Number of Matches', 'Points', 'GF', 'GA', 'xG', 'xGA', 'Poss'])
    
    data['Result'] = data['Result'].apply(lambda x : OUTCOMES[x])
    data = data.select_dtypes(include=['float64','int64'])
    new = pd.DataFrame(data = [data.shape[0]], columns=['Number of Matches'])
    data = pd.DataFrame(data.mean()).transpose()
    data = pd.concat([new, data], axis=1)
    data.columns = ['Number of Matches', 'Points', 'GF', 'GA', 'xG', 'xGA', 'Poss']
    
    return data

It is obvious that, since football is a complex sport, this amount of variables isn't enough to significantly predict the outcome of a match. In an ideal world, we'd like to at least use analytical data from the players starting the match (to cover cases when a star player is injured, when good players are in bad form or vice-versa, etc) and the lasts opponents' strength (to help our model understand that Chelsea beating Qaarabag, a weak team from Azerbaijan, by 6-0 in an European Competition 5 days ago does not make them super favorites for their next match against Arsenal).

We could try to scrape some player data - since FBRef also have it - but that would demand tons of coding and callbacks to the website. To keep it simple, we'll use only the data scraped thus far. As surprising as it may be, we shall see that even with only these data points we can be reasonably accurate. 

Before going further, one can ask which other variables we could engineer from the data we scraped. One information that could be helpful in predicting the outcome of a match is the **league position** before going into each match, since this is also somewhat a indication of form (for the whole season). 

Because we've all the matches played in a single season (in the EPL dictionary), we can do this using a helper function:

In [13]:
def buildSeasonStandings(matches):
    '''
    'matches' is a dataframe with columns ['Date', 'Home', 'Score', 'Away'], where the 'Score' is written as 'D-D', (D is a digit)
    (just like the ones contained in the EPL dict)
    returns a dictionary with DATES as KEYS and a STANDING DATAFRAME as VALUE, where the standings are calculated
    AFTER each game on said date was played
    '''
    
    dates = sorted(matches['Date'].unique())
    teams = sorted(matches['Home'].unique())
    
    table = pd.DataFrame(data = [[team, 0,0,0,0] for team in teams], columns=['Team', 'GF', 'GA', 'GD', 'Points'])
    
    d = {}
    
    for date in dates:
        rel = matches[matches['Date'] == date]
        date = pd.to_datetime(date)
        for row in rel.iterrows():
            home = row[1]['Home']
            away = row[1]['Away']
            home_goals, away_goals = int(row[1]['Score'][0]), int(row[1]['Score'][2])
    
            if home_goals == away_goals:
                points = [1,1]
            elif home_goals > away_goals:
                points = [3,0]
            else:
                points = [0,3]
            
            home_idx = table[table['Team'] == home].index[0]
            away_idx = table[table['Team'] == away].index[0]
            
            home_gf = table.iloc[home_idx, 1] + home_goals
            home_ga = table.iloc[home_idx, 2] + away_goals
            home_gd = home_gf - home_ga
            home_pts = table.iloc[home_idx, 4] + points[0]
            
            away_gf = table.iloc[away_idx, 1] + away_goals
            away_ga = table.iloc[away_idx, 2] + home_goals
            away_gd = away_gf - away_ga
            away_pts = table.iloc[away_idx, 4] + points[1]
            
            table.iloc[home_idx] = [home, home_gf, home_ga, home_gd, home_pts]
            table.iloc[away_idx] = [away, away_gf, away_ga, away_gd, away_pts]   
             
        table.sort_values(by=['Points', 'GD', 'GF'], ascending=False, inplace=True)
        table.reset_index(drop=True, inplace=True)
        d[date] = table.copy()    

    return d

We'll now calculate the standings throughout each season:

In [14]:
STANDINGS = {}
for season in SEASONS:
    STANDINGS[season] = buildSeasonStandings(EPL[season])

Just to illustrate what our buildSeasonStandings function does, lets take a look at the 2017-2018 season. First, let's look at the first 5 keys:

In [15]:
list(STANDINGS['2017-2018'].keys())[:5]

[Timestamp('2017-08-11 00:00:00'),
 Timestamp('2017-08-12 00:00:00'),
 Timestamp('2017-08-13 00:00:00'),
 Timestamp('2017-08-19 00:00:00'),
 Timestamp('2017-08-20 00:00:00')]

So the key *'Timestamp('2017-08-20 00:00:00')'* in the dictionary returns the standings for 20/08/2017 **after** the games played on that day.

Since we can easily swap from 'Timestamp' to 'datetime', we'll be using the latter:

In [16]:
STANDINGS['2017-2018'][datetime.datetime(2017,8,20)]

Unnamed: 0,Team,GF,GA,GD,Points
0,Manchester Utd,8,0,8,6
1,Huddersfield,4,0,4,6
2,West Brom,2,0,2,6
3,Watford,5,3,2,4
4,Liverpool,4,3,1,4
5,Southampton,3,2,1,4
6,Manchester City,2,0,2,3
7,Leicester City,5,4,1,3
8,Tottenham,3,2,1,3
9,Everton,1,0,1,3


With all the standings in hand, we now continue to define our helper functions and build our dataframe.

In [17]:
def getScore(score):
    '''
    A function to one-hot encode the score of a match.
    'score': a string written as 'D-D', where D are digits.
    '''
    
    home = int(score[0])
    away = int(score[2])
    
    if home == away:
        score = [0,1,0]
    elif home > away:
        score = [1,0,0]
    else:
        score = [0,0,1]
    
    return pd.DataFrame(data=[score], columns=['Home Win', 'Draw', 'Away Win'])

def getTeamStandings(standings, team, date):
    '''
    standings is one dict with dates as keys (just like STANDINGS) whereas 'team' is the team full name (as in 'TEAMS_NAMES')
    returns the position in the table BEFORE the match. if it's the first match of the season, return 0
    '''
    idx = list(standings.keys())
    
    date_idx = idx.index(date)
    
    if date_idx == 0:
        return 0
    
    else:
        date = idx[date_idx-1]
        table = standings[date]
        pos = table[table['Team'] == team].index[0]
        return pos + 1

def buildDataFrame(sched_dict, standings_dict, home, away, date, score, time=LAST_MATCHES):
    '''
    The function that will build the Dataframe as desired
    '''
    home_code = TEAMS_CODES_DICT[home]
    away_code = TEAMS_CODES_DICT[away]
    
    home_matches = sched_dict[home_code].copy()
    away_matches = sched_dict[away_code].copy()
    
    home_matches['Date'] = date - home_matches['Date']
    away_matches['Date'] = date - away_matches['Date']
        
    home_matches = home_matches[(home_matches['Date'] > datetime.timedelta(0)) & (home_matches['Date'] <= time)]
    away_matches = away_matches[(away_matches['Date'] > datetime.timedelta(0)) & (away_matches['Date'] <= time)]

    home_matches = sumPastMatches(home_matches)
    away_matches = sumPastMatches(away_matches)
           
    home_pos = pd.DataFrame([getTeamStandings(standings_dict, home, date)], columns=['Home League Position'])
    away_pos = pd.DataFrame([getTeamStandings(standings_dict, away, date)], columns=['Away League Position'])
        
    score = getScore(score)
    
    return pd.concat([home_pos, home_matches, away_pos, away_matches, score], axis=1).iloc[0]

With all the needed functions written, let's build our dataframe:

In [18]:
#### BUILDING THE DATAFRAME
SEASONS_DF = {}
for s in EPL:
    matches = EPL[s]
    scheds = SCHEDULES[s]
    stands = STANDINGS[s]
    df = matches.apply(lambda x : buildDataFrame(scheds, stands, x[2], x[4], x[1], x[3]), axis=1)
    SEASONS_DF[s] = df

# Joining all Dataframes from each season
df = pd.DataFrame()

for s in SEASONS_DF:
    df = pd.concat([df,SEASONS_DF[s]])

Let's take a peek on the final dataframe:

In [19]:
df.head()

Unnamed: 0,Home League Position,Number of Matches,Points,GF,GA,xG,xGA,Poss,Away League Position,Number of Matches.1,Points.1,GF.1,GA.1,xG.1,xGA.1,Poss.1,Home Win,Draw,Away Win
0,0.0,1.0,3.0,1.0,1.0,1.0,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,7.0,1.0,3.0,1.0,0.0,1.0,0.0,,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Before dealing with the NaN values (and the 0.0 Home League Position at the first row), lets drop matches where one of the teams hadn't played in the last **LAST_MATCHES** days. Since our main assumption is that **form** is an important factor in the outcome of a match, this is a reasonable thing to do:

In [20]:
df = df[(df.iloc[:,1] != 0) & (df.iloc[:,9] != 0)]
print(df.shape)

(1077, 19)


We shouldn't have any NaN values next. Let's just check it:

In [21]:
df.isna().sum()

Home League Position    0
Number of Matches       0
Points                  0
GF                      0
GA                      0
xG                      0
xGA                     0
Poss                    0
Away League Position    0
Number of Matches       0
Points                  0
GF                      0
GA                      0
xG                      0
xGA                     0
Poss                    0
Home Win                0
Draw                    0
Away Win                0
dtype: int64

For easier future use, we'll save it:

In [22]:
df.to_csv(r'CustomDATA.csv', index=False)

We are now ready to use this dataset.

## Models

In [23]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

We'll try to use the above data on two models. First, we'll be using Random Forests.

Note that we'll **not** be using Cross-Validation since the main purpose of this notebook is to show our ability to parse data.

### Random Forest

In [24]:
from sklearn.ensemble import RandomForestClassifier

def runRF(df):
    '''
    df is a dataframe with matches in which the last 3 columns are a one-hot encoding of the possible result (home win, draw, away win)
    '''
    # separating variables
    X = df.iloc[:,:-3]
    y = df.iloc[:,-3:]
    
    # label encoding the response variable
    y = y.values.argmax(axis=1)

    # train, test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=RANDOM_STATE)

    # scaling
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    # random forest
    model = RandomForestClassifier(n_estimators=100)
    
    model.fit(X_train, y_train)
    
    preds = model.predict(X_test)
    print(confusion_matrix(y_test,preds))
    print(classification_report(y_test,preds))
    return accuracy_score(y_test, preds)



score = runRF(df)
print(score)

[[116   5  29]
 [ 41   8  22]
 [ 41   8  54]]
              precision    recall  f1-score   support

           0       0.59      0.77      0.67       150
           1       0.38      0.11      0.17        71
           2       0.51      0.52      0.52       103

    accuracy                           0.55       324
   macro avg       0.49      0.47      0.45       324
weighted avg       0.52      0.55      0.51       324

0.5493827160493827


The Random Forest model has around 55% accuracy. We'll see that this score is not that bad. Now we train a Neural Network on this same dataset.

### Neural Network

After importing the standard libraries, we'll define a function to run the dataframe easily.

In [25]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping

In [26]:
def runNN(df):
    '''
    df is a dataframe with matches in which the last 3 columns are a one-hot encoding of the possible result (home win, draw, away win)
    '''
    # separating variables
    X = df.iloc[:,:-3]
    y = df.iloc[:,-3:]

    # train, test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=RANDOM_STATE)

    # scaling
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    # neural network
    units = df.shape[1]
    units_2 = round(units/2)
    model = Sequential()
    model.add(Dense(units=units,activation='relu'))
    model.add(Dense(units=units_2,activation='relu'))
    model.add(Dense(units=3,activation='softmax'))


    early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=25)


    # compile and fitting
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])
    model.fit(x=X_train, y=y_train, epochs=300, validation_data=(X_test, y_test), verbose=0, callbacks=[early_stop])

    #model_loss = pd.DataFrame(model.history.history)
    #model_loss.plot()

    preds = model.predict(X_test).argmax(axis=1)
    print(confusion_matrix(y_test.values.argmax(axis=1),preds))
    print(classification_report(y_test.values.argmax(axis=1),preds))
    return accuracy_score(y_test.values.argmax(axis=1), preds)

Let's now just run our Neural Network:

In [27]:
score = runNN(df)
print(score)

Epoch 00059: early stopping
[[118   2  30]
 [ 46   1  24]
 [ 46   0  57]]
              precision    recall  f1-score   support

           0       0.56      0.79      0.66       150
           1       0.33      0.01      0.03        71
           2       0.51      0.55      0.53       103

    accuracy                           0.54       324
   macro avg       0.47      0.45      0.41       324
weighted avg       0.50      0.54      0.48       324

0.5432098765432098


The Neural Network performed almost the same as the Random Forest: around 54% accuracy. 


Even considering that these models are 'generic' (i.e. not specifically designed for football matches or competitions) and trained on data that we already discussed it is not ideal (since it doesn't have players data, for example), it is not only fair, but is something that we should always ask ourselves: are these good results? Is 53% accuracy a good mark? 

Well, it's obviously better than a random choice (that is 33% accurate) - but that shouldn't be our benchmark. So we may need to look at something else to compare. We chose to compare these models with one that is also simple (but not random): we'll predict the result for each match based on the pre-match betting odds from [BET365](https://www.bet365.com). The historical betting data has been obtained via [Football Data](https://www.football-data.co.uk/englandm.php).

### Betting Odds Comparison

We'll first load the files:

In [28]:
BET_DATA = {}

for s in SEASONS:
    name = 'BET_' + s
    file = r'BET_DATA/' + s + '.csv'
    data = pd.read_csv(file)
    #we'll be using only BET365 data. Below B365H denotes their betting odds for the Home Team, 
    #B365D for Draw and B365A for the Away Team
    BET_DATA[s] = data[['HomeTeam', 'AwayTeam', 'B365H', 'B365D', 'B365A']]

We'll now define a helper function to find the odds for a given match. Since the betting data's CSV has different club names than those we've utilized so far, we'll use the opportunity to use the [Fuzzy Matching](https://github.com/seatgeek/fuzzywuzzy) library for python. For more on Fuzzy Matching, see [Wikipedia](https://en.wikipedia.org/wiki/Approximate_string_matching).

In [29]:
from fuzzywuzzy import process


def getOdds(data, home, away):
    '''
    data is a dataframe with the betting ods.
    home & away are the team name from the BET_DATA databases
    '''
    
    #hardcoding for Man Utd since fuzzywuzzy extract 'Chelsea' from 'Manchester Utd'... bad wuzzy!
    if home == 'Manchester Utd':
        home = 'Man United'
    elif away == 'Manchester Utd':
        away = 'Man United'
   

    try:
        ret = data[(data['HomeTeam'] == home) & (data['AwayTeam'] == away)].iloc[0, 2:]
            
    except:
        choices = list(set(data['HomeTeam']))
        if process.extractOne(home, choices)[1] > 60: 
            new_home = process.extractOne(home, choices)[0]
            #if home != new_home:
                #print(home + ' -> ' + new_home)
                
            
        if process.extractOne(away, choices)[1] > 60: 
            new_away = process.extractOne(away, choices)[0]
            #if away != new_away:
                #print(away + ' -> ' + new_away)
            
        ret = data[(data['HomeTeam'] == new_home) & (data['AwayTeam'] == new_away)].iloc[0, 2:]
        
    
    return ret

Now we'll rebuild the main dataframe with the appropriate betting data for each match:

In [30]:
SEASONS_DF = {}
for s in EPL:
    matches = EPL[s]
    scheds = SCHEDULES[s]
    bet_odds = BET_DATA[s]
    stands = STANDINGS[s]
    df = matches.apply(lambda x : buildDataFrame(scheds, stands, x[2], x[4], x[1], x[3]), axis=1)
    odds = matches.apply(lambda x : getOdds(bet_odds, x[2], x[4]), axis=1)
    SEASONS_DF[s] = pd.concat([df, odds], axis=1)
    
df = pd.DataFrame()
    
for s in SEASONS_DF:
    df = pd.concat([df,SEASONS_DF[s]])
    
df = df[(df.iloc[:,1] != 0) & (df.iloc[:,9] != 0)]
print(df.shape)

(1077, 22)


Note that the above dataframe has the same amount of lines - but 3 extra rows (which are the betting data).

In [31]:
df.head()

Unnamed: 0,Home League Position,Number of Matches,Points,GF,GA,xG,xGA,Poss,Away League Position,Number of Matches.1,Points.1,GF.1,GA.1,xG.1,xGA.1,Poss.1,Home Win,Draw,Away Win,B365H,B365D,B365A
10,12.0,1.0,1.0,0.0,0.0,0.4,2.1,41.0,1.0,2.0,1.5,2.5,1.0,1.55,1.3,46.0,0.0,0.0,1.0,11.0,5.0,1.36
11,11.0,1.0,1.0,0.0,0.0,2.1,0.4,59.0,20.0,1.0,0.0,0.0,4.0,0.6,2.1,45.0,1.0,0.0,0.0,1.75,3.8,5.25
12,13.0,1.0,0.0,3.0,4.0,1.6,2.1,32.0,18.0,1.0,0.0,0.0,2.0,0.5,1.6,23.0,1.0,0.0,0.0,1.73,3.8,5.5
13,15.0,1.0,0.0,0.0,1.0,0.4,1.2,69.0,10.0,1.0,1.0,3.0,3.0,2.0,3.0,47.0,0.0,0.0,1.0,2.0,3.6,4.0
14,6.0,1.0,3.0,3.0,2.0,0.8,1.0,38.0,8.0,1.0,3.0,1.0,0.0,1.2,0.4,31.0,0.0,0.0,1.0,2.63,3.2,3.0


Since **lower** betting odds means that more people are betting on this particular outcome, the lowest betting odds will be selected as the predicted outcome (some sort of [Collective Intelligence](https://en.wikipedia.org/wiki/Collective_intelligence)).

In [32]:
def betOddsModel(df):
    '''
    df is a dataframe with matches in which the first 3 columns are the betting data in order (Home Win, Draw, Away Win) and
    the last 3 columns are a one-hot encoding of the possible result (also in the order above)
    '''
    # separating variables
    X = df.iloc[:,:3]
    y = df.iloc[:,-3:]

    # train, test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=RANDOM_STATE)
      
    # the model prediction: using the lowest betting odds
    preds = X_test.values.argmin(axis=1)
    
    print(confusion_matrix(y_test.values.argmax(axis=1),preds))
    print(classification_report(y_test.values.argmax(axis=1),preds))
    return accuracy_score(y_test.values.argmax(axis=1), preds)

For the dataframe to fit our function, we'll be doing a little adjustment:

In [33]:
df = df[['B365H', 'B365D', 'B365A', 'Home Win', 'Draw', 'Away Win']]

score = betOddsModel(df)
print(score)

[[123   0  27]
 [ 47   0  24]
 [ 38   0  65]]
              precision    recall  f1-score   support

           0       0.59      0.82      0.69       150
           1       0.00      0.00      0.00        71
           2       0.56      0.63      0.59       103

    accuracy                           0.58       324
   macro avg       0.38      0.48      0.43       324
weighted avg       0.45      0.58      0.51       324

0.5802469135802469


  _warn_prf(average, modifier, msg_start, len(result))


Around 58% accuracy. It is better than what we achieved with Random Forests and the Neural Networks (both around 54%) - but not significantly better. 

## Conclusion

Both Random Forest and the Neural Network performed worse than the "Betting Model". However, since the difference it is not significant (only 4 p.p), it is plausible to think that, with some better data (as we discussed earlier), our predictions based on "form" could be more accurate.

Also, we used generic classification models. We could try other generic classification models (such as KNN, Naive-Bayes, LinearSVC XGBoostClassifier, etc) but we expect to obtain similar results. The way to definitely improve is to define a model from scratch (similar to what is done [here](https://link.springer.com/article/10.1007/s10994-018-5741-1), for example), but that would be a long and heavy statistical work - and that is not our intention right now.