In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from functions import *

In [2]:
path = '/Users/allanbellahsene/Desktop/FOOTBALL_PREDICTION_PROJECT/data/PREMIER_LEAGUE/PL_'
years = ['2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
         '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']
data = import_data(init_path=path, years=years)

The variable data is a list, and each element of this list corresponds to the pandas dataframe that captures the data of one Premier League season. For instance, the first element of the list corresponds to the data of the 2000-2001 Premier League season.

In [3]:
data[0].head()

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,IWA,LBH,LBD,LBA,SBH,SBD,SBA,WHH,WHD,WHA
0,E0,19/08/00,Charlton,Man City,4,0,H,2,0,H,...,2.7,2.2,3.25,2.75,2.2,3.25,2.88,2.1,3.2,3.1
1,E0,19/08/00,Chelsea,West Ham,4,2,H,1,0,H,...,4.2,1.5,3.4,6.0,1.5,3.6,6.0,1.44,3.6,6.5
2,E0,19/08/00,Coventry,Middlesbrough,1,3,A,1,1,D,...,2.7,2.25,3.2,2.75,2.3,3.2,2.75,2.3,3.2,2.62
3,E0,19/08/00,Derby,Southampton,2,2,D,1,2,A,...,3.5,2.2,3.25,2.75,2.05,3.2,3.2,2.0,3.2,3.2
4,E0,19/08/00,Leeds,Everton,2,0,H,2,0,H,...,4.5,1.55,3.5,5.0,1.57,3.6,5.0,1.61,3.5,4.5


One might be interested in merging all elements of this list into one DataFrame. 

In [4]:
def merge_list(data):
    df = pd.merge(data[0], data[1], 'outer')
    for i in range(2, len(data)):
        df = pd.merge(df, data[i], 'outer')
    return df

In [5]:
df= merge_list(data)

In [6]:
len(df) == len(data) * len(data[0]) + 1

True

In [7]:
df.head()

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,BbAvAHA,BSH,BSD,BSA,PSH,PSD,PSA,PSCH,PSCD,PSCA
0,E0,19/08/00,Charlton,Man City,4.0,0.0,H,2.0,0.0,H,...,,,,,,,,,,
1,E0,19/08/00,Chelsea,West Ham,4.0,2.0,H,1.0,0.0,H,...,,,,,,,,,,
2,E0,19/08/00,Coventry,Middlesbrough,1.0,3.0,A,1.0,1.0,D,...,,,,,,,,,,
3,E0,19/08/00,Derby,Southampton,2.0,2.0,D,1.0,2.0,A,...,,,,,,,,,,
4,E0,19/08/00,Leeds,Everton,2.0,0.0,H,2.0,0.0,H,...,,,,,,,,,,


In [8]:
df.tail()

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,BbAvAHA,BSH,BSD,BSA,PSH,PSD,PSA,PSCH,PSCD,PSCA
7216,E0,12/05/2019,Liverpool,Wolves,2.0,0.0,H,1.0,0.0,H,...,1.95,,,,1.31,5.77,10.54,1.32,5.89,9.48
7217,E0,12/05/2019,Man United,Cardiff,0.0,2.0,A,0.0,1.0,A,...,1.64,,,,1.28,6.33,10.21,1.3,6.06,9.71
7218,E0,12/05/2019,Southampton,Huddersfield,1.0,1.0,D,1.0,0.0,H,...,1.73,,,,1.44,4.83,7.62,1.37,5.36,8.49
7219,E0,12/05/2019,Tottenham,Everton,2.0,2.0,D,1.0,0.0,H,...,1.8,,,,2.1,3.64,3.64,1.91,3.81,4.15
7220,E0,12/05/2019,Watford,West Ham,1.0,4.0,A,0.0,2.0,A,...,1.72,,,,2.2,3.85,3.21,2.11,3.86,3.41


In [9]:
[n_observations, n_features] = df.shape
start_date = df['Date'][0]
final_date = df['Date'][len(df)-1]
print('This dataset contains data for ' + str(n_observations), 'Premier League games, between ' + str(start_date), 'and ' + str(final_date), ', with a total of ' + str(n_features), 
      'variables.')

This dataset contains data for 7221 Premier League games, between 19/08/00 and 12/05/2019 , with a total of 103 variables.


In [10]:
df.dtypes

Div            object
Date           object
HomeTeam       object
AwayTeam       object
FTHG          float64
FTAG          float64
FTR            object
HTHG          float64
HTAG          float64
HTR            object
Attendance    float64
Referee        object
HS            float64
AS            float64
HST           float64
AST           float64
HHW           float64
AHW           float64
HC            float64
AC            float64
HF            float64
AF            float64
HO            float64
AO            float64
HY            float64
AY            float64
HR            float64
AR            float64
HBP           float64
ABP           float64
               ...   
VCH           float64
VCD           float64
VCA           float64
Bb1X2         float64
BbMxH         float64
BbAvH         float64
BbMxD         float64
BbAvD         float64
BbMxA         float64
BbAvA         float64
BbOU          float64
BbMx>2.5      float64
BbAv>2.5      float64
BbMx<2.5      float64
BbAv<2.5  

# Description of variables

Key to results data:

- Div = League Division
- Date = Match Date (dd/mm/yy)
- Time = Time of match kick off
- HomeTeam = Home Team
- AwayTeam = Away Team
- FTHG and HG = Full Time Home Team Goals
- FTAG and AG = Full Time Away Team Goals
- FTR and Res = Full Time Result (H=Home Win, D=Draw, A=Away Win)
- HTHG = Half Time Home Team Goals
- HTAG = Half Time Away Team Goals
- HTR = Half Time Result (H=Home Win, D=Draw, A=Away Win)

Match Statistics (where available)
- Attendance = Crowd Attendance
- Referee = Match Referee
- HS = Home Team Shots
- AS = Away Team Shots
- HST = Home Team Shots on Target
- AST = Away Team Shots on Target
- HHW = Home Team Hit Woodwork
- AHW = Away Team Hit Woodwork
- HC = Home Team Corners
- AC = Away Team Corners
- HF = Home Team Fouls Committed
- AF = Away Team Fouls Committed
- HFKC = Home Team Free Kicks Conceded
- AFKC = Away Team Free Kicks Conceded
- HO = Home Team Offsides
- AO = Away Team Offsides
- HY = Home Team Yellow Cards
- AY = Away Team Yellow Cards
- HR = Home Team Red Cards
- AR = Away Team Red Cards
- HBP = Home Team Bookings Points (10 = yellow, 25 = red)
- ABP = Away Team Bookings Points (10 = yellow, 25 = red)

Note that Free Kicks Conceeded includes fouls, offsides and any other offense commmitted and will always be equal to or higher than the number of fouls. Fouls make up the vast majority of Free Kicks Conceded. Free Kicks Conceded are shown when specific data on Fouls are not available (France 2nd, Belgium 1st and Greece 1st divisions).

Note also that English and Scottish yellow cards do not include the initial yellow card when a second is shown to a player converting it into a red, but this is included as a yellow (plus red) for European games.


Key to 1X2 (match) betting odds data:

- B365H = Bet365 home win odds
- B365D = Bet365 draw odds
- B365A = Bet365 away win odds
- BSH = Blue Square home win odds
- BSD = Blue Square draw odds
- BSA = Blue Square away win odds
- BWH = Bet&Win home win odds
- BWD = Bet&Win draw odds
- BWA = Bet&Win away win odds
- GBH = Gamebookers home win odds
- GBD = Gamebookers draw odds
- GBA = Gamebookers away win odds
- IWH = Interwetten home win odds
- IWD = Interwetten draw odds
- IWA = Interwetten away win odds
- LBH = Ladbrokes home win odds
- LBD = Ladbrokes draw odds
- LBA = Ladbrokes away win odds
- PSH and PH = Pinnacle home win odds
- PSD and PD = Pinnacle draw odds
- PSA and PA = Pinnacle away win odds
- SOH = Sporting Odds home win odds
- SOD = Sporting Odds draw odds
- SOA = Sporting Odds away win odds
- SBH = Sportingbet home win odds
- SBD = Sportingbet draw odds
- SBA = Sportingbet away win odds
- SJH = Stan James home win odds
- SJD = Stan James draw odds
- SJA = Stan James away win odds
- SYH = Stanleybet home win odds
- SYD = Stanleybet draw odds
- SYA = Stanleybet away win odds
- VCH = VC Bet home win odds
- VCD = VC Bet draw odds
- VCA = VC Bet away win odds
- WHH = William Hill home win odds
- WHD = William Hill draw odds
- WHA = William Hill away win odds

- Bb1X2 = Number of BetBrain bookmakers used to calculate match odds averages and maximums
- BbMxH = Betbrain maximum home win odds
- BbAvH = Betbrain average home win odds
- BbMxD = Betbrain maximum draw odds
- BbAvD = Betbrain average draw win odds
- BbMxA = Betbrain maximum away win odds
- BbAvA = Betbrain average away win odds

- MaxH = Market maximum home win odds
- MaxD = Market maximum draw win odds
- MaxA = Market maximum away win odds
- AvgH = Market average home win odds
- AvgD = Market average draw win odds
- AvgA = Market average away win odds



Key to total goals betting odds:

- BbOU = Number of BetBrain bookmakers used to calculate over/under 2.5 goals (total goals) averages and maximums
- BbMx>2.5 = Betbrain maximum over 2.5 goals
- BbAv>2.5 = Betbrain average over 2.5 goals
- BbMx<2.5 = Betbrain maximum under 2.5 goals
- BbAv<2.5 = Betbrain average under 2.5 goals

- GB>2.5 = Gamebookers over 2.5 goals
- GB<2.5 = Gamebookers under 2.5 goals
- B365>2.5 = Bet365 over 2.5 goals
- B365<2.5 = Bet365 under 2.5 goals
- P>2.5 = Pinnacle over 2.5 goals
- P<2.5 = Pinnacle under 2.5 goals
- Max>2.5 = Market maximum over 2.5 goals
- Max<2.5 = Market maximum under 2.5 goals
- Avg>2.5 = Market average over 2.5 goals
- Avg<2.5 = Market average under 2.5 goals



Key to Asian handicap betting odds:

- BbAH = Number of BetBrain bookmakers used to Asian handicap averages and maximums
- BbAHh = Betbrain size of handicap (home team)
- AHh = Market size of handicap (home team) (since 2019/2020)
- BbMxAHH = Betbrain maximum Asian handicap home team odds
- BbAvAHH = Betbrain average Asian handicap home team odds
- BbMxAHA = Betbrain maximum Asian handicap away team odds
- BbAvAHA = Betbrain average Asian handicap away team odds

- GBAHH = Gamebookers Asian handicap home team odds
- GBAHA = Gamebookers Asian handicap away team odds
- GBAH = Gamebookers size of handicap (home team)
- LBAHH = Ladbrokes Asian handicap home team odds
- LBAHA = Ladbrokes Asian handicap away team odds
- LBAH = Ladbrokes size of handicap (home team)
- B365AHH = Bet365 Asian handicap home team odds
- B365AHA = Bet365 Asian handicap away team odds
- B365AH = Bet365 size of handicap (home team)
- PAHH = Pinnacle Asian handicap home team odds
- PAHA = Pinnacle Asian handicap away team odds
- MaxAHH = Market maximum Asian handicap home team odds
- MaxAHA = Market maximum Asian handicap away team odds	
- AvgAHH = Market average Asian handicap home team odds
- AvgAHA = Market average Asian handicap away team odds

Dans un premier temps, nous souhaitons mettre en place un modèle qui, basé sur nos connaissances du Football, nous permettra de prédire les résultats (Home Win, Draw, Away Win) avec les variables explicatives suivantes:
-	Domicile/extérieur
-	Forme des deux équipes (e.g. résultat de leur 5 derniers matches)
-	Classement des deux équipes
-	Historique des deux équipes (donner plus de poids aux matches récents, et peut être plus de poids aux matches qui se sont joués dans les mêmes conditions (càd domicile ou extérieur))
-	Si des joueurs clés d’une équipe ou l’autre sont blessés
-	Fatigue (date du match le plus récent)
-	Importance du match : savoir si le match est plus important pour une équipe ou l’autre selon le contexte (titre, maintien, etc.)
-	Savoir s’il y a un match plus important à venir pour l’une des équipe (e.g. match de LDC)

Il s'agit donc d'un problème de classification (avec trois issus). Nous pouvons par exemple utilisé des méthodes comme Logistic Regression ou Support Vector Machine. Par la suite, il sera également intéréssant d'inclure le plus de variables explicatives possibles et voir si notre précision augmente. Mais dans un premier temps, concentrons nous sur la méthode citée ici.

Pour mettre en place cette méthode, il va nous falloir "créer" les variables explicatives souhaitées à partir des données que l'on possède.

In [11]:
len(df) - df.count() 

Div              1
Date             1
HomeTeam         1
AwayTeam         1
FTHG             1
FTAG             1
FTR              1
HTHG             1
HTAG             1
HTR              1
Attendance    6462
Referee          1
HS               1
AS               1
HST              1
AST              1
HHW           6461
AHW           6461
HC               1
AC               1
HF               1
AF               1
HO            6461
AO            6461
HY               1
AY               1
HR               1
AR               1
HBP           6461
ABP           6461
              ... 
VCH           1901
VCD           1901
VCA           1901
Bb1X2         1901
BbMxH         1901
BbAvH         1901
BbMxD         1901
BbAvD         1901
BbMxA         1901
BbAvA         1901
BbOU          1901
BbMx>2.5      1901
BbAv>2.5      1901
BbMx<2.5      1901
BbAv<2.5      1901
BbAH          1911
BbAHh         1911
BbMxAHH       1911
BbAvAHH       1911
BbMxAHA       1911
BbAvAHA       1911
BSH         

Cela nous indique le nombre de NaN par variable.