## Data Coalition of fixtures and league tables

Within this workbook all the premier league fixtures from 1993 to 2018 and all the premier league tables during the period are combined into one usable csv.
As well as this some useful columns have been added which may aid in finding a link between the league results. These are the difference in league position, the form over the past 2,3,4,5 games of the team whose result is being calculated.
Other columns which could be added are the form of the opposition and the difference between the forms of the two sides.

Throughout the workbook there are various comments and also the data is regularly displayed so as to show th echanges that have been made. I have highlighted these where useful.

In [None]:
#here I just import the basic packages that will be used in theis worksheet
#Numpy and Pandas for there usual data analysis and datetime to help catorgarise the date string into a date easier.
import numpy as np
import pandas as pd
from datetime import datetime

# Premier league Games

Start by importing the Premier league game data so we can see what we are working with, this is a basic set downloaded from Kaggle.

In [None]:
LeagueGames = pd.read_csv("EPL_Set.csv")
#LeagueGames is a DataFrame from pandas.
LeagueGames

Now we need to remove any non useful data, is is unlikely half time data will be relevant in biulding our model, so this will be removed. 

The division also shouldn't be useful as all games are expected to be in the Premier league, but this will be checked before deleting. It is checked below to ensure there is only one value which will be the premier league if true.

In [None]:
LeagueGames['Div'].unique()

Hence from the above it can be seen that only one value exists in the Div column and therefore it can be removed.

The goals will also be removed as it seems unlikely to be useful.

In [None]:
LeagueGames = LeagueGames.drop(['Div','FTHG','FTAG','HTHG','HTAG','HTR'],axis=1)
LeagueGames.head()

The "Season Column will be converted into something more useable such as seasonStart and seasonEnd.
This will allow an easier comparison to the final league table from the season before.


In [None]:
LeagueGames['seasonStart'] = LeagueGames['Season'].apply(lambda title: title.split('-')[0])
LeagueGames['seasonEnd'] = LeagueGames['Season'].apply(lambda title: title.split('-')[1])
LeagueGames['seasonEnd'] = LeagueGames['seasonEnd'].astype(int)
LeagueGames['seasonStart'] = LeagueGames['seasonStart'].astype(int)
LeagueGames['seasonEnd'] = np.where(LeagueGames['seasonEnd']<=20,2000+LeagueGames['seasonEnd'],
                                    1900+LeagueGames['seasonEnd'])

# A simple Lambda expression was used to split the season in 2 and then the first and 
# second part were taken for the start and end respectively
# These were then converted from strings to int types, to allow easier subtraction and sorting later.

LeagueGames.tail(1)

Now because all results need to be considered, the table will be duplicated and every game duplicated, the new columns will have team, home/away, opponent, result, season. The reason for the duplication is that currently every team appears once for each fixture but it would be easier to double each fixture. Then each team appears twice for the same fixture, once as the team whose result is being modelled and once as the opposition.

This way it will be easier to analyse the data and the teams league position as the team and the opposition

In [None]:
LeagueGamesHome = LeagueGames
LeagueGamesHome['team'] = LeagueGamesHome['HomeTeam']
LeagueGamesHome['opposition'] = LeagueGamesHome['AwayTeam']
LeagueGamesHome = LeagueGamesHome.drop(['HomeTeam','AwayTeam'],axis=1)
LeagueGamesHome['result'] = np.where(LeagueGamesHome['FTR']=='H', 'Win',np.where(LeagueGamesHome['FTR']=='A','Loss','Draw'))
LeagueGamesHome['ho/Aw'] = 'Home'
LeagueGamesHome.tail()

In [None]:
LeagueGamesAway = LeagueGames
LeagueGamesAway['team'] = LeagueGamesAway['AwayTeam']
LeagueGamesAway['opposition'] = LeagueGamesAway['HomeTeam']
LeagueGamesAway = LeagueGamesAway.drop(['HomeTeam','AwayTeam'],axis=1)
LeagueGamesAway['result'] = np.where(LeagueGamesAway['FTR']=='H', 'Loss',np.where(LeagueGamesAway['FTR']=='A','Win','Draw'))
LeagueGamesAway['ho/Aw'] = 'Away'
LeagueGamesAway.head()

In [None]:
LeagueGamesCombo = pd.concat([LeagueGamesHome,LeagueGamesAway])

#This just simply combines the two created dataframes which were duplcuates reversed into one listing.
#This listing has the columns identified below

LeagueGamesCombo.info()

In [None]:
#Now attention is turned to the date column which as can be seen from looking at the type below is currently just an object.
#It's easier to work with a datetime object so it will be converted.

LeagueGamesCombo['Date'].tail()

In [None]:
LeagueGamesCombo['date'] =  pd.to_datetime(LeagueGamesCombo['Date'],dayfirst=True)
LeagueGamesCombo = LeagueGamesCombo.drop(['Date','FTR'],axis=1)

# Then the old date column and the FTR column are removed as they are obsolete

LeagueGamesCombo.info()

In [None]:
LeagueGamesCombo.columns=LeagueGamesCombo.columns.str.strip()
LeagueGamesCombo = LeagueGamesCombo.sort_values(['team','date'])
LeagueGamesCombo.info()

Next it is important to establish the form of each team. For the purposes of this workbook form will consist of the number of point gained in the previous 2, 3, 4 and 5 games. If on an initial linear regression use it is found that one of these metrics is not useful it will be discarded.

Currently this is only for the team being focused on and also runs accross seasons which may or may not be helpful.

First we must establish the points gained by the 'Team' in each of their games which should be straightforward, a fixture number is also added so help keep track of each teams fixtures each season.

In [None]:
LeagueGamesCombo['teamPoints'] = np.where(LeagueGamesCombo['result']=='Win',3,np.where(LeagueGamesCombo['result']=='Draw',1,0))
LeagueGamesCombo = LeagueGamesCombo.reset_index()
LeagueGamesCombo = LeagueGamesCombo.drop('index',axis=1)

In [None]:
LeagueGamesCombo = LeagueGamesCombo.reset_index()
LeagueGamesCombo.head()

In [None]:
LeagueGamesCombo

The below are functions designed to establish a teams form in the premier league exclusively. This takes the form of the number of points earned in the previous 'x' games. It currently runs from 2 to 5 as the method is not the most efficient meaning that bigger ones will take longer.

In [None]:
def find_last2(x):
    if x == 0:
        return 0
    elif x == 1:
        return LeagueGamesCombo['teamPoints'][x-1]
    elif LeagueGamesCombo['team'][x] != LeagueGamesCombo['team'][x-1]:
        return 0
    elif LeagueGamesCombo['team'][x-2] != LeagueGamesCombo['team'][x]:
        return LeagueGamesCombo['teamPoints'][x-1]
    else:
        return LeagueGamesCombo['teamPoints'][x-1]+LeagueGamesCombo['teamPoints'][x-2]
    
def find_last3(x):
    if x == 0:
        return 0
    elif x == 1:
        return LeagueGamesCombo['teamPoints'][x-1]
    elif x == 2:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2]
    
    elif LeagueGamesCombo['team'][x] != LeagueGamesCombo['team'][x-1]:
        return 0
    elif LeagueGamesCombo['team'][x] != LeagueGamesCombo['team'][x-2]:
        return LeagueGamesCombo['teamPoints'][x-1]
    elif LeagueGamesCombo['team'][x] != LeagueGamesCombo['team'][x-3]:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2]
    
    else:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2] + LeagueGamesCombo['teamPoints'][x-3]
    
def find_last4(x):
    if x == 0:
        return 0
    elif x == 1:
        return LeagueGamesCombo['teamPoints'][x-1]
    elif x == 2:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2]
    elif x == 3:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2] + + LeagueGamesCombo['teamPoints'][x-3]
    
    elif LeagueGamesCombo['team'][x] != LeagueGamesCombo['team'][x-1]:
        return 0
    elif LeagueGamesCombo['team'][x] != LeagueGamesCombo['team'][x-2]:
        return LeagueGamesCombo['teamPoints'][x-1]
    elif LeagueGamesCombo['team'][x] != LeagueGamesCombo['team'][x-3]:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2]
    elif LeagueGamesCombo['team'][x] != LeagueGamesCombo['team'][x-4]:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2] + LeagueGamesCombo['teamPoints'][x-3]
    
    else:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2] + LeagueGamesCombo['teamPoints'][x-3] + LeagueGamesCombo['teamPoints'][x-4]
    
def find_last5(x):
    if x == 0:
        return 0
    elif x == 1:
        return LeagueGamesCombo['teamPoints'][x-1]
    elif x == 2:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2]
    elif x == 3:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2] + + LeagueGamesCombo['teamPoints'][x-3]
    elif x == 4:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2] + + LeagueGamesCombo['teamPoints'][x-3] + LeagueGamesCombo['teamPoints'][x-4]
    
    elif LeagueGamesCombo['team'][x] != LeagueGamesCombo['team'][x-1]:
        return 0
    elif LeagueGamesCombo['team'][x] != LeagueGamesCombo['team'][x-2]:
        return LeagueGamesCombo['teamPoints'][x-1]
    elif LeagueGamesCombo['team'][x] != LeagueGamesCombo['team'][x-3]:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2]
    elif LeagueGamesCombo['team'][x] != LeagueGamesCombo['team'][x-4]:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2] + LeagueGamesCombo['teamPoints'][x-3]
    elif LeagueGamesCombo['team'][x] != LeagueGamesCombo['team'][x-5]:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2] + LeagueGamesCombo['teamPoints'][x-3] + LeagueGamesCombo['teamPoints'][x-4]
    
    else:
        return LeagueGamesCombo['teamPoints'][x-1] + LeagueGamesCombo['teamPoints'][x-2] + LeagueGamesCombo['teamPoints'][x-3] + LeagueGamesCombo['teamPoints'][x-4] + LeagueGamesCombo['teamPoints'][x-5]

In [None]:
LeagueGamesCombo['team_form_2games'] = LeagueGamesCombo['index'].apply(lambda x : find_last2(x))
LeagueGamesCombo['team_form_3games'] = LeagueGamesCombo['index'].apply(lambda x : find_last3(x))
LeagueGamesCombo['team_form_4games'] = LeagueGamesCombo['index'].apply(lambda x : find_last4(x))
LeagueGamesCombo['team_form_5games'] = LeagueGamesCombo['index'].apply(lambda x : find_last5(x))

Now it is also worth finding out the form of all the Opposition teams leading into each game as well to see if that has an impact on the result, it could be that a useful metric is instead to look at the difference in form between the two teams to see if this is a useful predictor.

All that needs to be done to find the opposition form is to find to corresponding reverse fixture where the "team" form will have ben found already

In [None]:
def find_form(x,y):
    finder = 'string'
    if y==2: finder = 'team_form_2games'
    elif y==3: finder = 'team_form_3games'
    elif y==4: finder = 'team_form_4games'
    elif y==5: finder = 'team_form_5games'
    else: raise Exception('y must be an integer between 2 and 5')
    
    team = LeagueGamesCombo['opposition'][x]
    date = LeagueGamesCombo['date'][x]
    result = LeagueGamesCombo[(LeagueGamesCombo['team'] == team) & (LeagueGamesCombo['date'] == date)][finder].iat[0]
    return result

In [None]:
LeagueGamesCombo['opp_form_2games'] = LeagueGamesCombo['index'].apply(lambda x,y=2 : find_form(x,y))
LeagueGamesCombo['opp_form_3games'] = LeagueGamesCombo['index'].apply(lambda x,y=3 : find_form(x,y))
LeagueGamesCombo['opp_form_4games'] = LeagueGamesCombo['index'].apply(lambda x,y=4 : find_form(x,y))
LeagueGamesCombo['opp_form_5games'] = LeagueGamesCombo['index'].apply(lambda x,y=5 : find_form(x,y))

In [None]:
LeagueGamesCombo.head(5)

# Premier League Tables

Now that we have our League games in a useable format it's time to also add in the league position that each team finished in the previous season. In addition to this each clubs form will need to be established.

In [None]:
LeagueTables = pd.read_csv("tables_1968_2019.csv")
LeagueTables.head()

Now need to remove the non useful features, the season and teams and position will be used, but not the other features on this first run through, these columns should be removed for ease of use.

In [None]:
LeagueTables = LeagueTables.drop(['name','p','w','d','l','f','a','gd','points'],axis=1)

The rows which relate to seasons for which no data is present for game results should also be dropped, this means all years before the season 92/93.

So create a new column with the year the season started and ended

In [None]:
LeagueTables['seasonStart'] = LeagueTables['season'].apply(lambda title: title.split('/')[0])
LeagueTables['seasonEnd'] = LeagueTables['season'].apply(lambda title: title.split('/')[1])
LeagueTables.head()

Finally for ease of use with the data any data from before the 92/93 season can be discarded

In [None]:
LeagueTables['seasonEnd'] = LeagueTables['seasonEnd'].astype(int)
LeagueTables = LeagueTables[LeagueTables['seasonEnd'] >= 1993]
LeagueTables.info()

In [None]:
LeagueTables

# Combine Results and league Position

It is now time to combine the game results with each teams league position they finished the previous year in.

First it is important to consider is all teams have the same names in each dataset as they have come from different sources. The number of teams should stay the same. 

Below it is checked whether each source has the same number of teams and what their names are.

In [None]:
LTT = pd.Series(LeagueTables['team'].sort_values().unique())
LGT = pd.Series(LeagueGamesCombo['team'].sort_values().unique())
Check1 = pd.concat((LTT,LGT),axis=1)
Check1

From the above table we can see that the offending teams that are causing the differences are 'Leeds United','West Ham united' which appears on the league table side twice. To filter it out Leed united will be renamed Leeds in the LeagueTables dataframe.

On the League games side we can see that Middlesboro also appears twice so this should also be filtered out in favour of Middlesborough.

In [None]:
LeagueTables['team'] = np.where(LeagueTables['team'] == 'Leeds United', 'Leeds', LeagueTables['team'])
LeagueTables['team'] = np.where(LeagueTables['team'] == 'West Ham United', 'West Ham', LeagueTables['team'])
LeagueGamesCombo['team'] = np.where(LeagueGamesCombo['team'] == 'Middlesboro', 'Middlesbrough', LeagueGamesCombo['team'])
LeagueGamesCombo['opposition'] = np.where(LeagueGamesCombo['opposition'] == 'Middlesboro', 'Middlesbrough', LeagueGamesCombo['opposition'])
LeagueGamesCombo.info()

Now check that the team names are almost consistent by reusing the same formula as before

In [None]:
LTT = pd.Series(LeagueTables['team'].sort_values().unique())
LGTt = pd.Series(LeagueGamesCombo['team'].sort_values().unique())
LGTo = pd.Series(LeagueGamesCombo['opposition'].sort_values().unique())
Check2 = pd.concat((LTT,LGTt,LGTo),axis=1,)
Check2

Now change the League Table team names so that they match the LeagueGamesCombo Array, this will make it easier later when looking up league position.

In [None]:
check_dict = Check2.set_index(0)[1].to_dict()
LeagueTables = LeagueTables.replace(check_dict)

No do a final check to ensure that all team names are aligned

In [None]:
LTT = pd.Series(LeagueTables['team'].sort_values().unique())
LGTt = pd.Series(LeagueGamesCombo['team'].sort_values().unique())
LGTo = pd.Series(LeagueGamesCombo['team'].sort_values().unique())
Check3 = pd.concat((LTT,LGTt,LGTo),axis=1)
Check3['check'] = (Check3[0] == Check3[1]) & (Check3[0] == Check3[2])
Check3

Now it's important to actually use the league table in the previous season to guide the current seasons results.
So for the results in the 93/94 season we need the table from the 92/93.
So For the tables we will use the 'SeasonEnd' column and for the results we should use the Season 'SeasonStart' column. 

To do this a pivot table will be made.

In [None]:
LeagueTables.head(3)

In [None]:
UnstackedLeagueTables = LeagueTables.pivot(index='team',columns='seasonEnd',values='pos').unstack().to_frame()
UnstackedLeagueTables.info()

In [None]:
LeagueGamesCombo.info()

In [None]:
LeagueGamesCombo1 = LeagueGamesCombo.merge(UnstackedLeagueTables, left_on=["seasonStart", "team"],
                                          right_on=['seasonEnd','team'],how='left')


In [None]:
LeagueGamesCombo1.info()

In [None]:
LeagueGamesCombo2 = LeagueGamesCombo1.merge(UnstackedLeagueTables, left_on=["seasonStart", "opposition"],
                                          right_on=['seasonEnd','team'],how='left')

In [None]:
LeagueGamesCombo2.info()

In [None]:
LeagueGamesCombo2 = LeagueGamesCombo2.rename(columns={"0_x": "py_T_Leag_Pos", "0_y": "py_O_Leag_Pos"})
LeagueGamesCombo2.info()

In [None]:
LeagueGamesCombo2 = LeagueGamesCombo2.fillna(21)
LeagueGamesCombo2.info()

A more useful measure than league position may be the difference in league position of the two sides so this metric will also be included.

In [None]:
LeagueGamesCombo2['leagueDifference'] = LeagueGamesCombo2['py_T_Leag_Pos'] - LeagueGamesCombo2['py_O_Leag_Pos']

In [None]:
LeagueGamesCombo2.to_csv(r"C:\Users\Elliott\Documents\Python\PremierLeaguePredictor\combined_data.csv")

#finally the dataframe is exported for use within another workbook where the data analysis and logistic/linear regaression can take place