# Sportsbetting Project Bundesliga

This Notebook contains the development of a machine learning model to predict future Bundesliga games. The goal is to pass the algorithm two team names (i.e. Home Team and Away Team) to get the probability of the outcome of the game.

### Setup

In [1]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import sklearn

import tensorflow as tf
from tensorflow import keras

### Load Data

In [2]:
data1 = pd.read_csv('D1.csv')
data1.head()

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,...,AvgC<2.5,AHCh,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA
0,D1,18/09/2020,19:30,Bayern Munich,Schalke 04,8,0,H,3,0,...,4.34,-2.5,1.89,2.04,1.87,2.02,1.95,2.18,1.85,2.02
1,D1,19/09/2020,14:30,Ein Frankfurt,Bielefeld,1,1,D,0,0,...,2.33,-0.75,1.96,1.97,1.96,1.96,2.02,1.98,1.94,1.93
2,D1,19/09/2020,14:30,FC Koln,Hoffenheim,2,3,A,1,2,...,2.27,0.0,1.91,2.02,1.92,2.01,1.97,2.08,1.89,1.98
3,D1,19/09/2020,14:30,Stuttgart,Freiburg,2,3,A,0,2,...,2.33,-0.25,1.92,2.01,1.91,2.02,1.94,2.04,1.88,1.99
4,D1,19/09/2020,14:30,Union Berlin,Augsburg,1,3,A,0,1,...,1.71,-0.25,2.02,1.91,2.0,1.92,2.05,1.93,2.0,1.87


In [3]:
data2 = pd.read_csv('MD_10.csv', skiprows=1)
data2.head()

Unnamed: 0,Team,Rank,P,M,W,D,L,G,GA,GD,...,GD.1,Rank.2,P.2,M.2,W.2,D.2,L.2,G.2,GA.2,GD.2
0,Bayern Munich,1,23,10,7,2,1,34,16,18,...,14,2,12,5,4,0,1,13,9,4
1,Bayer Leverkusen,2,22,10,6,4,0,19,9,10,...,3,1,14,6,4,2,0,11,4,7
2,RB Leipzig,3,21,10,6,3,1,21,9,12,...,11,9,6,5,1,3,1,7,6,1
3,Borussia Dortmund,4,19,10,6,1,3,22,10,12,...,8,4,10,5,3,1,1,9,5,4
4,VfL Wolfsburg,5,18,10,4,6,0,16,10,6,...,4,7,7,5,1,4,0,7,5,2


As of first, we will take care of the training features **X** included in the datasets. This includes the selection of important features. The goal is to use as many important features to create a row vector for each team with their according stats.

In [4]:
X1 = data1.iloc[:, 3:15] # Leaving only most important features
X1.head()

Unnamed: 0,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HS,AS,HST,AST
0,Bayern Munich,Schalke 04,8,0,H,3,0,H,22,5,12,1
1,Ein Frankfurt,Bielefeld,1,1,D,0,0,D,18,14,6,4
2,FC Koln,Hoffenheim,2,3,A,1,2,A,13,13,6,7
3,Stuttgart,Freiburg,2,3,A,0,2,A,22,7,7,6
4,Union Berlin,Augsburg,1,3,A,0,1,A,13,9,3,5


### Abbreviation Definition

- **FTHG** = Full time home team goals  $\qquad\qquad$  **FTAG** = Full time away team goals 
- **FTR** = Full time result            $\qquad\qquad\qquad\qquad\;\;$  **HTR** = Half time result           
- **HTHG** = Half time home team goals  $\qquad\quad\;\;\;$  **HTAG** = Half time away team goals 
- **HS** = Home team shots              $\qquad\qquad\qquad\;\;\;\;\;\;\;$  **AS** = Away team shots             
- **HST** = Home team shots on target   $\qquad\quad\;\;\;\;\;$  **AST** Away team shots on target         

In [10]:
X1_clean = X1.drop(['HTR','FTR'], axis=1) # Dropping half time and full time result for training data
X1_clean[0:10]

Unnamed: 0,HomeTeam,AwayTeam,FTHG,FTAG,HTHG,HTAG,HS,AS,HST,AST
0,Bayern Munich,Schalke 04,8,0,3,0,22,5,12,1
1,Ein Frankfurt,Bielefeld,1,1,0,0,18,14,6,4
2,FC Koln,Hoffenheim,2,3,1,2,13,13,6,7
3,Stuttgart,Freiburg,2,3,0,2,22,7,7,6
4,Union Berlin,Augsburg,1,3,0,1,13,9,3,5
5,Werder Bremen,Hertha,1,4,0,2,17,13,7,6
6,Dortmund,M'gladbach,3,0,1,0,9,8,4,2
7,RB Leipzig,Mainz,3,1,2,0,23,8,10,1
8,Wolfsburg,Leverkusen,0,0,0,0,9,6,1,2
9,Hertha,Ein Frankfurt,1,3,0,2,12,10,6,3


In [11]:
X2_clean = data2.iloc[:, 0:9] # Leaving only important features
X2_clean[0:10]

Unnamed: 0,Team,Rank,P,M,W,D,L,G,GA
0,Bayern Munich,1,23,10,7,2,1,34,16
1,Bayer Leverkusen,2,22,10,6,4,0,19,9
2,RB Leipzig,3,21,10,6,3,1,21,9
3,Borussia Dortmund,4,19,10,6,1,3,22,10
4,VfL Wolfsburg,5,18,10,4,6,0,16,10
5,Union Berlin,6,16,10,4,4,2,22,14
6,M&ouml;nchengladbach,7,16,10,4,4,2,19,16
7,VfB Stuttgart,8,14,10,3,5,2,19,16
8,Eintracht Frankfurt,9,13,10,2,7,1,15,17
9,1899 Hoffenheim,10,12,10,3,3,4,18,17


### Integrate both Datasets
In this section, we will write a function that will combine the two datasets and sorts the features respectively. The Idea is that each Team has it's own row, including their rank, goals scored etc. as well as the stats of the last games

In [12]:
# First, we need to make sure the team names of both datasets match

for name in range(len(X2_clean['Team'])):
    if (X2_clean.iloc[name, 0] == '1. FC K&ouml;ln'):
        X2_clean.iloc[name, 0] = 'FC Koln'
    if (X2_clean.iloc[name, 0] == 'VfL Wolfsburg'):
        X2_clean.iloc[name, 0] = 'Wolfsburg'
    if (X2_clean.iloc[name, 0] == 'M&ouml;nchengladbach'):
        X2_clean.iloc[name, 0] = 'M\'gladbach'
    if (X2_clean.iloc[name, 0] == 'Borussia Dortmund'):
        X2_clean.iloc[name, 0] = 'Dortmund'
    if (X2_clean.iloc[name, 0] == 'VfB Stuttgart'):
        X2_clean.iloc[name, 0] = 'Stuttgart'
    if (X2_clean.iloc[name, 0] == 'Eintracht Frankfurt'):
        X2_clean.iloc[name, 0] = 'Ein Frankfurt'
    if (X2_clean.iloc[name, 0] == '1899 Hoffenheim'):
        X2_clean.iloc[name, 0] = 'Hoffenheim'
    if (X2_clean.iloc[name, 0] == 'FC Augsburg'):
        X2_clean.iloc[name, 0] = 'Augsburg'
    if (X2_clean.iloc[name, 0] == 'Hertha BSC Berlin'):
        X2_clean.iloc[name, 0] = 'Hertha'
    if (X2_clean.iloc[name, 0] == 'SC Freiburg'):
        X2_clean.iloc[name, 0] = 'Freiburg'
    if (X2_clean.iloc[name, 0] == 'Arminia Bielefeld'):
        X2_clean.iloc[name, 0] = 'Bielefeld'
    if (X2_clean.iloc[name, 0] == 'FSV Mainz 05'):
        X2_clean.iloc[name, 0] = 'Mainz'
    if (X2_clean.iloc[name, 0] == 'FC Schalke 04'):
        X2_clean.iloc[name, 0] = 'Schalke 04'
    if (X2_clean.iloc[name, 0] == 'Bayer Leverkusen'):
        X2_clean.iloc[name, 0] = 'Leverkusen'
        
X2_clean['Team']

0     Bayern Munich
1        Leverkusen
2        RB Leipzig
3          Dortmund
4         Wolfsburg
5      Union Berlin
6        M'gladbach
7         Stuttgart
8     Ein Frankfurt
9        Hoffenheim
10         Augsburg
11           Hertha
12    Werder Bremen
13         Freiburg
14          FC Koln
15        Bielefeld
16            Mainz
17       Schalke 04
Name: Team, dtype: object

In [13]:
#X = pd.DataFrame()
X1_clean_num = X1_clean.iloc[:,:].values # using numpy array
X1_clean_num[0:5]

array([['Bayern Munich', 'Schalke 04', 8, 0, 3, 0, 22, 5, 12, 1],
       ['Ein Frankfurt', 'Bielefeld', 1, 1, 0, 0, 18, 14, 6, 4],
       ['FC Koln', 'Hoffenheim', 2, 3, 1, 2, 13, 13, 6, 7],
       ['Stuttgart', 'Freiburg', 2, 3, 0, 2, 22, 7, 7, 6],
       ['Union Berlin', 'Augsburg', 1, 3, 0, 1, 13, 9, 3, 5]],
      dtype=object)

In [14]:
# Changing the categorical Team names to numerical values with ordinal encoder
#from sklearn.preprocessing import OrdinalEncoder

#ordinal_encoder = OrdinalEncoder()
#X_clean[:, (0,1)] = ordinal_encoder.fit_transform(X_clean[:, (0,1)])
#X_clean[:, (0,1)] = X_clean[:, (0,1)].astype(int)
#X_clean = X_clean[:, 2:] ## Dropping numerical values of teams
#X_clean[0:20]

### Calculating Home Team stats
In the following code we sum up the total stats (goals, shots, ..) for each Hometeam and calculate the average for each feature

In [15]:
X1_clean.head()

Unnamed: 0,HomeTeam,AwayTeam,FTHG,FTAG,HTHG,HTAG,HS,AS,HST,AST
0,Bayern Munich,Schalke 04,8,0,3,0,22,5,12,1
1,Ein Frankfurt,Bielefeld,1,1,0,0,18,14,6,4
2,FC Koln,Hoffenheim,2,3,1,2,13,13,6,7
3,Stuttgart,Freiburg,2,3,0,2,22,7,7,6
4,Union Berlin,Augsburg,1,3,0,1,13,9,3,5


##### Total home team stats for each Bundesliga team

In [43]:
teamnames = X2_clean['Team']
sum_FTHG = {key: 0 for key in teamnames} #Initialize dictionary (Team : HG) and initialize values to 0
sum_FTHGC = {key: 0 for key in teamnames} #Initialize dictionary (Team : HGC) and initialize values to 0
sum_HST = {key: 0 for key in teamnames} #Initialize dictionary (Team : HST) and initialize values to 0
sum_HSTC = {key: 0 for key in teamnames} #Initialize dictionary (Team : HST) and initialize values to 0
sum_HS = {key: 0 for key in teamnames} #Initialize dictionary (Team : HS) and initialize values to 0
sum_HSC = {key: 0 for key in teamnames} #Initialize dictionary (Team : HS) and initialize values to 0

hometeams = X1_clean['HomeTeam']
FTHG = X1_clean['FTHG']
FTHGC = X1_clean['FTAG']
HST = X1_clean['HST']
HSTC = X1_clean['AST']
HS = X1_clean['HS']
HSC = X1_clean['AS']

homegames_count = {key: 0 for key in teamnames} # Counting total home games for each team to calculate the average later on

# Loop through all the matches and add up the HST for each team
for team in range(len(hometeams)): 
        sum_FTHG[hometeams[team]] = sum_FTHG[hometeams[team]] + FTHG[team] # Hometeam goals scored
        sum_FTHGC[hometeams[team]] = sum_FTHGC[hometeams[team]] + FTHGC[team] # Hometeam goals conceded
        sum_HST[hometeams[team]] = sum_HST[hometeams[team]] + HST[team] # Hometeam shots on target
        sum_HSTC[hometeams[team]] = sum_HSTC[hometeams[team]] + HSTC[team] # Hometeam shots on target conceded
        sum_HS[hometeams[team]] = sum_HS[hometeams[team]] + HS[team] # Hometeam shots
        sum_HSC[hometeams[team]] = sum_HSC[hometeams[team]] + HSC[team] # Hometeam shots conceded
        homegames_count[hometeams[team]] = homegames_count[hometeams[team]] + 1 # Increasing count of homegames by 1
        #print(hometeams[team], sum_HST[hometeams[team]])

print("Total goals Home Team: ")
print(sum_FTHG)
print("\n")
print("Total goals conceded Home Team: ")
print(sum_FTHGC)
print("\n")
print("Total shots on Target at home for each Team: ")
print(sum_HST)
print("\n")
print("Total shots on Target conceded at home for each Team: ")
print(sum_HSTC)
print("\n")
print("Total shots at home for each Team: ")
print(sum_HS)
print("\n")
print("Total shots conceded at home for each Team: ")
print(sum_HSC)
print("\n")
print("Total Home games for each team: ")
print(homegames_count)

Total goals Home Team: 
{'Bayern Munich': 17, 'Leverkusen': 8, 'RB Leipzig': 12, 'Dortmund': 12, 'Wolfsburg': 4, 'Union Berlin': 11, "M'gladbach": 3, 'Stuttgart': 6, 'Ein Frankfurt': 4, 'Hoffenheim': 5, 'Augsburg': 5, 'Hertha': 2, 'Werder Bremen': 4, 'Freiburg': 4, 'FC Koln': 5, 'Bielefeld': 2, 'Mainz': 5, 'Schalke 04': 3}


Total goals conceded Home Team: 
{'Bayern Munich': 3, 'Leverkusen': 5, 'RB Leipzig': 2, 'Dortmund': 3, 'Wolfsburg': 2, 'Union Berlin': 4, "M'gladbach": 2, 'Stuttgart': 7, 'Ein Frankfurt': 3, 'Hoffenheim': 5, 'Augsburg': 6, 'Hertha': 6, 'Werder Bremen': 6, 'Freiburg': 6, 'FC Koln': 9, 'Bielefeld': 6, 'Mainz': 10, 'Schalke 04': 5}


Total shots on Target at home for each Team: 
{'Bayern Munich': 31, 'Leverkusen': 16, 'RB Leipzig': 32, 'Dortmund': 23, 'Wolfsburg': 19, 'Union Berlin': 25, "M'gladbach": 11, 'Stuttgart': 18, 'Ein Frankfurt': 17, 'Hoffenheim': 16, 'Augsburg': 15, 'Hertha': 15, 'Werder Bremen': 15, 'Freiburg': 16, 'FC Koln': 16, 'Bielefeld': 10, 'Mainz': 1

##### Average stats per game as a home team for each Bundesliga team

In [45]:
avg_FTHG = {key: 0 for key in teamnames} #Initialize dictionary (Team : average FTHG)
avg_FTHGC = {key: 0 for key in teamnames} #Initialize dictionary (Team : average FTHG)
avg_HST = {key: 0 for key in teamnames} #Initialize dictionary (Team : average HST)
avg_HSTC = {key: 0 for key in teamnames} #Initialize dictionary (Team : average HST)
avg_HS = {key: 0 for key in teamnames} #Initialize dictionary (Team : average HS)
avg_HSC = {key: 0 for key in teamnames} #Initialize dictionary (Team : average HS)

for team in range(len(teamnames)):
    avg_FTHG[teamnames[team]] = sum_FTHG[teamnames[team]] / homegames_count[teamnames[team]]
    avg_FTHGC[teamnames[team]] = sum_FTHGC[teamnames[team]] / homegames_count[teamnames[team]]
    avg_HST[teamnames[team]] = sum_HST[teamnames[team]] / homegames_count[teamnames[team]]
    avg_HSTC[teamnames[team]] = sum_HSTC[teamnames[team]] / homegames_count[teamnames[team]]
    avg_HS[teamnames[team]] = sum_HS[teamnames[team]] / homegames_count[teamnames[team]]
    avg_HSC[teamnames[team]] = sum_HSC[teamnames[team]] / homegames_count[teamnames[team]]

print("Average goals scored as a Home team")
rounded_FTHG = {key : round(avg_FTHG[key], 2) for key in avg_FTHG} # Round numbers
print(rounded_FTHG)
print('\n')
print("Average goals conceded as a Home team")
rounded_FTHGC = {key : round(avg_FTHGC[key], 2) for key in avg_FTHGC} # Round numbers
print(rounded_FTHGC)
print('\n')
print("Average shots on target as a Home team")
rounded_HST = {key : round(avg_HST[key], 2) for key in avg_HST} # Round numbers
print(rounded_HST)
print('\n')
print("Average shots on target conceded as a Home team")
rounded_HSTC = {key : round(avg_HSTC[key], 2) for key in avg_HSTC} # Round numbers
print(rounded_HSTC)
print('\n')
print("Average shots as a Home team")
rounded_HS = {key : round(avg_HS[key], 2) for key in avg_HS} # Round numbers
print(rounded_HS)
print('\n')
print("Average shots conceded as a Home team")
rounded_HSC = {key : round(avg_HSC[key], 2) for key in avg_HSC} # Round numbers
print(rounded_HSC)

Average goals scored as a Home team
{'Bayern Munich': 5.67, 'Leverkusen': 2.67, 'RB Leipzig': 3.0, 'Dortmund': 3.0, 'Wolfsburg': 1.0, 'Union Berlin': 2.75, "M'gladbach": 1.0, 'Stuttgart': 1.5, 'Ein Frankfurt': 1.33, 'Hoffenheim': 1.67, 'Augsburg': 1.25, 'Hertha': 0.67, 'Werder Bremen': 1.0, 'Freiburg': 1.33, 'FC Koln': 1.25, 'Bielefeld': 0.67, 'Mainz': 1.25, 'Schalke 04': 1.0}


Average goals conceded as a Home team
{'Bayern Munich': 1.0, 'Leverkusen': 1.67, 'RB Leipzig': 0.5, 'Dortmund': 0.75, 'Wolfsburg': 0.5, 'Union Berlin': 1.0, "M'gladbach": 0.67, 'Stuttgart': 1.75, 'Ein Frankfurt': 1.0, 'Hoffenheim': 1.67, 'Augsburg': 1.5, 'Hertha': 2.0, 'Werder Bremen': 1.5, 'Freiburg': 2.0, 'FC Koln': 2.25, 'Bielefeld': 2.0, 'Mainz': 2.5, 'Schalke 04': 1.67}


Average shots on target as a Home team
{'Bayern Munich': 10.33, 'Leverkusen': 5.33, 'RB Leipzig': 8.0, 'Dortmund': 5.75, 'Wolfsburg': 4.75, 'Union Berlin': 6.25, "M'gladbach": 3.67, 'Stuttgart': 4.5, 'Ein Frankfurt': 5.67, 'Hoffenheim': 5

### Calculating Away Team stats
In the following code we sum up the total stats (goals, shots, ..) for each Awayteam and calculate the average for each feature

In [48]:
sum_FTAG = {key: 0 for key in teamnames} #Initialize dictionary (Team : AG) and set values to 0
sum_FTAGC = {key: 0 for key in teamnames} #Initialize dictionary (Team : AGC) and set values to 0
sum_AST = {key: 0 for key in teamnames} #Initialize dictionary (Team : AST) and set values to 0
sum_ASTC = {key: 0 for key in teamnames} #Initialize dictionary (Team : ASTC) and set values to 0
sum_AS = {key: 0 for key in teamnames} #Initialize dictionary (Team : AS) and set values to 0
sum_ASC = {key: 0 for key in teamnames} #Initialize dictionary (Team : ASC) and set values to 0

awayteams = X1_clean['AwayTeam']
FTAG = X1_clean['FTAG']
FTAGC = X1_clean['FTHG']
AST = X1_clean['AST']
ASTC = X1_clean['HST']
AS = X1_clean['AS']
ASC = X1_clean['HS']

awaygames_count = {key: 0 for key in teamnames} # Counting total away games for each team to calculate the average later on

# Loop through all matches and add up the AST for each team
for team in range(len(awayteams)): 
        sum_FTAG[awayteams[team]] = sum_FTAG[awayteams[team]] + FTAG[team] # Awayteam goals scored
        sum_FTAGC[awayteams[team]] = sum_FTAGC[awayteams[team]] + FTAGC[team] # Awayteam goals conceded
        sum_AST[awayteams[team]] = sum_AST[awayteams[team]] + AST[team] # Awayteam shots on target
        sum_ASTC[awayteams[team]] = sum_ASTC[awayteams[team]] + ASTC[team] # Awayteam shots on target conceded
        sum_AS[awayteams[team]] = sum_AS[awayteams[team]] + AS[team] # Awayteam shots
        sum_ASC[awayteams[team]] = sum_ASC[awayteams[team]] + ASC[team] # Awayteam shots conceded
        awaygames_count[awayteams[team]] = awaygames_count[awayteams[team]] + 1

print("Total goals scored away for each Team: ")
print(sum_FTAG)
print("\n")
print("Total goals conceded away for each Team: ")
print(sum_FTAGC)
print("\n")
print("Total shots on Target away for each Team: ")
print(sum_AST)
print("\n")
print("Total shots on Target conceded away for each Team: ")
print(sum_ASTC)
print("\n")
print("Total shots away for each Team: ")
print(sum_AS)
print("\n")
print("Total shots conceded away for each Team: ")
print(sum_ASC)
print("\n")
print("Total Away games for each team: ")
print(awaygames_count)

Total goals scored away for each Team: 
{'Bayern Munich': 10, 'Leverkusen': 6, 'RB Leipzig': 3, 'Dortmund': 3, 'Wolfsburg': 3, 'Union Berlin': 5, "M'gladbach": 9, 'Stuttgart': 7, 'Ein Frankfurt': 6, 'Hoffenheim': 6, 'Augsburg': 4, 'Hertha': 11, 'Werder Bremen': 5, 'Freiburg': 4, 'FC Koln': 2, 'Bielefeld': 2, 'Mainz': 2, 'Schalke 04': 2}


Total goals conceded away for each Team: 
{'Bayern Munich': 8, 'Leverkusen': 3, 'RB Leipzig': 2, 'Dortmund': 2, 'Wolfsburg': 3, 'Union Berlin': 3, "M'gladbach": 10, 'Stuttgart': 2, 'Ein Frankfurt': 9, 'Hoffenheim': 7, 'Augsburg': 4, 'Hertha': 7, 'Werder Bremen': 3, 'Freiburg': 10, 'FC Koln': 3, 'Bielefeld': 9, 'Mainz': 10, 'Schalke 04': 17}


Total shots on Target away for each Team: 
{'Bayern Munich': 25, 'Leverkusen': 20, 'RB Leipzig': 11, 'Dortmund': 22, 'Wolfsburg': 11, 'Union Berlin': 15, "M'gladbach": 31, 'Stuttgart': 14, 'Ein Frankfurt': 17, 'Hoffenheim': 22, 'Augsburg': 9, 'Hertha': 20, 'Werder Bremen': 16, 'Freiburg': 15, 'FC Koln': 7, 'Biele

##### Average shots on target per game as an away team for each Bundesliga team

In [49]:
avg_FTAG = {key: 0 for key in teamnames} #Initialize dictionary (Team : average HA)
avg_FTAGC = {key: 0 for key in teamnames} #Initialize dictionary (Team : average AGC)
avg_AST = {key: 0 for key in teamnames} #Initialize dictionary (Team : average AST)
avg_ASTC = {key: 0 for key in teamnames} #Initialize dictionary (Team : average ASTC)
avg_AS = {key: 0 for key in teamnames} #Initialize dictionary (Team : average AS)
avg_ASC = {key: 0 for key in teamnames} #Initialize dictionary (Team : average ASC)


for team in range(len(teamnames)):
    avg_FTAG[teamnames[team]] = sum_FTAG[teamnames[team]] / awaygames_count[teamnames[team]]
    avg_FTAGC[teamnames[team]] = sum_FTAGC[teamnames[team]] / awaygames_count[teamnames[team]]
    avg_AST[teamnames[team]] = sum_AST[teamnames[team]] / awaygames_count[teamnames[team]]
    avg_ASTC[teamnames[team]] = sum_ASTC[teamnames[team]] / awaygames_count[teamnames[team]]
    avg_AS[teamnames[team]] = sum_AS[teamnames[team]] / awaygames_count[teamnames[team]]
    avg_ASC[teamnames[team]] = sum_ASC[teamnames[team]] / awaygames_count[teamnames[team]]
    
print("Average goals scored as an Away team")
rounded_FTAG = {key : round(avg_FTAG[key], 2) for key in avg_FTAG} # Round numbers
print(rounded_FTAG)
print('\n')
print("Average goals conceded as an Away team")
rounded_FTAGC = {key : round(avg_FTAGC[key], 2) for key in avg_FTAGC} # Round numbers
print(rounded_FTAGC)
print('\n')
print("Average shots on target as an Away team")
rounded_AST = {key : round(avg_AST[key], 2) for key in avg_AST} # Round numbers
print(rounded_AST)
print('\n')
print("Average shots on target conceded as an Away team")
rounded_ASTC = {key : round(avg_ASTC[key], 2) for key in avg_ASTC} # Round numbers
print(rounded_ASTC)
print('\n')
print("Average shots as an Away team")
rounded_AS = {key : round(avg_AS[key], 2) for key in avg_AS} # Round numbers
print(rounded_AS)
print('\n')
print("Average shots conceded as an Away team")
rounded_ASC = {key : round(avg_ASC[key], 2) for key in avg_ASC} # Round numbers
print(rounded_ASC)

Average goals scored as an Away team
{'Bayern Munich': 2.5, 'Leverkusen': 1.5, 'RB Leipzig': 1.0, 'Dortmund': 1.0, 'Wolfsburg': 1.0, 'Union Berlin': 1.67, "M'gladbach": 2.25, 'Stuttgart': 2.33, 'Ein Frankfurt': 1.5, 'Hoffenheim': 1.5, 'Augsburg': 1.33, 'Hertha': 2.75, 'Werder Bremen': 1.67, 'Freiburg': 1.0, 'FC Koln': 0.67, 'Bielefeld': 0.5, 'Mainz': 0.67, 'Schalke 04': 0.5}


Average goals conceded as an Away team
{'Bayern Munich': 2.0, 'Leverkusen': 0.75, 'RB Leipzig': 0.67, 'Dortmund': 0.67, 'Wolfsburg': 1.0, 'Union Berlin': 1.0, "M'gladbach": 2.5, 'Stuttgart': 0.67, 'Ein Frankfurt': 2.25, 'Hoffenheim': 1.75, 'Augsburg': 1.33, 'Hertha': 1.75, 'Werder Bremen': 1.0, 'Freiburg': 2.5, 'FC Koln': 1.0, 'Bielefeld': 2.25, 'Mainz': 3.33, 'Schalke 04': 4.25}


Average shots on target as an Away team
{'Bayern Munich': 6.25, 'Leverkusen': 5.0, 'RB Leipzig': 3.67, 'Dortmund': 7.33, 'Wolfsburg': 3.67, 'Union Berlin': 5.0, "M'gladbach": 7.75, 'Stuttgart': 4.67, 'Ein Frankfurt': 4.25, 'Hoffenheim'

The label **y** is simply the outcome of the game

In [31]:
y = data[['FTR']]
y.head()

Unnamed: 0,FTR
0,H
1,D
2,A
3,A
4,A


#### Using Ordinal Encoder instead of one Hot Encoder to ensure a one-dimensional target shape

In [11]:
ordinal_encoder = OrdinalEncoder()
y_encoded = ordinal_encoder.fit_transform(y)
y_encoded[0:10] # Home win = 2, Draw = 1, Away win = 0

array([[2.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [2.],
       [2.],
       [1.],
       [0.]])

### Splitting Data into Train and Test Set

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_clean, y_encoded, test_size=0.2)
X_train[0:10]

array([[1, 1, 1, 1, 22, 8, 7, 4],
       [2, 2, 2, 0, 16, 15, 5, 8],
       [1, 1, 1, 1, 5, 15, 2, 6],
       [1, 1, 1, 1, 21, 8, 7, 2],
       [1, 1, 0, 1, 9, 16, 5, 4],
       [1, 3, 0, 1, 13, 9, 3, 5],
       [1, 0, 0, 0, 7, 12, 3, 2],
       [2, 2, 2, 1, 14, 22, 7, 3],
       [1, 1, 0, 0, 14, 15, 5, 2],
       [3, 0, 1, 0, 15, 8, 7, 3]], dtype=object)

### Further Data Preprocessing

In [13]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [14]:
X_train.shape

(50, 8)

In [15]:
print(X_train[0:5])

[[-0.56926044 -0.38313051  0.27854301  0.3721042   1.52703709 -0.77673734
   0.80819019 -0.14071951]
 [ 0.17976646  0.41505806  1.43913887 -0.86824314  0.43888715  0.75055518
  -0.11020775  1.26647558]
 [-0.56926044 -0.38313051  0.27854301  0.3721042  -1.55605442  0.75055518
  -1.48780468  0.56287804]
 [-0.56926044 -0.38313051  0.27854301  0.3721042   1.34567877 -0.77673734
   0.80819019 -0.84431705]
 [-0.56926044 -0.38313051 -0.88205286  0.3721042  -0.83062113  0.96873983
  -0.11020775 -0.14071951]]


In [16]:
y_test[0:5]

array([[2.],
       [2.],
       [1.],
       [0.],
       [1.]])

# Training a Logistic regression Model for game prediction

In [23]:
from sklearn.linear_model import LogisticRegression

reg_clf = LogisticRegression(random_state=42)
reg_clf.fit(X_train, y_train.ravel())

LogisticRegression(random_state=42)

### Evaluation of training model

In [24]:
## After further research, corss valiodation is not valid to use in this application since it shuffles the matches randomly
#from sklearn.model_selection import cross_val_score
#cross_val_score(reg_clf, X_clean, y_encoded, cv=3, scoring='accuracy')

In [25]:
y_pred = reg_clf.predict(X_test)
y_pred

array([2., 2., 0., 0., 1., 2., 0., 1., 0., 1., 1., 1., 1.])

### Evaluation score of Prediction from Logistic Regression model

In [26]:
from sklearn.metrics import f1_score

f1_score(y_test, y_pred, average='micro')

0.9230769230769231

In [28]:
comparison = [y_pred, y_test.ravel()]
comparison

[array([2., 2., 0., 0., 1., 2., 0., 1., 0., 1., 1., 1., 1.]),
 array([2., 2., 1., 0., 1., 2., 0., 1., 0., 1., 1., 1., 1.])]

# Train a NN for Game predictions

In [21]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=(X_train.shape)),
    keras.layers.Dense(10, activation="sigmoid"),
    keras.layers.Dense(10, activation="softmax")
])

In [22]:
model

<tensorflow.python.keras.engine.sequential.Sequential at 0x7ffe6b475040>