# Betting strategies 
# _England Premiere League Match Results & Odds Dataset_

## I. Introduction

> You do not know anything about sports betting or you want to implement a new strategy? This notebook will help you understand this world, and also provide you a **betting strategy** that you will be able to apply on your own.

In this report, we will take the example of Football, one of the sports with the highest number of bets. More precisely, we will concentrate on **England Premiere League match results**, because its great number of matches will allow us to test and build a strategy.

### Explanations on betting's vocabulary
First, let's set up the vocabulary. While betting,  _odds_ are linked to each bet you make. Betting odds tell you how likely an event is to happen, and represents how much money you could win if your bet realizes itself.

> For example, you are on a betting website. If the odd is set as 1.40 for "Home team win" and you put 10€ on this odd, if the Home team actually wins (so your prediction realizes itself), you will earn 4€ and get back your first 10€.

There is the possibility to bet on different type of results before a match. Here, we will take into account the bets on the match result (Home team wins, Away team wins, Draw match).


## II. Dataset & Variables

The dataset we chose groups the results of the English Premier League matches **from 2008 to 2019**. It was taken from the following website:
We also added some features to this dataset, taken from this website:

The variables available are about the matches: the match's **ID**, the match's **date time**, the **Home team**, the **Away team**, the final **number of goals**, the **match result** (Home Team win, Away Team win or Draw Match), the **team's ranking** of the previous Football season (static rank at the end of the past season), the **referee** for the match, the number of **shots**, the number of **fouls**, and the other variables are not directly linked to the match statistics. These other variables are **the odds set before the match** by **different betting websites**. These odds are set on the **final result** of a match (Home Team win, Away Team win, Draw Match).

## III. Goal of the project & models

The goal of the project is to build a model that can predict the following output : Home Team win, Away Team win or Draw match.
Of course, the model will not predict perfectly the output, but by knowing which match result have the most chances to happen can create some winning bets. This will be done through **Classifications** models.

We decided to build two models, taking into acount different independant variables: 
- One time, we will try to predict a match result based on the **betting websites' predictions**, that can be seen **through the odds** websites put on the possible result. 

- Another time we will use the **Match Statistics**, with the rank of each team, the number of shots/fouls, the number of goals, etc. 

In order to obtain these predictions, we will also try two classifications models: **Descision Trees** and **Logistic Regressions**. 

> In the end, we will evaluate the relevance of each classification model, and also define **which independant variables** (Odds or Statitics of the Match) **explained the best the Match Result**.

## IV. Methodology

1. At first, we will import the packages that we will need during this analyse.
2. Then, we will upload, clean and add the necessary variables to complete the Dataset.
3. Some variables will be transformed the variables in a way we can use them properly.
4. Following this point, we will create scatter plots to show the repartition between the realized prediction vs non realized prediction. (When the match Odd is set on Home win => 0 the match result is Home did not win, 1 when Home won) 
5. Consequently, we will create the classification models (Logistic regression, then decision trees). It will be done for each match result, using one time the match statistics, the other time using the odds. 
6. We will also evaluate the accuracy of the models and create confusion matrixes to have a visual representation of the project.
7. Finally, we will draw some conclusions that we can be retrieved from these classfications and methodologies.

> Finally, we will analyse our models and try to tink about how we could improve them and how they could be useful for business.


## 1. Packages

In [None]:
# Useful starting lines
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import collections  as mc
%load_ext autoreload
%autoreload 2
import pandas as pd 
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
sns.set_style("white")

## 2. The dataset

The following table shows the 10 first rows of the dataset. It was uploaded from a csv file, and we decided yo name it **_data_**.

In [None]:
#We use the Dataset with the games of the season 2018
data = pd.read_csv('https://raw.githubusercontent.com/abdul232/DMML_Team_Rolex/master/data/England_2008_2018_Premiere_League_Final.csv',sep=";")

#data = data.sort_values(by="Match_ID",ascending=True)
# view the first 10 rows 
data.head(10)


In [None]:
data.shape

As we can see, we have some NaN variables in our data due to a format change, so we have to remove them.

In [None]:
data = data.drop(["Unnamed: 49", "Unnamed: 50", "Unnamed: 51","Unnamed: 52","Unnamed: 53","Unnamed: 54","Unnamed: 55","Unnamed: 56","Unnamed: 57","Unnamed: 58","Unnamed: 59","Unnamed: 60","Unnamed: 61","Unnamed: 62","Unnamed: 63","Unnamed: 64","Unnamed: 65","Unnamed: 66","Unnamed: 67","Unnamed: 68","Unnamed: 69"], axis=1)

#We have a new dimension
data.shape

We have also seen that some rows were showing _NaN_ data, so we removed them too.

In [None]:
data.dropna(inplace=True)
data.shape

Finally, this dataset (without the _NaN_ ) counts **4'176** rows for **49** columns.


## 3. Changing variables types

Now, as we settled up the **types of the variables**. We will change some as int (integer), the date as datetime, while other will remain objects.

In [None]:
data.dtypes

In [None]:
#We have to change the type of some variable (integer)
data['Match_ID'] = data.Match_ID.astype(int)

In [None]:
#We have to change the type of some variable (integer)
data[['Home Team Goals', 'Away Team Goals', 'Home Team Shots','Away Team Shots', 'Home Team Shots on Target', 'Away Team Shots on Target', 'Home Fouls Committed', 'Away Fouls Committed', 'Home Corners', 'Away Corners', 'Home Yellow Cards', 'Away Yellow Cards', 'Home Red Cards', 'Away Red Cards']]= data[['Home Team Goals', 'Away Team Goals', 'Home Team Shots','Away Team Shots', 'Home Team Shots on Target', 'Away Team Shots on Target', 'Home Fouls Committed', 'Away Fouls Committed', 'Home Corners', 'Away Corners', 'Home Yellow Cards', 'Away Yellow Cards', 'Home Red Cards', 'Away Red Cards']].astype(int)

In [None]:
#We have to change the type of some variable (Date)
data['Date'] = pd.to_datetime(data['Date'],)


In [None]:
data.dtypes


## 4. Tendencies through scatter plots

A match has 3 different possible results (the Home Team wins, the Away Team wins or a Draw match). Each of these results are likely to happen, or not happen. This fact shows the need of a binary variable, for each of the match result (Home, Away or Draw).

In [None]:
#We wo, one for the Home team wins, one for the draws and one for the Away team wins
data = pd.get_dummies(data, columns=['Match Result'])

Before building the graphs, we will normalise the odds numbers to better the comparison between the graphs. It will create a common scale for all the odds.


In [None]:
from sklearn import preprocessing
# separate the data from the target attributes
#X = data['B365 Home','B365 Draw','B365 Away','Bet&Win Home','Bet&Win Draw','Bet&Win Away','Interwetten Home','Iterwetten Draw','Interwetten Away','William Hill Home','William Hill Draw','William Hill Away','VC Bet Home','VC Bet Draw','VC Bet Away']
# normalisation par formule (x - x.min()) / (x.max() - x.min())
cols_to_norm = ['B365 Home','B365 Draw','B365 Away','Bet&Win Home','Bet&Win Draw','Bet&Win Away','Interwetten Home','Interwetten Draw','Interwetten Away','William Hill Home','William Hill Draw','William Hill Away','VC Bet Home','VC Bet Draw','VC Bet Away']
data[cols_to_norm] = data[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min())) 
data[cols_to_norm].head(10)

To build a betting strategy, it is first relevant to know how frequently the betting companies makes right predicting or not. So, in order to understand this fact, we will build **three different scatter plot**, showing the realised and not realised predictions :

- When the official result is the victory of the home team crossed with the "Home team win" odd.
- When the official result is the victory of the away team crossed with the "Away team win" odd.
- When the official result is draw match crossed with the "Draw"odd.

We will build scatter plots for each of the 3 differents possible results we have in a match, and compare them with the odds.

### Home team wins scenario: if the Home team wins the output is 1,  otherwise it's 0.

In [None]:
# number of Homewin vs No Home win
Homewin = data["Match Result_H"].value_counts()[1]
NoHomewin = data["Match Result_H"].value_counts()[0]
print("Home won:", Homewin)
print("Home did not win:", NoHomewin)

In [None]:
# Base rate 
# the base rate of the No Home Win
BaseRate = NoHomewin/data['Match Result_H'].count()
BaseRate

**Interpretation**: We can see that we have a almost 50-50 repartition.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score


#tips = sns.load_dataset(data)

a_B365_Home = sns.scatterplot(x="B365 Home", y="Match Result_H", data=data)

a_BetWin_Home = sns.scatterplot(x="Bet&Win Home", y="Match Result_H", data=data)

a_Interwetten_Home = sns.scatterplot(x="Interwetten Home", y="Match Result_H", data=data)

a_WilliamHill_Home = sns.scatterplot(x="William Hill Home", y="Match Result_H", data=data)

a_VCBet_Home = sns.scatterplot(x="VC Bet Home", y="Match Result_H", data=data)

a_VCBet_Home.set(xlabel='Website Odd while Home Team wins', ylabel='Home team won [1]')


**Interpretation:** As we can see the higher the odd is, the less a team has chance to win. However, it still happen sometimes.


### Away team wins scenario: if the Away team wins the output is 1,  otherwise it's 0.

In [None]:
# number of Away win vs No Away win
Awaywin = data["Match Result_A"].value_counts()[1]
NoAwaywin = data["Match Result_A"].value_counts()[0]
print("Away won:", Awaywin)
print("Away did not win:", NoAwaywin)

In [None]:
# Base rate 
# the base rate of the No Home Win
BaseRateA = NoAwaywin/data['Match Result_A'].count()
BaseRateA

**Interpretation**: We can see that we have a almost 70-30 repartition.

In [None]:
a_B365_Away = sns.scatterplot(x="B365 Away", y="Match Result_A", data=data)

a_BetWin_Away = sns.scatterplot(x="Bet&Win Away", y="Match Result_A", data=data)

a_Interwetten_Away = sns.scatterplot(x="Interwetten Away", y="Match Result_A", data=data)

a_WilliamHill_Away = sns.scatterplot(x="William Hill Away", y="Match Result_A", data=data)

a_VCBet_Away = sns.scatterplot(x="VC Bet Away", y="Match Result_H", data=data)

a_VCBet_Away.set(xlabel='Website Odd while Away Team wins', ylabel='Away team won [1]')


**Interpretation:** We can observe the same trend as the previous graph.

### Draw match scenario: if the result is draw, the output is 1,  otherwise it's 0.

In [None]:
# number of Draw vs No Draw
Draw = data["Match Result_D"].value_counts()[1]
NoDraw = data["Match Result_D"].value_counts()[0]
print("Draw match:", Draw)
print("No Draw match:", NoDraw)

In [None]:
# Base rate 
# the base rate of the No Draw
BaseRateD = NoDraw/data['Match Result_D'].count()
BaseRateD

**Interpretation**: We can see that we have a almost 75-25 repartition.

In [None]:
a_B365_Draw = sns.scatterplot(x="B365 Draw", y="Match Result_D", data=data)

a_BetWin_Draw = sns.scatterplot(x="Bet&Win Draw", y="Match Result_D", data=data)

a_Interwetten_Draw = sns.scatterplot(x="Interwetten Draw", y="Match Result_D", data=data)

a_WilliamHill_Draw = sns.scatterplot(x="William Hill Draw", y="Match Result_D", data=data)

a_VCBet_Draw = sns.scatterplot(x="VC Bet Draw", y="Match Result_D", data=data)

a_VCBet_Draw.set(xlabel='Website Odd while Draw match', ylabel='Draw [1]')

**Interpretation:** We can observe the same trend as the previous graph for the small odds. However, the high odds just never realize themselves.


## 5. Comparison between two Classification models

We will compare two model:

> - **Logistic Regressions**: comparing the Odd prediction to the statistics of the match prediction
> - **Decision Trees**: comparing the Odd prediction to the statistics of the match prediction

## 5.1 Logistic Regression

### I. With the odds

**Here is the 1st logistic regression for Home Team Win:**

In [None]:
feature_names = ['B365 Home','Bet&Win Home','Interwetten Home','William Hill Home','VC Bet Home']

X = np.array(data[feature_names])
y = np.array(data["Match Result_H"])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)


In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

# logistic regression with 5 fold cross validation
LR = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, multi_class="multinomial")

In [None]:
LR.fit(X_train,y_train)

In [None]:
# best regulariser parameter
LR.C_

In [None]:
# train accuracy
LR.score(X_train,y_train)

The train set accuracy is about 64%

In [None]:
# test accuracy
LR.score(X_test, y_test)

The test set accuracy is about 60%

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, LR.predict(X_train))

We can already see the repartition of results and predictions, but just below the confusion matrix shows it in a more readable way:

In [None]:
# Normalized confusion matrix, (code from the Lab 5.0)
from sklearn.utils.multiclass import unique_labels

y_pred = LR.predict(X_train)

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Reds):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

#     print(cm)

    fig, ax = plt.subplots(figsize=(10,7))
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')
    plt.ylim([-0.5, 2.5])

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout();
    
    ax.xaxis.set_ticklabels(["No Home Win", "Home Win"])
    ax.yaxis.set_ticklabels(["No Home Win", "Home Win"])
    return ax


np.set_printoptions(precision=2)


# Plot normalized confusion matrix
plot_confusion_matrix(y_train, y_pred, classes = y[unique_labels(y_train, y_pred)],title='Confusion matrix, with normalization')


### EXPLICATIONS CONFUSION MATRIX 1

In [None]:
from sklearn.metrics import precision_score, recall_score, precision_recall_curve

Precision = precision_score(y_train, y_pred)

print(Precision)
# precision (60.42% of the games were a Home win )

Recall = recall_score(y_train, y_pred)


print(Recall)
# recall (only 75.8% of the Home wins have been correctly identified)


### EXPLICATIONS RECALL


**Here is the 2nd logistic regression for Draws:**

In [None]:
feature_names = ['B365 Draw','Bet&Win Draw','Interwetten Draw','William Hill Draw','VC Bet Draw']

X = np.array(data[feature_names])
y = np.array(data["Match Result_D"])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

# logistic regression with 5 fold cross validation
LRD = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, multi_class="multinomial")

In [None]:
LRD.fit(X_train,y_train)

In [None]:
# best regulariser parameter
LRD.C_

Here we do not have asatisfying result : we should take sample of the data to have a better repartition of the [0,1] representing the results of the matches. 
### MORE

In [None]:
# train accuracy
LRD.score(X_train,y_train)

In [None]:
# test accuracy
LRD.score(X_test, y_test)

In [None]:
LRD.score(X_train,y_train)

### expliquer que les resultats des cellules du dessus est due à cause de la mauvaise répartition.

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, LRD.predict(X_train))

In [None]:
# Normalized confusion matrix, (code from the Lab 5.0)
from sklearn.utils.multiclass import unique_labels

y_pred = LRD.predict(X_train)

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Greens):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, with normalization')

#     print(cm)

    fig, ax = plt.subplots(figsize=(10,7))
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')
    plt.ylim([-0.5, 2.5])

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout();
    
    ax.xaxis.set_ticklabels(["No Draw", "Draw"])
    ax.yaxis.set_ticklabels(["No Draw", "Draw"])
    return ax


np.set_printoptions(precision=2)


# Plot normalized confusion matrix
plot_confusion_matrix(y_train, y_pred, classes = y[unique_labels(y_train, y_pred)],title='Confusion matrix, with normalization')



### Mauvaise repartition alors on ne gardera pas l'example de draw pour les logistic regression

**Here is the 3rd logistic regression for Away wins:**

In [None]:
feature_names = ['B365 Away','Bet&Win Away','Interwetten Away','William Hill Away','VC Bet Away']

X = np.array(data[feature_names])
y = np.array(data["Match Result_A"])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

# logistic regression with 5 fold cross validation
LRA = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, multi_class="multinomial")

In [None]:
LRA.fit(X_train,y_train)

In [None]:
# best regulariser parameter
LRA.C_

In [None]:
# train accuracy
LRA.score(X_train,y_train)

In [None]:
# test accuracy
LRA.score(X_test, y_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, LRA.predict(X_train))

In [None]:
# Normalized confusion matrix, (code from the Lab 5.0)
from sklearn.utils.multiclass import unique_labels

y_pred = LRA.predict(X_train)

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, with normalization')

#     print(cm)

    fig, ax = plt.subplots(figsize=(10,7))
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')
    plt.ylim([-0.5, 2.5])

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout();
    
    ax.xaxis.set_ticklabels(["No Away win", "Away win"])
    ax.yaxis.set_ticklabels(["No Away win", "Away win"])
    return ax


np.set_printoptions(precision=2)


# Plot normalized confusion matrix
plot_confusion_matrix(y_train, y_pred, classes = y[unique_labels(y_train, y_pred)],title='Confusion matrix, with normalization')



### Mauvaise repartition alors on ne gardera pas l'example de away pour les logistic regression, meme raison que draw

### II. With the Match Statistics

**Logistic Regression for Home win with statistics**


In [None]:
feature_names = ["Home ex-Rank","Home Team Shots","Away ex-Rank", "Away Team Shots","Home Team Shots on Target", "Away Team Shots on Target", "Home Fouls Committed", "Away Fouls Committed", "Home Corners", "Away Corners", "Home Yellow Cards", "Away Yellow Cards", "Home Red Cards", "Away Red Cards"]
#feature_names = ['B365 Home']
X = np.array(data[feature_names])
y = np.array(data["Match Result_H"])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

# logistic regression with 5 fold cross validation
LR_MS = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, multi_class="multinomial")

In [None]:
LR_MS.fit(X_train,y_train)

In [None]:
# best regulariser parameter
LR_MS.C_

In [None]:
# train accuracy
LR_MS.score(X_train,y_train)

In [None]:
# test accuracy
LR_MS.score(X_test, y_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, LR_MS.predict(X_train))

In [None]:
# Normalized confusion matrix, (code from the Lab 5.0)
from sklearn.utils.multiclass import unique_labels

y_pred = LR_MS.predict(X_train)

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Reds):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, with normalization')

#     print(cm)

    fig, ax = plt.subplots(figsize=(10,7))
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')
    plt.ylim([-0.5, 2.5])

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout();
    
    ax.xaxis.set_ticklabels(["No Home win", "Home win"])
    ax.yaxis.set_ticklabels(["No Home win", "Home win"])
    return ax


np.set_printoptions(precision=2)


# Plot normalized confusion matrix
plot_confusion_matrix(y_train, y_pred, classes = y[unique_labels(y_train, y_pred)],title='Confusion matrix, with normalization')

**Logistic Regression for Draws with statistics**

In [None]:
feature_names = ["Home ex-Rank","Home Team Shots","Away ex-Rank", "Away Team Shots","Home Team Shots on Target", "Away Team Shots on Target", "Home Fouls Committed", "Away Fouls Committed", "Home Corners", "Away Corners", "Home Yellow Cards", "Away Yellow Cards"]
#feature_names = ['B365 Home']
X = np.array(data[feature_names])
y = np.array(data["Match Result_D"])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

# logistic regression with 5 fold cross validation
LR_MSD = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, multi_class="multinomial")

In [None]:
LR_MSD.fit(X_train,y_train)

In [None]:
# best regulariser parameter
LR_MSD.C_

In [None]:
# train accuracy
LR_MSD.score(X_train,y_train)

In [None]:
# test accuracy
LR_MSD.score(X_test, y_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, LR_MSD.predict(X_train))

In [None]:
# Normalized confusion matrix, (code from the Lab 5.0)
from sklearn.utils.multiclass import unique_labels

y_pred = LR_MSD.predict(X_train)

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Greens):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, with normalization')

#     print(cm)

    fig, ax = plt.subplots(figsize=(10,7))
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')
    plt.ylim([-0.5, 2.5])

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout();
    
    ax.xaxis.set_ticklabels(["No Draw", "Draw"])
    ax.yaxis.set_ticklabels(["No Draw", "Draw"])
    return ax


np.set_printoptions(precision=2)


# Plot normalized confusion matrix
plot_confusion_matrix(y_train, y_pred, classes = y[unique_labels(y_train, y_pred)],title='Confusion matrix, with normalization')

**Logistic Regression for Away win with statistics**

In [None]:
feature_names = ["Home ex-Rank","Home Team Shots","Away ex-Rank", "Away Team Shots","Home Team Shots on Target", "Away Team Shots on Target", "Home Fouls Committed", "Away Fouls Committed", "Home Corners", "Away Corners", "Home Yellow Cards", "Away Yellow Cards", "Home Red Cards", "Away Red Cards"]
#feature_names = ['B365 Home']
X = np.array(data[feature_names])
y = np.array(data["Match Result_A"])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

# logistic regression with 5 fold cross validation
LR_MSA = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, multi_class="multinomial")

In [None]:
LR_MSA.fit(X_train,y_train)

In [None]:
# best regulariser parameter
LR_MSA.C_

In [None]:
# train accuracy
LR_MSA.score(X_train,y_train)

In [None]:
# test accuracy
LR_MSA.score(X_test, y_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, LR_MSA.predict(X_train))

In [None]:
# Normalized confusion matrix, (code from the Lab 5.0)
from sklearn.utils.multiclass import unique_labels

y_pred = LR_MSA.predict(X_train)

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, with normalization')

#     print(cm)

    fig, ax = plt.subplots(figsize=(10,7))
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')
    plt.ylim([-0.5, 2.5])

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout();
    
    ax.xaxis.set_ticklabels(["No Away win", "Away win"])
    ax.yaxis.set_ticklabels(["No Away win", "Away win"])
    return ax


np.set_printoptions(precision=2)


# Plot normalized confusion matrix
plot_confusion_matrix(y_train, y_pred, classes = y[unique_labels(y_train, y_pred)],title='Confusion matrix, with normalization')



### 5.2 Decision Trees

#### I. With the Odds

**Decision Tree for Home win**

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
feature_names = ['B365 Home','Bet&Win Home','Interwetten Home','William Hill Home','VC Bet Home']

X = data[feature_names]
y = data["Match Result_H"]


In [None]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

In [None]:
clf = DecisionTreeClassifier(criterion='entropy')

In [None]:
clf.fit(X_train, y_train)

In [None]:
# test accuracy
clf.score(X_test,y_test)

In [None]:
# depth of the decision tree
Depth = clf.get_depth()

In [None]:
#clf = DecisionTreeClassifier(criterion='entropy')
scores = [clf.score(X_test,y_test)]
for d in range(1, Depth):
    clf = DecisionTreeClassifier(criterion='entropy', max_depth=d)
    clf.fit(X_train, y_train)
    scores.append(clf.score(X_test, y_test))

In [None]:
plt.plot(scores,"r")
plt.ylabel('accuracy', fontsize=15)
plt.xlabel('depth', fontsize=15)

In [None]:
Max_Depth = np.argmax(scores)
print("The optimal Maximum Depth is", Max_Depth)

In [None]:
clfMax = DecisionTreeClassifier(criterion='entropy', max_depth=Max_Depth)

In [None]:
clfMax.fit(X_train, y_train)

In [None]:
clfMax.score(X_test,y_test)

### EXPLICATIONS

**Decision Tree for Draw**

In [None]:
feature_namesD = ['B365 Draw','Bet&Win Draw','Interwetten Draw','William Hill Draw','VC Bet Draw']

X = data[feature_names]
y = data["Match Result_D"]


In [None]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

In [None]:
clfD = DecisionTreeClassifier(criterion='entropy')

In [None]:
clfD.fit(X_train, y_train)

In [None]:
# test accuracy
clfD.score(X_test,y_test)

In [None]:
# depth of the decision tree
Depth = clfD.get_depth()

In [None]:
scores = [clfD.score(X_test,y_test)]
for d in range(1, Depth):
    clf = DecisionTreeClassifier(criterion='entropy', max_depth=d)
    clf.fit(X_train, y_train)
    scores.append(clf.score(X_test, y_test))

In [None]:
plt.plot(scores,"g")
plt.ylabel('accuracy', fontsize=15)
plt.xlabel('depth', fontsize=15)

In [None]:
Max_Depth = np.argmax(scores)
print("The optimal Maximum Depth is", Max_Depth)

In [None]:
clfMaxD = DecisionTreeClassifier(criterion='entropy', max_depth=Max_Depth)

In [None]:
clfMaxD.fit(X_train, y_train)

In [None]:
clfMaxD.score(X_test,y_test)

### EXPLICATIONS

**Decision Tree for Away win**

In [None]:
feature_namesA = ['B365 Away','Bet&Win Away','Interwetten Away','William Hill Away','VC Bet Away']

X = data[feature_names]
y = data["Match Result_A"]


In [None]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

In [None]:
clfA = DecisionTreeClassifier(criterion='entropy')

In [None]:
clfA.fit(X_train, y_train)

In [None]:
# test accuracy
clfA.score(X_test,y_test)

In [None]:
# depth of the decision tree
Depth = clfA.get_depth()

In [None]:
scores = [clfA.score(X_test,y_test)]
for d in range(1, Depth):
    clf = DecisionTreeClassifier(criterion='entropy', max_depth=d)
    clf.fit(X_train, y_train)
    scores.append(clf.score(X_test, y_test))

In [None]:
plt.plot(scores)
plt.ylabel('accuracy', fontsize=15)
plt.xlabel('depth', fontsize=15)

In [None]:
Max_Depth = np.argmax(scores)
print("The optimal Maximum Depth is", Max_Depth)

In [None]:
clfMaxA = DecisionTreeClassifier(criterion='entropy', max_depth=Max_Depth)

In [None]:
clfMaxA.fit(X_train, y_train)

In [None]:
clfMaxA.score(X_test,y_test)

### II. With the Match Statistics

**Match Statistics for Home win**

In [None]:
feature_names = ['Home ex-Rank', 'Home Team Shots', 'Away ex-Rank', 'Away Team Shots', 'Home Team Shots on Target', 'Away Team Shots on Target', 'Home Fouls Committed', 'Away Fouls Committed', 'Home Corners', 'Away Corners', 'Home Yellow Cards', 'Away Yellow Cards', 'Home Red Cards', 'Away Red Cards']

X = np.array(data[feature_names])
y = np.array(data["Match Result_H"])

In [None]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

In [None]:
clfHS = DecisionTreeClassifier(criterion='entropy')

In [None]:
clfHS.fit(X_train, y_train)

In [None]:
# test accuracy
clfHS.score(X_test,y_test)

In [None]:
# depth of the decision tree
clfHS.get_depth()

In [None]:
scores = [clfHS.score(X_test,y_test)]
for d in range(1, Depth):
    clf = DecisionTreeClassifier(criterion='entropy', max_depth=d)
    clf.fit(X_train, y_train)
    scores.append(clf.score(X_test, y_test))

In [None]:
plt.plot(scores, "r")
plt.ylabel('accuracy', fontsize=15)
plt.xlabel('depth', fontsize=15)

In [None]:
Max_Depth = np.argmax(scores)
print("The optimal Maximum Depth is", Max_Depth)

In [None]:
clfMaxHS = DecisionTreeClassifier(criterion='entropy', max_depth=Max_Depth)

In [None]:
clfMaxHS.fit(X_train, y_train)

In [None]:
clfMaxHS.score(X_test,y_test)

**Match Statistics for Draw**

In [None]:
feature_names = ['Home ex-Rank', 'Home Team Shots', 'Away ex-Rank', 'Away Team Shots', 'Home Team Shots on Target', 'Away Team Shots on Target', 'Home Fouls Committed', 'Away Fouls Committed', 'Home Corners', 'Away Corners', 'Home Yellow Cards', 'Away Yellow Cards', 'Home Red Cards', 'Away Red Cards']

X = np.array(data[feature_names])
y = np.array(data["Match Result_D"])

In [None]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

In [None]:
clfDS = DecisionTreeClassifier(criterion='entropy')

In [None]:
clfDS.fit(X_train, y_train)

In [None]:
# test accuracy
clfDS.score(X_test,y_test)

In [None]:
# depth of the decision tree
clfDS.get_depth()

In [None]:
scores = [clfDS.score(X_test,y_test)]
for d in range(1, Depth):
    clf = DecisionTreeClassifier(criterion='entropy', max_depth=d)
    clf.fit(X_train, y_train)
    scores.append(clf.score(X_test, y_test))

In [None]:
plt.plot(scores,"g")
plt.ylabel('accuracy', fontsize=15)
plt.xlabel('depth', fontsize=15)

In [None]:
Max_Depth = np.argmax(scores)
print("The optimal Maximum Depth is", Max_Depth)

In [None]:
clfMaxDS = DecisionTreeClassifier(criterion='entropy', max_depth=Max_Depth)

In [None]:
clfMaxDS.fit(X_train, y_train)

In [None]:
clfMaxDS.score(X_test,y_test)

**Match Statistics for Away**

In [None]:
feature_names = ['Home ex-Rank', 'Home Team Shots', 'Away ex-Rank', 'Away Team Shots', 'Home Team Shots on Target', 'Away Team Shots on Target', 'Home Fouls Committed', 'Away Fouls Committed', 'Home Corners', 'Away Corners', 'Home Yellow Cards', 'Away Yellow Cards', 'Home Red Cards', 'Away Red Cards']

X = np.array(data[feature_names])
y = np.array(data["Match Result_A"])

In [None]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

In [None]:
clfAS = DecisionTreeClassifier(criterion='entropy')

In [None]:
clfAS.fit(X_train, y_train)

In [None]:
# test accuracy
clfAS.score(X_test,y_test)

In [None]:
# depth of the decision tree
clfAS.get_depth()

In [None]:
scores = [clfAS.score(X_test,y_test)]
for d in range(1, Depth):
    clf = DecisionTreeClassifier(criterion='entropy', max_depth=d)
    clf.fit(X_train, y_train)
    scores.append(clf.score(X_test, y_test))

In [None]:
plt.plot(scores)
plt.ylabel('accuracy', fontsize=15)
plt.xlabel('depth', fontsize=15)

In [None]:
Max_Depth = np.argmax(scores)
print("The optimal Maximum Depth is", Max_Depth)

In [None]:
clfMaxAS = DecisionTreeClassifier(criterion='entropy', max_depth=Max_Depth)

In [None]:
clfMaxAS.fit(X_train, y_train)

In [None]:
clfMaxAS.score(X_test,y_test)

# Conclusion