# Betting strategies 
## _England Premiere League Results & Odds Dataset_

### I. Introduction

> You do not know anything about sports betting or you want to implement a new strategy? This notebook will help you understand this world, and also provide you a **betting strategy** that you will be able to apply on your own.

In this report, we will take the example of Football, the sport with the highest number of bets. More precisely, we will concentrate on **England Premiere League results**, because its great number of matches will allow us to test and build a strong strategy. 

##### Explanations on betting's vocabulary
First, let's set up the vocabulary. While betting,  _odds_ are linked to each bet you make. Betting odds tell you how likely an event is to happen, and represents how much money you could win if your bet realizes itself.
There is the possibility to bet on different type of results before a match. Here, we will take into account these type of bets: the one on the number of goals, the other on the match result (Home team wins, Away team wins, Draw match).


#### First, we imported some extensions that will be useful for this notebook. As you can see we imported :

> 1. numpy (...)
> 2. matplotlib (...)
> 3. pandas (...)
> 5. seaborn (...)
> 6. sklearn (...)

In [None]:
# Useful starting lines
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import collections  as mc
%load_ext autoreload
%autoreload 2
import pandas as pd 
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
sns.set_style("white")

#### Dataset

The dataset we chose groups the results of these matches since 2008 with data like date time, Home team/Away team, the goals, the match result, the referee for the match etc. There are also odds set on matches' results or number of goals, taken from different betting sites.

The following is the dataset, in a csv file, that we decided to call _data_.

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/abdul232/DMML_Team_Rolex/master/data/England%202008%202018%20Premiere%20League%20Clean%20DATA.csv', sep=';')
# view the first 10 rows 
data.head(10)

In [None]:
data.shape

In [None]:
#We have to drop the two lasts rows because they are NaN
data = data.drop([4180, 4181])

#We have a new dimension
data.shape

This dataset without the _NaN counts_ **4'182** rows for **47** columns.

Now, let's set up the **types of the variables**. We will change some as int (integer), the date as datetime, while other will remain objects.

In [None]:
data.dtypes

In [None]:
#We have to change the type of some variable
data['Match_ID'] = data.Match_ID.astype(int)

In [None]:
#We have to change the type of some variable
data['Date'] = pd.to_datetime(data['Date'], format="%d.%m.%Y")


In [None]:
#We have to change the type of some variable
data[['Home Team Goals', 'Away Team Goals', 'Home Team Shots','Away Team Shots', 'Home Team Shots on Target', 'Away Team Shots on Target', 'Home Fouls Committed', 'Away Fouls Committed', 'Home Corners', 'Away Cornners', 'Home Yellow Cards', 'Away Yellow Cards', 'Home Red Cards', 'Away Red Cards']]= data[['Home Team Goals', 'Away Team Goals', 'Home Team Shots','Away Team Shots', 'Home Team Shots on Target', 'Away Team Shots on Target', 'Home Fouls Committed', 'Away Fouls Committed', 'Home Corners', 'Away Cornners', 'Home Yellow Cards', 'Away Yellow Cards', 'Home Red Cards', 'Away Red Cards']].astype(int)

In [None]:
data.dtypes

### Regressions

To build a betting strategy, it is first relevant to know how frequently the betting companies makes right predicting or not. So, in order to undertand this fact, we will build three different scatter plot :
> * when the official result is the victory of the home team crossed with the "Home team win" odd.
> * When the official result is the victory of the away team crossed with the match result with the "Away team win" odd.
> * When the official result is draw crossed with the "Draw"odd.

Each graph will show the odds of all the betting websites, to compare them.

#### Normalistion
As first steps we will normalise the odds to have a better comparison :


In [None]:
#We would like to do 3 different regression, one for the Home team wins, one for the draws and one for the Away team wins
data = pd.get_dummies(data, columns=['Match Result'])

In [None]:
from sklearn import preprocessing
# separate the data from the target attributes
#X = data['B365 Home','B365 Draw','B365 Away','Bet&Win Home','Bet&Win Draw','Bet&Win Away','Interwetten Home','Iterwetten Draw','Interwetten Away','William Hill Home','William Hill Draw','William Hill Away','VC Bet Home','VC Bet Draw','VC Bet Away']
# normalisation par formule (x - x.min()) / (x.max() - x.min())
cols_to_norm = ['B365 Home','B365 Draw','B365 Away','Bet&Win Home','Bet&Win Draw','Bet&Win Away','Interwetten Home','Iterwetten Draw','Interwetten Away','William Hill Home','William Hill Draw','William Hill Away','VC Bet Home','VC Bet Draw','VC Bet Away']
data[cols_to_norm] = data[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min())) 
data[cols_to_norm].head(10)

In [None]:
round(data[cols_to_norm],3).head(10)

#### Plot with seaborn

Then we build a first scatter plot, we used seaborn and the betting website B365, when the Home team wins :

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score


#tips = sns.load_dataset(data)
ax = sns.scatterplot(x="B365 Home", y="Match Result_H", data=data)

#pour faire la régression on peut prendre qu'une seule variable à la fois
#feature_home_wins = ['B365 Home','Bet&Win Home','Interwetten Home','William Hill Home','VC Bet Home']
#feature_home_wins = ['B365 Home']
#X = data[feature_home_wins]
#y = data['Match Result_H']
#We create the linear model.


#model = LinearRegression(fit_intercept=True)
#model.fit(X, y)

#predicted = model.predict(X)
#print(predicted)

#plt.scatter(X, y)
#plt.plot([min(X), max(X)], [min(predicted), max(predicted)], color='red') # predicted
#plt.show()


#mae = mean_absolute_error(predicted, y)
 
#print("MAE = %.2f" % mae)

In [None]:
#course : feature_names = ['1stFlrSF']
#X = home_data[feature_names]
#y = home_data["SalePrice"]
#X.head()

feature_names = ['B365 Home','Bet&Win Home','Interwetten Home','William Hill Home','VC Bet Home']
X = data[feature_names]
y = data["Match Result_H"]
X.head()

