# Soccer Game Predicition: Exploratory Data Analysis

The purpose of this file is to find insights that will allow us to create a predictive model for soccer game results that maximizes our ROI on online betting.

In [1]:
# import necessary modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# import the data
data = pd.read_csv('../data/Games/raw/E0_2015_2016.csv')
data[:5]

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,BbAv<2.5,BbAH,BbAHh,BbMxAHH,BbAvAHH,BbMxAHA,BbAvAHA,PSCH,PSCD,PSCA
0,E0,08/08/15,Bournemouth,Aston Villa,0,1,A,0,0,D,...,1.79,26,-0.5,1.98,1.93,1.99,1.92,1.82,3.88,4.7
1,E0,08/08/15,Chelsea,Swansea,2,2,D,2,1,H,...,1.99,27,-1.5,2.24,2.16,1.8,1.73,1.37,5.04,10.88
2,E0,08/08/15,Everton,Watford,2,2,D,0,1,A,...,1.96,26,-1.0,2.28,2.18,1.76,1.71,1.75,3.76,5.44
3,E0,08/08/15,Leicester,Sunderland,4,2,H,3,0,H,...,1.67,26,-0.5,2.0,1.95,1.96,1.9,1.79,3.74,5.1
4,E0,08/08/15,Man United,Tottenham,1,0,H,1,0,H,...,2.01,26,-1.0,2.2,2.09,1.82,1.78,1.64,4.07,6.04


Features that we can craft from the data:
* Number of victories at home for the home team in current season [OK]
* Number of games at home for the home team in current season [OK]
* Number of points of the home team in current season [OK]
* Number of victories or draws outside for the away team in current season [OK]
* Number of games outside for the away team in current season [OK]
* Number of points of the away team in current season [OK]
* Total number of games played in the season for home team [OK]
* Total number of games played in the season for away team [OK]
* Number of points at home for home team from last n games at home [OK]
* Number of points away for away team from last n games away [OK]
* Number of goals scored at home for home team in current season [OK]
* Mean nb of goals scored at home for home team in current season [OK]
* Number of goals conceded at home for home team in current season [OK]
* Mean nb of goals conceded at home for home team in current season [OK]
* Number of goals scored away for away team in current season [OK]
* Mean nb of goals scored away for away team  in current season [OK]
* Number of goals conceded away for away team in current season [OK]
* Mean nb of goals conceded away for away team  in current season [OK]

* Mean difference of points with away team over last 3 games at home for home team
* Mean difference of points with home team over last 3 games away for away team
* Number of victories at home for the home team in last season
* Number of defeats or draws at home for the home team in last season
* Number of victories or draws outside for the away team in last season
* Number of defeats outside for the away team in last season
* Number of mean scored goals at home for the home team during last 5 games at home
* Number of mean conceaded goals at home for the home team during last 5 games at home
* Number of mean scored goals outside for the away team during the last 5 games outside 
* Number of mean conceaded goals outside for the away team during the last 5 games outside 
* Current position of home team
* Current position of away team
* Mean position of home team during last 5 games
* Mean position of away team during last 5 games
* Mean position of home team during last 3 seasons in the league
* Mean position of away team during last 3 seasons in the league

## What is the profile of the winner? 

Here, we try to assess if there are common characteristics for the winning teams.

### Is the home team more likely to win?

In [3]:
pd.crosstab(index=data['FTR'],columns='norm_count',normalize=True)

col_0,norm_count
FTR,Unnamed: 1_level_1
A,0.305263
D,0.281579
H,0.413158


## What features can we craft?

* Percentage of home victory for the last 5 games
* Number of goals scored at home during current season
* Number of goals scored at home during the last 5 games of the season
* 

In [4]:
data.columns

Index(['Div', 'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG',
       'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
       'AC', 'HY', 'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD',
       'BWA', 'IWH', 'IWD', 'IWA', 'LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA',
       'WHH', 'WHD', 'WHA', 'VCH', 'VCD', 'VCA', 'Bb1X2', 'BbMxH', 'BbAvH',
       'BbMxD', 'BbAvD', 'BbMxA', 'BbAvA', 'BbOU', 'BbMx>2.5', 'BbAv>2.5',
       'BbMx<2.5', 'BbAv<2.5', 'BbAH', 'BbAHh', 'BbMxAHH', 'BbAvAHH',
       'BbMxAHA', 'BbAvAHA', 'PSCH', 'PSCD', 'PSCA'],
      dtype='object')

In [5]:
len(data.columns)

65