# Final Project - Beat the Bookies

## Day One - Initial Data Exploration

The goal of today is:

- Create repositories
- Add collaborators
- Add notebooks to the project
- Confirm primary data source
- Build a dataframe that includes all the relevant and desired data from our model (check google sheets for the columns). 


Using the dataset downloaded from kaggle (https://www.kaggle.com/hugomathien/soccer?) - exploring the data available for one season (2015/16) to combine the data into our own dataframe that includes the relevant information that we need for our models.

Colummns for the dataset:

- Home Team Rank
- Home Team Total Wins
- Home Team win
- Draw
- Away Team Win
- Away Team Rank

To add columns including possesion, reds, yellows, corners, shots at goal, shots on target, goal percentage, throw ins (?), fouls, free kicks. Later on when we know that we can combine the core of the table easily for all seasons.

Target:
- Home Team win y/n

### imports 

In [None]:
import pandas as pd
import numpy as np 
import sqlite3

### importing the data from sqlite

In [None]:
# create the file path (include two beatthebookies)
path = '/Users/georgebrockman/code/georgebrockman/beatthebookies/beatthebookies/data/database.sqlite'
# create the connection
conn = sqlite3.connect(path)
# create the cursor
cursor = conn.cursor()

In [None]:
# from the match table we want to select all matches played in the English Premier league for the 2015/16 season.
# Premier League - league_id = 1729
# order by stages (Game Weeks)
# need quotes around '2015/2016'
result = cursor.execute(""" SELECT * FROM "Match" m 
                        WHERE league_id = 1729 
                        AND season = '2015/2016' 
                        ORDER BY stage ASC""")

In [None]:
row = result.fetchall()
row[0] # seems to be a lot of null values for shots, s

In [None]:
# create a DataFrame from the SQL query
matches_df = pd.read_sql(""" SELECT * FROM "Match" m 
                        WHERE league_id = 1729 
                        AND season = '2015/2016' 
                        ORDER BY stage ASC""", conn)

In [None]:
# inspect the DataFrame
matches_df.columns.values

In [None]:
# we can remove all the ones not needed for home-away goal data
simple_matches = matches_df.drop(['home_player_X1',
       'home_player_X2', 'home_player_X3', 'home_player_X4',
       'home_player_X5', 'home_player_X6', 'home_player_X7',
       'home_player_X8', 'home_player_X9', 'home_player_X10',
       'home_player_X11', 'away_player_X1', 'away_player_X2',
       'away_player_X3', 'away_player_X4', 'away_player_X5',
       'away_player_X6', 'away_player_X7', 'away_player_X8',
       'away_player_X9', 'away_player_X10', 'away_player_X11',
       'home_player_Y1', 'home_player_Y2', 'home_player_Y3',
       'home_player_Y4', 'home_player_Y5', 'home_player_Y6',
       'home_player_Y7', 'home_player_Y8', 'home_player_Y9',
       'home_player_Y10', 'home_player_Y11', 'away_player_Y1',
       'away_player_Y2', 'away_player_Y3', 'away_player_Y4',
       'away_player_Y5', 'away_player_Y6', 'away_player_Y7',
       'away_player_Y8', 'away_player_Y9', 'away_player_Y10',
       'away_player_Y11', 'home_player_1', 'home_player_2',
       'home_player_3', 'home_player_4', 'home_player_5', 'home_player_6',
       'home_player_7', 'home_player_8', 'home_player_9',
       'home_player_10', 'home_player_11', 'away_player_1',
       'away_player_2', 'away_player_3', 'away_player_4', 'away_player_5',
       'away_player_6', 'away_player_7', 'away_player_8', 'away_player_9',
       'away_player_10', 'away_player_11', 'goal', 'shoton', 'shotoff',
       'foulcommit', 'card', 'cross', 'corner', 'possession', 'B365H',
       'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH', 'IWD', 'IWA', 'LBH',
       'LBD', 'LBA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'SJH',
       'SJD', 'SJA', 'VCH', 'VCD', 'VCA', 'GBH', 'GBD', 'GBA', 'BSH',
       'BSD', 'BSA'], axis=1)

In [None]:
simple_matches.head(5)

In [None]:
simple_matches['home_goals']= simple_matches.groupby('home_team_api_id')['home_team_goal'].cumsum().fillna(0)

In [None]:
simple_matches['away_goals']= simple_matches.groupby('away_team_api_id')['away_team_goal'].cumsum().fillna(0)

In [None]:
simple_matches

In [None]:
simple_matches.info()

In [None]:
simple_matches['home_wins'] = simple_matches.home_team_goal > simple_matches.away_team_goal
simple_matches['draws'] = simple_matches.home_team_goal == simple_matches.away_team_goal
simple_matches['away_wins'] = simple_matches.home_team_goal < simple_matches.away_team_goal
simple_matches

In [None]:
# convert booleans into binary numbers 
simple_matches['home_wins'] = simple_matches['home_wins'].astype(int)
simple_matches['draws'] = simple_matches['draws'].astype(int)
simple_matches['away_wins'] = simple_matches['away_wins'].astype(int)


In [None]:
simple_matches.head()

In [None]:
# create cumulative home, draw and away wins grouping by each team id

simple_matches['cumulative_home_wins_home'] = simple_matches.groupby('home_team_api_id')['home_wins'].cumsum().fillna(0)
simple_matches['cumulative_away_wins_home'] = simple_matches.groupby('home_team_api_id')['home_wins'].cumsum().fillna(0)
simple_matches['cumulative_home_wins_away'] = simple_matches.groupby('away_team_api_id')['away_wins'].cumsum().fillna(0)
# simple_matches['cumulative_draws'] = simple_matches.groupby('home_team_api_id')['draws'].cumsum().fillna(0)
simple_matches['cumulative_away_wins_away'] = simple_matches.groupby('away_team_api_id')['away_wins'].cumsum().fillna(0)


In [None]:
simple_matches.tail(20)

## Day Two - Comparing bets (2015/2016)

The goal of today is to find a baseline betting profit and loss for our model to compare with.

From the data set we are going to look at the odds offered by the betting firm Ladbrokes on home win, away win and draw.

- Placing a ten pound bet on the favourite (team with lowest odds).
- Placing a ten pound bet on the Home team every time.
- Placing a ten pound bet on a draw every match.
- placing a ten pound bet on the underdog (team with highest odds).
- placing a ten pound bet on the away team.

Maximum loss is the ten pound stake - so after 380 matches worst case scenario is **£3800 down**. 


### Import the betting data

In [None]:
import pandas as pd
import numpy as np 
import sqlite3

In [None]:
# create the file path (include two beatthebookies)
path = '/Users/georgebrockman/code/georgebrockman/beatthebookies/beatthebookies/data/database.sqlite'
# create the connection
conn = sqlite3.connect(path)
# create the cursor
cursor = conn.cursor()

In [None]:
# use joining tables to keep the team names for better clarity
matches = pd.read_sql("""SELECT m.id, 
                        m.season, m.stage, m.date, 
                        ht.team_long_name as home_team, at.team_long_name as away_team, m.home_team_goal, 
                        m.away_team_goal, LBH, LBD, LBA                                      
                        FROM Match as m
                        LEFT JOIN Team AS ht on ht.team_api_id = m.home_team_api_id
                        LEFT JOIN Team AS at on at.team_api_id = m.away_team_api_id
                        WHERE league_id = 1729 AND season = '2015/2016'
                        ;""", conn)

In [None]:
# sort matches in to cronological order
matches.sort_values('date', inplace=True)

In [None]:
# check correctly imported the required data
matches.head()

In [None]:
# create booleans
matches['home_win'] = matches.home_team_goal > matches.away_team_goal
matches['draw'] = matches.home_team_goal == matches.away_team_goal
matches['away_win'] = matches.home_team_goal < matches.away_team_goal

In [None]:
# convert booleans into binary numbers 
matches['home_win'] = matches['home_win'].astype(int)
matches['draw'] = matches['draw'].astype(int)
matches['away_win'] = matches['away_win'].astype(int)

### Profit from betting on favourite

In [None]:
matches['fav_profit'] = -10

In [None]:
matches.head()

In [None]:
matches.loc[(matches[['LBH','LBD','LBA']].min(axis=1) == matches['LBH']) & (matches['home_win'] == 1), 'fav_profit'] = matches['LBH'] * 10
matches.loc[(matches[['LBH','LBD','LBA']].min(axis=1) == matches['LBD']) & (matches['draw'] == 1), 'fav_profit'] = matches['LBD'] * 10
matches.loc[(matches[['LBH','LBD','LBA']].min(axis=1) == matches['LBA']) & (matches['away_win'] == 1), 'fav_profit'] = matches['LBA'] * 10


In [None]:
matches

### Profit betting on Underdog

In [None]:
matches['dog_profit'] = -10

In [None]:
matches.loc[(matches[['LBH','LBD','LBA']].max(axis=1) == matches['LBH']) & (matches['home_win'] == 1), 'dog_profit'] = matches['LBH'] * 10
matches.loc[(matches[['LBH','LBD','LBA']].max(axis=1) == matches['LBD']) & (matches['draw'] == 1), 'dog_profit'] = matches['LBD'] * 10
matches.loc[(matches[['LBH','LBD','LBA']].max(axis=1) == matches['LBA']) & (matches['away_win'] == 1), 'dog_profit'] = matches['LBA'] * 10

In [None]:
matches.head()

### Profit from Home / Draw / Away

In [None]:
matches['home_profit'] = -10

In [None]:
matches.loc[(matches['home_win'] == 1), 'home_profit'] = matches['LBH'] * 10 

In [None]:
matches['draw_profit'] = -10

In [None]:
matches.loc[(matches['draw'] == 1), 'draw_profit'] = matches['LBD'] * 10 

In [None]:
matches['away_profit'] = -10

In [None]:
matches.loc[(matches['away_win'] == 1), 'away_profit'] = matches['LBA'] * 10 

In [None]:
matches.sort_values('date', inplace=True)

In [None]:
matches

### Cumulative profit totals for each strategey

In [None]:
# define a function that returns the maximum profit from each strategy
def c_profits(X):
    fav_profit_cum = X['fav_profit'].sum()
    dog_profit_cum = X['dog_profit'].sum()
    home_profit_cum = X['home_profit'].sum()
    draw_profit_cum = X['draw_profit'].sum()
    away_profit_cum = X['away_profit'].sum()
    return fav_profit_cum, dog_profit_cum, home_profit_cum,draw_profit_cum, away_profit_cum

In [None]:
matches['away_profit'].cumsum().tail(1)

In [None]:
c_profits(matches)

Very surpising that the underdog approach was the most profitable method of betting in this season but this was the year that Leicester City won the league. 

### Reorganise methods into functions to apply to all the data

In [None]:
# psuedo code
# import the data using get_data():
# this returns a table with home_w, away_w and draw already added so need to add the boolean and conversion.
# first function favourites method
# second function is underdog 
# third is home team 
# fourth is draw
# fifth allways bet away. 

In [None]:
def pick_the_fav(X, season, bet=10):
    """
    function returns a dataframe that includes the weekly profit/loss for season followng the favourites strategy
    """
    # set defaults profit equal to bet size
    X['fav_profit'] = -bet
    # update profit column
    X.loc[(matches[['LBH','LBD','LBA']].min(axis=1) == X['LBH']) & (X['home_w'] == 1), 'fav_profit'] = X['LBH'] * 10
    X.loc[(matches[['LBH','LBD','LBA']].min(axis=1) == X['LBD']) & (X['draw'] == 1), 'fav_profit'] = X['LBD'] * 10
    X.loc[(matches[['LBH','LBD','LBA']].min(axis=1) == X['LBA']) & (X['away_w'] == 1), 'fav_profit'] = X['LBA'] * 10
    
    return X

In [None]:
def pick_the_dog(X, season, bet=10):
    """
    function returns a dataframe that includes the weekly profit/loss for season followng the underdog strategy
    """
    # set defaults profit equal to bet size
    X['dog_profit'] = -bet
    # update profit column 
    X.loc[(matches[['LBH','LBD','LBA']].max(axis=1) == X['LBH']) & (X['home_w'] == 1), 'dog_profit'] = X['LBH'] * 10
    X.loc[(matches[['LBH','LBD','LBA']].max(axis=1) == X['LBD']) & (X['draw'] == 1), 'dog_profit'] = X['LBD'] * 10
    X.loc[(matches[['LBH','LBD','LBA']].max(axis=1) == X['LBA']) & (X['away_w'] == 1), 'dog_profit'] = X['LBA'] * 10
    
    return X

In [None]:
def always_home(X, season, bet=10):
    """
    function returns a dataframe that includes the weekly profit/loss for season following the home team strategy
    """
    X['home_profit'] = -bet
    X.loc[(X['home_w'] == 1), 'home_profit'] = X['LBH'] * 10 
    
    return X

In [None]:
def always_draw(X, season, bet=10):
    """
    function returns a dataframe that includes the weekly profit/loss for season following the home team strategy
    """
    X['draw_profit'] = -bet
    X.loc[(X['draw'] == 1), 'draw_profit'] = X['LBH'] * 10 
    
    return X

In [None]:
def always_away(X, season, bet=10):
    """
    function returns a dataframe that includes the weekly profit/loss for season following the home team strategy
    """
    X['away_profit'] = -bet
    X.loc[(X['away_w'] == 1), 'away_profit'] = X['LBH'] * 10 
    
    return X

### test functions

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from beatthebookies.data import get_betting_data, get_data

In [None]:
def get_betting_data(season='2015/2016', league=1729, local=False, optimize=False, **kwargs):
    path = "../data/"
    database = path + 'database.sqlite'
    conn = sqlite3.connect(database)

    df = pd.read_sql("""SELECT m.id,
                            m.season, m.stage, m.date,
                            ht.team_long_name as home_team, at.team_long_name as away_team, m.home_team_goal,
                            m.away_team_goal, m.LBH, m.LBD, m.LBA 
                            FROM Match as m
                            LEFT JOIN Team AS ht on ht.team_api_id = m.home_team_api_id
                            LEFT JOIN Team AS at on at.team_api_id = m.away_team_api_id
                            WHERE league_id = :league AND season = :season
                            ;""", conn, params={"league":league, "season":season})
    # add win columns
    df['home_w'] = 0
    df['away_w'] = 0
    df['draw'] = 0
    # set winner
    df.loc[df['home_team_goal'] > df['away_team_goal'], 'home_w'] = 1
    df.loc[df['home_team_goal'] < df['away_team_goal'], 'away_w'] = 1
    df.loc[df['home_team_goal'] == df['away_team_goal'], 'draw'] = 1
    # sort into order
    df.sort_values('date', inplace=True)

    return df

In [None]:
betting = get_data()

In [None]:
betting.info()

In [None]:
def simple_betting_profits(df, bet=10):
    """
    function returns the cumulative profits/loss from a season of following a consistent, simple betting strategy.
    """
    # set defaults profit equal to bet size
    # create new column for fav strategy
    df['fav_profit'] = -bet
    # update profit column
    df.loc[(df[['LBH','LBD','LBA']].min(axis=1) == df['LBH']) & (df['home_w'] == 1), 'fav_profit'] = (df['LBH'] * bet) - bet
    df.loc[(df[['LBH','LBD','LBA']].min(axis=1) == df['LBD']) & (df['draw'] == 1), 'fav_profit'] = (df['LBD'] * bet) - bet
    df.loc[(df[['LBH','LBD','LBA']].min(axis=1) == df['LBA']) & (df['away_w'] == 1), 'fav_profit'] = (df['LBA'] * bet) - bet

    # set defaults profit equal to bet size
    # create new column for underdog strategy
    df['dog_profit'] = -bet
    # update profit column
    df.loc[(df[['LBH','LBD','LBA']].max(axis=1) == df['LBH']) & (df['home_w'] == 1), 'dog_profit'] = (df['LBH'] * bet) - bet
    df.loc[(df[['LBH','LBD','LBA']].max(axis=1) == df['LBD']) & (df['draw'] == 1), 'dog_profit'] = (df['LBD'] * bet) - bet
    df.loc[(df[['LBH','LBD','LBA']].max(axis=1) == df['LBA']) & (df['away_w'] == 1), 'dog_profit'] = (df['LBA'] * bet) - bet

    # create new column for home team method
    df['home_profit'] = -bet
    df.loc[(df['home_w'] == 1), 'home_profit'] = (df['LBH'] * bet) - bet
    # create new column for draw tactic
    df['draw_profit'] = -bet
    df.loc[(df['draw'] == 1), 'draw_profit'] = (df['LBD'] * bet) - bet
    # create new column for betting on the away team
    df['away_profit'] = -bet
    df.loc[(df['away_w'] == 1), 'away_profit'] = (df['LBA'] * bet) - bet

    fav_profit_total = df['fav_profit'].sum()
    dog_profit_total = df['dog_profit'].sum()
    home_profit_total = df['home_profit'].sum()
    draw_profit_total = df['draw_profit'].sum()
    away_profit_total = df['away_profit'].sum()

    return fav_profit_total, dog_profit_total, home_profit_total, draw_profit_total, away_profit_total


In [None]:
simple_betting_profits(betting)

## Day Three - Ternery outputs and introducing additional Data

In [None]:
import pandas as pd
import numpy as np 
import sqlite3

In [None]:
# create the file path (include two beatthebookies)
path = '/Users/georgebrockman/code/georgebrockman/beatthebookies/beatthebookies/data/database.sqlite'
# create the connection
conn = sqlite3.connect(path)
# create the cursor
cursor = conn.cursor()

In [None]:
# use joining tables to keep the team names for better clarity
matches = pd.read_sql("""SELECT m.id, 
                        m.season, m.stage, m.date, 
                        ht.team_long_name as home_team, at.team_long_name as away_team, m.home_team_goal, 
                        m.away_team_goal, LBH, LBD, LBA, goal, shoton, shotoff, foulcommit, card, 
                        cross, corner, possession                                      
                        FROM Match as m
                        LEFT JOIN Team AS ht on ht.team_api_id = m.home_team_api_id
                        LEFT JOIN Team AS at on at.team_api_id = m.away_team_api_id
                        WHERE league_id = 1729 AND season = '2015/2016'
                        ;""", conn)

In [None]:
matches.head()

From here we have concluded not to use the Kaggle dataset for the match stats instead we will compile a complete data frame from 10 seperate csv files downloaded from https://www.football-data.co.uk

In [None]:
# load all csv files as dataframes
df = pd.read_csv("/Users/georgebrockman/code/georgebrockman/Premier League CSV/premierleague2008.csv")
df2 = pd.read_csv("/Users/georgebrockman/code/georgebrockman/Premier League CSV/premierleague2009.csv")
df3 = pd.read_csv("/Users/georgebrockman/code/georgebrockman/Premier League CSV/premierleague2010.csv")
df4 = pd.read_csv("/Users/georgebrockman/code/georgebrockman/Premier League CSV/premierleague2011.csv")
df5 = pd.read_csv("/Users/georgebrockman/code/georgebrockman/Premier League CSV/premierleague2012.csv")
df6 = pd.read_csv("/Users/georgebrockman/code/georgebrockman/Premier League CSV/premierleague2013.csv")
df7 = pd.read_csv("/Users/georgebrockman/code/georgebrockman/Premier League CSV/premierleague2014.csv")
df8 = pd.read_csv("/Users/georgebrockman/code/georgebrockman/Premier League CSV/premierleague2015.csv")
df9 = pd.read_csv("/Users/georgebrockman/code/georgebrockman/Premier League CSV/premierleague2016.csv")
df10 = pd.read_csv("/Users/georgebrockman/code/georgebrockman/Premier League CSV/premierleague2017.csv")
df11 = pd.read_csv("/Users/georgebrockman/code/georgebrockman/Premier League CSV/premierleague2018.csv")
df12 = pd.read_csv("/Users/georgebrockman/code/georgebrockman/Premier League CSV/premierleague2019.csv")

In [None]:
# issues combining all at once so combine individually
combin_df = pd.concat([df,df2])
combin_df.reset_index(drop=True, inplace=True)

In [None]:
# 2008-2010 dataset
temp = pd.concat([combin_df, df3])
temp.reset_index(drop=True, inplace=True)
temp

In [None]:
# 2008-2011
tmp = pd.concat([temp, df4])
tmp.reset_index(drop=True, inplace=True)
tmp

In [None]:
# create a merged csv for seasons 2008-2011 
tmp.to_csv( "combined_csv.csv", index=True, encoding='utf-8-sig')

In [None]:
# 2012-2013 season merge
tmp1 = pd.concat([df5, df6])
tmp1.reset_index(drop=True, inplace=True)
tmp1

In [None]:
# each csv has different values so need to drop columns that arn't in the kaggle dataset.
len(df.columns.values)

In [None]:
df.head(20)

In [None]:
df2.columns.values # GBH GBD GBA - GAMEBOOKERS

## Convert CSV Files to Relevant Columns and Combine

In [None]:
df = df[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
       'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF',
       'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'WHH', 'WHD', 'WHA']]

In [None]:
df2 = df2[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
       'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF',
       'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'WHH', 'WHD', 'WHA']]

In [None]:
df3 = df3[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
       'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF',
       'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'WHH', 'WHD', 'WHA']]

In [None]:
df4 = df4[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
       'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF',
       'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'WHH', 'WHD', 'WHA']]

In [None]:
df5 = df5[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
       'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF',
       'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'WHH', 'WHD', 'WHA']]

In [None]:
df6 = df6[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
       'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF',
       'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'WHH', 'WHD', 'WHA']]

In [None]:
df7 = df7[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
       'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF',
       'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'WHH', 'WHD', 'WHA']]

In [None]:
df8 = df8[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
       'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF',
       'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'WHH', 'WHD', 'WHA']]

In [None]:
df9 = df9[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
       'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF',
       'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'WHH', 'WHD', 'WHA']]

In [None]:
df10 = df10[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
       'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF',
       'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'WHH', 'WHD', 'WHA']]

In [None]:
df11 = df11[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
       'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF',
       'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR','WHH', 'WHD', 'WHA']]

In [None]:
df12 = df12[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
       'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF',
       'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'WHH', 'WHD', 'WHA']] 

In [None]:
match_stats = pd.concat([df,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12])
match_stats.reset_index(drop=True, inplace=True)
match_stats

In [None]:
# create the data frame
match_stats.to_csv( "premierleaguematches.csv", index=True, encoding='utf-8-sig')

### Add Seasons column to premierleaguematches.csv file 

In [24]:
# import the csv file 
df = pd.read_csv('/Users/georgebrockman/code/georgebrockman/beatthebookies/beatthebookies/notebooks/premierleaguematches.csv')

In [27]:
# rename the unamed column to season
df.rename({'Unnamed: 0': 'Season'}, axis=1, inplace=True)

In [35]:
df.loc[0:379, 'Season'] = '2008/2009'

In [38]:
df.loc[380:759, 'Season'] = '2009/2010'

In [40]:
df.loc[760:1139, 'Season'] = '2010/2011'

In [45]:
df.loc[1140:1519, 'Season'] = '2011/2012'

In [48]:
df.loc[1520:1899, 'Season'] = '2012/2013'

In [49]:
df.loc[1900:2279, 'Season'] = '2013/2014'

In [50]:
df.loc[2280:2659, 'Season'] = '2014/2015'

In [51]:
df.loc[2660:3039, 'Season'] = '2015/2016'

In [52]:
df.loc[3040:3419, 'Season'] = '2016/2017'

In [53]:
df.loc[3420:3799, 'Season'] = '2017/2018'

In [54]:
df.loc[3800:4179, 'Season'] = '2018/2019'

In [57]:
df.loc[4180:4560, 'Season'] = '2019/2020'

In [62]:
# check its worked
df.Season.unique()

array(['2008/2009', '2009/2010', '2010/2011', '2011/2012', '2012/2013',
       '2013/2014', '2014/2015', '2015/2016', '2016/2017', '2017/2018',
       '2018/2019', '2019/2020'], dtype=object)

In [63]:
df.to_csv( "premierleague.csv", index=True, encoding='utf-8-sig')

## Day Four - Scrapping Fifa Stats for 2008, 2012, 2016-2019

Aim of today is to collect the fifa stats from the kaggle dataset from the missing seasons, add them to the newly created match stats csv file, ready for merging with our kaggle dataset that our model has already been working on.

exploring different scraping options but failure to get anywhere may result in an attempt to use Beautiful Soup Library as a last resort...

In [10]:
import pandas as pd 
import numpy as np

In [1]:
from beatthebookies.data import get_data, get_rankings

In [11]:
data = get_data()

get_data 10.44


In [42]:
pd.options.display.max_columns = None
pd.options.display.max_rows = None

In [13]:
data.head()

Unnamed: 0,id,season,stage,date,home_team,away_team,home_team_goal,away_team_goal,home_w,away_w,draw,home_t_home_goals,home_t_home_goals_against,home_t_home_wins,home_t_home_losses,home_t_home_draws,away_t_away_goals,away_t_away_goals_against,away_t_away_wins,away_t_away_losses,away_t_away_draws,home_t_away_goals,home_t_away_goals_against,home_t_away_wins,home_t_away_losses,home_t_away_draws,away_t_home_goals,away_t_home_goals_against,away_t_home_wins,away_t_home_losses,away_t_home_draws,bups_x,bupp_x,ccp_x,ccc_x,ccs_x,dp_x,da_x,dtw_x,bups_y,bupp_y,ccp_y,ccc_y,ccs_y,dp_y,da_y,dtw_y
0,4390,2015/2016,1,2015-08-08 00:00:00,Bournemouth,Aston Villa,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,47,47,49,45,47,48,50,62,63,54,60,48,38,35,44,54
1,4391,2015/2016,1,2015-08-08 00:00:00,Chelsea,Swansea City,2,2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,67,36,41,34,44,39,41,46,45,42,34,36,55,31,47,42
2,4392,2015/2016,1,2015-08-08 00:00:00,Everton,Watford,2,2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,43,40,39,33,63,52,58,59,61,52,54,42,45,38,52,45
3,4393,2015/2016,1,2015-08-08 00:00:00,Leicester City,Sunderland,4,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,63,60,47,64,46,58,65,55,43,51,59,50,55,47,45,49
4,4394,2015/2016,1,2015-08-08 00:00:00,Manchester United,Tottenham Hotspur,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,38,44,49,44,40,54,53,56,47,40,41,41,63,63,54,56


Chris took over the webscrapping and smashed it!