<h1> Is there a statistical difference in the odds of winning a game when a team is playing in front of a home crowd?</h1>

<h3> Null Hypothesis:</h3> <b>There is no difference in projected win percentage between playing at a home field and playing in an away field.</b>

<h3> Alternative Hypothesis</h3> <b>There is a statistical advantage in being able to play with a home field.</b>

<h2> Observation </h2>
Our goal is to determine how we should be handicapping our models to project a difference in home versus away games - with consistent enough evidence, this can be used to calculate betting lines in a way that would only require the relative strengths of the teams. Then a handicap can be applied to these predictions accounting for Home and Away field advantages.

<h2> Experiment </h2>
Our experiment has been conducted other times with other contributors to this data set, including <a href = "https://www.kaggle.com/aiyoyo/exploring-home-team-advantage"> this experiment</a>. Here we are going to be applying a T Test to our normalized Sample Distributions of Sample Means of the averages of both Home and Away wins to determine a p-score as to the likelihood that home field advantage plays a role.

Our dataset contains much more information than we would otherwise need to conduct this analysis, some of which we may use for further questions. The entire dataset's structure looks like:

Table | Total Rows | Total Columns | Columns (edited for relevance)
--- | :- | :--- | :---:
Country | 11 | 2 | id, name
League | 11 | 3 | id, country_id, name
Match | 25979 | 115 | id, country_id, league_id, season, stage, date, match_api_id, home_team_api_id, away_team_api_id, home_team_goal, away_team_goal, (others)
Player | 11060 | 7 | id, player_api_id, player_name, player_fifa_api_id, birthday, height, weight
Player_Attributes | 183978 | 42 | id, player_fifa_api_id, player_api_id, date, overall_rating, (other FIFA related statistics)
sqlite_sequence | 7 | 2 | name, seq
Team | 299 | 5 | id, team_api_id, team_fifa_api_id, team_long_name, team_short_name
Team_Attributes | 1458 | 25 | d, team_fifa_api_id, team_api_id, date, buildUpPlaySpeed, buildUpPlaySpeedClass, buildUpPlayDribbling, buildUpPlayDribblingClass, buildUpPlayPassing, buildUpPlayPassingClass, buildUpPlayPositioningClass, chanceCreationPassing,  chanceCreationCrossing, chanceCreationShooting, defencePressure, defenceAggression,  defenceTeamWidth, (other Class related Attributes)


<h2> Analysis </h2>

<b>Importing Libraries</b>

In [16]:
import sqlite3 
conn = sqlite3.connect('zdata.sqlite')
c = conn.cursor()

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

query = """
SELECT *
FROM Country
"""

# cursor
c.execute(query)

# get the data
data = c.fetchall()

# put in dataframe
df1 = pd.DataFrame(data)

#   df.columns -> pass through a list of column names to change

column_names = []

for x in range(len(c.description)):
    column = c.description[x] 
    column_names.append(column[0])

print(column_names)

OperationalError: no such table: Match

First, we are going to start with a couple of queries from the relevant tables that we want to use, really just querying everything and then adding the column names to our Pandas Dataframe. Let's do this with our Match SQLite Data. 

In [11]:
query = """
SELECT *
FROM Match
"""

# cursor
c.execute(query)

# get the data
data = c.fetchall()

# put in dataframe
df3 = pd.DataFrame(data)

column_names3 = []

for x in range(len(c.description)):
    column3 = c.description[x] 
    column_names3.append(column3[0])

print(column_names3)

OperationalError: no such table: Match

To get a better look at some of the data we are going go join and assess, we also want to do this same query with our League data...

as well as our Team data.

Now that we have a better look at what these particular tables look like, we can start to perform more advanced SQL queries on our tables to add some columns that may be useful to us in our attempt to answer our Hypothesis. In this case, one thing that would be nice to know, since we don't have the ability to run these columns ourselves, is which one of the individual matches listed in our dataset resulted in Home Wins, which of the matches resulted in Away Wins, and which ones resulted in Draws. For our particular question we won't be doing much with the Draw data but it is still nice to have.

In [12]:
CustomZ_match = pd.read_sql("""SELECT Match.id,  
                                        League.name AS league_name, 
                                        season,
                                        HT.team_long_name AS  home_team,
                                        AT.team_long_name AS away_team,
                                        home_team_goal, 
                                        away_team_goal                                        
                                FROM Match
                                JOIN League on League.id = Match.league_id
                                JOIN Team AS HT on HT.team_api_id = Match.home_team_api_id
                                JOIN Team AS AT on AT.team_api_id = Match.away_team_api_id
                                ORDER by home_team;""", conn)
CustomZ_match['HomeWin'] = 0
CustomZ_match.loc[CustomZ_match['home_team_goal'] > CustomZ_match['away_team_goal'], "HomeWin"] = 1
CustomZ_match['AwayWin'] = 0
CustomZ_match.loc[CustomZ_match['away_team_goal'] > CustomZ_match['home_team_goal'], "AwayWin"] = 1
CustomZ_match['Draw'] = 0
CustomZ_match.loc[CustomZ_match['home_team_goal'] == CustomZ_match['away_team_goal'], "Draw"] = 1
CustomZ_match.head()

DatabaseError: Execution failed on sql 'SELECT Match.id,  
                                        League.name AS league_name, 
                                        season,
                                        HT.team_long_name AS  home_team,
                                        AT.team_long_name AS away_team,
                                        home_team_goal, 
                                        away_team_goal                                        
                                FROM Match
                                JOIN League on League.id = Match.league_id
                                JOIN Team AS HT on HT.team_api_id = Match.home_team_api_id
                                JOIN Team AS AT on AT.team_api_id = Match.away_team_api_id
                                ORDER by home_team;': no such table: Match

Now that we have this information, it will be fairly easy for us to make two separate tables, one for the sums of the Home Team wins and one for the sums of the Away Team wins.