<h1>Chapter 4 | Data Exercise #5 | <code>wms-management-survey</code> | Comparison and correlation</h1>
<h2>Introduction:</h2>
<p>In this notebook, you will find my notes and code for Chapter 4's <b>exercise 5</b> of the book <a href="https://gabors-data-analysis.com/">Data Analysis for Business, Economics, and Policy</a>, by Gábor Békés and Gábor Kézdi. The question was: 
<p>5. Use the <code>football</code> dataset and pick a season. Create three groups of teams, based on their performance in their previous season (new teams come from the lower division, and you may put them in the lowest bin).</p>
<p>Assignments:</p>
<ul>
    <li>Examine the extent of home team advantage (as in Chapter 3 Section 3.C1) by comparing it across  these three groups of teams.</li>
    <li>Produce bin scatters and scatterplots, and calculate conditional statistics.</li>
    <li>Discuss what you find, and comment on which visualization you find the most useful.</li>
</ul>
<h2><b>1.</b> Load the data</h2>

In [59]:
import os
import sys
import warnings
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from plotnine import *
from mizani.formatters import percent_format
import plotly.express as px
import nbformat

warnings.filterwarnings("ignore")
%matplotlib inline

In [60]:
# Increase number of returned rows in pandas
pd.set_option("display.max_rows", 500)

In [61]:
# Current script folder
current_path = os.getcwd()
dirname = f"{current_path}/"
# Get location folders
data_in = f"{dirname}da_data_repo/football/clean/"
data_out = f"{dirname}da_data_exercises/ch04-comparison_correlation/05-football_corr/data/clean/"
output = f"{dirname}da_data_exercises/ch04-comparison_correlation/05-football_corr/"
func = f"{dirname}da_case_studies/ch00-tech_prep/"
sys.path.append(func)

In [62]:
dirname

'c:\\Users\\Felipe\\python_work\\Projects\\bk_data_analysis/'

In [63]:
from py_helper_functions import *

In [64]:
df = pd.read_csv(f"{data_in}epl_games.csv")

In [65]:
df.head()

Unnamed: 0,div,season,date,team_home,team_away,points_home,points_away,goals_home,goals_away
0,E0,2008,16aug2008,Arsenal,West Brom,3,0,1,0
1,E0,2008,16aug2008,West Ham,Wigan,3,0,2,1
2,E0,2008,16aug2008,Middlesbrough,Tottenham,3,0,2,1
3,E0,2008,16aug2008,Everton,Blackburn,0,3,2,3
4,E0,2008,16aug2008,Bolton,Stoke,3,0,3,1


## 2. Data preprocessing
How many seasons do we have?

In [66]:
df["season"].unique()

array([2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018],
      dtype=int64)

Let's choose the latest available seasons, 2017/2018. 

In [67]:
football_2017_2018 = df[df["season"].isin([2017, 2018])]

Get season 2017.

In [68]:
football_2017 = df.loc[df["season"] == 2017, :]

In [69]:
football_2018 = df.loc[df["season"] == 2018, :]

In [70]:
football_2017.head()

Unnamed: 0,div,season,date,team_home,team_away,points_home,points_away,goals_home,goals_away
3420,E0,2017,11aug2017,Arsenal,Leicester,3,0,4,3
3421,E0,2017,12aug2017,Watford,Liverpool,1,1,3,3
3422,E0,2017,12aug2017,Brighton,Man City,0,3,0,2
3423,E0,2017,12aug2017,Crystal Palace,Huddersfield,0,3,0,3
3424,E0,2017,12aug2017,Southampton,Swansea,1,1,0,0


In [71]:
football_2017["team_home"].unique()


array(['Arsenal', 'Watford', 'Brighton', 'Crystal Palace', 'Southampton',
       'Everton', 'Chelsea', 'West Brom', 'Man United', 'Newcastle',
       'Liverpool', 'Stoke', 'Burnley', 'Swansea', 'Leicester',
       'Bournemouth', 'Huddersfield', 'Tottenham', 'Man City', 'West Ham'],
      dtype=object)

In [72]:
football_2017["team_home"].nunique()

20

In [73]:
football_2017["season"].value_counts()

season
2017    380
Name: count, dtype: int64

Perfect, we have two full seasons - 20 teams that played against each other at home and away, totalizing 380 matches.
### 2.1 Task 1 | Create groups of teams
Let's break down the task:
- Create **three** groups of teams.
- Based on their performance in their previous season.
- New teams come from the lower division, and you may put them in the lowest bin.
We will create a bin based on a 1 - 3 range, being 1 the top tier teams.
So, to bin our teams, we need to:
- Identify their performance in season 2017.
- We will group the data by team, sum the points, and order them by the total number of points. 
- If the teams were not present, we will assign them to bin 3.

Sum the number of points for each team.

In [74]:
standings_2017 = pd.DataFrame()

# Create a list of unique teams by combining both team columns
teams = pd.concat([football_2017["team_home"], football_2017["team_away"]]).unique()

# Calculate total points for each team using apply and lambda functions
standings_2017["team"] = teams
standings_2017["total_points"] = standings_2017["team"].apply(
    lambda team: football_2017.loc[football_2017["team_home"] == team, "points_home"].sum() +
    football_2017.loc[football_2017["team_away"] == team, "points_away"]. sum()
    )

standings_2017 = standings_2017.sort_values(by="total_points", ascending=False)
standings_2017 = standings_2017.reset_index(drop=True)
standings_2017.index += 1

In [75]:
standings_2017

Unnamed: 0,team,total_points
1,Man City,100
2,Man United,81
3,Tottenham,77
4,Liverpool,75
5,Chelsea,70
6,Arsenal,63
7,Burnley,54
8,Everton,49
9,Leicester,47
10,Newcastle,44


Ok, now, we need to bin these values in three different values, from 1 to 3. First, we can determine the min and max values from total points. Then, we can establish quantiles, so that our distribution is not so skewed.

In [76]:
# 1. Determine the range of total points
min_points, max_points = standings_2017["total_points"].min(), standings_2017["total_points"].max()

# 2. Calculate quantiles
quantile_edges = standings_2017["total_points"].quantile([1/3, 2/3])

# 3. Define bins based on quantiles and maximum point
bins = [min_points, quantile_edges[1/3], quantile_edges[2/3], max_points]
labels = [3, 2, 1]
standings_2017["team3bins"] = pd.cut(standings_2017["total_points"], bins=bins, labels=labels, include_lowest=True)

In [77]:
standings_2017

Unnamed: 0,team,total_points,team3bins
1,Man City,100,1
2,Man United,81,1
3,Tottenham,77,1
4,Liverpool,75,1
5,Chelsea,70,1
6,Arsenal,63,1
7,Burnley,54,1
8,Everton,49,2
9,Leicester,47,2
10,Newcastle,44,2


Now, we can create a DataFrame containing only the bin category and the team name. We can merge this data with the subsequent season. We do a merge that keeps unmatched team names (the teams that came from a lower division). In this case, we can assign the appropriate bin number (`3`) if the field returned `NaN`.

In [78]:
# Merge data to return binning
df_bins = standings_2017[["team", "team3bins"]]
df_bins.rename(columns={"team": "team_home"}, inplace=True)

df_merged = football_2018.merge(
    df_bins,
    on="team_home",
    how="left"
)

# Handle new teams by assigning them to the lowest bin
df_merged["team3bins"].fillna(3, inplace=True)

In [79]:
df_merged.shape

(380, 10)

In [80]:
df_merged.sample(10)

Unnamed: 0,div,season,date,team_home,team_away,points_home,points_away,goals_home,goals_away,team3bins
211,E0,2018,12jan2019,Burnley,Fulham,3,0,2,1,1
213,E0,2018,12jan2019,Cardiff,Huddersfield,1,1,0,0,3
54,E0,2018,22sep2018,Man United,Wolves,1,1,1,1,1
229,E0,2018,20jan2019,Fulham,Tottenham,0,3,1,2,3
136,E0,2018,01dec2018,Leicester,Watford,3,0,2,0,2
238,E0,2018,30jan2019,Tottenham,Watford,3,0,2,1,1
108,E0,2018,04nov2018,Man City,Southampton,3,0,6,1,1
259,E0,2018,10feb2019,Tottenham,Leicester,3,0,3,1,1
64,E0,2018,29sep2018,Everton,Fulham,3,0,3,0,2
124,E0,2018,24nov2018,Tottenham,Chelsea,3,0,3,1,1


In [81]:
# Define a mapping from bin number to performance category
bin_to_category = {
    1: "Top",
    2: "Middle",
    3: "Bottom"
}

# Create a new column with performance category
df_merged["performance"] = df_merged["team3bins"].map(bin_to_category)

### 2.2 Task 2 | Create a `home_goaladv` variable ###

In [82]:
def home_goal_adv(data: pd.DataFrame) -> pd.Series:
    """Return the difference between home and away goals for each match and measures home goal advantage."""
    data["home_goaladv"] = data["goals_home"] - data["goals_away"]
    return data["home_goaladv"]

In [83]:
df_merged["home_goaladv"] = home_goal_adv(df_merged)

We can now create a summary statistics table and observe the differences for each bin.

In [84]:
# Create a summary statistics table for home goal advantage for each bin as a function
def summary_stats(data):
    """Return summary statistics of a dataset."""
    return pd.Series(
        {
            "Mean": data["home_goaladv"].mean(),
            "Standard deviation": data["home_goaladv"].std(),
            "Percent positive": (data["home_goaladv"] > 0).mean() * 100,
            "Percent zero": (data["home_goaladv"] == 0).mean() * 100,
            "Percent negative": (data["home_goaladv"] < 0).mean() * 100,
            "Number of observations": data["home_goaladv"].count(),
        }
    ).round(1)

summary_df = df_merged.groupby('team3bins').apply(summary_stats).sort_index(ascending=False)

In [85]:
summary_df

Unnamed: 0_level_0,Mean,Standard deviation,Percent positive,Percent zero,Percent negative,Number of observations
team3bins,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1.2,1.9,67.7,15.8,16.5,133.0
2,0.2,1.7,42.1,19.3,38.6,114.0
3,-0.4,1.7,32.3,21.1,46.6,133.0


Observations:
- The mean value of home goal advantage decreases with relation of how teams performed in the previous season. Teams that peformed well have shown an average of `1.2` goal per match at home, while teams that were at the bottom (or are coming from the lower division) displayed a negative value - `-0.4`. This means that they tended to lose nearly one game on every 2 to 3 matches.
- The standard deviation is very similar, apart from teams at bin #1. We can expect that, from the mean of each group, there is a deviation of 1.7 goal at least (1.9 for bin #1).
- Teams at bin #1 performed really well at home. They won around **67.5%** of the matches at home. Bin #2 teams won less than half of them (**42.1%**), and bin #3 teams won only a bit more than one third of the matches (**32.3%**). 
- Teams that performed well also decreased their number of draws at home - only **15.8%** ended in a tie. In contrast, teams at bin #2 and #3 displayed an increasing proportion of draws, **19.3%** and **21.1%**, respectively. 
- Finally, teams at bin #1 lost very few matches at home - only **16.5%**, while teams at bin #2 and #3 lost **38.6%** and **46.6%**. 

We can safely affirm that teams that performed well in the last season repeated a good performance in the following year by making the best of home games. This is particularly evident when considering wins and losses, where the difference between top and bottom performers is higher than when comparing draws.

We can now take a look at this data by producing a few visualizations.

## 3. Data Visualization ##

### 3.1 Histograms ###
To compare the difference in home goal advantage, we can plot one histogram for each group. 


In [86]:
df_merged.columns

Index(['div', 'season', 'date', 'team_home', 'team_away', 'points_home',
       'points_away', 'goals_home', 'goals_away', 'team3bins', 'performance',
       'home_goaladv'],
      dtype='object')

In [87]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Calculate the minimum and maximum home_goaladv across the entire DataFrame
x_min = df_merged['home_goaladv'].min()
x_max = df_merged['home_goaladv'].max()

bins = df_merged["team3bins"].unique()
# Convert to list and sort
bins = list(bins)
bins.sort() 

# Create subplots
fig = make_subplots(rows=len(bins), cols=1, subplot_titles=[f"Bin {bin}" for bin in bins])

# Add a histogram to each subplot
for i, bin in enumerate(bins, start=1):
    filtered_data = df_merged.loc[df_merged["team3bins"] == bin, :]
    fig.add_trace(
        go.Histogram(x=filtered_data["home_goaladv"], name=f"Bin {bin}"),
        row=i, col=1
    )

# Update layout for each x-axis in the subplots
for i in range(1, len(bins) + 1):
    fig.update_xaxes(range=[x_min, x_max], title_text="Home Goal Advantage", row=i, col=1)

# Update layout for each y-axis in the subplots
for i in range(1, len(bins) + 1):
    fig.update_yaxes(title_text="Frequency", row=i, col=1)

fig.update_layout(
    height=900, 
    width=800, 
    title_text="Home Goal Advantage by Bin",
    showlegend=False
)
fig.update_traces(opacity=0.75, marker=dict(line=dict(color="black", width=0.5)))

fig.show()


We can notice how teams at bin #1 presented a higher frequency of positive home goal advantage. The mode is clearly `2`. Teams at bin #2 displayed a somewhat less normal distribuion, with a more equal proportion of games ending at `-1` to `2`. This means less consistency when playing at home. This trend increases even further and gets worse when we observe bin #3. Not only they displayed a similar proportion of games at the -1, 1 range, yet also their distribution seems to be more concentrated on the negative spectre of home goal advantage. Finally, we can also observe how teams that performed well won some games with a comfortable advantage while also suffering heavy losses. Being among the best does not necessarily mean not losing by a wide margin at home sometimes!

### 2.2 Bin scatters

In [88]:
# Generate mean goal adv scores
df1 = df_merged.groupby("team3bins").agg(mean_goaladv=("home_goaladv", "mean")).sort_index(ascending=False).reset_index()
df1

Unnamed: 0,team3bins,mean_goaladv
0,1,1.210526
1,2,0.157895
2,3,-0.443609


In [89]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Define a color map for bins
color_map = {1: 'blue', 2: 'green', 3: 'red'}

# Create subplots with 1 row and 2 columns
fig = make_subplots(rows=1, cols=2, subplot_titles=("Mean Home Goal Advantage", "Home Goal Advantage Distribution"))

# Adding the scatter plot in the first column
fig.add_trace(
    go.Scatter(
        x=df1['team3bins'],
        y=df1['mean_goaladv'],
        mode='markers',
        marker=dict(size=10, color=[color_map[x] for x in df1['team3bins']], opacity=0.8)
    ),
    row=1, col=1
)

# Adding each bin as a separate box plot trace
for bin in sorted(df_merged['team3bins'].unique()):
    fig.add_trace(
        go.Box(
            y=df_merged[df_merged['team3bins'] == bin]['home_goaladv'],
            name=f"Bin {bin}",
            marker=dict(color=color_map[bin], opacity=0.8),
            boxpoints='all'  # Optionally show all points
        ),
        row=1, col=2
    )

# Update layout for the entire figure
fig.update_layout(
    title_text="Analysis of Home Goal Advantage by Previous Year Team Performance",
    template="plotly_white",
    xaxis=dict(
        title="Team Home Previous Year Performance (ranked 1 to 3)",
        tickmode="array",
        tickvals=[1, 2, 3],
        range=[0, 4]
    ),
    xaxis2=dict(
        title="Team Home Previous Year Performance (ranked 1 to 3)"
    ),
    yaxis=dict(
        title="Mean Home Goal Advantage",
        range=[df1['mean_goaladv'].min() - 0.5, df1['mean_goaladv'].max() + 0.5]
    ),
    yaxis2=dict(
        title="Home Goal Advantage"
    )
)

# Show the figure
fig.show()


We can see how the mean home goal advantage decreases for teams in the current season when compared to how they peformed last year. 

First observations:
- There seems to be a positive correlation between the previous season performance for each team and home goal advantage on the following seasons. If teams performed well, probably, they performed as well in the following season.
- Teams that performed well on the previous season clearly showed a positive home advantage in the following season. Group `1` displayed mean goal advantage of `1.21`.
- The boxplots summarize our histograms. We can see how the statistics decrease as the bin number increases.

## 3. Conditional Statistics ##
### 3.1 Conditional Mean ###


In [95]:
# Calculate the mean home goal advantage for each bin
conditional_means = df_merged.groupby("performance")["home_goaladv"].mean().reset_index().sort_values(by="home_goaladv", ascending=False)
conditional_means.columns = ["performance", "mean_home_goaladv"]
conditional_means

Unnamed: 0,performance,mean_home_goaladv
2,Top,1.210526
1,Middle,0.157895
0,Bottom,-0.443609


We have already seen how there is a positive correlation between the teams' previous year performance and their home goal advantage. As we move from the highest performing teams to the lowest, there seems to be a clear gradiant in home advantage.

### 3.2 Conditional Variance
How much **variability** is there in each group's home performance?

In [98]:
# Calculate the variance of home goal advantage for each bin
conditional_variance = df_merged.groupby("performance")["home_goaladv"].var().reset_index().sort_values(by="performance", ascending=False)
conditional_variance.columns = ["performance", "variance_home_goaladv"]
conditional_variance

Unnamed: 0,performance,variance_home_goaladv
2,Top,3.591707
1,Middle,2.912902
0,Bottom,3.036569


We can observe that top performers showed a higher goal variance than middle and bottom ones. When comparing to the mean, there seems to be a weaker correlation. In fact, the variance is high among Top, lowest among Middle, and then a big higher among Bottom. Given that the mean goal difference is higher among Top performers, we can expect a higher, positive variance too.

### 3. Conditional Quantiles ###
Let's calculate the difference for quantiles to get a gist of the distribution of our data.

In [104]:
# Calculate various quantiles for home goal advantage within each bin
quantiles = df_merged.groupby("performance")["home_goaladv"].quantile([0.25, 0.5, 0.75]).unstack().sort_values(by="performance", ascending=False)
quantiles.columns = ["25th percentile", "50th percentile", "75th percentile"]

# Calculate the IQR for each bin
quantiles["IQR"] = quantiles["75th percentile"] - quantiles["25th percentile"]
quantiles

Unnamed: 0_level_0,25th percentile,50th percentile,75th percentile,IQR
performance,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Top,0.0,1.0,2.0,2.0
Middle,-1.0,0.0,2.0,3.0
Bottom,-1.0,0.0,1.0,2.0


#### The Quantiles: ####
1. **25th Percentile**
- **Top Performers**. The value of `0.0` suggests that at least 25% of the top performers have no home goal advantage or worse.
- **Middle Performers**. The value of `-1.0` implies that a quarter of them have a home goal disadvantage of a least -1.0.
- **Bottom Performers**. Similarly to middle performers.

2. **50th Percentile**
- **Top Performers**. The median of `1.0` implies that half of the top performers have a home goal advantage of 1.0 or less.
- **Middle and Bottom Performers**. The value of `0.0` suggest that half of these teams have shown no advantage or a disadvantage at home.

3. **75th Percentile**
- **Top Performers**. 75% of the top performers have a home goal advantage of `2.0` or less. This shows a strong home advantage among the top performers.
- **Middle Performers**. Although the spread is similar to the top peformers at the upper end, the central tendency in this group is much lower (median at 0.0)
- **Bottom Performers**. A 75% percentile of `1.0` suggests that 75% of the bottom performers have a home goal advantage of 1.0 or less, showing a less pronounced home advantage compared to other groups.

#### Insights and Analysis: ####
- **Spread and Skewness**. Comparing the 25th to the 75th percentile, we can see that top performers not only tend to have a higher median but also a **narrower interquartile range (IQR)** around *higher* values. This indicates less variability and a *positive* skew - more values are concentrated on the higher side of the median. Middle and bottom performers, while leaving a wider range of results, center around zero, which indicates **more variability** and **less consistent** home advantage.

### 3.4 Conditional Probability ###
We can calculate the probability of having a positive home goal advantage within each bin. This is a basic probability estimate based on the count of positive instances vs total instances.


In [107]:
# Calculate the probability of a positive home goal advantage in each bin
conditional_probability = df_merged.groupby("performance").apply(lambda x: (x["home_goaladv"] >0).sum() / len(x)).reset_index(name="probability_positive_home_goaladv").sort_values(by="performance", ascending=False)
conditional_probability

Unnamed: 0,performance,probability_positive_home_goaladv
2,Top,0.676692
1,Middle,0.421053
0,Bottom,0.323308


Here, we can see a strong correlation between how teams performed in the last season and their home goal advantage in the subsequent season. 
- **Top Performers**. Around `67%` of them won at home. Again, as discussed before, this is an impressive result.
- **Middle Performers**. Circa `42%` won at home. This is a significant drop of more than 20 percent.
- **Bottom Performers**. With roughly `32%`, these teams performed poorly at home, winning only a third of the matches.

## 4. Conclusion ##
The analysis clearly suggests that teams with better performance in the previous season are more likely to exploit home advantage effectively in the subsequent season. This correlation highlights the importance of momentum and confidence derived from past successes, which likely contribute to better home performances. For teams in the middle and lower tiers, the data suggests a need for targeted improvements in strategy, team dynamics, and possibly fan engagement to better capitalize on home game opportunities.