<div style="text-align: right">INFO 6105 Data Science Eng Methods and Tools, Lecture 8 Day 2</div>
<div style="text-align: right">Dino Konstantopoulos, 27 October 2022</div>

# The English Premier League Lab
<br />
<center>
<img src="ipynb.images/man-city.jpg" width=1000 />
</center>

We study the statistics of the English Premier Leaague.

These are our distributions:

```
Data displays several attributes:

    Div = League Division
    Date = Match Date (dd/mm/yy)
    HomeTeam = Home Team
    AwayTeam = Away Team
    FTHG = Full Time Home Team Goals
    FTAG = Full Time Away Team Goals
    FTR and Res = Full Time Result (H=Home Win, D=Draw, A=Away Win)
    HTHG = Half Time Home Team Goals
    HTAG = Half Time Away Team Goals
    HTR = Half Time Result (H=Home Win, D=Draw, A=Away Win)

Match Statistics (where available)

    HS = Home Team Shots
    AS = Away Team Shots
    HST = Home Team Shots on Target
    AST = Away Team Shots on Target
    HC = Home Team Corners
    AC = Away Team Corners
    HF = Home Team Fouls Committed
    AF = Away Team Fouls Committed
    HY = Home Team Yellow Cards
    AY = Away Team Yellow Cards
    HR = Home Team Red Cards
    AR = Away Team Red Cards
```

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from datetime import datetime as dt
import itertools

%matplotlib inline

In [3]:
season2021 = pd.read_csv('2020-2021.csv')

In [4]:
columns_req = ['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR']
play_stats = season2021[columns_req]  
play_stats

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,12/09/2020,Fulham,Arsenal,0,3,A
1,12/09/2020,Crystal Palace,Southampton,1,0,H
2,12/09/2020,Liverpool,Leeds,4,3,H
3,12/09/2020,West Ham,Newcastle,0,2,A
4,13/09/2020,West Brom,Leicester,0,3,A
...,...,...,...,...,...,...
375,23/05/2021,Liverpool,Crystal Palace,2,0,H
376,23/05/2021,Man City,Everton,5,0,H
377,23/05/2021,Sheffield United,Burnley,1,0,H
378,23/05/2021,West Ham,Southampton,3,0,H


# Step 1
Create two new columns in the dataset: A boolean column called `Home team wins` of 0s and 1 where 1 designates a Home team win and a 0 designates an Away team win or tie, and an integer column called `Differential` that is the difference between home team goals and away team goals.

In [None]:
season2021['Home team wins']=np.where()

In [5]:
def determin_winner(fthg, ftag):
    if fthg < ftag:
        return 1
    else:
        return 0
    
def determin_strength(fthg, ftag):
    differential=fthg-ftag
    return differential



In [10]:
play_stats['Home_team_wins'] = play_stats.apply(
    lambda row: determin_winner(row['FTHG'], row['FTAG']),
    axis=1
)

play_stats['Differential'] = play_stats.apply(
    lambda row: determin_strength(row['FTHG'], row['FTAG']),
    axis=1
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  play_stats['Home_team_wins'] = play_stats.apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  play_stats['Differential'] = play_stats.apply(


In [11]:
play_stats.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,Home_team_wins,Differential
0,12/09/2020,Fulham,Arsenal,0,3,A,1,-3
1,12/09/2020,Crystal Palace,Southampton,1,0,H,0,1
2,12/09/2020,Liverpool,Leeds,4,3,H,0,1
3,12/09/2020,West Ham,Newcastle,0,2,A,1,-2
4,13/09/2020,West Brom,Leicester,0,3,A,1,-3


# Step 2
Find all unique teams in the English Premier League and assign them to a python list so that each team can be represented by an index into the list.

In [15]:
teams = list(play_stats['HomeTeam'].unique())

In [17]:
play_stats['HomeTeam'].nunique()

20

In [16]:
teams

['Fulham',
 'Crystal Palace',
 'Liverpool',
 'West Ham',
 'West Brom',
 'Tottenham',
 'Brighton',
 'Sheffield United',
 'Everton',
 'Leeds',
 'Man United',
 'Arsenal',
 'Southampton',
 'Newcastle',
 'Chelsea',
 'Leicester',
 'Aston Villa',
 'Wolves',
 'Burnley',
 'Man City']

#  Step 3
Create two new columns in the dataset: An integer column called `HomeTeamIndex` that assigns a number to the home team for each row in the dataset (pointing to the index in your teams list), and an integer column called `AwayTeamIndex` that assigns a number to the away team for each row in the dataset.

# Step 4
Build a Byaesian model in PyMC3. Use Gaussian priors for the teams’ strengths. Let the model infer posteriors for all the teams.

Another important thing is that I will not use the sigmoid function explicitly. If we pass the difference of the teams’ strengths via the `logit_p` parameter instead of `p`, the `pm.Bernoulli` object will take care of it.

The `shape` parameter in pymc3 allows you to create multidimensional priors:
```
strength = pm.Normal("strength", -5, 5, shape = 20)
```

Your data likelihood will follow the Bernoulii pdf since we are tracking team wins and losses:
```
obs = pm.Bernoulli("wins", logit_p = diff, observed = play_stats["Home team wins"]
    )
```
You will use the the `logit_p` form of `pm.Bernoulli`, and you will model the following variable:
```
diff = strength[play_stats["HomeTeamIndex"]] - strength[play_stats["AwayTeamIndex"]]
```

# Step 5
Draw the simulation traces of all probabilistic variables:
```
pm.traceplot(trace)
```

# Step 6
Plot all posterior pdfs:
```
plot_posterior(trace[100:], varnames=["strength"])
```

# Step 7
Compare all posteriors to rank them, using `arviz`:
```
import arviz as az
az.plot_forest(trace[100:], kind='forestplot')
```

# Step 8
Use 
```
az.summary(trace).round(2)
```
to quickly examine the mean and standard deviations of all probabilistic variables.

From here, you can also see that the MCMC seems to have converged well since the [Gelman-Rubin diagnostic](https://www.stata.com/features/overview/gelman-rubin-convergence-diagnostic/), r_hat, 
doesn’t indicate any problem (values are all close or equal to 1).

You can also see that some players have a negative strength, but this is totally fine since we only use the difference in strength between 2 players anyway. If you do not like this for some reason, you can either replace the strength priors with a HalfNormal distribution or you just add some constant like 5 to the posteriors, so all means and HDIs are in the positive range.

# Step 9
Use the zipper to list each team and their strength:

# Step 10
Plot a bar plot of teams' strengths by transforming the dictionary above to a pandas dataframe:

The model started off with some prior beliefs about the strength levels of teams, which then got updated via observations. The more games a team plays, the smaller the uncertainty about this team’s strength. In one extreme case, if a team never played a single game, the posterior distribution of their strength equals the prior distribution.