<h1>Chapter 2 | Case Study C1 | <b>Identifying Successful Football Managers</b></h1>
<p>In this notebook, I'll be writing the code for the aforementionred case study of the book. The main goal is to observe and learn how to structure the data according to our neeeds, which means applying the concept of <i>tidy data</i> aas well as merging different datasets to get the desired outcome.</p>
<p>Here, our question is: <i>who are the most successful football managers in England?</p>


In [1]:
import os
import sys
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from mizani.formatters import percent_format
from plotnine import *

warnings.filterwarnings("ignore")

In [7]:
# Current script folder path
current_path = os.getcwd()
dirname = current_path.split("da_case_studies")[0]

# Location folders
data_in = f"{dirname}da_data_repo/football/clean/"
data_out = f"{dirname}ch02-football_manager_successs/"
output = f"{dirname}ch02-football_manager_successs/output"

<p>Ok, let's load the dataset containing data on 11 seasons of EPL games (2008/2009 - 2018/2019). Following, I will sort the values according to the <code>Team Home</code> variable, ascending.</p>

In [8]:
epl_games = pd.read_csv(f"{data_in}epl_games.csv")

In [9]:
epl_games.sort_values(["team_home"])

Unnamed: 0,div,season,date,team_home,team_away,points_home,points_away,goals_home,goals_away
0,E0,2008,16aug2008,Arsenal,West Brom,3,0,1,0
3638,E0,2017,03jan2018,Arsenal,Chelsea,1,1,2,2
1900,E0,2013,17aug2013,Arsenal,Aston Villa,0,3,1,3
1014,E0,2010,12feb2011,Arsenal,Wolves,3,0,2,0
3939,E0,2018,02dec2018,Arsenal,Tottenham,3,0,4,2
...,...,...,...,...,...,...,...,...,...
1004,E0,2010,05feb2011,Wolves,Man United,3,0,2,1
489,E0,2009,07nov2009,Wolves,Arsenal,0,3,1,4
3928,E0,2018,25nov2018,Wolves,Huddersfield,0,3,0,2
553,E0,2009,20dec2009,Wolves,Burnley,3,0,2,0


<p>Note, however, that all games are not sorted by season. Let's sort the dataset once more, now, considering also the variable <code>season</code>.

In [10]:
epl_games.sort_values(["season", "team_home"])

Unnamed: 0,div,season,date,team_home,team_away,points_home,points_away,goals_home,goals_away
0,E0,2008,16aug2008,Arsenal,West Brom,3,0,1,0
21,E0,2008,30aug2008,Arsenal,Newcastle,3,0,3,0
53,E0,2008,27sep2008,Arsenal,Hull,0,3,1,2
75,E0,2008,18oct2008,Arsenal,Everton,3,0,3,1
95,E0,2008,29oct2008,Arsenal,Tottenham,1,1,4,4
...,...,...,...,...,...,...,...,...,...
4085,E0,2018,02mar2019,Wolves,Cardiff,3,0,2,0
4115,E0,2018,02apr2019,Wolves,Man United,3,0,2,1
4139,E0,2018,20apr2019,Wolves,Brighton,1,1,0,0
4148,E0,2018,24apr2019,Wolves,Arsenal,3,0,3,1


<p>We can filter the dataset and get one particular season. Let's choose the year of <code>2018</code>.</p>

In [11]:
epl_games.loc[lambda x: x["season"] == 2018]

Unnamed: 0,div,season,date,team_home,team_away,points_home,points_away,goals_home,goals_away
3800,E0,2018,10aug2018,Man United,Leicester,3,0,2,1
3801,E0,2018,11aug2018,Wolves,Everton,1,1,2,2
3802,E0,2018,11aug2018,Huddersfield,Chelsea,0,3,0,3
3803,E0,2018,11aug2018,Fulham,Crystal Palace,0,3,0,2
3804,E0,2018,11aug2018,Newcastle,Tottenham,0,3,1,2
...,...,...,...,...,...,...,...,...,...
4175,E0,2018,12may2019,Burnley,Arsenal,0,3,1,3
4176,E0,2018,12may2019,Tottenham,Everton,1,1,2,2
4177,E0,2018,12may2019,Liverpool,Wolves,3,0,2,0
4178,E0,2018,12may2019,Southampton,Huddersfield,1,1,1,1


<p>We can now make some observations about this dataset:</p>
<ul>
    <li>Each observation is a single game</li>
    <li>Key variables: <b>date</b>, <b>name of the home team</b>, <b>name of the away team</b>, <b>points home and away</b>, <b>goals home</b>, <b>goals away</b></li>
    <li>Because each observation is a game, and each game is a separate row in the data table, we can affirm this is a <b>tidy data table</b></li>
    <li>There are 3 ID variables: date, home team, and away team</li>
</ul>
<p>We can analyse the data by structuring differently. If each row is a game played by a team, we will make the data table longer yet it will help us to get the answer to our question.</p>

In [12]:
epl_teams_games = pd.read_csv(f"{data_in}epl-teams-games.csv")

In [13]:
epl_teams_games.sort_values(["team"])

Unnamed: 0,div,season,date,team,gameno,home,points,goals,team_opponent,points_opponent,goals_opponent,hometeam_uid,awayteam_uid
0,E0,2008,16aug2008,Arsenal,1,1,3,1,West Brom,0,0,1,33
286,E0,2015,13jan2016,Arsenal,21,0,1,3,Liverpool,1,3,18,1
285,E0,2015,02jan2016,Arsenal,20,1,3,1,Newcastle,0,0,1,22
283,E0,2015,26dec2015,Arsenal,18,0,0,0,Southampton,3,4,27,1
282,E0,2015,21dec2015,Arsenal,17,1,3,2,Man City,0,1,1,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8259,E0,2010,20nov2010,Wolves,14,0,0,1,Blackpool,3,2,5,36
8260,E0,2010,27nov2010,Wolves,15,1,3,3,Sunderland,0,2,36,29
8261,E0,2010,04dec2010,Wolves,16,0,0,0,Blackburn,3,3,4,36
8254,E0,2010,23oct2010,Wolves,9,0,0,0,Chelsea,3,2,11,36


In [14]:
epl_teams_games.sort_values(["season", "team"])

Unnamed: 0,div,season,date,team,gameno,home,points,goals,team_opponent,points_opponent,goals_opponent,hometeam_uid,awayteam_uid
0,E0,2008,16aug2008,Arsenal,1,1,3,1,West Brom,0,0,1,33
1,E0,2008,23aug2008,Arsenal,2,0,0,0,Fulham,3,1,14,1
2,E0,2008,30aug2008,Arsenal,3,1,3,3,Newcastle,0,0,1,22
3,E0,2008,13sep2008,Arsenal,4,0,3,4,Blackburn,0,0,4,1
4,E0,2008,20sep2008,Arsenal,5,0,3,3,Bolton,0,1,6,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8355,E0,2018,20apr2019,Wolves,34,1,1,0,Brighton,1,0,36,8
8356,E0,2018,24apr2019,Wolves,35,1,3,3,Arsenal,0,1,36,1
8357,E0,2018,27apr2019,Wolves,36,0,3,2,Watford,0,1,32,36
8358,E0,2018,04may2019,Wolves,37,1,3,1,Fulham,0,0,36,14
