<h1>Chapter 3 | <b>Extra</b> Data Exercise |  EDA With Football Data</h1>

<p>Inspired by the analysis of home team and away team performances of the book, I decided to reproduce a few of these steps using a different dataset. To do that, I will tap into a great API - Football API - and get some statistics on the main Brazilian Football League (Serie A). This will require going back a few steps and review the first two chapters as well. Succinticly, we will do the following procedures here:</p>
<ul>
    <li>Get the data</li>
    <li>Clean the data</li>
    <li>Explore the data</li>
</ul>
<p>This is an opportunity to learn more about how to get data from APIs. Even though this API is packed with data, let's keep it simple and try to do a simple job. With no further ado, let's do it.</p>
<hr>
<h2>Get the data</h2>
<p>Let's import the required libraries and set up our working directories, our yaml variables, and our urls.</p>


In [2]:
import requests
import yaml
import pandas as pd
import os


In [3]:
# Define working directory
dirname = os.getcwd()
data_out = f"{dirname}/data/output/" 

# Define variables set in yaml
with open(f"{dirname}/config/config.yaml") as f:
    config = yaml.safe_load(f)

API_KEY = config["api_key"]
url = "https://v3.football.api-sports.io/"

<p>We can now make a simple function that will make an API call to the url and, using the defined parameters, will return the data in a .json format. It's important - and that goes for any API - to take a look at the <b>documentation</b> to get acquainted with the required enpoints. We need the following information from the API:</p>
<ul>
    <li>All the matches between all teams for the 2022 Serie A season.</li>
    <li>The number of goals scored by team.</li>
    <li>The name of the home and the away team for each match.</p>
</ul>
<p>After reading the docs, I decided that the enpoint <code>/fixtures</code> will help me on this job. Also, to get the required season year and the id of this particular football league, I took a look at the id numbers provided by the API's <a href="https://dashboard.api-football.com/soccer/ids">dashboard</a>.</p>

In [6]:
parameters = {"season": 2022,
              "league": 71,
              }

headers = {
    "x-rapidapi-key": f"{API_KEY}",
    "x-rapidapi-host": "v3.football.api-sports.io",
    }

endpoint = "fixtures"

In [5]:
def call_api(url, url_endpoint, headers, params):
    """Make a call to a designed api trigger and its endpoint, with defined parameters and headers"""
    url_call = f"{url}{url_endpoint}"

    response = requests.get(url=url_call, headers=headers, params=params)
    if response.status_code == 200:
        print(f"All good!")
    else:
        print(f"Oops, something went wrong. \nStatus code: {response.status_code}")
    
    data = response.json()
    return data

In [7]:
request = call_api(url=url, url_endpoint=endpoint, headers=headers, params=parameters)
df = pd.DataFrame(request["response"])

All good!


<p>Ok, now, we need to get our data. We got a huge json file, with many dictionaries that we need to access. To reproduce the book's table, we need the following fields:</p>
<ul>
    <li>Division League</li>
    <li>Year of the season</li>
    <li>Date</li>
    <li>Home team</li>
    <li>Away team</li>
    <li>Points scored at home</li>
    <li>Points scored away</li>
    <li>Goals scored at home</li>
    <li>Goals scored away</li>
</ul>
<p>We'll need to track one variable each time, checking where they are stored.</p> 

In [19]:
df.head()

Unnamed: 0,fixture,league,teams,goals,score
0,"{'id': 837991, 'referee': 'Bruno Arleu de Arau...","{'id': 71, 'name': 'Serie A', 'country': 'Braz...","{'home': {'id': 1062, 'name': 'Atletico-MG', '...","{'home': 2, 'away': 0}","{'halftime': {'home': 1, 'away': 0}, 'fulltime..."
1,"{'id': 837992, 'referee': 'Anderson Daronco', ...","{'id': 71, 'name': 'Serie A', 'country': 'Braz...","{'home': {'id': 124, 'name': 'Fluminense', 'lo...","{'home': 0, 'away': 0}","{'halftime': {'home': 0, 'away': 0}, 'fulltime..."
2,"{'id': 837993, 'referee': 'Wagner do Nasciment...","{'id': 71, 'name': 'Serie A', 'country': 'Braz...","{'home': {'id': 126, 'name': 'Sao Paulo', 'log...","{'home': 4, 'away': 0}","{'halftime': {'home': 1, 'away': 0}, 'fulltime..."
3,"{'id': 837994, 'referee': 'Caio Max Augusto Vi...","{'id': 71, 'name': 'Serie A', 'country': 'Braz...","{'home': {'id': 121, 'name': 'Palmeiras', 'log...","{'home': 2, 'away': 3}","{'halftime': {'home': 1, 'away': 2}, 'fulltime..."
4,"{'id': 837995, 'referee': 'Wilton Pereira Samp...","{'id': 71, 'name': 'Serie A', 'country': 'Braz...","{'home': {'id': 120, 'name': 'Botafogo', 'logo...","{'home': 1, 'away': 3}","{'halftime': {'home': 0, 'away': 3}, 'fulltime..."


<h3>Get the id and the date of each match</h3>
<p>Let's start by getting the number of games. We should be able to do that by tapping into <code>fixture</code>, the first column. To do this, we will extract the <code>id</code> and the <code>date</code> value for each match. We will use <code>pd.json_normalize()</code> to simplify the process of expanding nested JSON-like data into separate columns. It makes our job much easier. We then reset the index and keep the original as an id to be used in future joins.</p>

In [41]:
df["fixture"][0]

{'id': 837991,
 'referee': 'Bruno Arleu de Araujo',
 'timezone': 'UTC',
 'date': '2022-04-10T19:00:00+00:00',
 'timestamp': 1649617200,
 'periods': {'first': 1649617200, 'second': 1649620800},
 'venue': {'id': 234,
  'name': 'Estádio Governador Magalhães Pinto',
  'city': 'Belo Horizonte, Minas Gerais'},
 'status': {'long': 'Match Finished', 'short': 'FT', 'elapsed': 90}}

In [53]:
df_ids_dates = pd.json_normalize(df["fixture"])[["id", "date"]].reset_index()

In [57]:
df_ids_dates.shape

(380, 3)

In [58]:
df_ids_dates.head()

Unnamed: 0,index,id,date
0,0,837991,2022-04-10T19:00:00+00:00
1,1,837992,2022-04-09T19:30:00+00:00
2,2,837993,2022-04-10T22:00:00+00:00
3,3,837994,2022-04-10T00:00:00+00:00
4,4,837995,2022-04-10T19:00:00+00:00


<p>Good! We need to get hold of the date of the game and get rid of the hour (which is useful, but not for our task). We can use <code>.split()</code>, using the <code>"T"</code> as the argument for the parameter used to split the string. Then, we convert the variable to datetime.</p> 

In [63]:
df_ids_dates["date"] = df_ids_dates["date"].str.split("T").str[0]
df_ids_dates["date"] = pd.to_datetime(df_ids_dates["date"], format ="%Y-%m-%d")

In [64]:
df_ids_dates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   index   380 non-null    int64         
 1   id      380 non-null    int64         
 2   date    380 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(2)
memory usage: 9.0 KB


In [62]:
df_ids_dates

Unnamed: 0,index,id,date
0,0,837991,2022-04-10
1,1,837992,2022-04-09
2,2,837993,2022-04-10
3,3,837994,2022-04-10
4,4,837995,2022-04-10
...,...,...,...
375,375,838366,2022-11-13
376,376,838367,2022-11-13
377,377,838368,2022-11-13
378,378,838369,2022-11-13


<h3>Get Division and Season</h3>
<p>Man, you live and you surely learn! If I had known about <code>.json_normalize()</code> before, I would have been probably finished by now! But that's the beauty of pandas and data analysis - there is always room for improvement. Let's apply this function again and quickly get our variables.</p>

In [66]:
df_league = pd.json_normalize(df["league"])

In [71]:
df_name_season = df_league[["name", "season"]].reset_index()
df_name_season.rename(columns={"name": "div"}, inplace=True)

<p>We can now join them using <code>pd.merge()</code> and use our <code>index</code> as the common field.</p>

In [72]:
df_dates_season = pd.merge(df_ids_dates, df_name_season, on="index", how="inner")

In [80]:
df_dates_season.shape

(380, 5)

In [73]:
df_dates_season

Unnamed: 0,index,id,date,div,season
0,0,837991,2022-04-10,Serie A,2022
1,1,837992,2022-04-09,Serie A,2022
2,2,837993,2022-04-10,Serie A,2022
3,3,837994,2022-04-10,Serie A,2022
4,4,837995,2022-04-10,Serie A,2022
...,...,...,...,...,...
375,375,838366,2022-11-13,Serie A,2022
376,376,838367,2022-11-13,Serie A,2022
377,377,838368,2022-11-13,Serie A,2022
378,378,838369,2022-11-13,Serie A,2022


<h3>Get home and away team</h3>
<p>This should not be difficult - now on, it's a matter of repeating what we have done so far. As an extra, I'll just get the team logos, as I want to use them for my data visualization project.</p>

In [75]:
df_teams = pd.json_normalize(df["teams"])

In [78]:
df_teams = df_teams[["home.name", "home.logo", "away.name", "away.logo"]].reset_index()

In [79]:
df_teams.shape

(380, 5)

In [81]:
df_teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   index      380 non-null    int64 
 1   home.name  380 non-null    object
 2   home.logo  380 non-null    object
 3   away.name  380 non-null    object
 4   away.logo  380 non-null    object
dtypes: int64(1), object(4)
memory usage: 15.0+ KB


In [82]:
df_teams.head()

Unnamed: 0,index,home.name,home.logo,away.name,away.logo
0,0,Atletico-MG,https://media-2.api-sports.io/football/teams/1...,Internacional,https://media-2.api-sports.io/football/teams/1...
1,1,Fluminense,https://media-3.api-sports.io/football/teams/1...,Santos,https://media-1.api-sports.io/football/teams/1...
2,2,Sao Paulo,https://media-1.api-sports.io/football/teams/1...,Atletico Paranaense,https://media-1.api-sports.io/football/teams/1...
3,3,Palmeiras,https://media-1.api-sports.io/football/teams/1...,Ceara,https://media-1.api-sports.io/football/teams/1...
4,4,Botafogo,https://media-3.api-sports.io/football/teams/1...,Corinthians,https://media-3.api-sports.io/football/teams/1...


In [83]:
df_dates_season_teams = pd.merge(df_dates_season, df_teams, on="index", how="inner")

In [87]:
df_dates_season_teams.rename(columns={
    "home.name": "team_home",
    "home.logo": "team_home_logo",
    "away.name": "team_away",
    "away.logo": "team_away_logo",
}, inplace=True
)

In [88]:
df_dates_season_teams.shape

(380, 9)

In [89]:
df_dates_season_teams.head()

Unnamed: 0,index,id,date,div,season,team_home,team_home_logo,team_away,team_away_logo
0,0,837991,2022-04-10,Serie A,2022,Atletico-MG,https://media-2.api-sports.io/football/teams/1...,Internacional,https://media-2.api-sports.io/football/teams/1...
1,1,837992,2022-04-09,Serie A,2022,Fluminense,https://media-3.api-sports.io/football/teams/1...,Santos,https://media-1.api-sports.io/football/teams/1...
2,2,837993,2022-04-10,Serie A,2022,Sao Paulo,https://media-1.api-sports.io/football/teams/1...,Atletico Paranaense,https://media-1.api-sports.io/football/teams/1...
3,3,837994,2022-04-10,Serie A,2022,Palmeiras,https://media-1.api-sports.io/football/teams/1...,Ceara,https://media-1.api-sports.io/football/teams/1...
4,4,837995,2022-04-10,Serie A,2022,Botafogo,https://media-3.api-sports.io/football/teams/1...,Corinthians,https://media-3.api-sports.io/football/teams/1...


<h3>Get the goals of each match for home and away team</h3>

In [92]:
df_goals = pd.json_normalize(df["goals"]).reset_index()

In [94]:
df_goals.rename(columns={
        "home": "goals_home",
        "away": "goals_away",
    }, inplace=True
)

In [96]:
df_goals.shape

(380, 3)

In [98]:
df_goals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   index       380 non-null    int64
 1   goals_home  380 non-null    int64
 2   goals_away  380 non-null    int64
dtypes: int64(3)
memory usage: 9.0 KB


In [99]:
df_goals_dates_season_teams = pd.merge(df_dates_season_teams, df_goals, on="index", how="inner")

In [100]:
df_goals_dates_season_teams

Unnamed: 0,index,id,date,div,season,team_home,team_home_logo,team_away,team_away_logo,goals_home,goals_away
0,0,837991,2022-04-10,Serie A,2022,Atletico-MG,https://media-2.api-sports.io/football/teams/1...,Internacional,https://media-2.api-sports.io/football/teams/1...,2,0
1,1,837992,2022-04-09,Serie A,2022,Fluminense,https://media-3.api-sports.io/football/teams/1...,Santos,https://media-1.api-sports.io/football/teams/1...,0,0
2,2,837993,2022-04-10,Serie A,2022,Sao Paulo,https://media-1.api-sports.io/football/teams/1...,Atletico Paranaense,https://media-1.api-sports.io/football/teams/1...,4,0
3,3,837994,2022-04-10,Serie A,2022,Palmeiras,https://media-1.api-sports.io/football/teams/1...,Ceara,https://media-1.api-sports.io/football/teams/1...,2,3
4,4,837995,2022-04-10,Serie A,2022,Botafogo,https://media-3.api-sports.io/football/teams/1...,Corinthians,https://media-3.api-sports.io/football/teams/1...,1,3
...,...,...,...,...,...,...,...,...,...,...,...
375,375,838366,2022-11-13,Serie A,2022,Internacional,https://media-3.api-sports.io/football/teams/1...,Palmeiras,https://media-1.api-sports.io/football/teams/1...,3,0
376,376,838367,2022-11-13,Serie A,2022,Ceara,https://media-3.api-sports.io/football/teams/1...,Juventude,https://media-2.api-sports.io/football/teams/1...,4,1
377,377,838368,2022-11-13,Serie A,2022,Goias,https://media-1.api-sports.io/football/teams/1...,Sao Paulo,https://media-1.api-sports.io/football/teams/1...,0,4
378,378,838369,2022-11-13,Serie A,2022,Cuiaba,https://media-2.api-sports.io/football/teams/1...,Coritiba,https://media-1.api-sports.io/football/teams/1...,2,1


<p>We are nearily there! Let's get the points and we will be almost done with the data wrangling part.</p>
<h3>Get the points for each match for home and away team</h3>

In [101]:
df_points = pd.json_normalize(df["score"]).reset_index()

In [102]:
df_points.head()

Unnamed: 0,index,halftime.home,halftime.away,fulltime.home,fulltime.away,extratime.home,extratime.away,penalty.home,penalty.away
0,0,1,0,2,0,,,,
1,1,0,0,0,0,,,,
2,2,1,0,4,0,,,,
3,3,1,2,2,3,,,,
4,4,0,3,1,3,,,,


<p>Since we are not interested at when the goals were scored, we won't be using this data. But we can calculate the number of points each team scored (or not) using a few calculations. We will create each column individually, but the logic is nearly the same. We will pass <code>.apply()</code> to our DataFrame. Remember: <code>apply()</code> allows you to apply a function along a specific axis of a DataFrame or Series. It is very flexible, as you can perform many operations with it. As its first parameter, we will pass a condition: for each row in the DataFrame, if the number of goals scored from home team is equal to away's, return 0; if home team's number of goals is higher, it won and therefore gets 3 points; else, it lost and gets 0 points. Then, as the second parameter, we pass <code>axis=1</code>, indicating that the lambda function will apply this condition over each row. We hence get the number of points the home and away team scored for each match in two different columns!</p>

In [107]:
# Create "points_home" variable
df_goals_dates_season_teams["points_home"] = df_goals_dates_season_teams.apply(
    lambda row: 1 if row["goals_home"] == row["goals_away"] 
    else (3 if row["goals_home"] > row["goals_away"] else 0), 
    axis=1
    )

# Create "points_away" variable
df_goals_dates_season_teams["points_away"] = df_goals_dates_season_teams.apply(
    lambda row: 1 if row["goals_home"] == row["goals_away"] 
    else (3 if row["goals_home"] < row["goals_away"] else 0), 
    axis=1
    )

In [106]:
df_goals_dates_season_teams.head()

Unnamed: 0,index,id,date,div,season,team_home,team_home_logo,team_away,team_away_logo,goals_home,goals_away,points_home,points_away
0,0,837991,2022-04-10,Serie A,2022,Atletico-MG,https://media-2.api-sports.io/football/teams/1...,Internacional,https://media-2.api-sports.io/football/teams/1...,2,0,3,0
1,1,837992,2022-04-09,Serie A,2022,Fluminense,https://media-3.api-sports.io/football/teams/1...,Santos,https://media-1.api-sports.io/football/teams/1...,0,0,1,1
2,2,837993,2022-04-10,Serie A,2022,Sao Paulo,https://media-1.api-sports.io/football/teams/1...,Atletico Paranaense,https://media-1.api-sports.io/football/teams/1...,4,0,3,0
3,3,837994,2022-04-10,Serie A,2022,Palmeiras,https://media-1.api-sports.io/football/teams/1...,Ceara,https://media-1.api-sports.io/football/teams/1...,2,3,0,3
4,4,837995,2022-04-10,Serie A,2022,Botafogo,https://media-3.api-sports.io/football/teams/1...,Corinthians,https://media-3.api-sports.io/football/teams/1...,1,3,0,3


<p>Ok, we just need to create a <code>"home_goaladv"</code> to get the difference of goals for home team.</p>

In [108]:
df_goals_dates_season_teams["home_goaladv"] = df_goals_dates_season_teams["goals_home"] - df_goals_dates_season_teams["goals_away"]

In [109]:
df_goals_dates_season_teams.head()

Unnamed: 0,index,id,date,div,season,team_home,team_home_logo,team_away,team_away_logo,goals_home,goals_away,points_home,points_away,home_goaladv
0,0,837991,2022-04-10,Serie A,2022,Atletico-MG,https://media-2.api-sports.io/football/teams/1...,Internacional,https://media-2.api-sports.io/football/teams/1...,2,0,3,0,2
1,1,837992,2022-04-09,Serie A,2022,Fluminense,https://media-3.api-sports.io/football/teams/1...,Santos,https://media-1.api-sports.io/football/teams/1...,0,0,1,1,0
2,2,837993,2022-04-10,Serie A,2022,Sao Paulo,https://media-1.api-sports.io/football/teams/1...,Atletico Paranaense,https://media-1.api-sports.io/football/teams/1...,4,0,3,0,4
3,3,837994,2022-04-10,Serie A,2022,Palmeiras,https://media-1.api-sports.io/football/teams/1...,Ceara,https://media-1.api-sports.io/football/teams/1...,2,3,0,3,-1
4,4,837995,2022-04-10,Serie A,2022,Botafogo,https://media-3.api-sports.io/football/teams/1...,Corinthians,https://media-3.api-sports.io/football/teams/1...,1,3,0,3,-2


<p><b>AWESOME</b>! Let's just reorder our columns and we are good to go.</>

In [110]:
final_df = df_goals_dates_season_teams[[
    "div", "season", "team_home", "team_away", 
    "points_home", "points_away", "goals_home", 
    "goals_away", "home_goaladv", "team_home_logo", 
    "team_away_logo"
    ]]

In [111]:
final_df

Unnamed: 0,div,season,team_home,team_away,points_home,points_away,goals_home,goals_away,home_goaladv,team_home_logo,team_away_logo
0,Serie A,2022,Atletico-MG,Internacional,3,0,2,0,2,https://media-2.api-sports.io/football/teams/1...,https://media-2.api-sports.io/football/teams/1...
1,Serie A,2022,Fluminense,Santos,1,1,0,0,0,https://media-3.api-sports.io/football/teams/1...,https://media-1.api-sports.io/football/teams/1...
2,Serie A,2022,Sao Paulo,Atletico Paranaense,3,0,4,0,4,https://media-1.api-sports.io/football/teams/1...,https://media-1.api-sports.io/football/teams/1...
3,Serie A,2022,Palmeiras,Ceara,0,3,2,3,-1,https://media-1.api-sports.io/football/teams/1...,https://media-1.api-sports.io/football/teams/1...
4,Serie A,2022,Botafogo,Corinthians,0,3,1,3,-2,https://media-3.api-sports.io/football/teams/1...,https://media-3.api-sports.io/football/teams/1...
...,...,...,...,...,...,...,...,...,...,...,...
375,Serie A,2022,Internacional,Palmeiras,3,0,3,0,3,https://media-3.api-sports.io/football/teams/1...,https://media-1.api-sports.io/football/teams/1...
376,Serie A,2022,Ceara,Juventude,3,0,4,1,3,https://media-3.api-sports.io/football/teams/1...,https://media-2.api-sports.io/football/teams/1...
377,Serie A,2022,Goias,Sao Paulo,0,3,0,4,-4,https://media-1.api-sports.io/football/teams/1...,https://media-1.api-sports.io/football/teams/1...
378,Serie A,2022,Cuiaba,Coritiba,3,0,2,1,1,https://media-2.api-sports.io/football/teams/1...,https://media-1.api-sports.io/football/teams/1...


In [112]:
pd.DataFrame.from_dict(
    {
        "Statistics": [
            "Mean",
            "Standard deviation",
            "Percent positive",
            "Percent zero",
            "Percent negative",
            "Number of observations",
        ],
        "Value": [
            final_df["home_goaladv"].describe()["mean"],
            final_df["home_goaladv"].describe()["std"],
            (final_df["home_goaladv"] > 0).sum() / final_df["home_goaladv"].shape[0] * 100,
            (final_df["home_goaladv"] == 0).sum() / final_df["home_goaladv"].shape[0] * 100,
            (final_df["home_goaladv"] < 0).sum() / final_df["home_goaladv"].shape[0] * 100,
            final_df["home_goaladv"].describe()["count"],
        ],
    }
).round(1)

Unnamed: 0,Statistics,Value
0,Mean,0.4
1,Standard deviation,1.5
2,Percent positive,44.2
3,Percent zero,28.4
4,Percent negative,27.4
5,Number of observations,380.0
