# Introduction

This is an extension of my previous project on web scraping Serie A (Italian Football League) clubs' results and stats for each club from [FBRef](https://fbref.com/en/). The aim now is to also scrape an additional season as well as scrape some supplementary stats from each football match. The scraped data will be used to make predictions on the results of these matches. The following libraries will be used:

- Requests
- BeautifulSoup
- Pandas
- Numpy

## Scraping using Requests

We will start off scraping using the `requests` library to download the HTML code of the page we are interested in. So we need to import the library and define the URL of interest.

In [1]:
import requests

In [2]:
league_url = 'https://fbref.com/en/comps/11/Serie-A-Stats'
# this url is from the fbref website and contains the primary Serie A league table for the ongoing 2022-23 season

In [3]:
data = requests.get(league_url)
#using the get method to make a request to the server and download the HTML from the page

We now have the page's HTML downloaded. We are interested in checking out each individual club's stats for a particular season and the URLs or links for these clubs are contained in the "League Table" of the page we just scraped.

Using the inspect option in Google Chrome, we can see that the column of the table we need is `href` contained in an `<a> </a>` tag. This contains the URL to each team's "homepage".

## Extracting useful info using BeautifulSoup

Now that we know that we need to extract the `<a>` tags containing `href` in the HTML code we downloaded, we need to use BeautifulSoup to parse through the complex block of text/code that we scraped using `requests` earlier.

In [4]:
from bs4 import BeautifulSoup

In [5]:
#initialize a soup object
soup = BeautifulSoup(data.text)

To select the relevant URLs from each team's row on the table, we can use the inspect element section in Chrome to find out that the links are under a table tag and the class is called `stats_table`. So we first select the table and then the `<a>` tags contained inside it. 

In [6]:
# first select any table in the page which has a class called stats_table
league_table = soup.select('table.stats_table')

In [7]:
league_table[0]
# we only need the first entry as that has the links needed

<table class="stats_table sortable min_width force_mobilize" data-cols-to-freeze=",2" id="results2022-2023111_overall"> <caption>Regular season Table</caption> <colgroup><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/></colgroup> <thead> <tr> <th aria-label="Rank" class="poptip sort_default_asc center" data-stat="rank" data-tip="Squad finish in competition&lt;br&gt;Finish within the league or competition.&lt;br&gt;For knockout competitions may show final round reached.&lt;br&gt;Colors and arrows represent promotion/relegation or qualifiation for continental cups.&lt;br&gt;Trophy indicates team won league whether by playoffs or by leading the table.&lt;br&gt;Star indicates topped table in league USING another means of naming champion." scope="col">Rk</th> <th aria-label="Squad" class="poptip sort_default_asc center" data-stat="team" scope="col">Squad</th> <th aria-label="Matches Played" class="poptip center" data-st

In [8]:
links = league_table[0].find_all('a')

In [9]:
team_links = []
for link in links:
    temp = str(link)
    if 'squads' in temp:
        m = temp[9:60].split('"')
        team_links.append(m[0])

In [10]:
team_links

['/en/squads/d48ad4ff/Napoli-Stats',
 '/en/squads/dc56fe14/Milan-Stats',
 '/en/squads/e0652b02/Juventus-Stats',
 '/en/squads/d609edc0/Internazionale-Stats',
 '/en/squads/7213da33/Lazio-Stats',
 '/en/squads/922493f3/Atalanta-Stats',
 '/en/squads/cf74a709/Roma-Stats',
 '/en/squads/04eea015/Udinese-Stats',
 '/en/squads/105360fe/Torino-Stats',
 '/en/squads/421387cf/Fiorentina-Stats',
 '/en/squads/1d8099f8/Bologna-Stats',
 '/en/squads/21680aa4/Monza-Stats',
 '/en/squads/ffcbe334/Lecce-Stats',
 '/en/squads/a3d88bd8/Empoli-Stats',
 '/en/squads/68449f6d/Spezia-Stats',
 '/en/squads/c5577084/Salernitana-Stats',
 '/en/squads/e2befd26/Sassuolo-Stats',
 '/en/squads/0e72edf2/Hellas-Verona-Stats',
 '/en/squads/8ff9e3b3/Sampdoria-Stats',
 '/en/squads/9aad3a77/Cremonese-Stats']

The above code is slapped together because the `links` were in the format of `bs4.element.tag` which cannot be sliced. So we set up a list to store the individual team links sliced from `links` using the `.split()` method and we have new list called `team_links` to store the results as shown above.

Now that we have the portion of the links for each team's page, we can easily combine them with the website's address and get complete URLs! The process is better illustrated by the code below:

In [11]:
team_links = ['https://fbref.com'+link for link in team_links]

In [12]:
team_links

['https://fbref.com/en/squads/d48ad4ff/Napoli-Stats',
 'https://fbref.com/en/squads/dc56fe14/Milan-Stats',
 'https://fbref.com/en/squads/e0652b02/Juventus-Stats',
 'https://fbref.com/en/squads/d609edc0/Internazionale-Stats',
 'https://fbref.com/en/squads/7213da33/Lazio-Stats',
 'https://fbref.com/en/squads/922493f3/Atalanta-Stats',
 'https://fbref.com/en/squads/cf74a709/Roma-Stats',
 'https://fbref.com/en/squads/04eea015/Udinese-Stats',
 'https://fbref.com/en/squads/105360fe/Torino-Stats',
 'https://fbref.com/en/squads/421387cf/Fiorentina-Stats',
 'https://fbref.com/en/squads/1d8099f8/Bologna-Stats',
 'https://fbref.com/en/squads/21680aa4/Monza-Stats',
 'https://fbref.com/en/squads/ffcbe334/Lecce-Stats',
 'https://fbref.com/en/squads/a3d88bd8/Empoli-Stats',
 'https://fbref.com/en/squads/68449f6d/Spezia-Stats',
 'https://fbref.com/en/squads/c5577084/Salernitana-Stats',
 'https://fbref.com/en/squads/e2befd26/Sassuolo-Stats',
 'https://fbref.com/en/squads/0e72edf2/Hellas-Verona-Stats',
 '

## Extract Relevant Data

We can use any link in our `team_links` list to extract data for that specific team.

In [13]:
napoli_url = team_links[0]

In [14]:
data = requests.get(napoli_url)

Again on this page we see that Scores & Fixtures, the table we are interested in, is contained in a table tag with a class called `stats_table`. Each row is a match and we need to extract this table into a `Pandas` dataframe.

In [15]:
import pandas as pd

In [16]:
matches = pd.read_html(data.text, match = 'Scores & Fixtures')
#we scan the page for tables using .read_html and then we provide 'Scores & Fixtures' as the specific table we want

The first element of the `matches` list should be the `Pandas` dataframe we want.

In [17]:
matches[0].head(3)

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes
0,2022-08-15,18:30,Serie A,Matchweek 1,Mon,Away,W,5.0,2.0,Hellas Verona,2.4,1.1,66.0,16967.0,Giovanni Di Lorenzo,4-3-3,Michael Fabbri,Match Report,
1,2022-08-21,18:30,Serie A,Matchweek 2,Sun,Home,W,4.0,0.0,Monza,2.0,0.1,54.0,36559.0,Giovanni Di Lorenzo,4-3-3,Francesco Fourneau,Match Report,
2,2022-08-28,20:45,Serie A,Matchweek 3,Sun,Away,D,0.0,0.0,Fiorentina,1.7,0.5,52.0,32286.0,Giovanni Di Lorenzo,4-2-3-1,Livio Marinelli,Match Report,


In [18]:
type(matches)

list

In [19]:
# now we can get the shooting stats from the team pages

In [20]:
soup = BeautifulSoup(data.text)
links = soup.find_all('a')
links = [l.get("href") for l in links]
links = [l for l in links if l and 'all_comps/shooting/' in l]

In [21]:
data = requests.get(f"https://fbref.com{links[0]}")

In [22]:
shooting = pd.read_html(data.text, match="Shooting")[0]

In [23]:
shooting

Unnamed: 0_level_0,For Napoli,For Napoli,For Napoli,For Napoli,For Napoli,For Napoli,For Napoli,For Napoli,For Napoli,For Napoli,...,Standard,Standard,Standard,Standard,Expected,Expected,Expected,Expected,Expected,Unnamed: 25_level_0
Unnamed: 0_level_1,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2022-08-15,18:30,Serie A,Matchweek 1,Mon,Away,W,5,2,Hellas Verona,...,15.2,0,0,0,2.4,2.4,0.1,2.6,2.6,Match Report
1,2022-08-21,18:30,Serie A,Matchweek 2,Sun,Home,W,4,0,Monza,...,15.3,1,0,0,2.0,2.0,0.09,2.0,2.0,Match Report
2,2022-08-28,20:45,Serie A,Matchweek 3,Sun,Away,D,0,0,Fiorentina,...,14.7,1,0,0,1.7,1.7,0.13,-1.7,-1.7,Match Report
3,2022-08-31,20:45,Serie A,Matchweek 4,Wed,Home,D,1,1,Lecce,...,17.7,0,0,0,1.7,1.7,0.09,-0.7,-0.7,Match Report
4,2022-09-03,20:45,Serie A,Matchweek 5,Sat,Away,W,2,1,Lazio,...,16.1,0,0,0,2.1,2.1,0.11,-0.1,-0.1,Match Report
5,2022-09-07,21:00,Champions Lg,Group stage,Wed,Home,W,4,1,eng Liverpool,...,17.5,0,1,2,4.0,2.5,0.16,0.0,0.5,Match Report
6,2022-09-10,15:00,Serie A,Matchweek 6,Sat,Home,W,1,0,Spezia,...,17.8,0,0,0,2.2,2.2,0.08,-1.2,-1.2,Match Report
7,2022-09-14,20:00,Champions Lg,Group stage,Wed,Away,W,3,0,sct Rangers,...,20.0,0,1,2,3.5,2.0,0.09,-0.5,0.0,Match Report
8,2022-09-18,20:45,Serie A,Matchweek 7,Sun,Away,W,2,1,Milan,...,20.7,1,1,1,1.4,0.6,0.07,0.6,0.4,Match Report
9,2022-10-01,15:00,Serie A,Matchweek 8,Sat,Home,W,3,1,Torino,...,18.3,0,0,0,1.6,1.6,0.11,1.4,1.4,Match Report


In [24]:
# to drop the multi level index
shooting.columns = shooting.columns.droplevel()

In [25]:
shooting.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2022-08-15,18:30,Serie A,Matchweek 1,Mon,Away,W,5,2,Hellas Verona,...,15.2,0,0,0,2.4,2.4,0.1,2.6,2.6,Match Report
1,2022-08-21,18:30,Serie A,Matchweek 2,Sun,Home,W,4,0,Monza,...,15.3,1,0,0,2.0,2.0,0.09,2.0,2.0,Match Report
2,2022-08-28,20:45,Serie A,Matchweek 3,Sun,Away,D,0,0,Fiorentina,...,14.7,1,0,0,1.7,1.7,0.13,-1.7,-1.7,Match Report
3,2022-08-31,20:45,Serie A,Matchweek 4,Wed,Home,D,1,1,Lecce,...,17.7,0,0,0,1.7,1.7,0.09,-0.7,-0.7,Match Report
4,2022-09-03,20:45,Serie A,Matchweek 5,Sat,Away,W,2,1,Lazio,...,16.1,0,0,0,2.1,2.1,0.11,-0.1,-0.1,Match Report


## Merging dataframes together

We can now merge the results holding dataframe with the shooting stats df for Napoli. 

In [26]:
matches = matches[0]

In [27]:
merged_data = matches.merge(shooting[["Date", "Sh", "SoT", "Dist", "PKatt", "G-xG"]], on="Date")

In [28]:
merged_data.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Captain,Formation,Referee,Match Report,Notes,Sh,SoT,Dist,PKatt,G-xG
0,2022-08-15,18:30,Serie A,Matchweek 1,Mon,Away,W,5.0,2.0,Hellas Verona,...,Giovanni Di Lorenzo,4-3-3,Michael Fabbri,Match Report,,25,8,15.2,0,2.6
1,2022-08-21,18:30,Serie A,Matchweek 2,Sun,Home,W,4.0,0.0,Monza,...,Giovanni Di Lorenzo,4-3-3,Francesco Fourneau,Match Report,,22,5,15.3,0,2.0
2,2022-08-28,20:45,Serie A,Matchweek 3,Sun,Away,D,0.0,0.0,Fiorentina,...,Giovanni Di Lorenzo,4-2-3-1,Livio Marinelli,Match Report,,13,2,14.7,0,-1.7
3,2022-08-31,20:45,Serie A,Matchweek 4,Wed,Home,D,1.0,1.0,Lecce,...,Giovanni Di Lorenzo,4-2-3-1,Matteo Marcenaro,Match Report,,19,7,17.7,0,-0.7
4,2022-09-03,20:45,Serie A,Matchweek 5,Sat,Away,W,2.0,1.0,Lazio,...,Giovanni Di Lorenzo,4-2-3-1,Simone Sozza,Match Report,,19,7,16.1,0,-0.1


## Combining Previous Steps to Extract Data for All Teams

Now we can use a for loop to repeat the steps above and extract data for multiple teams and also for multiple seasons. We will be scraping the ongoing season as well the 2 preceding seasons (2021-22, 2020-21).

In [29]:
seasons = list(range(2023, 2020, -1))

In [30]:
seasons

[2023, 2022, 2021]

In [31]:
#list to contain multiple dataframes containing the match logs of one team from each season
all_fixtures = []

Previously, we have gone to the Serie A league table stats page and taken team URLs for the current season. Now we want to go back and hit the 'Previous Season' button and add an additional layer to the loop which will enable the scraping of older results.

In [32]:
# we start with the same URL which we initially used for scraping the links of individual teams
league_url = 'https://fbref.com/en/comps/11/Serie-A-Stats'

In [35]:
import time

data = requests.get(league_url)
soup = BeautifulSoup(data.text)
standings_table = soup.select('table.stats_table')[0] #links of all teams
    
#get href 
links = [l.get('href') for l in standings_table.find_all('a')]
#filter for squad links
links = [l for l in links if '/squads/' in l]
#turn into website links 
team_urls = [f"https://fbref.com{l}" for l in links]

#loop through each of the team urls
#this is to individually scrape the match logs for each team
for team_url in team_urls:
    team = team_url.split('/')[-1].replace('-Stats', '').replace('-', ' ')
        
    #get the team url again to get the Scores & Fixtures table
    data = requests.get(team_url)
    matches = pd.read_html(data.text, match='Scores & Fixtures')[0]
    soup = BeautifulSoup(data.text)
    links = [l.get("href") for l in soup.find_all('a')]
    links = [l for l in links if l and 'all_comps/shooting/' in l]
    data = requests.get(f"https://fbref.com{links[0]}")
    shooting = pd.read_html(data.text, match="Shooting")[0]
    shooting.columns = shooting.columns.droplevel()
        
    try:
        merged_data = matches.merge(shooting[["Date", "Sh", "SoT", "Dist", "PKatt", "G-xG"]], on="Date")
    except ValueError:
        continue
        
        
    merged_data['Season'] = 2023
    merged_data['Team'] = team
    #append team_data for each iteration to the all_matches list
    all_fixtures.append(merged_data)
        
    #pauses to avoid getting blocked by FBref
    time.sleep(10)

In [36]:
#combine all dataframes into one
match_df = pd.concat(all_fixtures)

In [37]:
match_df.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Referee,Match Report,Notes,Sh,SoT,Dist,PKatt,G-xG,Season,Team
0,2022-08-15,18:30,Serie A,Matchweek 1,Mon,Away,W,5.0,2.0,Hellas Verona,...,Michael Fabbri,Match Report,,25.0,8.0,15.2,0,2.6,2023,Napoli
1,2022-08-21,18:30,Serie A,Matchweek 2,Sun,Home,W,4.0,0.0,Monza,...,Francesco Fourneau,Match Report,,22.0,5.0,15.3,0,2.0,2023,Napoli
2,2022-08-28,20:45,Serie A,Matchweek 3,Sun,Away,D,0.0,0.0,Fiorentina,...,Livio Marinelli,Match Report,,13.0,2.0,14.7,0,-1.7,2023,Napoli
3,2022-08-31,20:45,Serie A,Matchweek 4,Wed,Home,D,1.0,1.0,Lecce,...,Matteo Marcenaro,Match Report,,19.0,7.0,17.7,0,-0.7,2023,Napoli
4,2022-09-03,20:45,Serie A,Matchweek 5,Sat,Away,W,2.0,1.0,Lazio,...,Simone Sozza,Match Report,,19.0,7.0,16.1,0,-0.1,2023,Napoli


In [38]:
match_df['Date'].value_counts()

2023-01-04    20
2022-10-09    12
2022-10-02    12
2022-09-11    12
2022-11-13    12
              ..
2023-01-10     1
2022-08-18     1
2022-08-25     1
2022-09-13     1
2022-10-18     1
Name: Date, Length: 83, dtype: int64

In [39]:
match_df['Round'].value_counts()

Group stage       42
Matchweek 1       20
Matchweek 10      20
Matchweek 17      20
Matchweek 16      20
Matchweek 15      20
Matchweek 14      20
Matchweek 13      20
Matchweek 12      20
Matchweek 2       20
Matchweek 11      20
Matchweek 9       20
Matchweek 8       20
Matchweek 7       20
Matchweek 6       20
Matchweek 5       20
Matchweek 4       20
Matchweek 3       20
Matchweek 18      18
First round       12
Second round       7
Round of 16        6
Play-off round     2
Name: Round, dtype: int64

In [40]:
match_df['Team'].value_counts()

Fiorentina        27
Internazionale    25
Roma              25
Milan             25
Napoli            24
Juventus          24
Lazio             24
Torino            21
Monza             20
Sampdoria         20
Spezia            20
Bologna           20
Udinese           20
Cremonese         20
Lecce             19
Salernitana       19
Sassuolo          19
Hellas Verona     19
Empoli            18
Atalanta          18
Name: Team, dtype: int64

In [41]:
league_url = 'https://fbref.com/en/comps/11/2021-2022/2021-2022-Serie-A-Stats'
all_fixtures = []

In [42]:
data = requests.get(league_url)
soup = BeautifulSoup(data.text)
standings_table = soup.select('table.stats_table')[0] #links of all teams
    
#get href 
links = [l.get('href') for l in standings_table.find_all('a')]
#filter for squad links
links = [l for l in links if '/squads/' in l]
#turn into website links 
team_urls = [f"https://fbref.com{l}" for l in links]

#loop through each of the team urls
#this is to individually scrape the match logs for each team
for team_url in team_urls:
    team = team_url.split('/')[-1].replace('-Stats', '').replace('-', ' ')
        
    #get the team url again to get the Scores & Fixtures table
    data = requests.get(team_url)
    matches = pd.read_html(data.text, match='Scores & Fixtures')[0]
    soup = BeautifulSoup(data.text)
    links = [l.get("href") for l in soup.find_all('a')]
    links = [l for l in links if l and 'all_comps/shooting/' in l]
    data = requests.get(f"https://fbref.com{links[0]}")
    shooting = pd.read_html(data.text, match="Shooting")[0]
    shooting.columns = shooting.columns.droplevel()
        
    try:
        merged_data = matches.merge(shooting[["Date", "Sh", "SoT", "Dist", "PKatt", "G-xG"]], on="Date")
    except ValueError:
        continue
        
        
    merged_data['Season'] = 2022
    merged_data['Team'] = team
    #append team_data for each iteration to the all_matches list
    all_fixtures.append(merged_data)
        
    #pauses to avoid getting blocked by FBref
    time.sleep(10)

In [43]:
match_df_1 = pd.concat(all_fixtures)

In [44]:
match_df_1.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Referee,Match Report,Notes,Sh,SoT,Dist,PKatt,G-xG,Season,Team
0,2021-08-23,20:45,Serie A,Matchweek 1,Mon,Away,W,1,0,Sampdoria,...,Marco Guida,Match Report,,11.0,4.0,19.7,0,-0.2,2022,Milan
1,2021-08-29,20:45,Serie A,Matchweek 2,Sun,Home,W,4,1,Cagliari,...,Marco Serra,Match Report,,17.0,4.0,19.0,1,1.1,2022,Milan
2,2021-09-12,18:00,Serie A,Matchweek 3,Sun,Home,W,2,0,Lazio,...,Daniele Chiffi,Match Report,,18.0,3.0,20.0,1,-1.0,2022,Milan
3,2021-09-15,20:00,Champions Lg,Group stage,Wed,Away,L,2,3,eng Liverpool,...,Szymon Marciniak,Match Report,,7.0,3.0,13.4,0,0.4,2022,Milan
4,2021-09-19,20:45,Serie A,Matchweek 4,Sun,Away,D,1,1,Juventus,...,Daniele Doveri,Match Report,,13.0,3.0,23.0,0,0.0,2022,Milan


In [45]:
match_df_1['Team'].value_counts()

Roma              55
Juventus          52
Atalanta          52
Internazionale    52
Milan             48
Lazio             48
Napoli            47
Fiorentina        44
Empoli            41
Genoa             41
Cagliari          41
Sampdoria         41
Venezia           41
Udinese           41
Spezia            40
Salernitana       40
Torino            40
Hellas Verona     40
Sassuolo          40
Bologna           39
Name: Team, dtype: int64

In [46]:
league_url = 'https://fbref.com/en/comps/11/2020-2021/2020-2021-Serie-A-Stats'
all_fixtures = []

In [47]:
data = requests.get(league_url)
soup = BeautifulSoup(data.text)
standings_table = soup.select('table.stats_table')[0] #links of all teams
    
#get href 
links = [l.get('href') for l in standings_table.find_all('a')]
#filter for squad links
links = [l for l in links if '/squads/' in l]
#turn into website links 
team_urls = [f"https://fbref.com{l}" for l in links]

#loop through each of the team urls
#this is to individually scrape the match logs for each team
for team_url in team_urls:
    team = team_url.split('/')[-1].replace('-Stats', '').replace('-', ' ')
        
    #get the team url again to get the Scores & Fixtures table
    data = requests.get(team_url)
    matches = pd.read_html(data.text, match='Scores & Fixtures')[0]
    soup = BeautifulSoup(data.text)
    links = [l.get("href") for l in soup.find_all('a')]
    links = [l for l in links if l and 'all_comps/shooting/' in l]
    data = requests.get(f"https://fbref.com{links[0]}")
    shooting = pd.read_html(data.text, match="Shooting")[0]
    shooting.columns = shooting.columns.droplevel()
        
    try:
        merged_data = matches.merge(shooting[["Date", "Sh", "SoT", "Dist", "PKatt", "G-xG"]], on="Date")
    except ValueError:
        continue
        
        
    merged_data['Season'] = 2021
    merged_data['Team'] = team
    #append team_data for each iteration to the all_matches list
    all_fixtures.append(merged_data)
        
    #pauses to avoid getting blocked by FBref
    time.sleep(10)

In [48]:
match_df_2 = pd.concat(all_fixtures)

In [49]:
match_df_2.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Referee,Match Report,Notes,Sh,SoT,Dist,PKatt,G-xG,Season,Team
0,2020-09-26,20:45,Serie A,Matchweek 2,Sat,Home,W,4,3,Fiorentina,...,Giampaolo Calvarese,Match Report,,21.0,8.0,18.3,0,0.7,2021,Internazionale
1,2020-09-30,18:00,Serie A,Matchweek 1,Wed,Away,W,5,2,Benevento,...,Marco Piccinini,Match Report,,18.0,8.0,15.5,0,1.4,2021,Internazionale
2,2020-10-04,15:00,Serie A,Matchweek 3,Sun,Away,D,1,1,Lazio,...,Marco Guida,Match Report,,12.0,3.0,16.9,0,0.4,2021,Internazionale
3,2020-10-17,18:00,Serie A,Matchweek 4,Sat,Home,L,1,2,Milan,...,Maurizio Mariani,Match Report,,18.0,5.0,13.7,0,-1.8,2021,Internazionale
4,2020-10-21,21:00,Champions Lg,Group stage,Wed,Home,D,2,2,de M'Gladbach,...,Björn Kuipers,Match Report,,17.0,4.0,12.2,0,-0.8,2021,Internazionale


In [50]:
match_df_2['Team'].value_counts()

Roma              53
Milan             53
Juventus          52
Atalanta          51
Napoli            51
Internazionale    48
Lazio             48
Spezia            42
Fiorentina        41
Torino            41
Cagliari          41
Genoa             41
Parma             41
Bologna           40
Udinese           40
Hellas Verona     40
Sampdoria         40
Sassuolo          39
Benevento         39
Crotone           39
Name: Team, dtype: int64

In [None]:
#finally we can create a csv with the concatenated match log dataframe with additional stats this time
#match_df.to_csv('italian_footy.csv')

In [52]:
match_df_2.shape

(880, 26)

In [53]:
match_df_1.shape

(883, 26)

In [56]:
data = [match_df_2, match_df_1]
df = pd.concat(data)
df

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Referee,Match Report,Notes,Sh,SoT,Dist,PKatt,G-xG,Season,Team
0,2020-09-26,20:45,Serie A,Matchweek 2,Sat,Home,W,4,3,Fiorentina,...,Giampaolo Calvarese,Match Report,,21.0,8.0,18.3,0,0.7,2021,Internazionale
1,2020-09-30,18:00,Serie A,Matchweek 1,Wed,Away,W,5,2,Benevento,...,Marco Piccinini,Match Report,,18.0,8.0,15.5,0,1.4,2021,Internazionale
2,2020-10-04,15:00,Serie A,Matchweek 3,Sun,Away,D,1,1,Lazio,...,Marco Guida,Match Report,,12.0,3.0,16.9,0,0.4,2021,Internazionale
3,2020-10-17,18:00,Serie A,Matchweek 4,Sat,Home,L,1,2,Milan,...,Maurizio Mariani,Match Report,,18.0,5.0,13.7,0,-1.8,2021,Internazionale
4,2020-10-21,21:00,Champions Lg,Group stage,Wed,Home,D,2,2,de M'Gladbach,...,Björn Kuipers,Match Report,,17.0,4.0,12.2,0,-0.8,2021,Internazionale
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36,2022-05-01,12:30,Serie A,Matchweek 35,Sun,Away,L,1,2,Juventus,...,Alessandro Prontera,Match Report,,9.0,3.0,21.0,0,0.4,2022,Venezia
37,2022-05-05,18:00,Serie A,Matchweek 20,Thu,Away,L,1,2,Salernitana,...,Maurizio Mariani,Match Report,,10.0,4.0,15.0,0,-0.4,2022,Venezia
38,2022-05-08,15:00,Serie A,Matchweek 36,Sun,Home,W,4,3,Bologna,...,Livio Marinelli,Match Report,,16.0,7.0,19.3,2,0.8,2022,Venezia
39,2022-05-14,20:45,Serie A,Matchweek 37,Sat,Away,D,1,1,Roma,...,Simone Sozza,Match Report,,4.0,2.0,25.7,0,0.7,2022,Venezia


In [57]:
data = [df, match_df]
all_data = pd.concat(data)
all_data

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Referee,Match Report,Notes,Sh,SoT,Dist,PKatt,G-xG,Season,Team
0,2020-09-26,20:45,Serie A,Matchweek 2,Sat,Home,W,4,3,Fiorentina,...,Giampaolo Calvarese,Match Report,,21.0,8.0,18.3,0,0.7,2021,Internazionale
1,2020-09-30,18:00,Serie A,Matchweek 1,Wed,Away,W,5,2,Benevento,...,Marco Piccinini,Match Report,,18.0,8.0,15.5,0,1.4,2021,Internazionale
2,2020-10-04,15:00,Serie A,Matchweek 3,Sun,Away,D,1,1,Lazio,...,Marco Guida,Match Report,,12.0,3.0,16.9,0,0.4,2021,Internazionale
3,2020-10-17,18:00,Serie A,Matchweek 4,Sat,Home,L,1,2,Milan,...,Maurizio Mariani,Match Report,,18.0,5.0,13.7,0,-1.8,2021,Internazionale
4,2020-10-21,21:00,Champions Lg,Group stage,Wed,Home,D,2,2,de M'Gladbach,...,Björn Kuipers,Match Report,,17.0,4.0,12.2,0,-0.8,2021,Internazionale
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15,2022-11-08,20:45,Serie A,Matchweek 14,Tue,Home,D,0.0,0.0,Milan,...,Antonio Rapuano,Match Report,,6.0,1.0,20.9,0,-0.3,2023,Cremonese
16,2022-11-11,20:45,Serie A,Matchweek 15,Fri,Away,L,0.0,2.0,Empoli,...,Federico Dionisi,Match Report,,20.0,5.0,14.1,0,-1.3,2023,Cremonese
17,2023-01-04,18:30,Serie A,Matchweek 16,Wed,Home,L,0.0,1.0,Juventus,...,Giovanni Ayroldi,Match Report,,13.0,3.0,18.4,0,-0.9,2023,Cremonese
18,2023-01-09,18:30,Serie A,Matchweek 17,Mon,Away,L,0.0,2.0,Hellas Verona,...,Maurizio Mariani,Match Report,,14.0,5.0,21.3,0,-0.8,2023,Cremonese


In [58]:
all_data.to_csv('football_matches.csv')