The point of this notebook is to web-scrape MLS data for at least years 2022 through 2024.

I found a pretty useful set on Kaggle, covering years 1996 - 2022.
I included 2022 in the web-scraping to get a better understanding of how the web-scraping
could possibly fit with that existing data. If possible, it would possibly save me some time
on scraping. However, further exploration is required.

# Essential Libraries + Other

In [1]:
%load_ext autoreload
%autoreload 2

# necessary imports 
import configparser
import os
import sys
import pandas as pd
import sqlite3
import datetime

from pathlib import Path

# for my custom functions
sys.path.insert(0, '../')
import src.data_extraction as de

In [2]:
# pull some variables, paths, and other from a central config.ini file
config = configparser.ConfigParser()
config.read('../src/config.ini')

['../src/config.ini']

In [3]:
# for file saving
today = datetime.datetime.now()
today = today.strftime("%Y_%m_%d")

# the output path is specified in the config.ini file
output = Path(config['paths']['output'])

# I plan on at least collecting data from 2022 to 2024
yearly_directories = [Path(output/f"mls_{year}") for year in range(2022,2025)]

# create output directory and sub-directories if doesnt exist
for directory in yearly_directories+[output]:
    try:
        assert directory.exists()
    except:
        os.mkdir(directory)

# Current Season - 2024

## Setup
Run this only on your first time running this file.

In [4]:
# start with a url that looks like this page. It will automatically grab/generate associated URLs
# for each team and player
base_url = 'https://fbref.com/en/comps/22/Major-League-Soccer-Stats'

In [None]:
# Pulls team names, player names, and associated urls needed for the associated season
# Both the all_teams_df and all_players_df is saved with the current date appended at the end

# year = current season year
# all_teams_df = dataframe of associated team names and associated URLs for the associated season
# all_players_df = dataframe of associated player names, position, team, and URLs for the associated season
year, all_teams_df, all_players_df = de.get_teams_and_players('https://fbref.com/en/comps/22/Major-League-Soccer-Stats')

In [6]:
year

'2024'

In [7]:
all_teams_df

Unnamed: 0,team,team_url,season
0,Inter Miami,https://fbref.com/en/squads/cb8b86a2/Inter-Mia...,2024
1,Columbus Crew,https://fbref.com/en/squads/529ba333/Columbus-...,2024
2,FC Cincinnati,https://fbref.com/en/squads/e9ea41b2/FC-Cincin...,2024
3,Orlando City,https://fbref.com/en/squads/46ef01d0/Orlando-C...,2024
4,Charlotte,https://fbref.com/en/squads/eb57545a/Charlotte...,2024
5,NYCFC,https://fbref.com/en/squads/64e81410/New-York-...,2024
6,NY Red Bulls,https://fbref.com/en/squads/69a0fb10/New-York-...,2024
7,CF Montréal,https://fbref.com/en/squads/fc22273c/CF-Montre...,2024
8,Atlanta Utd,https://fbref.com/en/squads/1ebc1a5b/Atlanta-U...,2024
9,D.C. United,https://fbref.com/en/squads/44117292/DC-United...,2024


In [8]:
all_players_df

Unnamed: 0,player_name,player_url,position,team,season
0,Drake Callender,https://fbref.com/en/players/c4d9567d/Drake-Ca...,GK,Inter Miami,2024
1,Julian Gressel,https://fbref.com/en/players/acd47bc0/Julian-G...,"MF,FW",Inter Miami,2024
2,Sergio Busquets,https://fbref.com/en/players/5ab0ea87/Sergio-B...,"MF,DF",Inter Miami,2024
3,Tomás Avilés,https://fbref.com/en/players/f51b9ae1/Tomas-Av...,DF,Inter Miami,2024
4,Jordi Alba,https://fbref.com/en/players/4601e194/Jordi-Alba,DF,Inter Miami,2024
...,...,...,...,...,...
938,Beau Leroux,https://fbref.com/en/players/6a9ab308/Beau-Leroux,MF,SJ Earthquakes,2024
939,Riley Lynch,https://fbref.com/en/players/9eb24ed6/Riley-Lynch,FW,SJ Earthquakes,2024
940,Cruz Medina,https://fbref.com/en/players/89d44509/Cruz-Medina,MF,SJ Earthquakes,2024
941,Emi Ochoa,https://fbref.com/en/players/30a08779/Emi-Ochoa,GK,SJ Earthquakes,2024


Only run this portion if you want a SQL db

In [None]:
# The above dataframes are saved as csv files, but I will add them to a db
# file as well here to refresh my SQL skills at some point.
# I'm also doing this to share the db with some friends since they're more familiar
# with SQL than Python.

# set up connection
# con = sqlite3.connect(output / 'mls.db')
# cur = con.cursor()

In [None]:
# if_exists is set to 'append' since I can always remove duplicates if needed
# I don't want to accidentally replace the entire table
# all_teams_df.to_sql(name='teams', con=con, if_exists='append', index=False)
# all_players_df.to_sql(name='players', con=con, if_exists='append', index=False)

29

## Scraping
Each team takes me about 30-40 minutes to scrape since I am grabbing data from several URLs per player.

The reason it takes so long is because the website limits bots to only a few calls per minute. I had to
use *time.sleep()* to delay each extraction to not get temporarily or permanently banned :(

For each player, it currently grabs the stats related for each of the associated tables on the web-page:
* Summary
* Passing
* Pass Types
* Goal and Shot Creation
* Defensive Actions
* Possession
* Miscellaneous Stats

As a result, I am running the scraping in chunks to make sure everything goes smoothly since it will take hours.

## Progress

* ~~0: Inter Miami~~ - Initial Data Extraction Done
* ~~1: Columbus Crew~~ - Initial Data Extraction Done
* ~~2: FC Cincinnati~~ - Initial Data Extraction Done
* Orlando City
* Charlotte
* NYCFC
* NY Red Bulls
* CF Montréal
* Atlanta Utd
* D.C. United
* Toronto FC
* Philadelphia Union
* Nashville SC
* NE Revolution
* Chicago Fire
* LAFC
* LA Galaxy
* Real Salt Lake
* Seattle Sounders FC
* Houston Dynamo
* Minnesota Utd
* Colorado Rapids
* Vancouver W'caps
* Portland Timbers
* Austin
* FC Dallas
* St. Louis
* Sporting KC
* SJ Earthquakes

In [None]:
# Re-run this each time as to not have to rerun the above
all_players_df = pd.read_csv(output / 'mls_2024/all_players_2024_12_08.csv')
all_teams_df = pd.read_csv(output / 'mls_2024/all_teams_2024_12_08.csv')
year = 2024

In [5]:
current_teams = list(all_teams_df['team'].iloc[:3])
filter = all_players_df.apply(lambda x: True if x['team'] in current_teams else False, axis=1)
current_players = all_players_df[filter]

In [6]:
all_players_stats_df, failed_links = de.get_all_players_data(current_players, year)

on 5: Could not obtain data for Marcelo-Weigandt                                
on 31: Could not obtain data for Israel-Boatwright                              
on 32: Could not obtain data for Ryan-Carmichael                                
on 33: Could not obtain data for Tyler-Hall                                     
on 34: Could not obtain data for Cole-Jensen                                    
on 35: Could not obtain data for Ricardo-Montenegro                             
on 65: Could not obtain data for Giorgio-De-Libera                              
on 66: Could not obtain data for Cole-Johnson                                   
on 67: Could not obtain data for Owen-Presthus                                  
on 68: Could not obtain data for Gibran-Rayo                                    
on 69: Could not obtain data for Stanislau-Lapkies                              
on 99: Could not obtain data for Nico-Benalcazar                                
on 100: Could not obtain dat

In [8]:
failed_filter = all_players_df.apply(lambda x: True if x['player_url'] in failed_links else False, axis=1)
failed_players = all_players_df[failed_filter]
failed_players

Unnamed: 0,player_name,player_url,position,team,season
5,Marcelo Weigandt,https://fbref.com/en/players/ac106741/Marcelo-...,"DF,MF",Inter Miami,2024
31,Israel Boatwright,https://fbref.com/en/players/418e8e85/Israel-B...,DF,Inter Miami,2024
32,Ryan Carmichael,https://fbref.com/en/players/5492c7c6/Ryan-Car...,MF,Inter Miami,2024
33,Tyler Hall,https://fbref.com/en/players/36f876d9/Tyler-Hall,DF,Inter Miami,2024
34,Cole Jensen,https://fbref.com/en/players/129d3be2/Cole-Jensen,GK,Inter Miami,2024
35,Ricardo Montenegro,https://fbref.com/en/players/896d2ff1/Ricardo-...,DF,Inter Miami,2024
65,Giorgio De Libera,https://fbref.com/en/players/1a97e3ce/Giorgio-...,MF,Columbus Crew,2024
66,Cole Johnson,https://fbref.com/en/players/cfe2f825/Cole-Joh...,GK,Columbus Crew,2024
67,Owen Presthus,https://fbref.com/en/players/e8572636/Owen-Pre...,DF,Columbus Crew,2024
68,Gibran Rayo,https://fbref.com/en/players/ed4a5d57/Gibran-Rayo,FW,Columbus Crew,2024


In [10]:
con = sqlite3.connect(output / 'mls.db')
cur = con.cursor()

all_players_stats_df.to_sql(name='player_stats', con=con, if_exists='append', index=False)
all_players_stats_df.to_csv(output/'mls_2024/player_stats_2024_12_08.csv')
failed_players.to_sql(name='failed_extractions', con=con, if_exists='append', index=False)
failed_players.to_csv(output/'mls_2024/failed_extractions_2024_12_08.csv')

OperationalError: duplicate column name: summary_att

In [62]:
existing= []
updated = []
for col in all_players_stats_df.columns:
    if col in existing:
        updated.append(col+'_')
        existing.append(col+'_')
    else:
        updated.append(col)
        existing.append(col)

In [63]:
len(updated)

214

In [64]:
len(set(updated))

208

In [None]:
for col in all_players_stats_df:
    print(f'"{col}",')

"summary_date",
"summary_day",
"summary_comp",
"summary_round",
"summary_venue",
"summary_result",
"summary_squad",
"summary_opponent",
"summary_start",
"summary_pos",
"summary_min",
"summary_gls",
"summary_ast",
"summary_pk",
"summary_pkatt",
"summary_sh",
"summary_sot",
"summary_crdy",
"summary_crdr",
"summary_touches",
"summary_tkl",
"summary_int",
"summary_blocks",
"summary_xg",
"summary_npxg",
"summary_xag",
"summary_sca",
"summary_gca",
"summary_cmp",
"summary_att",
"summary_cmp%",
"summary_prgp",
"summary_carries",
"summary_prgc",
"summary_att",
"summary_succ",
"summary_match report",
"passing_date",
"passing_day",
"passing_comp",
"passing_round",
"passing_venue",
"passing_result",
"passing_squad",
"passing_opponent",
"passing_start",
"passing_pos",
"passing_min",
"passing_cmp",
"passing_att",
"passing_cmp%",
"passing_totdist",
"passing_prgdist",
"passing_cmp",
"passing_att",
"passing_cmp%",
"passing_cmp",
"passing_att",
"passing_cmp%",
"passing_cmp",
"passing_att",
"passing_cmp

In [69]:
updated_names = ["summary_date",
"summary_day",
"summary_comp",
"summary_round",
"summary_venue",
"summary_result",
"summary_squad",
"summary_opponent",
"summary_start",
"summary_pos",
"summary_min",
"summary_gls",
"summary_ast",
"summary_pk",
"summary_pkatt",
"summary_sh",
"summary_sot",
"summary_crdy",
"summary_crdr",
"summary_touches",
"summary_tkl",
"summary_int",
"summary_blocks",
"summary_xg",
"summary_npxg",
"summary_xag",
"summary_sca",
"summary_gca",
"summary_cmp",
"summary_att",
"summary_cmp%",
"summary_prgp",
"summary_carries",
"summary_prgc",
"summary_att",
"summary_succ",
"summary_match report",
"passing_date",
"passing_day",
"passing_comp",
"passing_round",
"passing_venue",
"passing_result",
"passing_squad",
"passing_opponent",
"passing_start",
"passing_pos",
"passing_min",
"passing_cmp",
"passing_att",
"passing_cmp%",
"passing_totdist",
"passing_prgdist",
"passing_cmp",
"passing_att",
"passing_cmp%",
"passing_cmp",
"passing_att",
"passing_cmp%",
"passing_cmp",
"passing_att",
"passing_cmp%",
"passing_ast",
"passing_xag",
"passing_xa",
"passing_kp",
"passing_1/3",
"passing_ppa",
"passing_crspa",
"passing_prgp",
"passing_match report",
"passing_types_date",
"passing_types_day",
"passing_types_comp",
"passing_types_round",
"passing_types_venue",
"passing_types_result",
"passing_types_squad",
"passing_types_opponent",
"passing_types_start",
"passing_types_pos",
"passing_types_min",
"passing_types_att",
"passing_types_live",
"passing_types_dead",
"passing_types_fk",
"passing_types_tb",
"passing_types_sw",
"passing_types_crs",
"passing_types_ti",
"passing_types_ck",
"passing_types_in",
"passing_types_out",
"passing_types_str",
"passing_types_cmp",
"passing_types_off",
"passing_types_blocks",
"passing_types_match report",
"gca_date",
"gca_day",
"gca_comp",
"gca_round",
"gca_venue",
"gca_result",
"gca_squad",
"gca_opponent",
"gca_start",
"gca_pos",
"gca_min",
"gca_sca",
"gca_passlive",
"gca_passdead",
"gca_to",
"gca_sh",
"gca_fld",
"gca_def",
"gca_gca",
"gca_passlive",
"gca_passdead",
"gca_to",
"gca_sh",
"gca_fld",
"gca_def",
"gca_match report",
"defense_date",
"defense_day",
"defense_comp",
"defense_round",
"defense_venue",
"defense_result",
"defense_squad",
"defense_opponent",
"defense_start",
"defense_pos",
"defense_min",
"defense_tkl",
"defense_tklw",
"defense_def 3rd",
"defense_mid 3rd",
"defense_att 3rd",
"defense_tkl",
"defense_att",
"defense_tkl%",
"defense_lost",
"defense_blocks",
"defense_sh",
"defense_pass",
"defense_int",
"defense_tkl+int",
"defense_clr",
"defense_err",
"defense_match report",
"possession_date",
"possession_day",
"possession_comp",
"possession_round",
"possession_venue",
"possession_result",
"possession_squad",
"possession_opponent",
"possession_start",
"possession_pos",
"possession_min",
"possession_touches",
"possession_def pen",
"possession_def 3rd",
"possession_mid 3rd",
"possession_att 3rd",
"possession_att pen",
"possession_live",
"possession_att",
"possession_succ",
"possession_succ%",
"possession_tkld",
"possession_tkld%",
"possession_carries",
"possession_totdist",
"possession_prgdist",
"possession_prgc",
"possession_1/3",
"possession_cpa",
"possession_mis",
"possession_dis",
"possession_rec",
"possession_prgr",
"possession_match report",
"misc_date",
"misc_day",
"misc_comp",
"misc_round",
"misc_venue",
"misc_result",
"misc_squad",
"misc_opponent",
"misc_start",
"misc_pos",
"misc_min",
"misc_crdy",
"misc_crdr",
"misc_2crdy",
"misc_fls",
"misc_fld",
"misc_off",
"misc_crs",
"misc_int",
"misc_tklw",
"misc_pkwon",
"misc_pkcon",
"misc_og",
"misc_recov",
"misc_won",
"misc_lost",
"misc_won%",
"misc_match report"]

In [70]:
final_df = all_players_stats_df.copy()
final_df.columns = updated_names

In [74]:
final_df.to_csv(output/'all_players_stats_20241208.csv', index=False)
failed_players_final = failed_players.copy()
failed_players_final.to_csv(output/'failed_extractions_stats_20241208.csv', index=False)

In [75]:
final_df.to_sql(name='player_stats', con=con, if_exists='append', index=False)
# all_players_stats_df.to_csv(output/'mls_2024/player_stats_2024_12_08.csv')
failed_players_final.to_sql(name='failed_extractions', con=con, if_exists='append', index=False)
# failed_players.to_csv(output/'mls_2024/failed_extractions_2024_12_08.csv')

OperationalError: duplicate column name: summary_att