The point of this notebook is to web-scrape MLS data for at least years 2022 through 2024.

I found a pretty useful set on Kaggle, covering years 1996 - 2022.
I included 2022 in the web-scraping to get a better understanding of how the web-scraping
could possibly fit with that existing data. If possible, it would possibly save me some time
on scraping. However, further exploration is required.

# Essential Libraries + Other

In [1]:
%load_ext autoreload
%autoreload 2

# necessary imports 
import configparser
import os
import sys
import pandas as pd
import sqlite3
import datetime

from pathlib import Path

# for my custom functions
sys.path.insert(0, '../')
import src.data_extraction as de

In [2]:
# pull some variables, paths, and other from a central config.ini file
config = configparser.ConfigParser()
config.read('../src/config.ini')

['../src/config.ini']

In [3]:
# for file saving
today = datetime.datetime.now()
today = today.strftime("%Y_%m_%d")

# the output path is specified in the config.ini file
output = Path(config['paths']['output'])

# I plan on at least collecting data from 2022 to 2024
yearly_directories = [Path(output/f"mls_{year}") for year in range(2022,2025)]

# create output directory and sub-directories if doesnt exist
for directory in yearly_directories+[output]:
    try:
        assert directory.exists()
    except:
        os.mkdir(directory)

# Current Season - 2024

## Setup
Run this section only on your first time running this file.

In [4]:
# start with a url that looks like this page. It will automatically grab/generate associated URLs
# for each team and player
base_url = 'https://fbref.com/en/comps/22/Major-League-Soccer-Stats'

In [None]:
# Pulls team names, player names, and associated urls needed for the associated season
# Both the all_teams_df and all_players_df is saved with the current date appended at the end

# year = current season year
# all_teams_df = dataframe of associated team names and associated URLs for the associated season
# all_players_df = dataframe of associated player names, position, team, and URLs for the associated season
year, all_teams_df, all_players_df = de.get_teams_and_players('https://fbref.com/en/comps/22/Major-League-Soccer-Stats')

In [6]:
year

'2024'

In [7]:
all_teams_df

Unnamed: 0,team,team_url,season
0,Inter Miami,https://fbref.com/en/squads/cb8b86a2/Inter-Mia...,2024
1,Columbus Crew,https://fbref.com/en/squads/529ba333/Columbus-...,2024
2,FC Cincinnati,https://fbref.com/en/squads/e9ea41b2/FC-Cincin...,2024
3,Orlando City,https://fbref.com/en/squads/46ef01d0/Orlando-C...,2024
4,Charlotte,https://fbref.com/en/squads/eb57545a/Charlotte...,2024
5,NYCFC,https://fbref.com/en/squads/64e81410/New-York-...,2024
6,NY Red Bulls,https://fbref.com/en/squads/69a0fb10/New-York-...,2024
7,CF Montréal,https://fbref.com/en/squads/fc22273c/CF-Montre...,2024
8,Atlanta Utd,https://fbref.com/en/squads/1ebc1a5b/Atlanta-U...,2024
9,D.C. United,https://fbref.com/en/squads/44117292/DC-United...,2024


In [8]:
all_players_df

Unnamed: 0,player_name,player_url,position,team,season
0,Drake Callender,https://fbref.com/en/players/c4d9567d/Drake-Ca...,GK,Inter Miami,2024
1,Julian Gressel,https://fbref.com/en/players/acd47bc0/Julian-G...,"MF,FW",Inter Miami,2024
2,Sergio Busquets,https://fbref.com/en/players/5ab0ea87/Sergio-B...,"MF,DF",Inter Miami,2024
3,Tomás Avilés,https://fbref.com/en/players/f51b9ae1/Tomas-Av...,DF,Inter Miami,2024
4,Jordi Alba,https://fbref.com/en/players/4601e194/Jordi-Alba,DF,Inter Miami,2024
...,...,...,...,...,...
938,Beau Leroux,https://fbref.com/en/players/6a9ab308/Beau-Leroux,MF,SJ Earthquakes,2024
939,Riley Lynch,https://fbref.com/en/players/9eb24ed6/Riley-Lynch,FW,SJ Earthquakes,2024
940,Cruz Medina,https://fbref.com/en/players/89d44509/Cruz-Medina,MF,SJ Earthquakes,2024
941,Emi Ochoa,https://fbref.com/en/players/30a08779/Emi-Ochoa,GK,SJ Earthquakes,2024


Only run this portion if you want a SQL db

In [None]:
# The above dataframes are saved as csv files, but I will add them to a db
# file as well here to refresh my SQL skills at some point.
# I'm also doing this to share the db with some friends since they're more familiar
# with SQL than Python.

# set up connection
con = sqlite3.connect(output / 'mls.db')
cur = con.cursor()

In [None]:
# if_exists is set to 'append' since I can always remove duplicates if needed
# I don't want to accidentally replace the entire table
all_teams_df.to_sql(name='teams', con=con, if_exists='append', index=False)
all_players_df.to_sql(name='players', con=con, if_exists='append', index=False)

29

## Scraping
Continue from here when you come back to continue scraping.

Each team takes me about 30-40 minutes to scrape since I am grabbing data from several URLs per player.

The reason it takes so long is because the website limits bots to only a few calls per minute. I had to
use *time.sleep()* to delay each extraction to not get temporarily or permanently banned :(

For each player, it currently grabs the stats related for each of the associated tables on the web-page:
* Summary
* Passing
* Pass Types
* Goal and Shot Creation
* Defensive Actions
* Possession
* Miscellaneous Stats

As a result, I am running the scraping in chunks to make sure everything goes smoothly since it will take hours.

## Progress

* ~~0: Inter Miami~~ - Initial HTML Extraction Done
* ~~1: Columbus Crew~~ - Initial HTML Extraction Done
* ~~2: FC Cincinnati~~ - Initial HTML Extraction Done
* ~~3: Orlando City~~ - Initial HTML Extraction Done
* ~~4: Charlotte~~ - Initial HTML Extraction Done
* ~~5: NYCFC~~ - Initial HTML Extraction Done
* ~~6: NY Red Bulls~~ - Initial HTML Extraction Done
* ~~7: CF Montréal~~ - Initial HTML Extraction Done
* ~~8: Atlanta Utd~~ - Initial HTML Extraction Done
* ~~9: D.C. United~~ - Initial HTML Extraction Done
* ~~10: Toronto FC~~ - Initial HTML Extraction Done
* ~~11: Philadelphia Union~~ - Initial HTML Extraction Done
* ~~12: Nashville SC~~ - Initial HTML Extraction Done
* ~~13: NE Revolution~~ - Initial HTML Extraction Done
* ~~14: Chicago Fire~~ - Initial HTML Extraction Done
* ~~15: LAFC~~ - Initial HTML Extraction Done
* ~~16: LA Galaxy~~ - Initial HTML Extraction Done
* ~~17: Real Salt Lake~~ - Initial HTML Extraction Done
* ~~18: Seattle Sounders FC~~ - Initial HTML Extraction Done
* ~~19: Houston Dynamo~~ - Initial HTML Extraction Done
* ~~20: Minnesota Utd~~ - Initial HTML Extraction Done
* ~~21: Colorado Rapids~~ - Initial HTML Extraction Done
* ~~22: Vancouver W'caps~~ - Initial HTML Extraction Done
* ~~23: Portland Timbers~~ - Initial HTML Extraction Done
* ~~24: Austin~~ - Initial HTML Extraction Done
* ~~25: FC Dallas~~ - Initial HTML Extraction Done
* ~~26: St. Louis~~ - Initial HTML Extraction Done
* ~~27: Sporting KC~~ - Initial HTML Extraction Done
* ~~28: SJ Earthquakes~~ - Initial HTML Extraction Done

In [4]:
# Start re-runs from here as to not have to rerun the above
all_players_df = pd.read_csv(output / 'mls_2024/all_players_2024_12_08.csv')
all_teams_df = pd.read_csv(output / 'mls_2024/all_teams_2024_12_08.csv')
year = 2024

for team in all_teams_df['team']:
    directory = Path(output / f'mls_{year}/html_files/{team}')
    try:
        assert directory.exists()
    except:
        os.mkdir(directory)

In [5]:
# edit these lines to adjust how many you're running at a time
# I have been running a couple teams at a time by adjusting the iloc
current_teams = list(all_teams_df['team'].iloc[14:])
filter = all_players_df.apply(lambda x: True if x['team'] in current_teams else False, axis=1)
current_players = all_players_df[filter]
current_players

Unnamed: 0,player_name,player_url,position,team,season
462,Christopher Brady,https://fbref.com/en/players/a71768a2/Christop...,GK,Chicago Fire,2024
463,Hugo Cuypers,https://fbref.com/en/players/9f417f8c/Hugo-Cuy...,FW,Chicago Fire,2024
464,Kellyn Acosta,https://fbref.com/en/players/ece10cfe/Kellyn-A...,MF,Chicago Fire,2024
465,Brian Gutiérrez,https://fbref.com/en/players/d88f31db/Brian-Gu...,"MF,FW",Chicago Fire,2024
466,Rafael Czichos,https://fbref.com/en/players/b0a0698b/Rafael-C...,DF,Chicago Fire,2024
...,...,...,...,...,...
938,Beau Leroux,https://fbref.com/en/players/6a9ab308/Beau-Leroux,MF,SJ Earthquakes,2024
939,Riley Lynch,https://fbref.com/en/players/9eb24ed6/Riley-Lynch,FW,SJ Earthquakes,2024
940,Cruz Medina,https://fbref.com/en/players/89d44509/Cruz-Medina,MF,SJ Earthquakes,2024
941,Emi Ochoa,https://fbref.com/en/players/30a08779/Emi-Ochoa,GK,SJ Earthquakes,2024


In [6]:
failed_indicis, failed_links = de.save_player_htmls(current_players, year)

3 failed html requests                                                          
4 failed html requests                                                          
5 failed html requests                                                          
6 failed html requests                                                          
7 failed html requests                                                          
1 failed html requests                                                          
2 failed html requests                                                          
3 failed html requests                                                          
4 failed html requests                                                          
5 failed html requests                                                          
3 failed html requests                                                          
4 failed html requests                                                          
5 failed html requests      

In [7]:
failed_indicis

[547, 547, 547, 547, 547, 548, 548, 548, 548, 548, 662, 662, 662]

In [8]:
failed_links

['https://fbref.com/en/players/edbfbb3f/matchlogs/2024/passing_types/Eriq-Zavaleta',
 'https://fbref.com/en/players/edbfbb3f/matchlogs/2024/gca/Eriq-Zavaleta',
 'https://fbref.com/en/players/edbfbb3f/matchlogs/2024/defense/Eriq-Zavaleta',
 'https://fbref.com/en/players/edbfbb3f/matchlogs/2024/possession/Eriq-Zavaleta',
 'https://fbref.com/en/players/edbfbb3f/matchlogs/2024/misc/Eriq-Zavaleta',
 'https://fbref.com/en/players/1fafbf53/matchlogs/2024/summary/Tucker-Lepley',
 'https://fbref.com/en/players/1fafbf53/matchlogs/2024/passing/Tucker-Lepley',
 'https://fbref.com/en/players/1fafbf53/matchlogs/2024/passing_types/Tucker-Lepley',
 'https://fbref.com/en/players/1fafbf53/matchlogs/2024/gca/Tucker-Lepley',
 'https://fbref.com/en/players/1fafbf53/matchlogs/2024/defense/Tucker-Lepley',
 'https://fbref.com/en/players/ed4dc7f4/matchlogs/2024/passing_types/DJ-Taylor',
 'https://fbref.com/en/players/ed4dc7f4/matchlogs/2024/gca/DJ-Taylor',
 'https://fbref.com/en/players/ed4dc7f4/matchlogs/2024

In [9]:
failed = all_players_df.iloc[failed_indicis]

In [10]:
failed

Unnamed: 0,player_name,player_url,position,team,season
547,Eriq Zavaleta,https://fbref.com/en/players/edbfbb3f/Eriq-Zav...,DF,LA Galaxy,2024
547,Eriq Zavaleta,https://fbref.com/en/players/edbfbb3f/Eriq-Zav...,DF,LA Galaxy,2024
547,Eriq Zavaleta,https://fbref.com/en/players/edbfbb3f/Eriq-Zav...,DF,LA Galaxy,2024
547,Eriq Zavaleta,https://fbref.com/en/players/edbfbb3f/Eriq-Zav...,DF,LA Galaxy,2024
547,Eriq Zavaleta,https://fbref.com/en/players/edbfbb3f/Eriq-Zav...,DF,LA Galaxy,2024
548,Tucker Lepley,https://fbref.com/en/players/1fafbf53/Tucker-L...,MF,LA Galaxy,2024
548,Tucker Lepley,https://fbref.com/en/players/1fafbf53/Tucker-L...,MF,LA Galaxy,2024
548,Tucker Lepley,https://fbref.com/en/players/1fafbf53/Tucker-L...,MF,LA Galaxy,2024
548,Tucker Lepley,https://fbref.com/en/players/1fafbf53/Tucker-L...,MF,LA Galaxy,2024
548,Tucker Lepley,https://fbref.com/en/players/1fafbf53/Tucker-L...,MF,LA Galaxy,2024


In [11]:
failed_indicis2, failed_links2 = de.save_player_htmls(failed, year)

|████████████████████████████████████████| 13/13 [100%] in 14:19.0 (0.01/s)     


In [12]:
failed_indicis2

[]

In [13]:
failed_links2

[]

# Data From HTML

In [23]:
def get_html_stat_file_paths(dir, stat):
    html_files = []
    for roots, dirs, files in os.walk(dir):
        for file in files:
            if file.endswith(f'_{stat}.html'):
                html_files.append(file)
    return html_files

In [24]:
html_files = get_html_stat_file_paths(output, 'defense')
html_files

['Aiden McFadden_defense.html',
 'Ajani Fortune_defense.html',
 'Aleksei Miranchuk_defense.html',
 'Bartosz Slisz_defense.html',
 'Brad Guzan_defense.html',
 'Brooks Lennon_defense.html',
 'Caleb Wiley_defense.html',
 'Daniel Armando Ríos_defense.html',
 'Dax McCarty_defense.html',
 'Derrick Etienne_defense.html',
 'Derrick Williams_defense.html',
 'Edwin Mosquera_defense.html',
 'Efrain Morales_defense.html',
 'Giorgos Giakoumakis_defense.html',
 'Jamal Thiaré_defense.html',
 'Josh Cohen_defense.html',
 'Luis Abram_defense.html',
 'Luke Brennan_defense.html',
 'Matthew Edwards_defense.html',
 'Matías Gallardo_defense.html',
 'Nicolas Firmino_defense.html',
 'Noah Cobb_defense.html',
 'Pedro Amador_defense.html',
 'Quentin Westberg_defense.html',
 'Ronald Hernández_defense.html',
 'Saba Lobzhanidze_defense.html',
 'Stian Rode Gregersen_defense.html',
 'Thiago Almada_defense.html',
 'Tristan Muyumba_defense.html',
 'Tyler Wolff_defense.html',
 'Xande Silva_defense.html',
 'Alexander Rin

In [25]:
len(html_files)

941