The point of this notebook is to web-scrape MLS data for at least years 2022 through 2024.

I found a pretty useful set on Kaggle, covering years 1996 - 2022.
I included 2022 in the web-scraping to get a better understanding of how the web-scraping
could possibly fit with that existing data. If possible, it would possibly save me some time
on scraping. However, further exploration is required.

# Essential Libraries + Other

In [1]:
%load_ext autoreload
%autoreload 2

# necessary imports 
import configparser
import os
import sys
import pandas as pd
import sqlite3
import datetime

from bs4 import BeautifulSoup
from pathlib import Path

# for my custom functions
sys.path.insert(0, '../')
import src.data_extraction as de

In [2]:
# pull some variables, paths, and other from a central config.ini file
config = configparser.ConfigParser()
config.read('../src/config.ini')

['../src/config.ini']

In [3]:
# for file saving
today = datetime.datetime.now()
today = today.strftime("%Y_%m_%d")

# the output path is specified in the config.ini file
output = Path(config['paths']['output'])

# I plan on at least collecting data from 2022 to 2024
yearly_directories = [Path(output/f"mls_{year}") for year in range(2022,2026)]

# create output directory and sub-directories if doesnt exist
for directory in yearly_directories+[output]:
    try:
        assert directory.exists()
    except:
        os.mkdir(directory)

# Current Season - 2025
I'm able to start this season off with a clean-slate. I currently have some time before the new season even starts, so 
parts of the team and player information on the website is missing. I'm going to try to set some stuff up for when that
information comes rolling in.

## Setup
Run this section only on your first time running this file.

In [4]:
# start with a url that looks like this page. It will automatically grab/generate associated URLs
# for each team and player
base_url = 'https://fbref.com/en/comps/22/Major-League-Soccer-Stats'

In [5]:
# Pulls team names, player names, and associated urls needed for the associated season
# Both the all_teams_df and all_players_df is saved with the current date appended at the end

# year = current season year
# all_teams_df = dataframe of associated team names and associated URLs for the associated season
# all_players_df = dataframe of associated player names, position, team, and URLs for the associated season

# player rosters are not yet available on the website at the time I ran this, so I atleast got the teams and year
# year, all_teams_df, all_players_df = de.get_teams_and_players(base_url)
all_teams_df, year = de.get_all_teams(base_url)

In [6]:
year

'2025'

In [7]:
all_teams_df

Unnamed: 0,team,team_url,season
0,NE Revolution,https://fbref.com/en/squads/3c079def/New-Engla...,2025
1,D.C. United,https://fbref.com/en/squads/44117292/DC-United...,2025
2,Chicago Fire,https://fbref.com/en/squads/f9940243/Chicago-F...,2025
3,Columbus Crew,https://fbref.com/en/squads/529ba333/Columbus-...,2025
4,NY Red Bulls,https://fbref.com/en/squads/69a0fb10/New-York-...,2025
5,Toronto FC,https://fbref.com/en/squads/130f43fa/Toronto-F...,2025
6,Philadelphia Union,https://fbref.com/en/squads/46024eeb/Philadelp...,2025
7,CF Montréal,https://fbref.com/en/squads/fc22273c/CF-Montre...,2025
8,Orlando City,https://fbref.com/en/squads/46ef01d0/Orlando-C...,2025
9,NYCFC,https://fbref.com/en/squads/64e81410/New-York-...,2025


Only run this portion if you want a SQL db

In [8]:
# The above dataframes are saved as csv files, but I will add them to a db
# file as well here to refresh my SQL skills at some point.
# I'm also doing this to share the db with some friends since they're more familiar
# with SQL than Python.

# set up connection
con = sqlite3.connect(output / 'mls.db')
cur = con.cursor()

In [None]:
# if_exists is set to 'append' since I can always remove duplicates if needed
# I don't want to accidentally replace the entire table
all_teams_df.to_sql(name='teams', con=con, if_exists='append', index=False)
# all_players_df.to_sql(name='players', con=con, if_exists='append', index=False)

30

## Scraping
Continue from here when you come back to continue scraping.

Each team takes me about 30-40 minutes to scrape since I am grabbing data from several URLs per player.

The reason it takes so long is because the website limits bots to only a few calls per minute. I had to
use *time.sleep()* to delay each extraction to not get temporarily or permanently banned :(

For each player, it currently grabs the stats related for each of the associated tables on the web-page:
* Summary - I'm skipping this since it varies from player to player
* Passing
* Pass Types
* Goal and Shot Creation
* Defensive Actions
* Possession
* Miscellaneous Stats

As a result, I am running the scraping in chunks to make sure everything goes smoothly since it will take hours.

## Progress

* 0: NE Revolution
* 1: D.C. United
* 2: Chicago Fire
* 3: Columbus Crew
* 4: NY Red Bulls
* 5: Toronto FC
* 6: Philadelphia Union
* 7: CF Montréal
* 8: Orlando City
* 9: NYCFC
* 10: Atlanta Utd
* 11: Inter Miami
* 12: FC Cincinnati
* 13: Nashville SC
* 14: Charlotte
* 15: LA Galaxy
* 16: Sporting KC
* 17: Colorado Rapids
* 18: Houston Dynamo
* 19: FC Dallas
* 20: Real Salt Lake
* 21: SJ Earthquakes
* 22: Seattle Sounders FC
* 23: Vancouver W'caps
* 24: Portland Timbers
* 25: Minnesota Utd
* 26: LAFC
* 27: St. Louis
* 28: Austin
* 29: San Diego FC

In [None]:
# Start re-runs from here as to not have to rerun the above
all_players_df = pd.read_csv(output / 'mls_2025/all_players_2025_02_08.csv')
all_teams_df = pd.read_csv(output / 'mls_2024/all_teams_2025_02_08.csv')
year = 2025

for team in all_teams_df['team']:
    directory = Path(output / f'mls_{year}/html_files/{team}')
    try:
        assert directory.exists()
    except:
        os.mkdir(directory)

In [None]:
# edit these lines to adjust how many you're running at a time
# I have been running a couple teams at a time by adjusting the iloc
current_teams = list(all_teams_df['team'].iloc[:6])
filter = all_players_df.apply(lambda x: True if x['team'] in current_teams else False, axis=1)
current_players = all_players_df[filter]
current_players

In [None]:
all_players_df

In [None]:
failed_indicis, failed_links = de.save_player_htmls(current_players, year)

In [None]:
failed = all_players_df.iloc[failed_indicis]
failed

In [None]:
failed_indicis2, failed_links2 = de.save_player_htmls(failed, year)

In [None]:
failed_indicis2

In [None]:
failed_links2

# Data From HTML

In [7]:
passing_html = de.get_html_stat_file_paths(output, 'passing')
passingtypes_html = de.get_html_stat_file_paths(output, 'passing_types')
gca_html = de.get_html_stat_file_paths(output, 'gca')
defense_html = de.get_html_stat_file_paths(output, 'defense')
possession_html = de.get_html_stat_file_paths(output, 'possession')
misc_html = de.get_html_stat_file_paths(output, 'misc')

In [None]:
passing_df, passing_failed = de.create_df_for_stat(passing_html, 'passing')
passingtypes_df, passingtypes_failed = de.create_df_for_stat(passingtypes_html, 'passing_types')
gca_df, gca_failed = de.create_df_for_stat(gca_html, 'gca')
defense_df, defense_failed = de.create_df_for_stat(defense_html, 'defense')
possession_df, possession_failed = de.create_df_for_stat(possession_html, 'possession')
misc_df, misc_failed = de.create_df_for_stat(misc_html, 'misc')

In [None]:
print(len(passing_failed))
print(len(passingtypes_failed))
print(len(gca_failed))
print(len(defense_failed))
print(len(possession_failed))
print(len(misc_failed))

In [102]:
passing_df.to_csv(output / 'mls_2025/passing_df.csv', index=False)
passingtypes_df.to_csv(output / 'mls_2025/passingtypes_df.csv', index=False)
gca_df.to_csv(output / 'mls_2025/gca_df.csv', index=False)
defense_df.to_csv(output / 'mls_2025/defense_df.csv', index=False)
possession_df.to_csv(output / 'mls_2025/possession_df.csv', index=False)
misc_df.to_csv(output / 'mls_2025/misc_df.csv', index=False)