The point of this notebook is to web-scrape MLS data for at least years 2022 through 2024.

I found a pretty useful set on Kaggle, covering years 1996 - 2022.
I included 2022 in the web-scraping to get a better understanding of how the web-scraping
could possibly fit with that existing data. If possible, it would possibly save me some time
on scraping. However, further exploration is required.

# Essential Libraries + Other

In [1]:
%load_ext autoreload
%autoreload 2

# necessary imports 
import configparser
import os
import sys
import pandas as pd
import sqlite3
import datetime

from pathlib import Path

# for my custom functions
sys.path.insert(0, '../')
import src.data_extraction as de

In [2]:
# pull some variables, paths, and other from a central config.ini file
config = configparser.ConfigParser()
config.read('../src/config.ini')

['../src/config.ini']

In [3]:
# for file saving
today = datetime.datetime.now()
today = today.strftime("%Y_%m_%d")

# the output path is specified in the config.ini file
output = Path(config['paths']['output'])

# I plan on at least collecting data from 2022 to 2024
yearly_directories = [Path(output/f"mls_{year}") for year in range(2022,2025)]

# create output directory and sub-directories if doesnt exist
for directory in yearly_directories+[output]:
    try:
        assert directory.exists()
    except:
        os.mkdir(directory)

# Current Season - 2024

## Setup
Run this only on your first time running this file.

In [4]:
# start with a url that looks like this page. It will automatically grab/generate associated URLs
# for each team and player
base_url = 'https://fbref.com/en/comps/22/Major-League-Soccer-Stats'

In [None]:
# Pulls team names, player names, and associated urls needed for the associated season
# Both the all_teams_df and all_players_df is saved with the current date appended at the end

# year = current season year
# all_teams_df = dataframe of associated team names and associated URLs for the associated season
# all_players_df = dataframe of associated player names, position, team, and URLs for the associated season
year, all_teams_df, all_players_df = de.get_teams_and_players('https://fbref.com/en/comps/22/Major-League-Soccer-Stats')

In [6]:
year

'2024'

In [7]:
all_teams_df

Unnamed: 0,team,team_url,season
0,Inter Miami,https://fbref.com/en/squads/cb8b86a2/Inter-Mia...,2024
1,Columbus Crew,https://fbref.com/en/squads/529ba333/Columbus-...,2024
2,FC Cincinnati,https://fbref.com/en/squads/e9ea41b2/FC-Cincin...,2024
3,Orlando City,https://fbref.com/en/squads/46ef01d0/Orlando-C...,2024
4,Charlotte,https://fbref.com/en/squads/eb57545a/Charlotte...,2024
5,NYCFC,https://fbref.com/en/squads/64e81410/New-York-...,2024
6,NY Red Bulls,https://fbref.com/en/squads/69a0fb10/New-York-...,2024
7,CF Montréal,https://fbref.com/en/squads/fc22273c/CF-Montre...,2024
8,Atlanta Utd,https://fbref.com/en/squads/1ebc1a5b/Atlanta-U...,2024
9,D.C. United,https://fbref.com/en/squads/44117292/DC-United...,2024


In [8]:
all_players_df

Unnamed: 0,player_name,player_url,position,team,season
0,Drake Callender,https://fbref.com/en/players/c4d9567d/Drake-Ca...,GK,Inter Miami,2024
1,Julian Gressel,https://fbref.com/en/players/acd47bc0/Julian-G...,"MF,FW",Inter Miami,2024
2,Sergio Busquets,https://fbref.com/en/players/5ab0ea87/Sergio-B...,"MF,DF",Inter Miami,2024
3,Tomás Avilés,https://fbref.com/en/players/f51b9ae1/Tomas-Av...,DF,Inter Miami,2024
4,Jordi Alba,https://fbref.com/en/players/4601e194/Jordi-Alba,DF,Inter Miami,2024
...,...,...,...,...,...
938,Beau Leroux,https://fbref.com/en/players/6a9ab308/Beau-Leroux,MF,SJ Earthquakes,2024
939,Riley Lynch,https://fbref.com/en/players/9eb24ed6/Riley-Lynch,FW,SJ Earthquakes,2024
940,Cruz Medina,https://fbref.com/en/players/89d44509/Cruz-Medina,MF,SJ Earthquakes,2024
941,Emi Ochoa,https://fbref.com/en/players/30a08779/Emi-Ochoa,GK,SJ Earthquakes,2024


Only run this portion if you want a SQL db

In [None]:
# The above dataframes are saved as csv files, but I will add them to a db
# file as well here to refresh my SQL skills at some point.
# I'm also doing this to share the db with some friends since they're more familiar
# with SQL than Python.

# set up connection
con = sqlite3.connect(output / 'mls.db')
cur = con.cursor()

In [None]:
# if_exists is set to 'append' since I can always remove duplicates if needed
# I don't want to accidentally replace the entire table
all_teams_df.to_sql(name='teams', con=con, if_exists='append', index=False)
all_players_df.to_sql(name='players', con=con, if_exists='append', index=False)

29

## Scraping
Each team takes me about 30-40 minutes to scrape since I am grabbing data from several URLs per player.

The reason it takes so long is because the website limits bots to only a few calls per minute. I had to
use *time.sleep()* to delay each extraction to not get temporarily or permanently banned :(

For each player, it currently grabs the stats related for each of the associated tables on the web-page:
* Summary
* Passing
* Pass Types
* Goal and Shot Creation
* Defensive Actions
* Possession
* Miscellaneous Stats

As a result, I am running the scraping in chunks to make sure everything goes smoothly since it will take hours.

## Progress

* ~~0: Inter Miami~~ - Initial HTML Extraction Done
* ~~1: Columbus Crew~~ - Initial HTML Extraction Done
* ~~2: FC Cincinnati~~ - Initial HTML Extraction Done
* ~~3: Orlando City~~ - Initial HTML Extraction Done
* ~~4: Charlotte~~ - Initial HTML Extraction Done
* 5: NYCFC
* 6: NY Red Bulls
* 7: CF Montréal
* 8: Atlanta Utd
* 9: D.C. United
* 10: Toronto FC
* 11: Philadelphia Union
* 12: Nashville SC
* 13: NE Revolution
* 14: Chicago Fire
* 15: LAFC
* 16: LA Galaxy
* 17: Real Salt Lake
* 18: Seattle Sounders FC
* 19: Houston Dynamo
* 20: Minnesota Utd
* 21: Colorado Rapids
* 22: Vancouver W'caps
* 23: Portland Timbers
* 24: Austin
* 25: FC Dallas
* 26: St. Louis
* 27: Sporting KC
* 28: SJ Earthquakes

In [4]:
# Start re-runs from here as to not have to rerun the above
all_players_df = pd.read_csv(output / 'mls_2024/all_players_2024_12_08.csv')
all_teams_df = pd.read_csv(output / 'mls_2024/all_teams_2024_12_08.csv')
year = 2024

for team in all_teams_df['team']:
    directory = Path(output / f'mls_{year}/html_files/{team}')
    try:
        assert directory.exists()
    except:
        os.mkdir(directory)

In [5]:
# edit these lines to adjust how many you're running at a time
# I have been running a couple teams at a time by adjusting the iloc
current_teams = list(all_teams_df['team'].iloc[4:5])
filter = all_players_df.apply(lambda x: True if x['team'] in current_teams else False, axis=1)
current_players = all_players_df[filter]
current_players

Unnamed: 0,player_name,player_url,position,team,season
134,Kristijan Kahlina,https://fbref.com/en/players/53e360bb/Kristija...,GK,Charlotte,2024
135,Ashley Westwood,https://fbref.com/en/players/495945ef/Ashley-W...,MF,Charlotte,2024
136,Adilson Malanda,https://fbref.com/en/players/9d83969b/Adilson-...,DF,Charlotte,2024
137,Nathan Byrne,https://fbref.com/en/players/c4041fbc/Nathan-B...,DF,Charlotte,2024
138,Andrew Privett,https://fbref.com/en/players/4fd77159/Andrew-P...,DF,Charlotte,2024
139,Kerwin Vargas,https://fbref.com/en/players/758de7b4/Kerwin-V...,FW,Charlotte,2024
140,Jere Uronen,https://fbref.com/en/players/d75f3980/Jere-Uronen,DF,Charlotte,2024
141,Djibril Diani,https://fbref.com/en/players/4d095331/Djibril-...,MF,Charlotte,2024
142,Liel Abada,https://fbref.com/en/players/1339039e/Liel-Abada,FW,Charlotte,2024
143,Patrick Agyemang,https://fbref.com/en/players/d987142b/Patrick-...,FW,Charlotte,2024


In [7]:
failed_indicis, failed_links = de.save_player_htmls(current_players, year)

|████████████████████████████████████████| 35/35 [100%] in 42:38.2 (0.01/s)     


In [8]:
failed_indicis

[]

In [9]:
failed_links

[]