# Comparing NBA and Euroleague Basketball
# Part 1: Data Scraping

## Introduction

Basketball is one of the most popular sports in the world and it's easy to understand why: it involves a lot of scoring and its fast-paced tempo, paired with a rather straightforward set of rules, makes it accessible to a casual audience and exciting to watch. 

When talking about basketball, most people think about the NBA, the top league in the USA, which can count on some of the best players in the world. However, it is important to acknowledge that basketball is played professionally all around the world. Especially in Europe, this sport has deep solid roots and can count on a [long tradition](https://www.sports-fitness.co.uk/blog/growth-basketball-europe). In fact, many believe the level of competition there is comparable to the one of the NBA and, in [exhibition games](https://en.wikipedia.org/wiki/NBA_versus_EuroLeague_games#2010s) held during the off season, it's not unusual to see a European team beat an NBA squad.

The most important European basketball league is (very intuitively) called Euroleague. Unlike the NBA, where all franchises (except one) are American, teams from many different European countries participate in the Euroleague. Moreover, compared to the same 30 teams that are part of the NBA, only 16 to 18 different clubs compete in the Euroleague within a semi-open system; this means that, next to a slate of teams which return every year thanks to a special license, there are a few open slots assigned on merit according to the standings of each country's national league. Besides their format, the NBA and Euroleague also differ in some of their rules, from the size of the court to the actual in-game calls. The reader can refer to [this page](https://www.fiba.basketball/rule-differences) for an overview of these rule differences.   

---

In this notebook we collect the basketball data we will clean and analyze in the remaining parts of the project. We will scrape this data from the internet with the help of the `requests` library and the `Beautiful Soup` parser. The data will come from two main sources: [BasketballReference.com](https://www.basketball-reference.com/) for the NBA and [RealGM.com](https://basketball.realgm.com/international/league/1/Euroleague/home) for the Euroleague. For either league, we are interested in both teams' and players' data. The latter will include both statistical and biographical data.

## Scraping NBA Players' Stats

BasketballReference hosts webpages which contain the stats for all NBA players in each single season. All [these pages](https://www.basketball-reference.com/leagues/NBA_2017_per_game.html) have similar URLs which are easy to format.  

In [1]:
# Allow to run all code in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# Define a list containing all the seasons we are interested in (on the URLs we use seaons are marked using their ending year)
years = range(2017, 2022)

In [3]:
# Initialize the URL of the pages containing the NBA players' stats
NBA_player_stats_URL = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html"

In [4]:
# We need selenium to execute the page since it's a dynamic webpage
from selenium import webdriver

driver = webdriver.Chrome(executable_path = "C:\\Users\\Olivetti\\chromedriver")

  driver = webdriver.Chrome(executable_path = "C:\\Users\\Olivetti\\chromedriver")


In [5]:
import time

# Loop through each season, scrape the pages with the help of the driver and save them in separate files
for year in years:
    year_NBA_player_stats_URL = NBA_player_stats_URL.format(year)
    driver.get(year_NBA_player_stats_URL)
    driver.execute_script("window.scrollTo(1, 10000)")
    
    # Make the machine sleep for 2 seconds to give time to the driver to actually scroll down the page
    time.sleep(2)

    html = driver.page_source
    
    with open("scraped_webpages/NBA/NBA_player_stats/{}.html".format(year - 1), "w+", encoding="utf-8") as f:
        _ = f.write(html)
        
    # Wait 30 seconds before navigating to the next page
    time.sleep(30)

In [6]:
# Initialize a list to collect the NBA players' stats for each season
NBA_player_stats_by_year = []

# We will parse the HTML using Beautiful Soup
from bs4 import BeautifulSoup

# Loop through each season and parse the pages we scraped
for year in years:
    with open("scraped_webpages/NBA/NBA_player_stats/{}.html".format(year - 1), encoding="utf-8") as f:
        page = f.read()

    soup = BeautifulSoup(page, parser = "html")

    # First get rid of the intra-table headers
    to_remove = soup.find_all("tr", class_= "thead")
    for element in to_remove:
        element.decompose()
        
    # Now find the table we are interested in
    table = soup.find(id = "per_game_stats")

    import pandas as pd
    
    # Read the HTML table into a pandas DataFrame,
    pd_table = pd.read_html(str(table))[0]
    
    # Create a "Year" column and add it to the list
    pd_table["Year"] = year - 1
    NBA_player_stats_by_year.append(pd_table)

In [7]:
# Create a single cumulative DataFrame with the NBA players' stats from all seasons
NBA_player_stats = pd.concat(NBA_player_stats_by_year)

# Add a "League" column to remember this is NBA data
NBA_player_stats["League"] = "NBA"

In [8]:
# Take a look at the final result
NBA_player_stats

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year,League
0,1,Álex Abrines,SG,23,OKC,68,6,15.5,2.0,5.0,...,1.0,1.3,0.6,0.5,0.1,0.5,1.7,6.0,2016,NBA
1,2,Quincy Acy,PF,26,TOT,38,1,14.7,1.8,4.5,...,2.5,3.0,0.5,0.4,0.4,0.6,1.8,5.8,2016,NBA
2,2,Quincy Acy,PF,26,DAL,6,0,8.0,0.8,2.8,...,1.0,1.3,0.0,0.0,0.0,0.3,1.5,2.2,2016,NBA
3,2,Quincy Acy,PF,26,BRK,32,1,15.9,2.0,4.8,...,2.8,3.3,0.6,0.4,0.5,0.6,1.8,6.5,2016,NBA
4,3,Steven Adams,C,23,OKC,80,80,29.9,4.7,8.2,...,4.2,7.7,1.1,1.1,1.0,1.8,2.4,11.3,2016,NBA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
700,536,Delon Wright,PG,28,SAC,27,8,25.8,3.9,8.3,...,2.9,3.9,3.6,1.6,0.4,1.3,1.1,10.0,2020,NBA
701,537,Thaddeus Young,PF,32,CHI,68,23,24.3,5.4,9.7,...,3.8,6.2,4.3,1.1,0.6,2.0,2.2,12.1,2020,NBA
702,538,Trae Young,PG,22,ATL,63,63,33.7,7.7,17.7,...,3.3,3.9,9.4,0.8,0.2,4.1,1.8,25.3,2020,NBA
703,539,Cody Zeller,C,28,CHO,48,21,20.9,3.8,6.8,...,4.4,6.8,1.8,0.6,0.4,1.1,2.5,9.4,2020,NBA


In [9]:
# Save the DataFrame to a CSV file
NBA_player_stats.to_csv("final_DataFrames/NBA/NBA_player_stats.csv")

## Scraping NBA Players' Biodata

The NBA players' biodata is contained in the roster table of [each team's webpage](https://www.basketball-reference.com/teams/ATL/2017.html) on BasketballReference. Even in this case the URLs are standardized and easy to format.

In [10]:
# Initialize the URL of the pages containing the NBA teams' rosters
NBA_rosters_URL = "https://www.basketball-reference.com/teams/{}/{}.html"

In [11]:
# Get the NBA teams' three letters nicknames from the DataFrame we created above (we need this to format the URLs)
NBA_team_nicknames = NBA_player_stats["Tm"].unique().tolist()

In [12]:
# Quick look at the result
len(NBA_team_nicknames)
print(*NBA_team_nicknames)

31

OKC TOT DAL BRK SAC NOP MIN SAS IND MEM POR CLE LAC PHI HOU MIL NYK DEN ORL MIA PHO GSW CHO DET ATL WAS LAL UTA BOS CHI TOR


In [13]:
# Remove the "TOT" nickname which doesn't belong to any team (it identifies players whith multiple teams in the same season)
NBA_team_nicknames.remove("TOT")

In [14]:
import requests

# Loop through each season, scrape the pages with the help of the request library and save them in separate files
for year in years:
    
    for team in NBA_team_nicknames:
        year_team_NBA_rosters_URL = NBA_rosters_URL.format(team, year)
        page = requests.get(year_team_NBA_rosters_URL)

        with open("scraped_webpages/NBA/NBA_rosters/{}/{}.html".format(year - 1, team), "w+", encoding = "utf-8") as f:
            _ = f.write(page.text);
            
        # Wait 30 seconds before sending the next web request
        time.sleep(30)    

In [15]:
# Initialize a list to collect the NBA rosters for each season
NBA_rosters_by_year = []

# Loop through each season
for year in years:
                
    # Initialize a DataFrame to contain the NBA rosters for the current season
    year_NBA_player_rosters = pd.DataFrame()
    
    # Loop through each team in the current season and parse the pages we scraped
    for team in NBA_team_nicknames:
        
        with open("scraped_webpages/NBA/NBA_rosters/{}/{}.html".format(year - 1, team), encoding = "utf-8") as f:
            page = f.read()

        soup = BeautifulSoup(page, parser = "html")
        
        # Find the table we are interested in
        table = soup.find(id = "roster")
        
        # Read the HTML table into a pandas DataFrame
        pd_table = pd.read_html(str(table))[0]
        
        # Update the DataFrame adding the roster we just pulled
        year_NBA_player_rosters = year_NBA_player_rosters.append(pd_table)
        
    # Get rid of players who played for multiple teams in one season (they will appear in multiple teams' rosters)
    year_NBA_player_rosters = year_NBA_player_rosters.drop_duplicates(subset = "Player")
    
    # Create the "Year" column and add the DataFrame to the list 
    year_NBA_player_rosters["Year"] = year - 1
    NBA_rosters_by_year.append(year_NBA_player_rosters)

In [16]:
# Create a single cumulative DataFrame with NBA rosters from all seasons
NBA_rosters = pd.concat(NBA_rosters_by_year)

# Add a "League" column to remember this is NBA data
NBA_rosters["League"] = "NBA"

In [17]:
# Take a look at the final result
NBA_rosters

Unnamed: 0,No.,Player,Pos,Ht,Wt,Birth Date,Unnamed: 6,Exp,College,Year,League
0,8,Álex Abrines,SG,6-6,200,"August 1, 1993",es,R,,2016,NBA
1,12,Steven Adams,C,6-11,265,"July 20, 1993",nz,3,Pitt,2016,NBA
2,6,Semaj Christon,PG,6-3,190,"November 1, 1992",us,R,Xavier,2016,NBA
3,30,Norris Cole,PG,6-2,175,"October 13, 1988",us,5,Cleveland State University,2016,NBA
4,4,Nick Collison,PF,6-10,255,"October 26, 1980",us,12,Kansas,2016,NBA
...,...,...,...,...,...,...,...,...,...,...,...
14,22,Patrick McCaw,SF,6-7,181,"October 25, 1995",us,4,UNLV,2020,NBA
16,43,Pascal Siakam,PF,6-9,230,"April 2, 1994",cm,4,New Mexico State,2020,NBA
19,23,Fred VanVleet,SG,6-1,197,"February 25, 1994",us,4,Wichita State,2020,NBA
20,18,Yuta Watanabe,SF,6-9,215,"October 13, 1994",jp,2,George Washington,2020,NBA


In [18]:
# Save the DataFrame to a CSV file
NBA_rosters.to_csv("final_DataFrames/NBA/NBA_rosters.csv")

## Scraping NBA Teams' Stats

BasketballReference hosts webpages which contain the stats for all NBA teams in each single season. All [these pages](https://www.basketball-reference.com/leagues/NBA_2017.html) have similar URLs which are easy to format.  

In [19]:
# Initialize the URL of the pages containing the NBA teams' stats
NBA_team_stats_URL = "https://www.basketball-reference.com/leagues/NBA_{}.html"

In [20]:
# Loop through each season, scrape the pages and save them in separate files
for year in years:
    year_NBA_team_stats_URL = NBA_team_stats_URL.format(year)
    page = requests.get(year_NBA_team_stats_URL)

    with open("scraped_webpages/NBA/NBA_team_stats/{}.html".format(year - 1), "w+", encoding = "utf-8") as f:
       _ = f.write(page.text)
    
    # Wait 30 seconds before sending the next web request
    time.sleep(30)    

In [21]:
# Initialize a list to collect the NBA teams' stats for each season
NBA_team_stats_by_year = []

# Loop through each season and parse the pages we scraped
for year in years:
    with open("scraped_webpages/NBA/NBA_team_stats/{}.html".format(year - 1), encoding = "utf-8") as f:
        page = f.read()

        soup = BeautifulSoup(page, parser = "html")
        
        # First get rid of useless footer
        soup.find("tfoot").decompose()
        
        # Now find the table we are interesting
        table = soup.find(id = "per_game-team")
        
        # Read the HTML table into a pandas DataFrame
        pd_table = pd.read_html(str(table))[0]
        
        # Create the "Year" column and add it to the list
        pd_table["Year"] = year - 1
        NBA_team_stats_by_year.append(pd_table)

In [22]:
# Create a single cumulative DataFrame with NBA teams' stats from all seasons
NBA_team_stats = pd.concat(NBA_team_stats_by_year)

# Add a "League" column to remember this is NBA data
NBA_team_stats["League"] = "NBA"

In [23]:
# Take a look at the final result
NBA_team_stats

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year,League
0,1,Golden State Warriors*,82,241.2,43.1,87.1,0.495,12.0,31.2,0.383,...,35.0,44.4,30.4,9.6,6.8,14.8,19.3,115.9,2016,NBA
1,2,Houston Rockets*,82,241.2,40.3,87.2,0.462,14.4,40.3,0.357,...,33.5,44.4,25.2,8.2,4.3,15.1,19.9,115.3,2016,NBA
2,3,Denver Nuggets,82,240.9,41.2,87.7,0.469,10.6,28.8,0.368,...,34.6,46.4,25.3,6.9,3.9,15.0,19.1,111.7,2016,NBA
3,4,Cleveland Cavaliers*,82,242.4,39.9,84.9,0.470,13.0,33.9,0.384,...,34.4,43.7,22.7,6.6,4.0,13.7,18.1,110.3,2016,NBA
4,5,Washington Wizards*,82,242.1,41.3,87.0,0.475,9.2,24.8,0.372,...,32.6,42.9,23.9,8.5,4.1,14.2,21.3,109.2,2016,NBA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,26,New York Knicks*,72,242.1,39.4,86.5,0.456,11.8,30.0,0.392,...,35.5,45.1,21.4,7.0,5.1,12.9,20.5,107.0,2020,NBA
26,27,Detroit Pistons,72,242.1,38.7,85.6,0.452,11.6,32.9,0.351,...,33.1,42.7,24.2,7.4,5.2,14.9,20.5,106.6,2020,NBA
27,28,Oklahoma City Thunder,72,241.0,38.8,88.0,0.441,11.9,35.1,0.339,...,35.7,45.6,22.1,7.0,4.4,16.1,18.1,105.0,2020,NBA
28,29,Orlando Magic,72,240.7,38.3,89.2,0.429,10.9,31.8,0.343,...,35.1,45.4,21.8,6.9,4.4,12.8,17.2,104.0,2020,NBA


In [24]:
# Save the DataFrame to a CSV file
NBA_team_stats.to_csv("final_DataFrames/NBA/NBA_team_stats.csv")

## Scraping NBA Teams' Records

To make our analysis more complete, we are also interested in teams' winning records. The records of the NBA teams are contained in tables appearing in the [same webpages](https://www.basketball-reference.com/leagues/NBA_2017.html) we just scraped. 

In [25]:
# Initialize a list to collect the NBA teams' records for each season
NBA_team_records_by_year = []

# Loop through each season and parse the pages we scraped
for year in years:
    with open("scraped_webpages/NBA/NBA_team_stats/{}.html".format(year - 1), encoding = "utf-8") as f:
        page = f.read()

        # Find the tables with the teams' records (one for each conference)
        table_E = soup.find(id = "confs_standings_E")
        table_W = soup.find(id = "confs_standings_W")
        
        # Read the HTML tables into a pandas DataFrame
        pd_table_E = pd.read_html(str(table_E))[0]
        pd_table_W = pd.read_html(str(table_W))[0]
        
        # Rename the column with the team names
        pd_table_E = pd_table_E.rename(columns = {"Eastern Conference" : "Team"})
        pd_table_W = pd_table_W.rename(columns = {"Western Conference" : "Team"})
        
        # Combine the two DataFrames, add a "Year" column and add the result to the list
        pd_table = pd.concat([pd_table_E, pd_table_W], ignore_index = True)
        pd_table["Year"] = year - 1
        NBA_team_records_by_year.append(pd_table)

In [26]:
# Create a single cumulative DataFrame with NBA teams' records from all seasons
NBA_team_records = pd.concat(NBA_team_records_by_year, ignore_index = True)

# Add a "League" column to remember this is NBA data
NBA_team_records["League"] = "NBA"

In [27]:
# Take a look at the final result
NBA_team_records

Unnamed: 0,Team,W,L,W/L%,GB,PS/G,PA/G,SRS,Year,League
0,Philadelphia 76ers*,49,23,0.681,—,113.6,108.1,5.28,2016,NBA
1,Brooklyn Nets*,48,24,0.667,1.0,118.6,114.1,4.24,2016,NBA
2,Milwaukee Bucks*,46,26,0.639,3.0,120.1,114.2,5.57,2016,NBA
3,New York Knicks*,41,31,0.569,8.0,107.0,104.7,2.13,2016,NBA
4,Atlanta Hawks*,41,31,0.569,8.0,113.7,111.4,2.14,2016,NBA
...,...,...,...,...,...,...,...,...,...,...
145,New Orleans Pelicans,31,41,0.431,21.0,114.6,114.9,-0.20,2020,NBA
146,Sacramento Kings,31,41,0.431,21.0,113.7,117.4,-3.45,2020,NBA
147,Minnesota Timberwolves,23,49,0.319,29.0,112.1,117.7,-5.25,2020,NBA
148,Oklahoma City Thunder,22,50,0.306,30.0,105.0,115.6,-10.13,2020,NBA


In [28]:
# Save the DataFrame to a CSV file
NBA_team_records.to_csv("final_DataFrames/NBA/NBA_team_records.csv")

## Scraping EL Players' Stats

RealGM hosts webpages which contain the stats for all EL players in each single season. Each season's stats are collected into three separate tables which appear on separate webpages. However, all [these pages](https://basketball.realgm.com/international/league/1/Euroleague/stats/2017/Averages/All/All/points/All/desc/1) have similar URLs which are easy to format.  

In [29]:
# Initialize the URL of the pages containing the EL players' stats
EL_player_stats_URL = "https://basketball.realgm.com/international/league/1/Euroleague/stats/{}/Averages/All/All/points/All/desc/{}"

In [30]:
# Loop through each season
for year in years:
    
    # Loop through each table for the current season, scrape the each page and save it in separate files
    for i in range(1, 4):
        
        year_EL_player_stats_URL = EL_player_stats_URL.format(year, i)
        page = requests.get(year_EL_player_stats_URL.format(year, i))

        with open("scraped_webpages/EL/EL_player_stats/{}/{}_{}.html".format(year-1, year-1, i), "w+", encoding = "utf-8") as f:
            _ = f.write(page.text)
            
        # Wait 30 seconds before sending the next web request
        time.sleep(30)        

In [31]:
# Initialize a list to collect the EL players' stats for each season
EL_player_stats_by_year = []

# Loop through each season
for year in years:
    
    # Initialize a DataFrame to contain the EL players' stats for the current season
    year_EL_player_stats = pd.DataFrame()
    
    # Loop through each page containing the tables from the current season and parse the page
    for i in range(1, 4):

        with open("scraped_webpages/EL/EL_player_stats/{}/{}_{}.html".format(year-1, year-1, i), encoding = "utf-8") as f:
            page = f.read()

        soup = BeautifulSoup(page, parser = "html")
        table = soup.find(class_ = "tablesaw compact")
        
        # Read the HTML table into a pandas DataFrame
        pd_table = pd.read_html(str(table))[0] 
        
        # Update the DataFrame adding the table we just pulled
        year_EL_player_stats = year_EL_player_stats.append(pd_table)
    
    # Create the "Year" column and add the DataFrame to the list 
    year_EL_player_stats["Year"] = year - 1
    EL_player_stats_by_year.append(year_EL_player_stats)

In [32]:
# Create a single cumulative DataFrame with EL players' stats from all seasons
EL_player_stats = pd.concat(EL_player_stats_by_year, ignore_index = True)

# Add a "League" column to remember this is EL data
EL_player_stats["League"] = "EL"

In [33]:
# Take a look at the final result
EL_player_stats

Unnamed: 0,#,Player,Team,GP,MPG,PPG,FGM,FGA,FG%,3PM,...,ORB,DRB,RPG,APG,SPG,BPG,TOV,PF,Year,League
0,1,Keith Langford,UNI,28,34.0,21.8,6.9,16.8,0.409,2.0,...,0.7,2.7,3.4,3.7,0.8,0.2,2.4,1.8,2016,EL
1,2,Nando De Colo,CSKA,28,27.1,19.1,6.3,12.2,0.516,1.5,...,0.1,2.8,2.9,3.9,1.0,0.1,3.2,2.2,2016,EL
2,3,Andrew Goudelock,MAC,20,28.7,17.2,6.3,12.5,0.508,2.5,...,0.2,2.5,2.6,2.9,0.2,0.0,1.7,1.6,2016,EL
3,4,Brad Wanamaker,DBI,34,33.5,16.7,5.3,11.8,0.448,1.6,...,0.5,2.6,3.1,4.7,1.5,0.1,3.0,2.8,2016,EL
4,5,Sergio Llull,RMB,33,27.8,16.5,5.4,12.9,0.416,2.2,...,0.3,1.5,1.8,5.9,0.7,0.1,2.3,1.9,2016,EL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1344,295,Yonatan Atias,MAC,2,2.6,0.0,0.0,1.0,0.000,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.5,0.0,2020,EL
1345,296,Andrey Lopatin,CSKA,2,3.4,0.0,0.0,0.5,0.000,0.0,...,0.5,0.5,1.0,0.0,0.0,0.0,1.0,0.0,2020,EL
1346,297,Khadeen Carrington,BASK,1,0.0,0.0,0.0,0.0,0.000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2020,EL
1347,298,Dzanan Musa,EFE,4,2.9,0.0,0.0,0.5,0.000,0.0,...,0.0,0.5,0.5,0.2,0.2,0.0,0.5,0.5,2020,EL


In [34]:
# Save the DataFrame to a CSV file
EL_player_stats.to_csv("final_DataFrames/EL/EL_player_stats.csv")

## Scraping EL Players' Biodata

The EL players' biodata are contained in the roster table of [each team's webpage](https://basketball.realgm.com/international/league/1/Euroleague/team/55/Anadolu-Efes/rosters) on RealGM. The URLs of these webpages contain the full team name rather than the three letters nickname that appear in the `EL_player_stats` DataFrame we just created. They also contain a numeric code which is specific to each team. Hence, in order to make the URLs formatting easier, we first fix the `Team` column with the help of a CSV file which also contains each team's numeric code 

In [35]:
# Read the supporting CSV file we will use
EL_teams_nicknames = pd.read_csv("auxiliary_DataFrames/EL_teams_nicknames.csv")

In [36]:
# Take a look at the DataFrame
EL_teams_nicknames 

Unnamed: 0,Team,Nickname,URL_code
0,Anadolu-Efes,EFE,55
1,AX-Armani-Exchange-Milan,MIL,5
2,Barca,FCB,23
3,Baskonia,BASK,18
4,Brose-Baskets-Bamberg,BRO,109
5,CSKA-Moscow,CSKA,44
6,Darussafaka-Basketbol-Istanbul,DBI,1114
7,Fenerbahce-Ulker,FEN,10
8,Galatasaray,GAL,338
9,KK-Crvena-Zvezda,ZVE,180


In [37]:
# Leave out UNICS-Kazan since its "UNI" nickname is not unique (it also used by the Unicaja team)
EL_unique_nicknames = EL_teams_nicknames.iloc[:-1] 

In [38]:
# Hardcode the data for UNICS-Kazan in the EL_player_stats DataFrame
EL_player_stats.loc[(EL_player_stats["Team"] == "UNI") & (EL_player_stats["Year"] == 2016),
                    "Team"] = "UNICS-Kazan" 

In [39]:
# Convert the EL_unique_nicknames DataFrame to a dictionary with (Nickname, Team) key, value pairs
EL_teams_nicknames_dict = EL_unique_nicknames.set_index("Nickname")["Team"].to_dict() 

In [40]:
# Use the dictionary to replace the nicknames in the "Team" column with the teams' full names
EL_player_stats["Team"] = EL_player_stats["Team"].replace(EL_teams_nicknames_dict)

In [41]:
# Update the CSV file with the new team names format
EL_player_stats.to_csv("final_DataFrames/EL/EL_player_stats.csv")

Given the EL semi-open system, different teams participate each year so we need to consider a different collection of URLs for each season. To make our life easier, we include all the information we need into a dictionary. More specifically, this dictionary will have keys corresponding to the EL teams'; the value of each key is another dictionary containing the list of years the team participated to the EL and the numeric code used in the team's page on RealGM.   

In [42]:
# Initialize an empty dictionary
scraping_dict = dict()

# Collect all the EL teams' names
EL_teams = list(EL_player_stats["Team"].unique())

# Loop through each team and update the dictionary with the info we described above
for team in EL_teams:
    
    # Create a dictionary for each team
    scraping_dict[team] = dict()
    
    # Get the years the current team participated to the EL and include them in the dictionary
    team_years = list(EL_player_stats.loc[EL_player_stats["Team"] == team, "Year"].unique())
    scraping_dict[team]["years"] = team_years
    
    # Get the current team's code that appears in its URL and include it in the dictionary
    scraping_dict[team]["URL_code"] = EL_teams_nicknames.loc[EL_teams_nicknames["Team"] == team, "URL_code"].iloc[0]

In [43]:
# Make sure everything looks good
scraping_dict

{'UNICS-Kazan': {'years': [2016], 'URL_code': 146},
 'CSKA-Moscow': {'years': [2016, 2017, 2018, 2019, 2020], 'URL_code': 44},
 'Maccabi-FOX-Tel-Aviv': {'years': [2016, 2017, 2018, 2019, 2020],
  'URL_code': 4},
 'Darussafaka-Basketbol-Istanbul': {'years': [2016, 2018], 'URL_code': 1114},
 'Real-Madrid': {'years': [2016, 2017, 2018, 2019, 2020], 'URL_code': 52},
 'Fenerbahce-Ulker': {'years': [2016, 2017, 2018, 2019, 2020], 'URL_code': 10},
 'Panathinaikos': {'years': [2016, 2017, 2018, 2019, 2020], 'URL_code': 9},
 'Baskonia': {'years': [2016, 2017, 2018, 2019, 2020], 'URL_code': 18},
 'Zalgiris': {'years': [2016, 2017, 2018, 2019, 2020], 'URL_code': 50},
 'Barca': {'years': [2016, 2017, 2018, 2019, 2020], 'URL_code': 23},
 'Olympiacos': {'years': [2016, 2017, 2018, 2019, 2020], 'URL_code': 120},
 'Anadolu-Efes': {'years': [2016, 2017, 2018, 2019, 2020], 'URL_code': 55},
 'KK-Crvena-Zvezda': {'years': [2016, 2017, 2019, 2020], 'URL_code': 180},
 'AX-Armani-Exchange-Milan': {'years': [

Now we are finally ready to format the URLS and scrape the pages.

In [44]:
# Loop through the teams
for team in scraping_dict.keys():
    
    # Get the current team's scraping info
    team_years = scraping_dict[team]["years"]
    team_url_code = scraping_dict[team]["URL_code"]
    
    # Loop through season the team participated to the EL 
    for year in team_years:
        
        # Format the URL of the webpage containing the team's roster for the current year, scrape the page and save it
        year_EL_roster_URL = "https://basketball.realgm.com/international/league/1/Euroleague/team/{}/{}/rosters/{}".format(
            team_url_code, team, year + 1)
        page = requests.get(year_EL_roster_URL)
        
        with open("scraped_webpages/EL/EL_rosters/{}/{}.html".format(team, year), "w+", encoding = "utf-8") as f:
            _ = f.write(page.text)
            
        # Wait 30 seconds before sending the next web request
        time.sleep(30)        

In [45]:
# Initialize a list to collect the EL rosters for each season
EL_rosters_by_team = []

# Loop through the teams
for team in scraping_dict.keys():
    team_years = scraping_dict[team]["years"]
    
    # Loop through season the current team participated to the EL and parse the pages we scraped 
    for year in team_years:
             
        with open("scraped_webpages/EL/EL_rosters/{}/{}.html".format(team, year), encoding = "utf-8") as f:
            page = f.read()
            
        soup = BeautifulSoup(page, parser = "html")    
        
        # Find the table we are interested in
        table = soup.find(class_ = "tablesaw")
        
        # Read the HTML table into a pandas DataFrame
        pd_table = pd.read_html(str(table))[0]
        
        # Create the "Year" column and add the DataFrame to the list 
        pd_table["Year"] = year
        EL_rosters_by_team.append(pd_table)

In [46]:
# Create a single cumulative DataFrame with EL rosters from all seasons
EL_rosters = pd.concat(EL_rosters_by_team, ignore_index = True)

# Get rid of players who played for multiple teams in one season (they will appear in multiple teams' rosters)
EL_rosters = EL_rosters.drop_duplicates(subset = ["Player", "Year"])

# Add a "League" column to remember this EL data
EL_rosters["League"] = "EL"

In [47]:
# Take a look at the final result
EL_rosters

Unnamed: 0,#,Player,Pos,Height,Weight,Age,Birth City,NBA Draft Status,Nationality,Year,League
0,-,Danilo Andjusic,SG,6-4,200,25,Belgrade,"2013 NBA Draft, Undrafted",Serbia,2016,EL
1,2,Pavel Antipov,F,6-9,212,25,Tatarstan,"2013 NBA Draft, Undrafted",Russia,2016,EL
2,13,Marko Banic,SF,6-8,250,32,Zadar,"2006 NBA Draft, Undrafted",Croatia,2016,EL
3,4,Coty Clarke,F,6-7,235,24,Antioch (TN),"2014 NBA Draft, Undrafted",United States,2016,EL
4,12,Quim Colom,PG,6-2,194,28,Andorra La Vella,"2010 NBA Draft, Undrafted",AndorraSpain,2016,EL
...,...,...,...,...,...,...,...,...,...,...,...
1608,-,Elwin Ndjock,F,6-7,-,19,Lyon,2023 NBA Draft Eligible,France,2020,EL
1609,12,Amine Noua,F,6-8,196,24,Lyon,"2019 NBA Draft, Undrafted",France,2020,EL
1610,32,Matthew Strazel,PG,6-0,178,18,Paris,2024 NBA Draft Eligible,France,2020,EL
1611,-,"Derrick Walton, Jr.",PG,6-0,189,25,Harper Woods (MI),"2017 NBA Draft, Undrafted",United States,2020,EL


In [48]:
# Save the DataFrame to a CSV file
EL_rosters.to_csv("final_DataFrames/EL/EL_rosters.csv")

## Scraping EL Teams' Stats

RealGM hosts webpages which contain the stats for all EL teams in each single season. All [these pages](https://basketball.realgm.com/international/league/1/Euroleague/team-stats/2017) have similar URLs which are easy to format. 

In [49]:
# Initialize the URL of the pages containing the EL teams' stats
EL_team_stats_URL = "https://basketball.realgm.com/international/league/1/Euroleague/team-stats/{}"

In [50]:
# Loop through each season, scrape the pages and save them in separate files
for year in years:
    year_EL_team_stats_URL = EL_team_stats_URL.format(year)
    page = requests.get(year_EL_team_stats_URL)
    
    with open("scraped_webpages/EL/EL_team_stats/{}.html".format(year-1), "w+", encoding = "utf-8") as f:
        _ = f.write(page.text)    
        
    # Wait 30 seconds before sending the next web request
    time.sleep(30)        

In [51]:
# Initialize a list to collect the EL teams' stats for each season
EL_team_stats_by_year = []

# Loop through each season and parse the pages we scraped
for year in years:
    
    with open("scraped_webpages/EL/EL_team_stats/{}.html".format(year - 1), encoding = "utf-8") as f:
        page = f.read()
    
    soup = BeautifulSoup(page, parser = "html")
    
    # Find the table we are interested in
    table = soup.find(class_ = "tablesaw")
    
    # Read the HTML table into a pandas DataFrame
    pd_table = pd.read_html(str(table))[0]

    # Create the "Year" column and add the DataFrame to the list 
    pd_table["Year"] = year - 1
    EL_team_stats_by_year.append(pd_table)

In [52]:
# Create a cumulative DataFrame with the EL teams' stats from all seasons
EL_team_stats = pd.concat(EL_team_stats_by_year, ignore_index = True)

# Create a League column to remember this is EL data
EL_team_stats["League"] = "EL"

In [53]:
# Take a look at the final result
EL_team_stats

Unnamed: 0,#,Team,GP,MPG,PPG,FGM,FGA,FG%,3PM,3PA,...,ORB,DRB,RPG,APG,SPG,BPG,TOV,PF,Year,League
0,1,CSKA Moscow,35,40.3,87.5,30.0,60.2,0.498,8.7,21.6,...,7.8,22.3,30.1,19.9,7.1,3.2,14.1,22.6,2016,EL
1,2,Real Madrid,36,40.0,85.3,30.8,63.4,0.485,9.7,26.1,...,9.4,23.6,33.0,20.2,6.7,2.9,11.9,21.1,2016,EL
2,3,Anadolu Efes,35,40.6,82.0,30.0,67.3,0.445,7.6,22.5,...,10.6,21.6,32.2,17.9,6.9,3.4,11.3,19.4,2016,EL
3,4,Baskonia,33,40.0,82.0,30.0,64.0,0.468,8.2,23.1,...,9.3,23.5,32.8,18.4,6.7,3.0,13.2,21.6,2016,EL
4,5,AX Armani Exchange Milan,30,40.2,80.7,29.5,63.0,0.469,7.1,20.2,...,9.4,20.4,29.8,17.4,7.0,1.5,14.0,22.0,2016,EL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79,14,Khimki,34,40.3,77.5,27.5,62.2,0.442,9.3,26.8,...,7.5,20.8,28.3,18.2,7.0,3.3,13.9,22.1,2020,EL
80,15,Zenit Saint Petersburg,39,40.1,77.5,27.2,56.7,0.480,9.2,24.5,...,7.7,20.5,28.1,17.5,6.0,2.0,12.2,20.9,2020,EL
81,16,Zalgiris,34,40.0,77.4,28.4,58.5,0.486,8.4,20.0,...,8.1,20.7,28.8,18.3,6.4,1.1,13.3,19.4,2020,EL
82,17,ASVEL Basket,34,40.3,76.7,27.4,57.9,0.474,8.4,21.4,...,7.3,20.8,28.1,16.4,6.6,3.2,14.5,21.5,2020,EL


In [54]:
# Save the DataFrame to a CSV file
EL_team_stats.to_csv("final_DataFrames/EL/EL_team_stats.csv")

## Scraping EL Teams' Records

The records of the EL teams are contained in tables appearing on RealGM in a [different webpage](https://basketball.realgm.com/international/league/1/Euroleague/standings/491/2017) for each season. Since the URL for each of this pages includes a numeric code specific to each season, it's faster to just hardcode all the URLs and put them into a list.    

In [55]:
# Collect the URLs of the pages containing the EL teams' records
EL_team_records_URL = ["https://basketball.realgm.com/international/league/1/Euroleague/standings/491/2017",
                       "https://basketball.realgm.com/international/league/1/Euroleague/standings/580/2018",
                       "https://basketball.realgm.com/international/league/1/Euroleague/standings/727/2019",
                       "https://basketball.realgm.com/international/league/1/Euroleague/standings/829/2020",
                       "https://basketball.realgm.com/international/league/1/Euroleague/standings/932/2021"
                        ]

In [56]:
# Start from the first season (we'll use this to name the files we create to save the pages we scrape)
year = 2017


# Loop through each season, scrape the pages and save them in separate files
for url in EL_team_records_URL:
    
    page = requests.get(url)

    with open("scraped_webpages/EL/EL_team_records/{}.html".format(year - 1), "w+", encoding = "utf-8") as f:
        _ = f.write(page.text)
    
    year += 1
    
    # Wait 30 seconds before sending the next web request
    time.sleep(30)    

In [57]:
# Initialize a list to collect the EL team's records for each season
EL_team_records_by_year = []

# Loop through each season and parse the pages we scraped
for year in years:
    
    with open("scraped_webpages/EL/EL_team_records/{}.html".format(year - 1), encoding = "utf-8") as f:
        page = f.read()
    
    soup = BeautifulSoup(page, parser = "html")
    
    # Find the table we are interested in
    table = soup.find(class_ = "tablesaw")
    
    # Read the HTML table into a pandas DataFrame
    pd_table = pd.read_html(str(table))[0]
    
    # Create the "Year" column and add the DataFrame to the list 
    pd_table["Year"] = year - 1
    EL_team_records_by_year.append(pd_table)

In [58]:
# Create a single cumulative DataFrame with the EL teams' records from all seasons
EL_team_records = pd.concat(EL_team_records_by_year, ignore_index = True)

# Create a League column to remember this is EL data
EL_team_records["League"] = "EL"

In [59]:
# Take a look at the final result
EL_team_records

Unnamed: 0,#,Team,W,L,PCT,GB,L10,STRK,PPG,OPPG,DIFF,Home,Away,Year,League
0,1,Real Madrid,23,7,0.767,0,7-3,L 2,86.2,78.4,7.8,14-1,9-6,2016,EL
1,2,CSKA Moscow,22,8,0.733,1,7-3,W 1,87.3,79.6,7.7,14-1,8-7,2016,EL
2,3,Olympiacos,19,11,0.633,4,4-6,L 1,77.9,74.2,3.7,11-4,8-7,2016,EL
3,4,Panathinaikos,19,11,0.633,4,5-5,L 3,77.5,74.5,3.0,14-1,5-10,2016,EL
4,5,Fenerbahce Ulker,18,12,0.600,5,7-3,W 6,76.2,74.8,1.4,11-4,7-8,2016,EL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79,14,ASVEL Basket,13,21,0.382,11,3-7,L 1,76.7,80.6,-3.9,8-9,5-12,2020,EL
80,15,ALBA Berlin,12,22,0.353,12,4-6,W 1,78.6,82.7,-4.1,6-11,6-11,2020,EL
81,16,Panathinaikos,11,23,0.324,13,2-8,L 3,79.7,83.7,-4.0,8-9,3-14,2020,EL
82,17,KK Crvena Zvezda,10,24,0.294,14,3-7,L 1,74.1,79.7,-5.6,6-11,4-13,2020,EL


In [60]:
# Save the DataFrame to a CSV file
EL_team_records.to_csv("final_DataFrames/EL/EL_team_records.csv")