This notebook illustrates how to scrape data from the NBA statistics webpage. This notebook is set up for educational and learning purpose only. In the previous notebook, IPL stats were retrieved using trivial methods with a static url. However, it is possible to exploit the pattern in the url to retrieve a large amount of data as seen in this notebook.

In [1]:
# !pip3 install requests
# !pip3 install pandas

In [2]:
import requests
import os
import pandas as pd
import logging
import time

Set up logging.

In [3]:
logging_dir_path = "../logging"
logging_filename = "logging.log"
if not os.path.exists(logging_dir_path):
    os.mkdir(logging_dir_path)

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    filename=logging_dir_path + "/" + logging_filename,
)

#### URL to retrieve the data from.
To scrape data from a web page, the url corresponding to that webpage needs to located. However, just the web url of the webpage would not return the data since an api call needs to be made to that url. The url for the api call can be retrieved from the webpage inspection on the web browser. Entering that url on the browser returns the desired data. Going through the raw data on the browser should give enough understanding on how to retrieve that data as desired.<br><br>
In this case, the link was copied as cURL and then converted into python code using curlconverter.com.

#### URL to retrieve the data from.
To scrape data from a web page, the url corresponding to that webpage needs to located. However, just the web url of the webpage would not return the data since an api call needs to be made to that url. The url for the api call can be retrieved from the webpage inspection on the web browser. Entering that url on the browser returns the desired data. Going through the raw data on the browser should give enough understanding on how to retrieve that data as desired.<br><br>
In this case, the link was copied as cURL and then converted into python code using curlconverter.com.

In [4]:
headers = {
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.9",
    "Connection": "keep-alive",
    "Origin": "https://www.nba.com",
    "Referer": "https://www.nba.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "same-site",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
    "sec-ch-ua": '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": '"macOS"',
}

params = {
    "College": "",
    "Conference": "",
    "Country": "",
    "DateFrom": "",
    "DateTo": "",
    "Division": "",
    "DraftPick": "",
    "DraftYear": "",
    "GameScope": "",
    "GameSegment": "",
    "Height": "",
    "LastNGames": "0",
    "LeagueID": "00",
    "Location": "",
    "MeasureType": "Base",
    "Month": "0",
    "OpponentTeamID": "0",
    "Outcome": "",
    "PORound": "0",
    "PaceAdjust": "N",
    "PerMode": "Totals",
    "Period": "0",
    "PlayerExperience": "",
    "PlayerPosition": "",
    "PlusMinus": "N",
    "Rank": "N",
    "Season": "2022-23",
    "SeasonSegment": "",
    "SeasonType": "Playoffs",
    "ShotClockRange": "",
    "StarterBench": "",
    "TeamID": "0",
    "VsConference": "",
    "VsDivision": "",
    "Weight": "",
}

After some inspeaction of the url by choosing a different stat on the page, it is known that the seasons and the season types can be altered in the url to retrieve the corresponding data.

In [5]:
start_time = time.time()
scraped_text_json = requests.get(
    "https://stats.nba.com/stats/playerindex", params=params, headers=headers
).json()
msg = (
    "total time take to retrieve all data: "
    + str(time.time() - start_time)
    + " seconds."
)
logging.info(msg)
print(msg)

total time take to retrieve all data: 0.71169114112854 seconds.


#### Identify column names and data in the json data.
After tinkering with the **preview** section under Fetch/XHR of the Chrome webpage **Inspect** option, the field names and values can be retrieved by exploiting the observed info as below.

In [6]:
cols = scraped_text_json["resultSets"][0]["headers"]
players_data = scraped_text_json["resultSets"][0]["rowSet"]
df_players_data = pd.DataFrame(data=players_data, columns=cols)
df_players_data.head(5)

Unnamed: 0,PERSON_ID,PLAYER_LAST_NAME,PLAYER_FIRST_NAME,PLAYER_SLUG,TEAM_ID,TEAM_SLUG,IS_DEFUNCT,TEAM_CITY,TEAM_NAME,TEAM_ABBREVIATION,...,DRAFT_YEAR,DRAFT_ROUND,DRAFT_NUMBER,ROSTER_STATUS,FROM_YEAR,TO_YEAR,PTS,REB,AST,STATS_TIMEFRAME
0,1630173,Achiuwa,Precious,precious-achiuwa,1610612761,raptors,0,Toronto,Raptors,TOR,...,2020.0,1.0,20.0,1.0,2020,2023,9.2,6.0,0.9,Season
1,203500,Adams,Steven,steven-adams,1610612763,grizzlies,0,Memphis,Grizzlies,MEM,...,2013.0,1.0,12.0,1.0,2013,2023,8.6,11.5,2.3,Season
2,1628389,Adebayo,Bam,bam-adebayo,1610612748,heat,0,Miami,Heat,MIA,...,2017.0,1.0,14.0,1.0,2017,2023,20.4,9.2,3.2,Season
3,1630534,Agbaji,Ochai,ochai-agbaji,1610612762,jazz,0,Utah,Jazz,UTA,...,2022.0,1.0,14.0,1.0,2022,2023,7.9,2.1,1.1,Season
4,1630583,Aldama,Santi,santi-aldama,1610612763,grizzlies,0,Memphis,Grizzlies,MEM,...,2021.0,1.0,30.0,1.0,2021,2023,9.0,4.8,1.3,Season


#### Feting all seasons.
Note that the data above shows the info only for a single year and the regular season. In order to retrieve the data from all seasons as well as all season types, a for loop can be implemented.

In [7]:
years = [
    "1996-97",
    "1997-98",
    "1998-99",
    "1999-00",
    "2000-01",
    "2002-03",
    "2003-04",
    "2004-05",
    "2005-06",
    "2006-07",  # add more years as desired.
]
season_types = ["Preseason", "Regular Season", "Playoffs", "All-Star", "Play In"]

In [8]:
# create the header for API call
def get_api_header_and_params(year, season_type):
    headers = {
        "Accept": "*/*",
        "Accept-Language": "en-US,en;q=0.9",
        "Connection": "keep-alive",
        "Origin": "https://www.nba.com",
        "Referer": "https://www.nba.com/",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-site",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
        "sec-ch-ua": '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": '"macOS"',
    }

    params = {
        "College": "",
        "Conference": "",
        "Country": "",
        "DateFrom": "",
        "DateTo": "",
        "Division": "",
        "DraftPick": "",
        "DraftYear": "",
        "GameScope": "",
        "GameSegment": "",
        "Height": "",
        "LastNGames": "0",
        "LeagueID": "00",
        "Location": "",
        "MeasureType": "Base",
        "Month": "0",
        "OpponentTeamID": "0",
        "Outcome": "",
        "PORound": "0",
        "PaceAdjust": "N",
        "PerMode": "Totals",
        "Period": "0",
        "PlayerExperience": "",
        "PlayerPosition": "",
        "PlusMinus": "N",
        "Rank": "N",
        "Season": year,  # season/ year
        "SeasonSegment": "",
        "SeasonType": season_type,  # season type
        "ShotClockRange": "",
        "StarterBench": "",
        "TeamID": "0",
        "VsConference": "",
        "VsDivision": "",
        "Weight": "",
    }

    return headers, params

In [9]:
cols = scraped_text_json["resultSets"][0]["headers"]
df_cumulative_players_data = pd.DataFrame(columns=cols)

start_time = time.time()
for current_year in years:
    logging.info("Year: " + str(current_year))

    for current_season in season_types:
        logging.info("Season Type: " + str(current_season))
        headers, params = get_api_header_and_params(
            year=current_year, season_type=current_season
        )

        scraped_data_json = requests.get(
            "https://stats.nba.com/stats/playerindex", params=params, headers=headers
        ).json()
        players_data = scraped_text_json["resultSets"][0]["rowSet"]

        # create df for current season type player data
        single_season_type_df = pd.DataFrame(columns=cols, data=players_data)
        # create a column for the season type data
        current_season_type_col_df = pd.DataFrame(
            {
                "Year": [current_year for _ in range(len(single_season_type_df))],
                "Season Type": [
                    current_season for _ in range(0, len(single_season_type_df))
                ],
            }
        )

        current_season_type_col_df = pd.concat(
            [current_season_type_col_df, single_season_type_df], axis=1
        )

        # add current season type data to the cumulative df
        df_cumulative_players_data = pd.concat(
            [df_cumulative_players_data, current_season_type_col_df], axis=0
        )

msg = "total time taken: " + str(time.time() - start_time) + " seconds."
logging.info(msg)
msg

'total time taken: 22.3494610786438 seconds.'

In [11]:
msg = "Total records retrieved: " + str(len(df_cumulative_players_data))
logging.info(msg)
print(msg)

Total records retrieved: 27050


In [12]:
display(df_cumulative_players_data.head(2))

Unnamed: 0,PERSON_ID,PLAYER_LAST_NAME,PLAYER_FIRST_NAME,PLAYER_SLUG,TEAM_ID,TEAM_SLUG,IS_DEFUNCT,TEAM_CITY,TEAM_NAME,TEAM_ABBREVIATION,...,DRAFT_NUMBER,ROSTER_STATUS,FROM_YEAR,TO_YEAR,PTS,REB,AST,STATS_TIMEFRAME,Year,Season Type
0,1630173,Achiuwa,Precious,precious-achiuwa,1610612761,raptors,0,Toronto,Raptors,TOR,...,20.0,1.0,2020,2023,9.2,6.0,0.9,Season,1996-97,Preseason
1,203500,Adams,Steven,steven-adams,1610612763,grizzlies,0,Memphis,Grizzlies,MEM,...,12.0,1.0,2013,2023,8.6,11.5,2.3,Season,1996-97,Preseason


In [13]:
display(df_cumulative_players_data.tail(2))

Unnamed: 0,PERSON_ID,PLAYER_LAST_NAME,PLAYER_FIRST_NAME,PLAYER_SLUG,TEAM_ID,TEAM_SLUG,IS_DEFUNCT,TEAM_CITY,TEAM_NAME,TEAM_ABBREVIATION,...,DRAFT_NUMBER,ROSTER_STATUS,FROM_YEAR,TO_YEAR,PTS,REB,AST,STATS_TIMEFRAME,Year,Season Type
539,203469,Zeller,Cody,cody-zeller,1610612748,heat,0,Miami,Heat,MIA,...,4.0,1.0,2013,2023,6.5,4.3,0.7,Season,2006-07,Play In
540,1627826,Zubac,Ivica,ivica-zubac,1610612746,clippers,0,LA,Clippers,LAC,...,32.0,1.0,2016,2023,10.8,9.9,1.0,Season,2006-07,Play In


#### Store the data in the desired dir.

In [14]:
filename = "nba_players_info_season_1997_to_2007.xlsx"
# create dir
relative_dir_path = "../data"
if not os.path.exists(relative_dir_path):
    os.mkdir(relative_dir_path)
# store the data
df_cumulative_players_data.to_csv(relative_dir_path + "/" + filename)
msg = "Data successfully saved as: " + filename
logging.info(msg)
msg

'Data successfully saved as: nba_players_info_season_1997_to_2007.xlsx'