# Obtaining efficiency data of all NBA players across seasons

You can open this notebook in <a href="https://colab.research.google.com/github/edu9as/web-scraping/blob/master/Gathering-Data-of-NBA-Players-With-Basketball-Reference.ipynb">Google Colab</a>.

In this notebook, the goal is to use web-scraping to get performance data of all the NBA players during this 2020-21 season. Because the playing pace might be different across all the teams, we expect that players in teams that play faster have better statistics (because they have more opportunities to score and grab rebounds and so on). 

Then, it is a nice idea to analyze the data that is normalized by possessions. Specifically, in <a href="https://www.basketball-reference.com/">Basketball Reference</a> webpage it is included in each player's webpage a table that shows the performance of the players per 100 possessions.

## 1. Loading libraries

The only five libraries that are needed here are **requests**, **BeautifulSoup**, **pandas** and **time**.

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import time

## 2. Examining robots.txt in Basketball Reference domain

Before crawling a website, it is important to read its ```robots.txt``` file and see the limitations when scraping the website:

In [2]:
print(requests.get("https://www.basketball-reference.com/robots.txt").text)

User-agent: AhrefsBot
Disallow: /

User-agent: Twitterbot
Disallow:

User-agent: *
Disallow: /blazers/
Disallow: /dump/
Disallow: /fc/
Disallow: /my/
Disallow: /7103
Disallow: /play-index/*.cgi?*
Disallow: /play-index/plus/*.cgi?*
Disallow: */gamelog/
Disallow: */splits/
Disallow: */on-off/
Disallow: */lineups/
Disallow: */shooting/

Disallow: /req/
Disallow: /short/
Disallow: /nocdn/

Crawl-delay: 3

# Disallow the plagiarism.org robot, www.slysearch.com
User-agent: SlySearch
User-agent: GroundControl
User-agent: Ground-Control
User-agent: Carmine
User-agent: Skynet
User-agent: The-Matrix
User-agent: Matrix
User-agent: HAL9000
Disallow: /            #Will disallow or robot from all urls on your site




We are only requesting HTML pages within **/teams/** and **/players/**. Then, the only restriction that we have to keep in mind is that we have to include a crawler delay of 3 seconds after every request to Basketball Reference domain.

## 3. Writing some useful functions

### 3.1. Crawler delay

We just read in ```robots.txt``` that we have to wait for 3 seconds after every request to this domain to avoid overloading the webpage. A very simple function that will be called after every request is:

In [3]:
def crawler_delay(delay = 3):
    time.sleep(delay)

### 3.2. Get player table

This function will look for the table "Per 100 Poss" inside each player's webpage, and will return the **div** tag were the table is found and the name of the player.

In [4]:
def get_player_table(url):
    page = requests.get(url).text
    soup = bs(page, "html.parser")
    
    a = soup.find("div", id="div_per_poss")
    if a is None:
        soup = no_comments(page)
        a = soup.find("div", id="div_per_poss")
    name = soup.find("h1", itemprop="name").text.strip()
    return a, name

### 3.3. Get information in HTML comments

For some reason, the tables we are looking for are found within HTML comments in the raw HTML. Then, we have to write a function that overrides this obstacle and finds the information we want even if it is inside HTML comments. It is as simple as removing the comment tags and parsing the HTML document again.

In [5]:
def no_comments(page):
    page = page.replace("<!--", "").replace("-->", "")
    return bs(page, "html.parser")

### 3.4. Build table from HTML

After extracting the table we want with the function defined in **3.2**, we want to convert it to a Pandas dataframe. This function performs this task.

Note that we are not interested in the whole table. Instead, we only want the data about each season that the player has played in the NBA, and not those of career and club totals. This is why we only keep the rows above **Career** row. 

Also, in the original table there is an empty column (to separate important data to the right of the table, I suppose). Because we do not want NA's in our table, we are removing this column (which is the only with a column name with more than 8 characters).

Finally, we add a column with the player name.

In [6]:
def build_table_from_html(tuple_with_table_and_player):
    html_table = tuple_with_table_and_player[0]
    player = tuple_with_table_and_player[1]
    if type(html_table) != type("a"):
        html_table = str(html_table)
    table = pd.read_html(html_table)[0]
    last_row = int(table[table["Season"] == "Career"].index.to_numpy())
    for col in table.columns:
        if len(col) > 8:
            del table[col]
    table["Player"] = player
    return table[:last_row]

As an example, let's examine what we have written so far with an example: <a href="https://www.basketball-reference.com/players/d/duranke01.html">Kevin Durant</a>.

In [7]:
build_table_from_html(get_player_table("https://www.basketball-reference.com/players/d/duranke01.html"))

Unnamed: 0,Season,Age,Tm,Lg,Pos,G,GS,MP,FG,FGA,...,TRB,AST,STL,BLK,TOV,PF,PTS,ORtg,DRtg,Player
0,2007-08,19.0,SEA,NBA,SG,80.0,80.0,2768.0,10.6,24.6,...,6.3,3.5,1.4,1.3,4.2,2.2,29.2,100.0,110.0,Kevin Durant
1,2008-09,20.0,OKC,NBA,SF,74.0,74.0,2885.0,11.7,24.7,...,8.6,3.6,1.7,0.9,4.0,2.4,33.3,111.0,109.0,Kevin Durant
2,2009-10,21.0,OKC,NBA,SF,82.0,82.0,3239.0,12.6,26.6,...,9.9,3.7,1.8,1.3,4.3,2.7,39.4,118.0,104.0,Kevin Durant
3,2010-11,22.0,OKC,NBA,SF,78.0,78.0,3038.0,12.1,26.2,...,9.1,3.6,1.5,1.3,3.7,2.7,36.8,115.0,107.0,Kevin Durant
4,2011-12,23.0,OKC,NBA,SF,66.0,66.0,2546.0,13.0,26.3,...,10.7,4.7,1.8,1.6,5.0,2.7,37.5,114.0,101.0,Kevin Durant
5,2012-13,24.0,OKC,NBA,SF,81.0,81.0,3119.0,12.1,23.6,...,10.6,6.2,1.9,1.7,4.6,2.4,37.6,122.0,100.0,Kevin Durant
6,2013-14,25.0,OKC,NBA,SF,81.0,81.0,3122.0,13.7,27.2,...,9.6,7.2,1.7,1.0,4.6,2.8,41.8,123.0,104.0,Kevin Durant
7,2014-15,26.0,OKC,NBA,SF,27.0,27.0,913.0,13.1,25.6,...,9.8,6.0,1.3,1.4,4.1,2.2,37.7,121.0,105.0,Kevin Durant
8,2015-16,27.0,OKC,NBA,SF,72.0,72.0,2578.0,13.4,26.6,...,11.3,7.0,1.3,1.6,4.8,2.6,39.1,122.0,104.0,Kevin Durant
9,2016-17,28.0,GSW,NBA,PF,62.0,62.0,2070.0,12.8,23.8,...,11.9,7.0,1.5,2.3,3.2,2.7,36.1,125.0,101.0,Kevin Durant


The table we have extracted looks really nice.

### 3.5. Get player URLs from team

Because active players are grouped by team, if we want to get the previous table for each player we should be able to obtain their URLs from their team's website.

In [8]:
def get_urls_from_team(url):
    page = requests.get(url).text
    soup = bs(page, "html.parser")
    roster = soup.find("table", id="roster").findAll("td", {"data-stat": "player"})
    
    base_url = "https://www.basketball-reference.com"
    urls = [base_url + player.a.get("href") for player in roster]

    return urls

### 3.6. Get URL of teams

The utility of this function is the same as that in section **3.5**.

In [9]:
def get_url_of_teams():
    base_url = "https://www.basketball-reference.com"
    url = base_url + "/teams"
    
    page = requests.get(url).text
    soup = bs(page, "html.parser")
    teams = soup.find("div", id="div_teams_active").findAll("a")
    urls = [base_url + team.get("href") for team in teams]
    return urls

### 3.7. Get URL of the 2020-2021 team

We are only interested in active players, i.e., players that are playing in the NBA during season 2020-2021:

In [10]:
def get_team_this_season(url):
    base_url = "https://www.basketball-reference.com"
    team_abb = url[-4:-1]
    page = requests.get(url).text
    soup = bs(page, "html.parser")
    table = soup.find("table", id = team_abb)
    team_url = table.findAll("th", {"data-stat": "season"})[1].a.get("href")
    return base_url + team_url

## 4. Crawl the website and store the results

Up to now, we have only loaded libraries and defined some functions. With all these tools, we are able to easily scrape Basketball Reference and get the information we want about all the active players.

The data will be stored in a single Pandas dataframe (```table```). First of all, we load the URL of each team with one of the functions we defined earlier:

In [11]:
team_urls = get_url_of_teams()
crawler_delay() #Crawl-delay: 3
table = pd.DataFrame()

Because I only want to show the power of this code and not to overload the website, I am only crawling 5 teams. 

The URL of each player whose data is being extracted is printed to the console to keep track of the process. Also, a try-except statement is included, just in case some unexperienced players haven't played yet in the NBA even though they are hired by a team.

In [12]:
for i, team in enumerate(team_urls):
    team_url = get_team_this_season(team)
    crawler_delay() #Crawl-delay: 3
    
    players_url = get_urls_from_team(team_url)
    crawler_delay() #Crawl-delay: 3
    
    for player in players_url:
        try:
            print(player)
            table = pd.concat([table,
                               build_table_from_html(get_player_table(player))])
        except:
            print("No table for " + player)
        
        crawler_delay() #Crawl-delay: 3
        
    if i > 3: break

https://www.basketball-reference.com/players/c/collijo01.html
https://www.basketball-reference.com/players/h/huertke01.html
https://www.basketball-reference.com/players/h/hillso01.html
https://www.basketball-reference.com/players/y/youngtr01.html
https://www.basketball-reference.com/players/c/capelca01.html
https://www.basketball-reference.com/players/r/reddica01.html
https://www.basketball-reference.com/players/g/gallida01.html
https://www.basketball-reference.com/players/g/goodwbr01.html
https://www.basketball-reference.com/players/s/snellto01.html
https://www.basketball-reference.com/players/r/rondora01.html
https://www.basketball-reference.com/players/h/huntede01.html
https://www.basketball-reference.com/players/f/fernabr01.html
https://www.basketball-reference.com/players/m/mayssk01.html
https://www.basketball-reference.com/players/o/okongon01.html
https://www.basketball-reference.com/players/b/bogdabo01.html
https://www.basketball-reference.com/players/k/knighna01.html
https://ww

Let's see how this partial table looks like:

In [13]:
table

Unnamed: 0,Season,Age,Tm,Lg,Pos,G,GS,MP,FG,FGA,...,TRB,AST,STL,BLK,TOV,PF,PTS,ORtg,DRtg,Player
0,2017-18,20.0,ATL,NBA,PF,74.0,26.0,1785.0,8.6,14.9,...,14.8,2.7,1.3,2.2,2.9,5.9,21.3,117.0,108.0,John Collins
1,2018-19,21.0,ATL,NBA,PF,61.0,59.0,1829.0,11.7,21.0,...,15.0,3.1,0.6,1.0,3.0,5.0,30.0,122.0,115.0,John Collins
2,2019-20,22.0,ATL,NBA,PF,41.0,41.0,1363.0,12.1,20.7,...,14.2,2.1,1.1,2.3,2.6,4.7,30.3,124.0,112.0,John Collins
3,2020-21,23.0,ATL,NBA,PF,36.0,36.0,1104.0,11.2,20.8,...,12.1,2.3,0.7,1.6,2.1,5.2,28.7,123.0,114.0,John Collins
0,2018-19,20.0,ATL,NBA,SG,75.0,59.0,2048.0,6.2,14.8,...,5.5,4.8,1.5,0.6,2.5,3.5,16.4,106.0,116.0,Kevin Huerter
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2,2017-18,25.0,CHI,NBA,C,55.0,16.0,977.0,6.3,10.7,...,11.6,2.7,0.8,0.5,2.7,5.2,15.4,114.0,113.0,Cristiano Felício
3,2018-19,26.0,CHI,NBA,C,60.0,0.0,746.0,6.2,11.6,...,14.2,2.4,0.7,0.5,2.1,4.5,15.6,116.0,114.0,Cristiano Felício
4,2019-20,27.0,CHI,NBA,C,22.0,0.0,386.0,4.2,6.7,...,12.7,2.0,1.2,0.2,2.2,4.2,10.7,130.0,112.0,Cristiano Felício
5,2020-21,28.0,CHI,NBA,C,6.0,0.0,38.0,3.7,5.0,...,14.9,6.2,2.5,0.0,3.7,3.7,13.7,136.0,111.0,Cristiano Felício


## 5. End of the notebook

In this notebook, I have obtained some per-100-possessions information about all the NBA players that are currently playing in the league. With a similar code, a massive collection of data might be gathered (always showing respect to the domain that brings us these data, Basketball Reference), and the further analysis of these data might bring some knowledge about NBA basketball.

I hope you have enjoyed it!