# Web Scraping with BeautifulSoup

A web page is a blob of [HTML](https://en.wikipedia.org/wiki/HTML) which is structured text (like JSON or XML or code files). 

You can see the HTML of a web page normally by right clicking and "view page source".

Because this is structured in a consistent way for most web sites, if we can parse the HTML and decompose it, we can use that to create dataset by ourselves. This is called "scraping".

Today we'll scrap basketball data tables from a reference website.

In [9]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.basketball-reference.com/leagues/NBA_2020_per_game.html'
page = requests.get(url)

page

<Response [200]>

The `page.content` object is the page source HTML

In [10]:
str(page.content)[:200]

'b\'\\n<!DOCTYPE html>\\n<html data-version="klecko-" data-root="/home/bbr/build" itemscope itemtype="https://schema.org/WebSite" lang="en" class="no-js" >\\n<head>\\n<!-- Quantcast Choice. Consent Manager '

We can parse it with BeautifulSoup:

In [11]:
soup = BeautifulSoup(page.content, 'html.parser')
# Not printing to save space
# soup.prettify()

# Extracting HTML

Our next step is to extract the HTML content of the table and header.

Start with manually inspecting the HTML in your browser (the "inspect element" feature is ueful)

From there, we can see that each row of each player has a HTML class of full_table:
Image for post

In [12]:
table = soup.find_all(class_="full_table")
# table

Now, we need to save our column headers. Doing the same thing we did as above, we see that the column header has a class of `thead`:

In [13]:
head = soup.find(class_="thead")

column_names_raw=[head.text for item in head][0]

column_names_clean = column_names_raw.replace("\n",",").split(",")[2:-1]

column_names_clean

['Player',
 'Pos',
 'Age',
 'Tm',
 'G',
 'GS',
 'MP',
 'FG',
 'FGA',
 'FG%',
 '3P',
 '3PA',
 '3P%',
 '2P',
 '2PA',
 '2P%',
 'eFG%',
 'FT',
 'FTA',
 'FT%',
 'ORB',
 'DRB',
 'TRB',
 'AST',
 'STL',
 'BLK',
 'TOV',
 'PF',
 'PTS']

# Filling in the data

Now we can iterate over the entire table element, and for every row of data (in every td HTML tag for every player) we can extract all the statistics.

In [17]:
"""Extracting full list of player_data"""

players = []
    
for i in range(len(table)):
    
    player_ = []
    
    for td in table[i].find_all("td"):
        player_.append(td.text)
    
    players.append(player_)
        
df = pd.DataFrame(players, columns = column_names_clean).set_index("Player")

#cleaning the player's name from occasional special characters
df.index = df.index.str.replace('*', '')

In [18]:
df

Unnamed: 0_level_0,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Steven Adams,C,26,OKC,63,63,26.7,4.5,7.6,.592,0.0,...,.582,3.3,6.0,9.3,2.3,0.8,1.1,1.5,1.9,10.9
Bam Adebayo,PF,22,MIA,72,72,33.6,6.1,11.0,.557,0.0,...,.691,2.4,7.8,10.2,5.1,1.1,1.3,2.8,2.5,15.9
LaMarcus Aldridge,C,34,SAS,53,53,33.1,7.4,15.0,.493,1.2,...,.827,1.9,5.5,7.4,2.4,0.7,1.6,1.4,2.4,18.9
Kyle Alexander,C,23,MIA,2,0,6.5,0.5,1.0,.500,0.0,...,,1.0,0.5,1.5,0.0,0.0,0.0,0.5,0.5,1.0
Nickeil Alexander-Walker,SG,21,NOP,47,1,12.6,2.1,5.7,.368,1.0,...,.676,0.2,1.6,1.8,1.9,0.4,0.2,1.1,1.2,5.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Trae Young,PG,21,ATL,60,60,35.3,9.1,20.8,.437,3.4,...,.860,0.5,3.7,4.3,9.3,1.1,0.1,4.8,1.7,29.6
Cody Zeller,C,27,CHO,58,39,23.1,4.3,8.3,.524,0.3,...,.682,2.8,4.3,7.1,1.5,0.7,0.4,1.3,2.4,11.1
Tyler Zeller,C,30,SAS,2,0,2.0,0.5,2.0,.250,0.0,...,,1.5,0.5,2.0,0.0,0.0,0.0,0.0,0.0,1.0
Ante Žižić,C,23,CLE,22,0,10.0,1.9,3.3,.569,0.0,...,.737,0.8,2.2,3.0,0.3,0.3,0.2,0.5,1.2,4.4
