<img src="assets/br_logo.png" width="400px">

## Web Scraping [Basketball-Reference.com](www.basketball-reference.com) Using Selenium

Example Player Page:
http://www.basketball-reference.com/players/b/bryanko01.html

 

### Using the `web_scrape` module

**`web_scrape`**

This module contains all the code necessary to web scrape www.Basketball-Reference.com. With this module, you can scrape a player's career averages from four tables on any player's page. In order for the web scrapper to work, you must also feed it a specific player's URL.
- `get_per_game()`
- `get_100()`
- `get_shooting()`
- `get_advanced()`

In order to obtain the column names, you must also instantiate the column names using these functions. Keep in mind how you order for column name alignment: 
- `get_pergame_cols()`
- `get_100_cols()`
- `get_shoot_cols()`
- `get_adv_cols()`

### Import Module

In [1]:
from lib.web_scrape import get_100, get_100_cols, get_shooting, get_shoot_cols, get_advanced, get_adv_cols

### Import Packages

In [2]:
import numpy as np
import pandas as pd

### Import Data

In [3]:
file_loc = "data/example_players.csv"

In [12]:
player_stats = pd.read_csv(file_loc)
player_stats.head()

Unnamed: 0,Season,Player,Pos,Player_ID,url
0,Inactive,A.J. Price,PG,priceaj01,file:///Users/alexcheng/Downloads/us.sitesucke...
1,Active,Aaron Brooks,PG,brookaa01,file:///Users/alexcheng/Downloads/us.sitesucke...
2,Active,Aaron Gordon,SF,gordoaa01,file:///Users/alexcheng/Downloads/us.sitesucke...
3,Active,Aaron Harrison,SG,harriaa01,file:///Users/alexcheng/Downloads/us.sitesucke...
4,Active,Adreian Payne,PF,paynead01,file:///Users/alexcheng/Downloads/us.sitesucke...


### Web Scrape

In [5]:
# Create a list of URLs to feed the web scraper
url_list = list(player_stats['url'])

In [6]:
# Temporary list used to save scraped data in memory
tmp_list = []

In [7]:
# Web Scraping
for url in url_list:
    # Select which tables to scrape from
    per_100 = get_100(url)
    shooting = get_shooting(url)
    advanced = get_advanced(url)
    
    # Build dataframe in order
    df = [per_100, shooting, advanced]
    player_stats = []
    for sublist in df:
        for val in sublist:
            player_stats.append(val)
    
    # Save to temporary list
    tmp_list.append(player_stats)

In [8]:
# Retrieve columns
per_100_cols = get_100_cols()
shoot_cols = get_shoot_cols()
advanced_cols = get_adv_cols()

tmp_list1 = [per_100_cols, shoot_cols, advanced_cols]
cols = []
for sublist in tmp_list1:
    for val in sublist:
        cols.append(val)

In [9]:
# Build dataframe
cleaned_stats = pd.DataFrame(tmp_list, columns=cols)
cleaned_stats.head()

Unnamed: 0,Player_ID,GAMES,GS,MP_,FG_100,FGA_100,FG%_100,3P_100,3PA_100,3P%_100,...,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DPM,BPM,VORP
0,priceaj01,261.0,25.0,3929.0,7.2,18.8,0.38,2.8,9.0,0.316,...,12.8,20.5,1.2,3.5,4.7,0.058,-0.3,-2.1,-2.4,-0.4
1,brookaa01,594.0,182.0,13016.0,8.7,21.0,0.413,3.1,8.5,0.368,...,14.4,23.3,10.5,8.5,19.0,0.07,0.7,-2.4,-1.7,1.1
2,gordoaa01,182.0,94.0,4238.0,7.2,16.1,0.45,1.3,4.5,0.289,...,9.8,17.8,4.1,4.1,8.1,0.092,-0.6,0.4,-0.3,1.9
3,harriaa01,26.0,0.0,110.0,2.3,10.5,0.217,1.4,5.5,0.25,...,12.1,13.6,-0.2,0.2,-0.1,-0.033,-6.2,-0.2,-6.4,-0.1
4,paynead01,96.0,24.0,1317.0,6.2,15.5,0.399,0.5,2.0,0.235,...,16.4,17.8,-1.5,0.7,-0.8,-0.03,-4.9,-0.8,-5.8,-1.3


In [13]:
# Merge dataframes to combine names 
cumulative_player_stats = pd.merge(player_stats, cleaned_stats, on='Player_ID', how='right')

In [14]:
file1_loc = "data/example_player_stats.csv"

# Save to CSV
cumulative_player_stats.to_csv(file1_loc)