## Web Scraping Basketball-Reference.com Using Selenium

<img src="assets/br_logo.png" width="400px">

Example Player Page:
http://www.basketball-reference.com/players/b/bryanko01.html

 

### Using the `cumulative_player_stats` module

**`cumulative_player_stats`**

This module contains all the code necessary to web scrape www.Basketball-Reference.com. With this module, you can scrape a player's career averages from four tables on any player's page. In order for the web scrapper to work, you must also feed it a specific player's URL.
- `get_per_game()`
- `get_100()`
- `get_shooting()`
- `get_advanced()`

In order to obtain the column names, you must also instantiate the column names using these functions. Keep in mind how you order for column name alignment: 
- `get_pergame_cols()`
- `get_100_cols()`
- `get_shoot_cols()`
- `get_adv_cols()`

### Import Module

In [None]:
from lib.cumulative_player_stats import get_100, get_100_cols, get_shooting, get_shoot_cols, get_advanced, get_adv_cols

### Import Packages

In [None]:
import pandas as pd
import numpy as np

### Import Data

In [None]:
file_loc = "/Users/alexcheng/dsi/dsi_workspace/projects/project-captsone/workspace/data/players_14-17.csv"

In [None]:
player_stats = pd.read_csv(file_loc)
player_stats.head()

### Web Scrape

In [None]:
# Create a list of URLs to feed the web scraper
url_list = list(player_stats['url'])

In [None]:
# Temporary list used to save scraped data in memory
tmp_list = []

In [None]:
# Web Scraping
for url in url_list:
    # Select which tables to scrape from
    per_100 = get_100(url)
    shooting = get_shooting(url)
    advanced = get_advanced(url)
    
    # Build dataframe in order
    df = [per_100, shooting, advanced]
    player_stats = []
    for sublist in df:
        for val in sublist:
            player_stats.append(val)
    
    # Save to temporary list
    tmp_list.append(player_stats)

In [None]:
# Retrieve columns
per_100_cols = get_100_cols()
shoot_cols = get_shoot_cols()
advanced_cols = get_adv_cols()

# Maintain specific order
cols = per_100_cols, shoot_cols, advanced_cols

In [None]:
# Build dataframe
cleaned_stats = pd.DataFrame(tmp_list, columns=cols)
cleaned_stats.head()

In [None]:
# Merge dataframes to combine names 
cumulative_player_stats = pd.merge(player_stats, cleaned_stats, on='Player_ID', how='right')

In [None]:
file1_loc = "/Users/alexcheng/dsi/dsi_workspace/projects/project-captsone/workspace/data/stats_14-17.csv"

# Save to CSV
cumulative_player_stats.to_csv(file1_loc)

### Testing Area

In [None]:
# # Load packages
# import re
# import pandas as pd
# import numpy as np

# # Load selenium
# from selenium import webdriver
# from selenium.webdriver.common.keys import Keys
# from selenium.webdriver.common.action_chains import ActionChains
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC

# # Chrome webdriver
# # driver = webdriver.Chrome('/Users/alexcheng/Downloads/chromedriver')

In [None]:
# test = pd.read_csv('/Users/alexcheng/Desktop/test.csv')

In [None]:
# test.head()

In [None]:
# test_list = list(test['url'])

In [None]:
# tmp_list = []

In [None]:
# for url in test_list:
#     per_game = get_per_game(url)
    
#     df = [per_game]
#     player_stats = []
#     for sublist in df:
#         for val in sublist:
#             player_stats.append(val)
            
#     tmp_list.append(player_stats)

In [None]:
# cols = get_pergame_cols()

In [None]:
# test_stats = pd.DataFrame(tmp_list, columns=cols)

In [None]:
# test_stats.head()

In [None]:
# test_stats.to_csv('/Users/alexcheng/Desktop/testtesttesttest.csv')