## Web Scraping Basketball-Reference.com 

This pipeline scrapes three tables (per game, shooting, advanced) from www.basketball-reference.com from any individual player page. It requires a player list of URLs to cycle through and start the web scraping process. 

### Load Modules and Packages

**`basketball_reference`** is a module that web scrapes data from [Basketball-Reference.com](http://basketball-reference.com). This module can call these functions:
- `get_player_name`
- `get_per_game_stats`
- `get_shooting_stats`
- `get_advanced_stats`
- `get_columns`


**`player_list`** is a module that returns the list of players to be fed into the web scraping module. This module can call these functions:
- `load_total`
- `load_part1`
- `load_part2`
- `load_test`

In [1]:
import pandas as pd
from lib.basketball_reference import get_player_name, get_per_game, get_shooting, get_advanced, get_columns 
from lib.player_list import load_total, load_part1, load_part2, load_test

### Player List

These are lists of URLs to be fed into the For loop.

In [2]:
total_list = load_total()
part1_list = load_part1()
part2_list = load_part2()
test_list = load_test()

In [3]:
test_list

['http://www.basketball-reference.com/players/a/abrinal01.html',
 'http://www.basketball-reference.com/players/a/acyqu01.html',
 'http://www.basketball-reference.com/players/a/adamsst01.html',
 'http://www.basketball-reference.com/players/a/afflaar01.html',
 'http://www.basketball-reference.com/players/a/ajincal01.html',
 'http://www.basketball-reference.com/players/a/aldrico01.html']

### Web Scraping

In [4]:
# Temporary list
tmp_list = []

In [5]:
for url in test_list:
    # Get stats
    name = get_player_name(url)
    per_game = get_per_game(url)
    advanced = get_advanced(url)
    shooting = get_shooting(url)   
    
    # Data Cleaning
    df = [per_game, shooting, advanced]
    player_stats = [name]    
    for sublist in df:
        for val in sublist:
            player_stats.append(val)
            
    # Append to tmp_list
    tmp_list.append(player_stats)

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id="meta"]/div/p[4]/span[1]"}
  (Session info: chrome=56.0.2924.87)
  (Driver info: chromedriver=2.27.440174 (e97a722caafc2d3a8b807ee115bfb307f7d2cfd9),platform=Mac OS X 10.12.3 x86_64)


### Final Dataframe

In [6]:
# Columns
cols = get_columns()

In [7]:
# Build dataframe
nba_player_stats = pd.DataFrame(tmp_list, columns=cols)
nba_player_stats

Unnamed: 0,Player,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,...,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DPM,BPM,VORP
0,[],43.0,0.0,13.9,1.7,4.4,0.397,1.2,3.2,0.375,...,10.1,16.3,0.6,0.5,1.1,0.091,0.0,-2.4,-3.0,-0.1
1,Quincy Acy,237.0,51.0,14.9,1.7,3.5,0.495,0.2,0.7,0.359,...,13.0,13.8,4.4,2.8,7.3,0.099,0.0,0.0,-1.5,0.5
2,Steven Adams,283.0,219.0,23.3,3.0,5.2,0.569,0.0,0.0,0.0,...,16.8,13.9,9.4,8.7,18.1,0.132,0.0,1.7,1.0,4.9
3,Arron Afflalo,693.0,519.0,28.4,4.1,9.2,0.45,1.2,3.0,0.385,...,10.4,18.0,23.4,9.3,32.7,0.08,0.0,-1.3,-1.0,5.2
4,Alexis Ajinca,275.0,71.0,13.2,2.2,4.3,0.5,0.0,0.1,0.293,...,15.1,19.4,2.7,4.2,6.8,0.091,0.0,0.4,-2.5,-0.4
5,Cole Aldrich,304.0,23.0,10.7,1.4,2.6,0.531,0.0,0.0,0.0,...,17.7,15.2,3.9,5.7,9.6,0.141,0.0,3.5,1.3,2.7


### Save to CSV

In [18]:
nba_player_stats.to_csv('data/test.csv')

### Testing

In [1]:
# Load selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Chrome webdriver
driver = webdriver.Chrome('/Users/alexcheng/Downloads/chromedriver')

In [2]:
test_url = "http://www.basketball-reference.com/players/a/abrinal01.html"

In [6]:
def get_att(season, url):
    driver.get(url)
    height = driver.find_element_by_xpath("""//*[@id="meta"]/div[2]/p[4]/span[1]""")
    weight = driver.find_element_by_xpath("""//*[@id="meta"]/div[2]/p[4]/span[2]""")
    
    feet = re.findall('(\d)-', height)
    inches = re.findall('-(\d+)', height)
    for i in feet:
        feet = float(i)
    for i in inches:
        inches = float(i)
    inches = [(feet * 12) + inches]
    return inches