#  Small Python Web Scraping Example : All-Star Classifier
### Project Question: 
Using data scraped from the basketball reference from the 17-18 NBA season, can we accurately classify whether a player is an all-star?
### Goals:
- Scrape a valid and clean data frame from basketball reference
- Investigate each core variable, understand relationships, and check whether new variables can be created
- Compare across industry standard classification techniques, and tune an appropriate model for classification

## Web Scraping

In [5]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from collections import OrderedDict
import requests

In [6]:
def gather_season():
    """
    Scrapes basketball-reference for the 17-18 season
    """
    season = []
    url = 'https://www.basketball-reference.com/leagues/NBA_2018_per_game.html'
    
    page_req = requests.get(url)
    soup = BeautifulSoup(page_req.text, 'lxml')
    table = soup.find('table')
    table_body = table.find('tbody')
    
    fields = ['Player','Pos','Age','Tm','G','GS','MP','FG','FGA','FG%','3P','3PA','3P%','2P','2PA','2P%','eFG%','FT','FTA','FT%','ORB','DRB','TRB','AST','STL','BLK','TOV','PF','PS/G']
    for row in table_body.findAll('tr'):
        player_url = row.find('a')
        cell = row.findAll('td')
        
        if len(cell) > 0:
            cell = [c.text for c in cell]
            player_entry = OrderedDict(zip(fields, cell))
            season.append(player_entry)
    
    return pd.DataFrame(season)

In [7]:
df = gather_season()
df.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PS/G
0,Alex Abrines,SG,24,OKC,75,8,15.1,1.5,3.9,0.395,...,0.848,0.3,1.2,1.5,0.4,0.5,0.1,0.3,1.7,4.7
1,Quincy Acy,PF,27,BRK,70,8,19.4,1.9,5.2,0.356,...,0.817,0.6,3.1,3.7,0.8,0.5,0.4,0.9,2.1,5.9
2,Steven Adams,C,24,OKC,76,76,32.7,5.9,9.4,0.629,...,0.559,5.1,4.0,9.0,1.2,1.2,1.0,1.7,2.8,13.9
3,Bam Adebayo,C,20,MIA,69,19,19.8,2.5,4.9,0.512,...,0.721,1.7,3.8,5.5,1.5,0.5,0.6,1.0,2.0,6.9
4,Arron Afflalo,SG,32,ORL,53,3,12.9,1.2,3.1,0.401,...,0.846,0.1,1.2,1.2,0.6,0.1,0.2,0.4,1.1,3.4


In [8]:
np.shape(df)

(664, 29)

### Handle Traded Players:
- Check whether a given player is was a traded player and not his 'TOT' summarized row
- Update dataframe

In [9]:
traded_players = list(df[df.Tm == 'TOT'].Player)
del_row_inds = [i for i, row in df.iterrows() if row['Player'] in traded_players and row['Tm'] != 'TOT']
df = df.drop(df.index[del_row_inds])
df = df.reset_index(drop = True)
np.shape(df)

(540, 29)

### Add Dummy Variable indicating All-Star or Not
- Create list of 2017-2018 all-stars
- Add new column indicating whether player is in list

In [10]:
all_stars = [
    'LeBron James','Kevin Durant','Anthony Davis',
    'Kyrie Irving','DeMarcus Cousins','LaMarcus Aldridge',
    'Bradley Beal','Goran Dragic','Andre Drummond','Paul George',
    'Victor Oladipo','Kemba Walker','Russell Westbrook','Kevin Love',
    'Kristaps Porzingis','John Wall','Stephen Curry','James Harden',
    'DeMar DeRozan','Giannis Antetokounmpo','Joel Embiid','Jimmy Butler',
    'Draymond Green','Al Horford','Damian Lillard','Kyle Lowry',
    'Klay Thompson','Karl-Anthony Towns'
]

### Special Cases:
- DeMarcus Cousins did not play due to injury; replaced by Paul George; replaced as starter by Russell Westbrook.
- Kevin Love did not play due to injury; replaced by Goran Dragic.
- Kristaps Porzingis did not play due to injury; replaced by Kemba Walker.
- John Wall did not play due to injury; replaced by Andre Drummond.
- Jimmy Butler did not play due to coach's decision.

In [11]:
df['All-Star'] = df.apply(lambda row: row['Player'] in all_stars, axis = 1)
df.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PS/G,All-Star
0,Alex Abrines,SG,24,OKC,75,8,15.1,1.5,3.9,0.395,...,0.3,1.2,1.5,0.4,0.5,0.1,0.3,1.7,4.7,False
1,Quincy Acy,PF,27,BRK,70,8,19.4,1.9,5.2,0.356,...,0.6,3.1,3.7,0.8,0.5,0.4,0.9,2.1,5.9,False
2,Steven Adams,C,24,OKC,76,76,32.7,5.9,9.4,0.629,...,5.1,4.0,9.0,1.2,1.2,1.0,1.7,2.8,13.9,False
3,Bam Adebayo,C,20,MIA,69,19,19.8,2.5,4.9,0.512,...,1.7,3.8,5.5,1.5,0.5,0.6,1.0,2.0,6.9,False
4,Arron Afflalo,SG,32,ORL,53,3,12.9,1.2,3.1,0.401,...,0.1,1.2,1.2,0.6,0.1,0.2,0.4,1.1,3.4,False


### Save Clean Dataset to CSV

In [125]:
df.to_csv('allstars_17_18.csv', sep = ',', index = False)

### Now you can build your model!