# The NBA's Next Super Star

## Abstract
All NBA superstars have their rookie seasons. No matter it started with a [terrible debut year (eg. Lebron James)](http://www.thesportster.com/basketball/15-nba-superstars-who-were-terrible-in-their-debut-season/) or rewarding one like being selected to the [all-star game (eg. Tim Duncan)](http://www.nba-allstar.com/players/lists/all-star-game-rookies.htm), it is believed that there exists some evidences in their early years foretelling their future success. In this project, we will study and examine those key factors that drives the players towards a successful career, and eventually predicting the future super star in NBA.

![alt text][all_star_game_2016]

[all_star_game_2016]: http://i.cdn.turner.com/nba/nba/dam/assets/160121164737-all-star-starters-graphic-1280-012116.1200x672.jpg "All Star Game 2016"


## Problem Statement
A common way to foresee a successful NBA player from his rookie year is normally by looking at the draft order. However, such argument is incomprehensible for not taking other extrinsic factors into account.

## Introduction
This project aims to predict the career growth of an NBA player with the aid of the following data sources that reflects a player's performance, annual ratings on video games, and exposure on social media and newspaper.

1. Performance: The past and current on-court statistics on-court. (eg. [1991 NBA Draft](http://www.basketball-reference.com/draft/NBA_1991.html))

2. Ratings: The player ratings in the NBA 2K16 video game (eg. [2K16 MyTEAM Players](http://2kmtcentral.com/16/players/theme/dynamic))

3. Social Media: The amount of tweets of a player might reflect his attitude and seriousness of his career. Besides, a language model composed of the tweet content could be useful.(eg. [Twitter API](https://dev.twitter.com/rest/reference/get/statuses/user_timeline))

4. Related News: As news carries information such as fans' support, anticipation, expert analysis, injury report, and anticipation, the language model built accordingly would be a great indicator.

## Objective
Judging from indicators such as efficiency or other stats on-court, we would like to predict the likelihood of a player becoming a superstar that dominates the league in the future. 

## Import Libraries

In [2]:
from urllib import urlopen
from bs4 import BeautifulSoup
import pandas as pd

## Data Preparation

We can obtains all rookies using function scrape_draft. In order to better predict each player's future performance, their performance in each regualr season should be considered. Therfore, we will crawl NBA player scoring per game on ESPN, and merge with the rookies dataframe. Finally, we will derive a dataframe containing each rookie, and their performance statistics since drafts. 

In [3]:
def obj2numeric(df, cols, str_cols = ['Tm', 'PLAYER', 'College']):
    """
    convert column type to numeric except those string fields
    
    Args:
    (data_frame): data frame to be converted
    
    Return:
    (data_frame): data frame with numeric fields converted
    """
    for c in cols:
        if c not in str_cols: df[c] = pd.to_numeric(df[c], errors='raise')
    return df

def transform_numeric(df):
    """
    convert columns type to numeric
    
    Args:
    (data_frame): data frame whose columns contain object type
    
    Return:
    (data_frame): data frame with numeric values converted
    """
    df.rename(columns={'WS/48':'WS_per_48', 'Player':'PLAYER'}, inplace=True)
    df.columns.values[14:18] = [df.columns.values[14:18][col] + "_per_game" for col in range(4)]
    df = obj2numeric(df, df.columns)
    df = df[df['PLAYER'].notnull()].fillna(0)
    df.loc[:,'Yrs':'AST'] = df.loc[:,'Yrs':'AST'].astype(int)
    return df
    
def scrape_draft(save_file, start_yr=1966, end_yr=2016):
    """
    Scrape draft data for the specified duration from:
    http://www.basketball-reference.com/draft/NBA_{year}.html
    
    Args:
    start_yr(int): start of year for scraping
    end_yr(int): end of year for scraping
    
    Return:
    (data_frame): Annually draft pick result
    """
    url_format = 'http://www.basketball-reference.com/draft/NBA_{yr}.html'
    frames = []
    
    for y in range(start_yr, end_yr):
        url = url_format.format(yr = y)
        bs = BeautifulSoup(urlopen(url), 'html.parser')
        
        # columns and remove the header column(Rk)
        tr_tags = bs.findAll('tr')
        th_tags = tr_tags[1].findAll('th')
        
        cols = [th.getText() for th in th_tags]; cols.pop(0)
        rows = [[td.getText() for td in tr_tags[i].findAll('td')] for i in range(2, len(tr_tags))]

        year_df = pd.DataFrame(rows, columns = cols)
        year_df.insert(0,'Draft_Yr', y)
        frames.append(year_df)
    
    df = pd.concat(frames)
    df = transform_numeric(df)
    df.to_csv(save_file)
    return df

In [4]:
# see ESPN_nba_player_stats.py for more details 
from ESPN_nba_player_stats import get_regular_season

def merge_draft_espn(draft_file, union_file, start_yr=1990, end_yr=2017):
    """
    merge all rookies from draft with statistics from ESPN
    
    Args:
    (data_frame): data frame to be converted
    
    Return:
    (data_frame): data frame with numeric fields converted
    """
    draft_df = pd.read_csv(draft_file, index_col=0)

    for yr in range(start_yr, end_yr):
        yr_data = get_regular_season(yr)
        # Only cares players in draft_df
        draft_df = pd.merge(draft_df, yr_data, on="PLAYER", how="left")

    draft_df.to_csv(union_file)
    return draft_df

start_yr = 2009
end_yr = 2017 # exclusive

draft_file_format = "draft_data_{start_yr}_to_{end_yr}.csv"
union_file_format = "all_data_{start_yr}_to_{end_yr}.csv"
draft_file = draft_file_format.format(start_yr = start_yr, end_yr = end_yr)
union_file = union_file_format.format(start_yr = start_yr, end_yr = end_yr)

# Retrive 2009 - 2016 rookies, and ESPN regular season statistic from 2009 - 2016
draft_df = scrape_draft(draft_file, start_yr, end_yr)
all_df = merge_draft_espn(draft_file, union_file, start_yr, end_yr)
print all_df.head(10)

   Draft_Yr    Pk   Tm            PLAYER                            College  \
0      2009   1.0  LAC     Blake Griffin             University of Oklahoma   
1      2009   2.0  MEM   Hasheem Thabeet          University of Connecticut   
2      2009   3.0  OKC      James Harden           Arizona State University   
3      2009   4.0  SAC      Tyreke Evans              University of Memphis   
4      2009   5.0  MIN       Ricky Rubio                                NaN   
5      2009   6.0  MIN       Jonny Flynn                Syracuse University   
6      2009   7.0  GSW     Stephen Curry                   Davidson College   
7      2009   8.0  NYK       Jordan Hill              University of Arizona   
8      2009   9.0  TOR     DeMar DeRozan  University of Southern California   
9      2009  10.0  MIL  Brandon Jennings                                NaN   

   Yrs    G     MP    PTS   TRB    ...     2016_TEAM  2016_GP  2016_MPG  \
0    7  417  14715   8936  3995    ...           LAC   

In [50]:

physical_stats = pd.read_csv('nba-pre-draft-measurements.csv')
physical_stats.head()



Unnamed: 0,Name,Height w/o Shoes,Height w/shoes,Weight,Wingspan,Reach,Body Fat,Hand Length,Hand Width,No Step Vert,No Step Vert Reach,Max Vert,Max Vert Reach,Bench,Agility,Sprint,Rank,Drafted
0,* Shawn Bradley - 1993,"7' 5.5""",,248.0,"7' 5""",,,0.0,0.0,,,,,,,,,2
1,Michael Fusek - 2016,"7' 3.75""","7' 4.75""",222.0,"7' 5""","9' 8""",,0.0,0.0,25.0,"11' 9""",31.0,"12' 3""",,12.94,3.54,,No
2,Pavel Podkolzine - 2003,"7' 3.5""","7' 5""",303.0,"7' 5.75""","9' 8""",16.3,0.0,0.0,19.5,"11' 3.5""",22.5,"11' 6.5""",5.0,13.4,3.8,76.0,21
3,Samuel Deguara - 2011,"7' 3.40""","7' 4.58""",300.0,"7' 5.76""","9' 3.42""",7.5,0.0,0.0,17.6,"10' 9.02""",24.2,"11' 3.62""",,,4.21,,No
4,Zydrunas Ilgauskas - 1996,"7' 3""",,258.0,,,,0.0,0.0,,,,,,,,,20


In [51]:
"""
Edge case examples:
    Shareef O'Neal - 2015
    Maxwell Lorca-Lloyd - 2016
    Johnny O\xd5Bryant - 2013
    Chris Wright (Gtown) - 2009
"""
import re

pattern = r'\W*(\w* \w*[\'-]*\w*).* - (\d{4})'

def separateNameField(row):
    s = row['Name']
    s = s.replace('\xd5', '\'')
    m = re.search(pattern, s)
    
    row['Name'] = m.group(1)
    row['DraftYear'] = m.group(2)
    return row
    
physical_stats2 = physical_stats.apply(separateNameField, axis = 1)
physical_stats2.head()




Unnamed: 0,Name,Height w/o Shoes,Height w/shoes,Weight,Wingspan,Reach,Body Fat,Hand Length,Hand Width,No Step Vert,No Step Vert Reach,Max Vert,Max Vert Reach,Bench,Agility,Sprint,Rank,Drafted,DraftYear
0,Shawn Bradley,"7' 5.5""",,248.0,"7' 5""",,,0.0,0.0,,,,,,,,,2,1993
1,Michael Fusek,"7' 3.75""","7' 4.75""",222.0,"7' 5""","9' 8""",,0.0,0.0,25.0,"11' 9""",31.0,"12' 3""",,12.94,3.54,,No,2016
2,Pavel Podkolzine,"7' 3.5""","7' 5""",303.0,"7' 5.75""","9' 8""",16.3,0.0,0.0,19.5,"11' 3.5""",22.5,"11' 6.5""",5.0,13.4,3.8,76.0,21,2003
3,Samuel Deguara,"7' 3.40""","7' 4.58""",300.0,"7' 5.76""","9' 3.42""",7.5,0.0,0.0,17.6,"10' 9.02""",24.2,"11' 3.62""",,,4.21,,No,2011
4,Zydrunas Ilgauskas,"7' 3""",,258.0,,,,0.0,0.0,,,,,,,,,20,1996
