# The NBA's Next Super Star

## Abstract
All NBA superstars have their rookie seasons. No matter it started with a [terrible debut year (eg. Lebron James)](http://www.thesportster.com/basketball/15-nba-superstars-who-were-terrible-in-their-debut-season/) or rewarding one like being selected to the [all-star game (eg. Tim Duncan)](http://www.nba-allstar.com/players/lists/all-star-game-rookies.htm), it is believed that there exists some evidences in their early years foretelling their future success. In this project, we will study and examine those key factors that drives the players towards a successful career, and eventually predicting the future super star in NBA.

![alt text][all_star_game_2016]

[all_star_game_2016]: http://i.cdn.turner.com/nba/nba/dam/assets/160121164737-all-star-starters-graphic-1280-012116.1200x672.jpg "All Star Game 2016"


## Problem Statement
A common way to foresee a successful NBA player from his rookie year is normally by looking at the draft order. However, such argument is incomprehensible for not taking other extrinsic factors into account.

## Introduction
This project aims to predict the career growth of an NBA player with the aid of the following data sources that reflects a player's performance, annual ratings on video games, and exposure on social media and newspaper.

1. Performance: The past and current on-court statistics on-court. (eg. [1991 NBA Draft](http://www.basketball-reference.com/draft/NBA_1991.html))

2. Ratings: The player ratings in the NBA 2K16 video game (eg. [2K16 MyTEAM Players](http://2kmtcentral.com/16/players/theme/dynamic))

3. Social Media: The amount of tweets of a player might reflect his attitude and seriousness of his career. Besides, a language model composed of the tweet content could be useful.(eg. [Twitter API](https://dev.twitter.com/rest/reference/get/statuses/user_timeline))

4. Related News: As news carries information such as fans' support, anticipation, expert analysis, injury report, and anticipation, the language model built accordingly would be a great indicator.

## Objective
Judging from indicators such as efficiency or other stats on-court, we would like to predict the likelihood of a player becoming a superstar that dominates the league in the future. 

## Import Libraries

In [10]:
from urllib import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

## Data Preparation

We can obtains all rookies using function scrape_draft. In order to better predict each player's future performance, their performance in each regualr season should be considered. Therfore, we will crawl NBA player scoring per game on ESPN, and merge with the rookies dataframe. Finally, we will derive a dataframe containing each rookie, and their performance statistics since drafts. 

In [11]:
def obj2numeric(df, cols, str_cols = ['Tm', 'PLAYER', 'College']):
    """
    convert column type to numeric except those string fields
    
    Args:
    (data_frame): data frame to be converted
    
    Return:
    (data_frame): data frame with numeric fields converted
    """
    for c in cols:
        if c not in str_cols: df[c] = pd.to_numeric(df[c], errors='raise')
    return df

def transform_numeric(df):
    """
    convert columns type to numeric
    
    Args:
    (data_frame): data frame whose columns contain object type
    
    Return:
    (data_frame): data frame with numeric values converted
    """
    df.rename(columns={'WS/48':'WS_per_48', 'Player':'PLAYER'}, inplace=True)
    df.columns.values[14:18] = [df.columns.values[14:18][col] + "_per_game" for col in range(4)]
    df = obj2numeric(df, df.columns)
    df = df[df['PLAYER'].notnull()].fillna(0)
    df.loc[:,'Yrs':'AST'] = df.loc[:,'Yrs':'AST'].astype(int)
    return df
    
def scrape_draft(save_file, start_yr=1966, end_yr=2016):
    """
    Scrape draft data for the specified duration from:
    http://www.basketball-reference.com/draft/NBA_{year}.html
    
    Args:
    start_yr(int): start of year for scraping
    end_yr(int): end of year for scraping
    
    Return:
    (data_frame): Annually draft pick result
    """
    url_format = 'http://www.basketball-reference.com/draft/NBA_{yr}.html'
    frames = []
    
    for y in range(start_yr, end_yr):
        url = url_format.format(yr = y)
        bs = BeautifulSoup(urlopen(url), 'html.parser')
        
        # columns and remove the header column(Rk)
        tr_tags = bs.findAll('tr')
        th_tags = tr_tags[1].findAll('th')
        
        cols = [th.getText() for th in th_tags]; cols.pop(0)
        rows = [[td.getText() for td in tr_tags[i].findAll('td')] for i in range(2, len(tr_tags))]

        year_df = pd.DataFrame(rows, columns = cols)
        year_df.insert(0,'Draft_Yr', y)
        frames.append(year_df)
    
    df = pd.concat(frames)
    df = transform_numeric(df)
    df.to_csv(save_file)
    return df

In [13]:
# see ESPN_nba_player_stats.py for more details 
from ESPN_nba_player_stats import get_regular_season

def merge_draft_espn(draft_file, union_file, start_yr=1990, end_yr=2017):
    """
    merge all rookies from draft with statistics from ESPN
    
    Args:
    (data_frame): data frame to be converted
    
    Return:
    (data_frame): data frame with numeric fields converted
    """
    draft_df = pd.read_csv(draft_file, index_col=0)

    for yr in range(start_yr, end_yr):
        yr_data = get_regular_season(yr)
        # Only cares players in draft_df
        draft_df = pd.merge(draft_df, yr_data, on="PLAYER", how="left")

    draft_df.to_csv(union_file)
    return draft_df

start_yr = 1999
end_yr = 2017 # exclusive

draft_file_format = "draft_data_{start_yr}_to_{end_yr}.csv"
union_file_format = "all_data_{start_yr}_to_{end_yr}.csv"
draft_file = draft_file_format.format(start_yr = start_yr, end_yr = end_yr)
union_file = union_file_format.format(start_yr = start_yr, end_yr = end_yr)

# Retrive 2009 - 2016 rookies, and ESPN regular season statistic from 2009 - 2016
draft_df = scrape_draft(draft_file, start_yr, end_yr)
all_df = merge_draft_espn(draft_file, union_file, start_yr, end_yr)
print all_df.head(10)
print all_df.columns

## Merge with all star stats

In [14]:
def separateNameField(row):
    pattern = r'(\w*[\'-]*\w* \w*[\'-]*\w*).*'
    s = row['Player']
    s = s.replace('\xd5', '\'')
    m = re.search(pattern, s)
    row['Player'] = m.group(1)
    return row

allstar_stats = pd.read_csv('ref/allstar.csv')
allstar_stats = allstar_stats.apply(separateNameField, axis = 1)

feature_stats = pd.read_csv('all_data_1999_to_2017.csv')
player_name = set(allstar_stats['Player'])


features = feature_stats.iloc[:, 0:23]
features['allstar'] = False
for i, r in features.iterrows():
    features.loc[i, 'allstar'] = r['PLAYER'] in player_name
    
df_college = pd.get_dummies(feature_stats['College'])
features = pd.concat([features, df_college], axis=1, join_axes=[features.index])
features = features.drop(['Unnamed: 0', 'College'], axis = 1)
features.to_csv('features.csv')


# TODO: plot university distribution
# pd.get_dummies(feature_stats['College']).sum(axis = 0)
# feature_stats[feature_stats['allstar'] == True][['PLAYER', 'allstar']]
#     feature_stats['allstar'] = feature_stats['PLAYER']
# feature_stats['allstar'] = player_name(feature_stats['allstar'])
# feature_stats['allstar']

IOError: File ref/allstar.csv does not exist

## Linear Regression

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset
features_all = pd.read_csv('features.csv')
features_all = features_all.drop(['Unnamed: 0', 'Tm', 'PLAYER'], axis = 1)

y = features_all['allstar']
X = features_all.drop('allstar', axis = 1)
ind = int(X.shape[0] * 0.7)

# Split the data into training/testing sets
m, n = X.shape
perm = np.random.permutation(m)
X, y = X.loc[perm], y.loc[perm]
diabetes_X_train = X[:ind]
diabetes_X_test = X[ind:]

# Split the targets into training/testing sets
diabetes_y_train = y[:ind]
diabetes_y_test = y[ind:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# # The coefficients
# print('Coefficients: \n', regr.coef_)
# # The mean squared error
# print("Mean squared error: %.2f"
#       % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# # Explained variance score: 1 is perfect prediction
# print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs
# print regr.predict(diabetes_X_train).shape
# print diabetes_y_train
for threshold in range(1, 10, 1):
    threshold /= 10.0
    print threshold
    print ((regr.predict(diabetes_X_train) > threshold)  == diabetes_y_train).mean()
    print ((regr.predict(diabetes_X_test) > threshold) == diabetes_y_test).mean()

In [15]:
def transform_ft(df):
    int_cols = ['GP', 'total_FTM', 'total_FTA']
    df.loc[:,'GP':'FT%'] = df.loc[:,'GP':'FT%'].astype(float)
    df[int_cols] = df[int_cols].astype(int)
    return df

def transform_3p(df):
    int_cols = ['GP', '3PA', 'total_3PM', 'total_3PA', '2PM']
    df.loc[:,'GP':'FG%'] = df.loc[:,'GP':'FG%'].astype(float)
    df[int_cols] = df[int_cols].astype(int)
    return df

def transform_fg(df):
    df.columns = list(df.columns[:-1]) + [('adj_' + df.columns[-1]).decode('utf-8')]
    int_cols = ['GP', 'total_FGM', 'total_FGA', '2PM', '2PA']
    df.loc[:,'GP':'adj_FG%'] = df.loc[:,'GP':'adj_FG%'].astype(float)
    df[int_cols] = df[int_cols].astype(int)
    return df
    
def scrape_ncaa(url, cnt):
    bs = BeautifulSoup(urlopen(url), 'html.parser')
    tr_tags = bs.findAll('tr')
    rows = []
    cols = tr_tags[1].findAll('td')
    cols = [th.getText() for th in cols]
    cols[7:9] = ['total_' + s for s in cols[7:9]]

    for i in range(2, len(tr_tags)):
        td_tags = tr_tags[i].findAll('td')
        rows.append([th.getText() for th in td_tags])
    df = pd.DataFrame(rows, columns = cols)
    
    # Clean up
    df = df[df['PLAYER'] != 'PER GAME']
    df = df[df['PLAYER'] != 'PLAYER']
    df = df.reset_index()
    df.drop('index', axis = 1, inplace = True)
    df['RK'] = df.index + cnt
    return df

def scrape_ncaa_year(url):
    df_fg_list = []; cnt = 1
    for cnt in range(1, 100, 40):
        df_fg = scrape_ncaa(url + str(cnt), cnt)
        df_fg_list.append(df_fg)
    df = pd.concat(df_fg_list)
    df = df.reset_index('RK')
    df.drop('index', axis = 1, inplace = True)
    return df

def scrape_ncaa_yr_range(s_yr, e_yr, url):
    df_list = []
    for yr in range(2002, 2016):
        url_reg = url + str(yr) + '/count/'
        url_pos = url + str(yr) + '/seasontype/3'
        df_list.append(scrape_ncaa_year(url_reg))
        df_list.append(scrape_ncaa_year(url_pos))
    df = pd.concat(df_list)
    df = df.reset_index()
    df.drop('index', axis = 1, inplace = True)
    return df

def scrape_driver(s_yr, e_yr):
    # Create urls
    url_base = 'http://www.espn.com/mens-college-basketball/statistics/player/_/stat/'
    stat_list = ['free-throws', '3-points', 'field-goals']
    url_list = [url_base + stat + '/year/' for stat in stat_list]

    # Scrape free throw
    df_ft = scrape_ncaa_yr_range(s_yr, e_yr, url_list[0])
    df_ft = transform_ft(df_ft)
    df_ft.to_csv('free_throw.csv')

    # Scrape three-point
    df_3p = scrape_ncaa_yr_range(2002, 2016, url_list[1])
    df_3p = transform_3p(df_3p)
    df_3p.to_csv('three_point.csv')

    # Scrape field-goal
    df_fg = scrape_ncaa_yr_range(2002, 2016, url_list[2])
    df_fg = transform_fg(df_fg)
    df_fg.to_csv('field_goal.csv')
    
    return (df_ft, df_3p, df_fg)
    
(df_ft, df_3p, df_fg) = scrape_driver(2002, 2016)

## NBA Pre-draft Measurements

In [None]:

physical_stats = pd.read_csv('nba-pre-draft-measurements.csv')
physical_stats.head()



In [None]:
"""
Edge case examples:
    Shareef O'Neal - 2015
    Maxwell Lorca-Lloyd - 2016
    Johnny O\xd5Bryant - 2013
    Chris Wright (Gtown) - 2009
"""
import re

pattern = r'\W*(\w* \w*[\'-]*\w*).* - (\d{4})'

def separateNameField(row):
    s = row['Name']
    s = s.replace('\xd5', '\'')
    m = re.search(pattern, s)
    
    row['Name'] = m.group(1)
    row['DraftYear'] = m.group(2)
    return row
    
physical_stats2 = physical_stats.apply(separateNameField, axis = 1)
physical_stats2.head()


