# The NBA's Next Super Star

## Abstract
All NBA superstars have their rookie seasons. No matter it started with a [terrible debut year (eg. Lebron James)](http://www.thesportster.com/basketball/15-nba-superstars-who-were-terrible-in-their-debut-season/) or rewarding one like being selected to the [all-star game (eg. Tim Duncan)](http://www.nba-allstar.com/players/lists/all-star-game-rookies.htm), it is believed that there exists some evidences in their early years foretelling their future success. In this project, we will study and examine those key factors that drives the players towards a successful career, and eventually predicting the future super star in NBA.

![alt text][all_star_game_2016]

[all_star_game_2016]: http://i.cdn.turner.com/nba/nba/dam/assets/160121164737-all-star-starters-graphic-1280-012116.1200x672.jpg "All Star Game 2016"


## Problem Statement
A common way to foresee a successful NBA player from his rookie year is normally by looking at the draft order. However, such argument is incomprehensible for not taking other extrinsic factors into account.

## Introduction
This project aims to predict the career growth of an NBA player with the aid of the following data sources that reflects a player's performance, annual ratings on video games, and exposure on social media and newspaper.

1. Performance: The past and current on-court statistics on-court. (eg. [1991 NBA Draft](http://www.basketball-reference.com/draft/NBA_1991.html))

2. Ratings: The player ratings in the NBA 2K16 video game (eg. [2K16 MyTEAM Players](http://2kmtcentral.com/16/players/theme/dynamic))

3. Social Media: The amount of tweets of a player might reflect his attitude and seriousness of his career. Besides, a language model composed of the tweet content could be useful.(eg. [Twitter API](https://dev.twitter.com/rest/reference/get/statuses/user_timeline))

4. Related News: As news carries information such as fans' support, anticipation, expert analysis, injury report, and anticipation, the language model built accordingly would be a great indicator.

## Objective
Judging from indicators such as efficiency or other stats on-court, we would like to predict the likelihood of a player becoming a superstar that dominates the league in the future. 

## Import Libraries

In [138]:
from urllib import urlopen
from bs4 import BeautifulSoup
import pandas as pd

## Data Preparation

In [139]:
def obj2numeric(df, cols, str_cols = ['Tm', 'Player', 'College']):
    """
    convert column type to numeric except those string fields
    
    Args:
    (data_frame): data frame to be converted
    
    Return:
    (data_frame): data frame with numeric fields converted
    """
    for c in cols:
        if c not in str_cols: df[c] = pd.to_numeric(df[c], errors='raise')
    return df

def transform_numeric(df):
    """
    convert columns type to numeric
    
    Args:
    (data_frame): data frame whose columns contain object type
    
    Return:
    (data_frame): data frame with numeric values converted
    """
    df.rename(columns={'WS/48':'WS/48min'}, inplace=True)
    df.columns.values[14:18] = [df.columns.values[14:18][col] + "/game" for col in range(4)]
    df = obj2numeric(df, df.columns)
    df = df[df['Player'].notnull()].fillna(0)
    df.loc[:,'Yrs':'AST'] = df.loc[:,'Yrs':'AST'].astype(int)
    return df
    
def scrape_draft(start_yr = 1966, end_yr = 2016):
    """
    Scrape draft data for the specified duration from:
    http://www.basketball-reference.com/draft/NBA_{year}.html
    
    Args:
    start_yr(int): start of year for scraping
    end_yr(int): end of year for scraping
    
    Return:
    (data_frame): Annually draft pick result
    """
    url_format = 'http://www.basketball-reference.com/draft/NBA_{yr}.html'
    frames = []
    
    for y in range(start_yr, end_yr):
        url = url_format.format(yr = y)
        bs = BeautifulSoup(urlopen(url), 'html.parser')
        
        # columns and remove the header column(Rk)
        tr_tags = bs.findAll('tr')
        th_tags = tr_tags[1].findAll('th')
        
        cols = [th.getText() for th in th_tags]; cols.pop(0)
        rows = [[td.getText() for td in tr_tags[i].findAll('td')] for i in range(2, len(tr_tags))]

        year_df = pd.DataFrame(rows, columns = cols)
        year_df.insert(0,'Draft_Yr', yr)
        frames.append(year_df)
    
    df = pd.concat(frames)
    return transform_numeric(df)

scrape_draft(1991, 1992)

Unnamed: 0,Draft_Yr,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,3P%,FT%,MP/game,PTS/game,TRB/game,AST/game,WS,WS/48min,BPM,VORP
0,2015,1.0,CHH,Larry Johnson,"University of Nevada, Las Vegas",10,707,25685,11450,5300,...,0.332,0.766,36.3,16.2,7.5,3.3,69.7,0.13,2.4,28.3
1,2015,2.0,NJN,Kenny Anderson,Georgia Institute of Technology,14,858,25868,10789,2641,...,0.346,0.79,30.1,12.6,3.1,6.1,62.5,0.116,0.8,18.0
2,2015,3.0,SAC,Billy Owens,Syracuse University,10,600,17619,7026,4016,...,0.291,0.629,29.4,11.7,6.7,2.8,28.6,0.078,0.5,11.2
3,2015,4.0,DEN,Dikembe Mutombo,Georgetown University,18,1196,36791,11729,12359,...,0.0,0.684,30.8,9.8,10.3,1.0,117.0,0.153,2.1,38.2
4,2015,5.0,MIA,Steve Smith,Michigan State University,14,942,28855,13430,3060,...,0.358,0.845,30.6,14.3,3.2,3.1,83.7,0.139,1.4,24.4
5,2015,6.0,DAL,Doug Smith,University of Missouri,5,296,5833,2356,1234,...,0.083,0.773,19.7,8.0,4.2,1.4,3.0,0.024,-3.0,-1.5
6,2015,7.0,MIN,Luc Longley,University of New Mexico,10,567,12006,4090,2794,...,0.0,0.76,21.2,7.2,4.9,1.5,17.5,0.07,-1.5,1.5
7,2015,8.0,DEN,Mark Macon,Temple University,6,251,5018,1685,467,...,0.27,0.735,20.0,6.7,1.9,1.7,0.1,0.001,-3.2,-1.5
8,2015,9.0,ATL,Stacey Augmon,"University of Nevada, Las Vegas",15,1001,21658,7990,3216,...,0.152,0.728,21.6,8.0,3.2,1.6,43.8,0.097,-0.2,9.8
9,2015,10.0,ORL,Bison Dele,University of Arizona,8,413,10004,4536,2564,...,0.143,0.691,24.2,11.0,6.2,1.1,22.8,0.109,-0.3,4.4
