Area of interest: Using ML methods to predict whether a player is an all star

Steps:


1.   Generate test data
2.   Assess test data to identify class imbalances and apply oversampling and/or undersampling techniques
3.   Implement a ML method and assess through confusion matrix and other metrics



Step 1: Generate test data

We have NBA season data via: https://www.kaggle.com/justinas/nba-players-data

This data:


*   Provides season averages for all players from the 1996 to 2019 NBA season
*   Has the standard columns (features) we expect such as games played, points, rebounds, assists, net rating, etc.
*   In addition, it has features such as usage, shooting, and assist percentage.

However, the output of whether a player is an all-star is not present and we need to acquire this information.

We do so by coding a web scraper that pulls our information of interest from each player's wikipedia page.

In [4]:
import pandas as pd
#upload data and turn into panda dataframe
data = pd.read_csv("all_seasons.csv")

#acquire all the unique player names from the data
uniquePlayers = data['player_name'].unique()

#11145 total stat lines
#2235 unique players

data.head()

Unnamed: 0.1,Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,draft_number,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season
0,0,Dennis Rodman,CHI,36.0,198.12,99.79024,Southeastern Oklahoma State,USA,1986,2,27,55,5.7,16.1,3.1,16.1,0.186,0.323,0.1,0.479,0.113,1996-97
1,1,Dwayne Schintzius,LAC,28.0,215.9,117.93392,Florida,USA,1990,1,24,15,2.3,1.5,0.3,12.3,0.078,0.151,0.175,0.43,0.048,1996-97
2,2,Earl Cureton,TOR,39.0,205.74,95.25432,Detroit Mercy,USA,1979,3,58,9,0.8,1.0,0.4,-2.1,0.105,0.102,0.103,0.376,0.148,1996-97
3,3,Ed O'Bannon,DAL,24.0,203.2,100.697424,UCLA,USA,1995,1,9,64,3.7,2.3,0.6,-8.7,0.06,0.149,0.167,0.399,0.077,1996-97
4,4,Ed Pinckney,MIA,34.0,205.74,108.86208,Villanova,USA,1985,1,10,27,2.4,2.4,0.2,-11.2,0.109,0.179,0.127,0.611,0.04,1996-97


In [8]:
from bs4 import BeautifulSoup
import requests
import numpy as np

#create dataframe with all players and a column indicating whether they were ever selected as an all star during their career
allStarData = pd.DataFrame({'unique_player': uniquePlayers,'All_Star': np.zeros(len(uniquePlayers))})

num = len(allStarData)

#iterate through all NBA players and identify whether the player had been an all star
for i in range(num):
    player = allStarData['unique_player'].iloc[i].replace(" ","_")

    url = "https://en.wikipedia.org/wiki/"+player

    pages = requests.get(url)

    soup = BeautifulSoup(pages.content,'html.parser')

    playerCard = soup.find('table',attrs={'class':'infobox vcard'})

    if not (playerCard is None):
        temp = playerCard.find("a",{"title": "NBA All-Star"})

        if not (temp is None):
            allStarData['All_Star'].iloc[i] = 1
    else:
        allStarData['All_Star'].iloc[i] = 0
    if(i%100 == 0):
        pctComplete = "{:.0%}".format(i/num)
        print(pctComplete + " Complete")

#create CSV that only has players who've been selected to be an all-star atleast once
allStarData[allStarData['All_Star'] == 1].to_csv('allStarDesignation_AllStarOnly.csv')
print('ding!')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


0% Complete
4% Complete
9% Complete
13% Complete
18% Complete
22% Complete
27% Complete
31% Complete
36% Complete
40% Complete
45% Complete
49% Complete
54% Complete
58% Complete
63% Complete
67% Complete
72% Complete
76% Complete
81% Complete
85% Complete
89% Complete
94% Complete
98% Complete
ding!


In [10]:
from bs4 import BeautifulSoup
import re
import requests
import pandas as pd
import numpy as np

#import data with only all stars
aSData = pd.read_csv("allStarDesignation_AllStarOnly.csv")

#establish two columns of interest (# of all star selections, years they were selected for their all-star appearance)
aSData['All-Star_Count'], aSData['All-Star_Years'] = [np.zeros(len(aSData)), np.zeros(len(aSData))]

for a in range(len(aSData)):
    player = aSData['unique_player'].iloc[a].replace(" ","_")

    url = "https://en.wikipedia.org/wiki/"+player

    pages = requests.get(url)

    soup = BeautifulSoup(pages.content,'html.parser')

    playerCard = soup.find('table',attrs={'class':'infobox vcard'})

    playerAwards = playerCard.find_all('ul')

    y = 0
    pAIndex = None
    while(pAIndex == None):
        checker = str(playerAwards[y])
        if('All-Star' in checker):
            pAIndex = y
        y+=1

    playerAwardsList = playerAwards[pAIndex].text.split('\n')

    playerIndex = None
    x = 0 
    while(playerIndex == None):
        if ('All-Star' in playerAwardsList[x]):
            playerIndex = x
        x+=1

    #grab stuff that is only within parentheses
    # result = playerCard.find(attrs={'title': re.compile(r"NBA All-Star Game$")}).find(text=True)
    #grab first instance
    #$ means end of string
    #^ means start of string
    #.find(text=True) just to grab that year
    aYears = re.findall(r'\(.*?\)', playerAwardsList[playerIndex])

    aYearsString = re.sub('[()]', '', aYears[0])
    aYearsString = re.sub(' ', '', aYearsString)

    aYearsList = aYearsString.split(',')

    aYearsRaw = []

    #look at list of years and if it is given in form of range (i.e. 2000-2003) break it out into individual years
    for i in range(len(aYearsList)):
        yearString = aYearsList[i]
        if(len(yearString) > 4):
            start = int(yearString[0:4])
            end = int(yearString[5:])
            temp = list(range(start,end+1))

            aYearsRaw = aYearsRaw + temp
        else:
            aYearsRaw.append(int(yearString))
    aSData['All-Star_Count'].iloc[a] = len(aYearsRaw)
    aSData['All-Star_Years'].iloc[a] = str(aYearsRaw)

#export to csv
aSData.to_csv('aSData_withCountAndYears.csv')
print('ding!')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


ding!


In [11]:
#Combine the baseline dataset to the all star data acquired
#Mark the particular season player selected as all star with 1, no all star selection is 0
import pandas as pd
import numpy as np
import re

#long list of years yielded '...' so set large column width so full list is retrieved
pd.set_option("display.max_colwidth", 10000)

#read in baseline dataset and all star dataset
fullData = pd.read_csv('all_seasons.csv')
allStarData = pd.read_csv('aSData_withCountAndYears.csv')

col = np.zeros(len(fullData))

fullData['all_star'] = col

#pull player rows from full data based on string of name

#1996 all star means 1995-1996 season

#iterate through all stars
for z in range(len(allStarData)):
    playerName = allStarData['unique_player'].iloc[z]

    #find all rows in full dataset with corresponding player name
    isPlayer = fullData['player_name'] == playerName
    playerRows = fullData[isPlayer]

    allStarPlayerInfo = allStarData.loc[allStarData['unique_player'] == playerName]

    #take list of all star years and turn to list
    allStarYrs = allStarPlayerInfo['All-Star_Years'].to_string()

    start = allStarYrs.find("[") + len("[")
    end = allStarYrs.find("]")

    allStarYrs = allStarYrs[start:end]
    allStarYrs = re.sub(' ', '', allStarYrs)

    allStarList = allStarYrs.split(',')
    #convert to int
    allStarList = [int(j) for j in allStarList]

    #subtract all all star years by one to ensure selection assigned to right season
    #1996 all star means 1995-1996 season
    allStarList = [x - 1 for x in allStarList]

    count = len(playerRows['season'])

    #iterate over all player seasons
    for i in range(count):
        yearString = playerRows['season'].iloc[i][:-3]
        year = int(yearString)
        #see if current season they were selected to be an all-star
        if(year in allStarList):
            playerRows['all_star'].iloc[i] = 1
            fullIndex = playerRows['Unnamed: 0'].iloc[i]
            fullData['all_star'].iloc[fullIndex] = 1

#export to csv
fullData.to_csv('all_seasons_w_all_star.csv')
print('ding!')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


ding!
