# Predicting NBA Games by selecting participating players

This is a submission for the Data Scraping course at Humboldt University. Its goal is to create an interface where a user can select NBA players and calculate the outcome of such a match by taking individual and VS performance of every player into account.

## Getting Player performance data

All data from players was taken from NBA.com's stats API. The data could have been gotten elsewhere (e.g. the same place where images of all players are scraped from (more later)), however dealing with an unofficially and incompletely documented API seemed like a good challenge.

### Is getting data from stats.nba.com against the sites terms of use ?

As far as http://www.nba.com/news/termsofuse/#priv9 says, stats.nba.com allows the use of all stats for non-commercial / unpublished use. The robots.txt file also does not deny scraping of any sort. However when accessing the site without a proper html header (as is common when using a bot), a proper content response is denied. As the terms of use also state that this project needs to prominently feature that the data is from NBA.com, this project will adhere to these terms.

# Statistical perfomance data:

Data is from NBA.com

## Getting a list of all current NBA players

Firstly it is necessary to generate a list of all nba players currently in a team roster. Because the nba stats website makes use of IDs to identify teams and players, we will need to access the ID of every player to get their individual stats.

In [11]:
import pandas as pd
import requests
import numpy as np
import urllib.parse as ur
# define the header of url request
headers = { 'Host': 'stats.nba.com',
            'Connection': 'keep-alive',
            'Cache-Control': 'max-age=0',
            'Upgrade-Insecure-Requests': '1',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.33 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'de,en;q=0.9'}

# first get list of all players and their team IDS
def getPlayers():
    # define parameters
    params = {'leagueId'            : '00',
              'season'              : '2017-18',
              'isOnlyCurrentSeason' : 1,}
    # create HTTP request
    url = "http://stats.nba.com/stats/commonallplayers/?"+ur.urlencode(params)
    response = requests.get(url, headers=headers)
    # read response content and create a DataFrame
    data = response.json()['resultSets'][0]
    colnames = data['headers']
    values = data['rowSet']
    players = pd.DataFrame(values,columns=colnames)
    currentPlayers = players[players.ROSTERSTATUS != 0]
    return currentPlayers


## Get individual stats for last 10 games

Next the last 10 games of every player are requested. The assumption is made, that the last 10 games are most representative for a players current performance.

In [12]:
# for every selected player get individual stats of last 10 games
# the process is very similar to getPlayers()
# I could potentially have put all these very similar api call functions into one
def individualStats(playerid,players):
    print("getting individual stats")
    # sum is used here to remove the step to make this a list and take the 0'th element
    teamid = sum(players[players.PERSON_ID==playerid]['TEAM_ID'])
    params = {'measureType'    : 'Base',
              'perMode'        : 'PerGame',
              'leagueId'       : '00',
              'season'         : '2017-18',
              'seasonType'     : 'Regular Season',
              'poRound'        : 0,
              'teamId'         : teamid,
              'playerId'       : playerid,
              'outcome'        : '',
              'location'       : '',
              'month'          : 0,
              'seasonSegment'  : '',
              'dateFrom'       : '',
              'dateTo'         : '',
              'oppTeamId'      : 0,
              'vsConference'   : '',
              'vsDivision'     : '',
              'gameSegment'    : '',
              'period'         : 0,
              'shotClockRange' : '',
              'lastNGames'     : 10}
    url = "http://stats.nba.com/stats/playergamelog/?"+ur.urlencode(params)
    response = requests.get(url, headers=headers)
    data = response.json()['resultSets'][0]
    colnames = data['headers']
    values = data['rowSet']
    print("successfully got individual stats for: ",playerid)
    return pd.DataFrame(values,columns=colnames)

## Getting stats vs all opposing players

Next stats vs all opposing players are called. These should represent the individuals performance against certain players.

In [13]:
# Again the structure is very similar to the previous api call functions
# different here is that it is possible to get offensive and defensive stats. However the stats seem to be
# exactly the same as when switching the input player to the oppsing player, 
# therefore offensive and defensive are not really needed.
def vsPlayersStats(playerid,offensive):
    print("getting VS stats")
    if(offensive):
        offid = playerid
        defid = ''
    if(not offensive):
        offid = ''
        defid = playerid
    
    params = {"LeagueID":"00",
              "Season":"2017-18",
              "SeasonType":"Regular Season",
              "PORound":0,
              "PerMode":"PerGame",
              "Outcome":'',
              "DateFrom":'',
              "DateTo":'',
              "DefTeamID":'',
              "OffTeamID":'',
              "OffPlayerID":offid,
              "DefPlayerID":defid}
    url = 'https://stats.nba.com/stats/leagueseasonmatchups?'+ur.urlencode(params)
    response = requests.get(url, headers=headers)

    data = response.json()['resultSets'][0]
    colnames = data['headers']
    values = data['rowSet']
    print("successfully got vs stats for: ",playerid)
    return pd.DataFrame(values,columns=colnames)




## Calculate an individual score

After getting all stats for every player, a score is calculated based on arbitrary scoring of every stat that could affect the player performance. "Arbitrary" because i simply made up numbers for weights of every stat. If the result should be statistically interesting in any way, we first need to find out variable importances. Because machine learning and data science have been part of my semester so much already, i decided to skip this part.

In [14]:

# Weights for every metric, and the interesting metrics from the stats
indMetrics = [['PTS','FG_PCT','FG3_PCT','FT_PCT','AST','STL','BLK','REB','TOV','PF'],[1,2,3,1,0.5,0.5,1,0.5,-2,-2]]
# It is assumed that the latest game should give the best indicator on the players current performance,
# therefore the weighted average is applied. This might not necessarily be the best way, because it should be considered 
# against which teams the player is playing and with which other players...
def indScore(df,metrics):
    matrix = pd.DataFrame.as_matrix(df[metrics[0]])
    l = matrix.shape[0]+1
    # importance of last 10 games, last one is most important, 10th one less.
    weights = list(reversed(range(1,l)))
    # weighted average
    averaged = np.average(matrix,axis=0, weights=weights)
    scoreweight = metrics[1]
    # weighted sum
    score = sum(averaged*scoreweight)
    return score,averaged[0]

# create a dictionary based on calculated individual scores
def scoreDict(input1,input2,players,metrics):
    inputlist = np.append(input1,input2)
    dic = {}
    for i in inputlist:
        if(i in dic):
            next
        else:
            dic[i] = indScore(individualStats(i,players),metrics)
    return dic

In [15]:
# the same kind of score creation is done with the opposing statistics
multiMetrics = [['PLAYER_PTS','FG_PCT','FG3_PCT','FTM','AST','BLK','TOV','SFL','DEF_FOULS','OFF_FOULS'],[1,2,3,0.1,0.5,1,-1,-1,-0.5,-0.5]]

def oppScore(ownTeam,oppTeam,dic):
    team = {}
    for i in ownTeam:
        # skip calculating a score twice if a user selects the same player multiple times
        # this is done because getting opposing score is time and a bit more bandwidth intensive
        if i in team:
            next
        else:
            offStats = vsPlayersStats(playerid=i,offensive=True)
            opps = offStats[offStats.DEF_PLAYER_ID.isin(oppTeam)]
            weights = []
            for j in list(opps['DEF_PLAYER_ID']):
                weights.append(dic[j][1])
            stats = pd.DataFrame.as_matrix(opps[multiMetrics[0]])
            if stats.size:
                averaged = np.average(stats,axis=0,weights=weights)
            else:
                averaged = 0
            scoreweight = multiMetrics[1]
            score = sum(averaged*scoreweight)
            team[i] = score
    print(team)
    return team



In [16]:
# finally a result is calculated which is presented to the user.
# In a second revision of this project, it should be possible to display alls stats and get into 
# specifics on why a score is how it is.
def calculateResults(dic,score1,score2,input1,input2):
    indSum2 = sum([dic.get(key)[0] for key in input2])
    vsSum2 = sum([score2.get(key) for key in input2])
    if(indSum2 == 0):
        indRes = 0
    else:
        indRes = sum([dic.get(key)[0] for key in input1])/indSum2
    if(vsSum2 == 0):
        vsRes =0
    else:
        vsRes = sum([score1.get(key) for key in input1])/vsSum2
    return indRes,vsRes



## Getting mugshots for every player

Because just using an api to get data was to little web scraping for this project, i decided to also get image data for every player. Because nba.com forbids image data scraping if the bot is not the twitter bot, we make use of a different page here. A number of challenges arrise because of this switch, however as the goal of this project is to learn something, these challenges will be solved.

All image data is taken from foxsports.com, they allow scraping the image and statistics data from their page (at least when looking at the robots.txt). Their terms of use explicitely forbids downloading and reproducing their content, however it seems like this mostly references the original videocontent and not the player images. If this was a public project, I would most likely take a different source or send an email to them.

In [17]:
from bs4 import BeautifulSoup
import re

# To start out, urls for all players are scraped. A dictionary is created which lists names and stat urls.
def getPlayerUrls():
    url = "https://basketball.realgm.com/nba/players/2018"
    response = requests.get(url)
    # use beautifulSoup to parse the html
    soup = BeautifulSoup(response.content, "html5lib") 
    links = soup.find(class_="tablesaw").find_all('a',href=re.compile('(/player).*'))
    links[0].get('href')
    linkDict = {}
    for i in links:
        linkDict[i.next] = 'https://basketball.realgm.com'+i.get('href')
    return linkDict

In [18]:
# download a helper file from my github because i cannot submit two files.
# This is required because in windows python notebooks, defined funtions in the python namespace 
# will not be found by the multiprocessing library. However when loading the functions in by importing a python file, 
# the functions will work.
with open("helperfunctions.py", "wb") as file:
    # get request
    response = requests.get('https://raw.githubusercontent.com/BuzzWoll/DataScrapingWS1718/master/helperfunctions.py')
    # write to file
    file.write(response.content)

# create a dictionary from a ordered input list and the original link dictionary
def createDictFromDict(inputList,linkDict):
    util = {}
    j = 0
    for i in linkDict.keys():
        util[i] = inputList[j]
        j = j+1
    return util



In [19]:
# actually get data
players =getPlayers()
linkDict = getPlayerUrls()

### Making use of multiprocessing
In order to speed up web scraping, multiple processes are used to search for specific parts of a html document in parallel. It is important to realize that the requests are also running in parallel, therefore this machine is requesting 6 pages at once in very rapid succession. 

Multiple processes reduce the execution time of this entire project by up to 10 minutes. Using lxml instead of beautifulSoup also seems to be an important factor when it comes to efficiency.

The goal of this part is to download image data and store it as actual image data which can be used by bokeh as glyph.

In [20]:
# a pool containing processes is created and the two helper functions to download all image data are executed.
from multiprocessing import Pool
from helperfunctions import getImageUrl,getImageData
from functools import partial
    
urls = list(linkDict.values())
pool = Pool(6) # 6 processes
playerImgUrls = pool.map(getImageUrl,urls)
imgUrlDict = createDictFromDict(playerImgUrls,linkDict)
imageUrls = list(imgUrlDict.values())
imageData = pool.map(getImageData,imageUrls)
imgDict = createDictFromDict(imageData,linkDict)

In [21]:
# Because different data sources are used, different names are an issue. Therefore the names are compared
# and the imageDictionary is aligned with the stats dictionary
# John Holland will be placeholder for all missing images for players

# All dictionary keys are defined as sets and set difference operations are used to get the different names
set1 = set(players.DISPLAY_FIRST_LAST)-set(imgDict.keys())
set2 = set(imgDict.keys())-set(players.DISPLAY_FIRST_LAST)
missing = set(['Briante Weber','Jarell Eddie', 'Jeremy Evans','Larry Drew II', 'Marquis Teague','Xavier Silas'])
missing2 = set(['John Holland', 'London Perrantes'])

origNames = set1-missing
otherNames = set2-missing2

for i in missing:
    imgDict[i] = imgDict['John Holland']
    
for i,j in zip(origNames,otherNames):
    imgDict[i] = imgDict.pop(j)

## Final Part: Creating an interface and displaying the results

Finally bokeh together with ipywidgets is used to create an interactive display in this ipynb.
I would not recommend using bokeh for this purpose, however i still wanted to learn how to create an interface and so i did.

In [22]:
from bokeh.io import output_file,show,output_notebook,push_notebook
from bokeh.plotting import *
from bokeh.models import Label,ColumnDataSource
from bokeh.models.glyphs import ImageURL,ImageRGBA
from ipywidgets import interactive_output,Dropdown,HBox,VBox,Layout,Box,Button


output_notebook()
test="abc"
oldRes = '0'

# create a plot that will contain everything
plot = figure(title="",x_range=(0,750),y_range=(0,500),plot_height=500,plot_width=750,toolbar_location=None)

# plot team backgrounds
plot.quad(top=[500, 500], bottom=[0, 0], left=[0, plot.plot_width/2],
       right=[plot.plot_width/2-150,600], color=["#EF5350","#64B5F6"])

# create labels for player names
labelPlayer1 = Label(x=75, y=plot.plot_height/5*4, text=test, text_font_size='15pt', text_color='#000000')
labelPlayer2 = Label(x=75, y=plot.plot_height/5*3, text=test, text_font_size='15pt', text_color='#000000')
labelPlayer3 = Label(x=75, y=plot.plot_height/5*2, text=test, text_font_size='15pt', text_color='#000000')
labelPlayer4 = Label(x=75, y=plot.plot_height/5*1, text=test, text_font_size='15pt', text_color='#000000')
labelPlayer5 = Label(x=75, y=plot.plot_height/5*0, text=test, text_font_size='15pt', text_color='#000000')

label2Player1 = Label(x=plot.plot_width/2+75, y=plot.plot_height/5*4, text=test, text_font_size='15pt', text_color='#000000')
label2Player2 = Label(x=plot.plot_width/2+75, y=plot.plot_height/5*3, text=test, text_font_size='15pt', text_color='#000000')
label2Player3 = Label(x=plot.plot_width/2+75, y=plot.plot_height/5*2, text=test, text_font_size='15pt', text_color='#000000')
label2Player4 = Label(x=plot.plot_width/2+75, y=plot.plot_height/5*1, text=test, text_font_size='15pt', text_color='#000000')
label2Player5 = Label(x=plot.plot_width/2+75, y=plot.plot_height/5*0, text=test, text_font_size='15pt', text_color='#000000')

# create additional texts
vs = Label(x=290,y=250,text='vs', text_font_size='15pt', text_color='#000000')
result =Label(x=250,y=220,text='result', text_font_size='15pt', text_color='#000000')
resultNumber = Label(x=250,y=190,text='0',text_font_size='15pt', text_color='#000000')

# add them to the plot
for i in [labelPlayer1,labelPlayer2,labelPlayer3,labelPlayer4,labelPlayer5,
          label2Player1,label2Player2,label2Player3,label2Player4,label2Player5,
          vs,result,resultNumber]:
    plot.add_layout(i)

# remove unnecessary lines
plot.axis.visible=False
plot.grid.visible=False


# create individual sources for all images
# this could be done better if i properly understood how to make use of sources
source1 = ColumnDataSource(dict(image=[imgDict['Alex Abrines']]))
source2 = ColumnDataSource(dict(image=[imgDict['Alex Abrines']]))
source3 = ColumnDataSource(dict(image=[imgDict['Alex Abrines']]))
source4 = ColumnDataSource(dict(image=[imgDict['Alex Abrines']]))
source5 = ColumnDataSource(dict(image=[imgDict['Alex Abrines']]))
source21 = ColumnDataSource(dict(image=[imgDict['Alex Abrines']]))
source22 = ColumnDataSource(dict(image=[imgDict['Alex Abrines']]))
source23 = ColumnDataSource(dict(image=[imgDict['Alex Abrines']]))
source24 = ColumnDataSource(dict(image=[imgDict['Alex Abrines']]))
source25 = ColumnDataSource(dict(image=[imgDict['Alex Abrines']]))

# define images for all players
imgPlayer1 = ImageRGBA(image='image',x=0, y=plot.plot_height/5*4, dw=75,dh=100)
imgPlayer2 = ImageRGBA(image='image',x=0, y=plot.plot_height/5*3, dw=75,dh=100)
imgPlayer3 = ImageRGBA(image='image',x=0, y=plot.plot_height/5*2, dw=75,dh=100)
imgPlayer4 = ImageRGBA(image='image',x=0, y=plot.plot_height/5*1, dw=75,dh=100)
imgPlayer5 = ImageRGBA(image='image',x=0, y=plot.plot_height/5*0, dw=75,dh=100)

img2Player1 = ImageRGBA(image='image',x=plot.plot_width/2, y=plot.plot_height/5*4, dw=75,dh=100)
img2Player2 = ImageRGBA(image='image',x=plot.plot_width/2, y=plot.plot_height/5*3, dw=75,dh=100)
img2Player3 = ImageRGBA(image='image',x=plot.plot_width/2, y=plot.plot_height/5*2, dw=75,dh=100)
img2Player4 = ImageRGBA(image='image',x=plot.plot_width/2, y=plot.plot_height/5*1, dw=75,dh=100)
img2Player5 = ImageRGBA(image='image',x=plot.plot_width/2, y=plot.plot_height/5*0, dw=75,dh=100)


# add all images
plot.add_glyph(source1,imgPlayer1)
plot.add_glyph(source2,imgPlayer2)
plot.add_glyph(source3,imgPlayer3)
plot.add_glyph(source4,imgPlayer4)
plot.add_glyph(source5,imgPlayer5)
plot.add_glyph(source21,img2Player1)
plot.add_glyph(source22,img2Player2)
plot.add_glyph(source23,img2Player3)
plot.add_glyph(source24,img2Player4)
plot.add_glyph(source25,img2Player5)

# this function will update players based on the drop down menu selection
# it would be smart to check for every player if anything has changed since the last selection
# such that the function would not have to reload all players every time a change occurs.
def update(p1,p2,p3,p4,p5,p21,p22,p23,p24,p25):    
    labelPlayer1.text = p1
    source1.data = dict(image = [imgDict[players.loc[players.DISPLAY_LAST_COMMA_FIRST==p1]['DISPLAY_FIRST_LAST'].item()]])
    labelPlayer2.text = p2
    source2.data = dict(image = [imgDict[players.loc[players.DISPLAY_LAST_COMMA_FIRST==p2]['DISPLAY_FIRST_LAST'].item()]])
    labelPlayer3.text = p3
    source3.data = dict(image = [imgDict[players.loc[players.DISPLAY_LAST_COMMA_FIRST==p3]['DISPLAY_FIRST_LAST'].item()]])
    labelPlayer4.text = p4
    source4.data = dict(image = [imgDict[players.loc[players.DISPLAY_LAST_COMMA_FIRST==p4]['DISPLAY_FIRST_LAST'].item()]])
    labelPlayer5.text = p5
    source5.data = dict(image = [imgDict[players.loc[players.DISPLAY_LAST_COMMA_FIRST==p5]['DISPLAY_FIRST_LAST'].item()]])

    label2Player1.text = p21
    source21.data = dict(image = [imgDict[players.loc[players.DISPLAY_LAST_COMMA_FIRST==p21]['DISPLAY_FIRST_LAST'].item()]])
    label2Player2.text = p22
    source22.data = dict(image = [imgDict[players.loc[players.DISPLAY_LAST_COMMA_FIRST==p22]['DISPLAY_FIRST_LAST'].item()]])
    label2Player3.text = p23
    source23.data = dict(image = [imgDict[players.loc[players.DISPLAY_LAST_COMMA_FIRST==p23]['DISPLAY_FIRST_LAST'].item()]])
    label2Player4.text = p24
    source24.data = dict(image = [imgDict[players.loc[players.DISPLAY_LAST_COMMA_FIRST==p24]['DISPLAY_FIRST_LAST'].item()]])
    label2Player5.text = p25
    source25.data = dict(image = [imgDict[players.loc[players.DISPLAY_LAST_COMMA_FIRST==p25]['DISPLAY_FIRST_LAST'].item()]])
    push_notebook()

# this function calculates a score and finally uses the previously defined functions.
# It outputs a score based on the selected players
def update2(x):
    print("starting...")
    # if nothing changes, bokeh will throw an error, therefore we need to check if anything changed since last calculation
    global oldRes
    # parse input
    names = [i.text for i in [labelPlayer1,labelPlayer2,labelPlayer3,labelPlayer4,labelPlayer5,label2Player1,label2Player2,label2Player3,label2Player4,label2Player5]]
    input1 = [players.loc[players.DISPLAY_LAST_COMMA_FIRST==i]['PERSON_ID'].item() for i in names[0:5]]    
    input2 = [players.loc[players.DISPLAY_LAST_COMMA_FIRST==i]['PERSON_ID'].item() for i in names[5:10]]
    # create score
    dic = scoreDict(input1,input2,players,indMetrics)
    score1 = oppScore(input1,input2,dic)
    score2 = oppScore(input2,input1,dic)
    indRes,vsRes = calculateResults(dic,score1,score2,input1,input2)
    res = (indRes+vsRes)/2
    # print results depending on score
    if(vsRes == 0):
        resultNumber.text = str(indRes)
        if (indRes>1):
            result.text = 'RED WINS!'
            result.text_color = '#EF5350'
        if (indRes<1):
            result.text = 'BLUE WINS!'
            result.text_color = '#64B5F6'
        if(indRes==1):
            result.text = 'DRAW'
    else:
        resultNumber.text = str(res)
        if (res>1):
            result.text = 'RED WINS!'
            result.text_color = '#EF5350'
        if (res<1):
            result.text = 'BLUE WINS!'
            result.text_color = '#64B5F6'
        if(res==1):
            result.text = 'DRAW'
    if(oldRes == resultNumber.text):
        return
    else:
        oldRes = resultNumber.text
        push_notebook()
        
show(plot, notebook_handle=True)

# create dropwdown selection menus
sel1 = Dropdown(options=list(players.DISPLAY_LAST_COMMA_FIRST),description='P 1')
sel2 = Dropdown(options=list(players.DISPLAY_LAST_COMMA_FIRST),description='P 2')
sel3 = Dropdown(options=list(players.DISPLAY_LAST_COMMA_FIRST),description='P 3')
sel4 = Dropdown(options=list(players.DISPLAY_LAST_COMMA_FIRST),description='P 4')
sel5 = Dropdown(options=list(players.DISPLAY_LAST_COMMA_FIRST),description='P 5')

sel21 = Dropdown(options=list(players.DISPLAY_LAST_COMMA_FIRST),description='P 1')
sel22 = Dropdown(options=list(players.DISPLAY_LAST_COMMA_FIRST),description='P 2')
sel23 = Dropdown(options=list(players.DISPLAY_LAST_COMMA_FIRST),description='P 3')
sel24 = Dropdown(options=list(players.DISPLAY_LAST_COMMA_FIRST),description='P 4')
sel25 = Dropdown(options=list(players.DISPLAY_LAST_COMMA_FIRST),description='P 5')

# create button to calculate results
button = Button(description="calculate result",disabled=False,button_style='success')

# selection layouting
box1 = VBox([sel1,sel2,sel3,sel4,sel5])
box2 = VBox([sel21,sel22,sel23,sel24,sel25])

ui = HBox([box1,box2])
out = interactive_output(update,{'p1':sel1,'p2':sel2,'p3':sel3,'p4':sel4,'p5':sel5,'p21':sel21,'p22':sel22,'p23':sel23,'p24':sel24,'p25':sel25})
display(ui,out)
display(button)
button.on_click(update2)

# Feel free to play around with the result

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget