## NBA Statistics Friday Project

##### This project's purpose is to see if there's a way to use statistical methods to identify the best players for a few NBA teams based on their offensive style. This can later evolve to identifying diamonds in the rough players.  

##### Teams for consideration: San Antonio Spurs, Houston Rockets, Golden State Warriors

In [1]:
# !pip install sklearn

In [2]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

### Dataset Selection

In [3]:
import pandas as pd

playergamedata = pd.read_csv('/Users/dereklee/ml-projects/nba_player_selection/201819_nbaplayergamedata.csv')

playergamedata.columns.values

array(['Rk', 'Player', 'Player ID', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG',
       'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT',
       'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV',
       'PF', 'PTS'], dtype=object)

In [None]:
playergamedata

### Web Scraping method of dataset #2 (Spurs season by season data)

In [4]:
import requests
import json
import pickle

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://www.basketball-reference.com/teams/SAS/stats_basic_totals.html"

html = urlopen(url)

soup = BeautifulSoup(html)


In [5]:
# gets column headers
soup.findAll('tr', limit=2)

[<tr>
 <th aria-label="If listed as single number, the year the season ended.★ - Indicates All-Star for league.Only on regular season tables." class="poptip sort_default_asc center" data-stat="season" data-tip="If listed as single number, the year the season ended.&lt;br&gt;★ - Indicates All-Star for league.&lt;br&gt;Only on regular season tables." scope="col">Season</th>
 <th aria-label="League" class="poptip sort_default_asc left" data-stat="lg_id" data-tip="League" scope="col">Lg</th>
 <th aria-label="Team" class="poptip sort_default_asc left" data-stat="team_id" data-tip="Team" scope="col">Tm</th>
 <th aria-label="Wins" class="poptip right" data-stat="wins" data-tip="Wins" scope="col">W</th>
 <th aria-label="Losses" class="poptip right" data-stat="losses" data-tip="Losses" scope="col">L</th>
 <th aria-label="Regular season finish (within division, if applicable)" class="poptip sort_default_asc right" data-stat="rank_team" data-tip="Regular season finish (within division, if applica

In [6]:
# extract text we need into a list
headers = [th.getText() for th in soup.findAll('tr', limit = 2)[0].findAll('th')]

In [7]:
headers = headers[0:]
len(headers)

34

In [8]:
# use [1:] to exclude the first header row (2 to exclude 1st 2 rows)

rows = soup.findAll('tr')[1:]

In [9]:
season_stats = [[td.getText() for td in rows[i].findAll(['td','th'])] 
               for i in range(len(rows))]

In [10]:
pd.set_option('display.max_columns', 50)

spurs_season_data = pd.DataFrame(season_stats, columns = headers)
spurs_season_data.head(5)

Unnamed: 0,Season,Lg,Tm,W,L,Finish,Unnamed: 7,Age,Ht.,Wt.,Unnamed: 11,G,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,2019-20,NBA,SAS,24,31,4,,28.1,6-6,212,,55,13350,2318,4923,0.471,573,1551,0.369,1745,3372,0.517,1017,1263,0.805,501,1981,2482,1346,383,301,679,1044,6226
1,2018-19,NBA,SAS,48,34,2,,28.8,6-6,218,,82,19805,3468,7248,0.478,812,2071,0.392,2656,5177,0.513,1408,1720,0.819,757,2910,3667,2013,501,386,992,1487,9156
2,2017-18,NBA,SAS,47,35,3,,29.3,6-6,214,,82,19730,3202,6999,0.457,696,1977,0.352,2506,5022,0.499,1324,1715,0.772,849,2777,3626,1868,628,460,1078,1408,8424
3,2016-17,NBA,SAS,61,21,1,,29.6,6-7,222,,82,19805,3222,6864,0.469,753,1927,0.391,2469,4937,0.5,1440,1806,0.797,821,2777,3598,1954,655,484,1101,1498,8637
4,2015-16,NBA,SAS,67,15,1,,30.3,6-7,223,,82,19705,3289,6797,0.484,570,1518,0.375,2719,5279,0.515,1342,1672,0.803,770,2831,3601,2010,677,485,1071,1433,8490


In [11]:
# eliminate header rows (['Finish'] is not number, but the numbers are all currently string.
#Just drop if it contains "Finish")

spurs_season_data = spurs_season_data[spurs_season_data['Finish'] != 'Finish']

# clean data up missing rows due to lockout seasons as well as years not part of new NBA format
# Eliminate 1994-96 season: shortened 3 point line
# (https://www.deseret.com/2019/2/21/20666425/nba-rules-have-adapted-over-the-years-to-make-the-game-more-fun-for-players-fans)

spurs_season_data = spurs_season_data[spurs_season_data['Season'] != "1994-95"]
spurs_season_data = spurs_season_data[spurs_season_data['Season'] != "1995-96"]
spurs_season_data = spurs_season_data[spurs_season_data['Season'] != "1996-97"]
# Conversion to numeric

cols = ['W', 'L', 'Finish', 'Age', 'Wt.',
       'G', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%',
       'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV',
       'PF', 'PTS']

for col in cols:
    spurs_season_data[col] = pd.to_numeric(spurs_season_data[col])

In [13]:
# Get a bool series representing which row satisfies the condition i.e. True for
# row in which value of 'Age' column is more than 30
seriesObj = spurs_season_data.apply(lambda x: True if x['Finish'] == 1 else False , axis=1)
 
# Count number of True in series
numOfRows = len(seriesObj[seriesObj == True].index)
 
print('Number of Rows in dataframe in which Spurs finished 1st in regular season: ', numOfRows)

Number of Rows in dataframe in which Spurs finished 1st in regular season:  20


At this point, we've successfully completed the initial clean of our dataset.

### Research:

##### Based on the game data, we get some stats on scoring, scoring percentages, some defensive metrics (defensive rebounds, steals, blocks).

Glossary: https://www.basketball-reference.com/about/glossary.html#site_menu_link

Found a few articles about some offensive strategies:
https://www.goldenstateofmind.com/2018/3/13/17108248/nba-2018-golden-state-warriors-visualizing-offense-the-top-five-offenses-statistics-houston-rockets (Talks about top 5 offenses of 2017-18 season and how they're all different)

https://www.youtube.com/watch?v=bRxMdABA1qQ (YouTube video of Spurs offensive strategy)

Since my favorite team is the San Antonio Spurs, we can first try to understand their offensive strategy. Historically, they've looked for selfless players and hinge their offense around passing the ball constantly to collapse defenses and get the ball to the player with the best shot opportunity. Looks like we'll have to find another dataset to append to this one with number of passes in a game. The Spurs offense especially relies on a Big (Center or Power Forward) who is able to pass well, which enables the Big to pass from the perimeter and draw their defender outside of the 'paint.'

Article about current Spurs offense: https://www.nba.com/article/2018/09/29/one-team-one-stat-san-antonio-spurs-shooting

Video on current Spurs offense: https://www.youtube.com/watch?v=QHhva5XY_9c

It seems like the Spurs still stick to their core principles: 1) passing a lot and 2) spacing the offense well

For the purposes of this project, I'll be focusing on offensive improvements.

Some issues identified in the articles above. I'll do some of my own too.

Home/Road splits: perhaps this is up to Spurs to track how well their players are sleeping when subject to the rigors of traveling.

#### Research Question 1: What have been the best Spurs teams over the years from our dataset?

Play with data to identify what Spurs are lagging compared to previous years of offense?  
Will need team-level data over the years

**Sub-research Question 1: What defines a best team?**  
Let's do a couple of things. First, Spurs have won 5 championships (1998-99, 2002-03, 2004-05, 2005-06, 2013-14). Can we try to look at what made those championship teams great?

In [37]:
# create a table with only data from those 6 championship teams
spurs_champions = spurs_season_data[(spurs_season_data['Season'] == '2013-14') | 
                                    (spurs_season_data['Season'] == '2005-06')| 
                                    (spurs_season_data['Season'] == '2004-05')| 
                                    (spurs_season_data['Season'] == '2002-03')| 
                                    (spurs_season_data['Season'] == '1998-99')]
spurs_champions = spurs_champions.reset_index().drop(['index'], axis=1)
spurs_champions

Unnamed: 0,Season,Lg,Tm,W,L,Finish,Unnamed: 7,Age,Ht.,Wt.,Unnamed: 11,G,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,2013-14,NBA,SAS,62,20,1,,28.9,6-6,213,,82,19755,3326,6844,0.486,698.0,1757.0,0.397,2628,5087,0.517,1289,1642,0.785,762.0,2786.0,3548,2064,604.0,420.0,1180,1495,8639
1,2005-06,NBA,SAS,63,19,1,,29.8,6-7,210,,82,19805,2993,6342,0.472,524.0,1362.0,0.385,2469,4980,0.496,1327,1891,0.702,851.0,2548.0,3399,1717,543.0,467.0,1126,1714,7837
2,2004-05,NBA,SAS,59,23,1,,28.5,6-6,212,,82,19805,2923,6450,0.453,507.0,1395.0,0.363,2416,5055,0.478,1535,2120,0.724,987.0,2489.0,3476,1771,613.0,543.0,1126,1717,7888
3,2002-03,NBA,SAS,60,22,1,,28.4,6-7,213,,82,19830,2908,6297,0.462,449.0,1270.0,0.354,2459,5027,0.489,1591,2194,0.725,939.0,2556.0,3495,1636,629.0,529.0,1295,1672,7856
4,1998-99,NBA,SAS,37,13,1,,30.1,6-6,213,,50,12075,1740,3812,0.456,172.0,521.0,0.33,1568,3291,0.476,988,1415,0.698,614.0,1584.0,2198,1101,421.0,351.0,759,1010,4640


In [None]:
# can we find patterns with small sample size?



## Player Data Analysis??

Here, we use a different dataset.

In [116]:
playergamedata.columns.values

array(['Rk', 'Player', 'Player ID', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG',
       'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT',
       'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV',
       'PF', 'PTS'], dtype=object)

Based on the data analysis, it seems like the Spurs are 

In [None]:
# Position-by-position analysis based on offensive holes


### Classification
It may be great to classify by point distribution over the championship years by position