Lambda School Data Science

*Unit 2, Sprint 3, Module 4*

---


# Model Interpretation 2

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Continue to iterate on your project: data cleaning, exploratory visualization, feature engineering, modeling.
- [ ] Make a Shapley force plot to explain at least 1 individual prediction.
- [ ] Share at least 1 visualization (of any type) on Slack.

But, if you aren't ready to make a Shapley force plot with your own dataset today, that's okay. You can practice this objective with another dataset instead. You may choose any dataset you've worked with previously.

## Stretch Goals
- [ ] Make Shapley force plots to explain at least 4 individual predictions.
    - If your project is Binary Classification, you can do a True Positive, True Negative, False Positive, False Negative.
    - If your project is Regression, you can do a high prediction with low error, a low prediction with low error, a high prediction with high error, and a low prediction with high error.
- [ ] Use Shapley values to display verbal explanations of individual predictions.
- [ ] Use the SHAP library for other visualization types.

The [SHAP repo](https://github.com/slundberg/shap) has examples for many visualization types, including:

- Force Plot, individual predictions
- Force Plot, multiple predictions
- Dependence Plot
- Summary Plot
- Summary Plot, Bar
- Interaction Values
- Decision Plots

We just did the first type during the lesson. The [Kaggle microcourse](https://www.kaggle.com/dansbecker/advanced-uses-of-shap-values) shows two more. Experiment and see what you can learn!


## Links
- [Kaggle / Dan Becker: Machine Learning Explainability — SHAP Values](https://www.kaggle.com/learn/machine-learning-explainability)
- [Christoph Molnar: Interpretable Machine Learning — Shapley Values](https://christophm.github.io/interpretable-ml-book/shapley.html)
- [SHAP repo](https://github.com/slundberg/shap) & [docs](https://shap.readthedocs.io/en/latest/)

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*
    !pip install eli5
    !pip install pdpbox
    !pip install shap

# If you're working locally:
else:
    DATA_PATH = '../data/'

### Continue to iterate on your project: data cleaning, exploratory visualization, feature engineering, modeling.

In [32]:
# Let's try to get a Coaches column.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

train = pd.read_csv('train-03.csv')
val = pd.read_csv('val-03.csv')

In [33]:
# I'll start by scraping a single season's page.
# This is the website I'm using as a reference: https://lfbueno.com/2019-02-19-scrape-bb/

import requests
from bs4 import BeautifulSoup

In [34]:
stats_page = requests.get('https://www.basketball-reference.com/leagues/NBA_2018_coaches.html')
content = stats_page.content

In [35]:
soup = BeautifulSoup(content, 'html.parser')
table = soup.find(name='table', attrs={'id':'NBA_coaches'})

In [36]:
table_str = str(table)
df = pd.read_html(table_str)[0]

df.head()

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Seasons,Seasons,Unnamed: 5_level_0,Regular Season,Regular Season,Regular Season,Regular Season,...,Unnamed: 16_level_0,Playoffs,Playoffs,Playoffs,Playoffs,Playoffs,Playoffs,Playoffs,Playoffs,Playoffs
Unnamed: 0_level_1,Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,w/ Franch,Overall,Unnamed: 5_level_1,Current Season,Current Season,Current Season,w/ Franchise,...,Unnamed: 16_level_1,Current Season,Current Season,Current Season,w/ Franchise,w/ Franchise,w/ Franchise,Career,Career,Career
Unnamed: 0_level_2,Coach,Tm,Unnamed: 2_level_2,#,#,Unnamed: 5_level_2,G,W,L,G,...,Unnamed: 16_level_2,G,W,L,G,W,L,G,W,L
0,Mike Budenholzer,ATL,,5,5,,82,24,58,410,...,,,,,39.0,17.0,22.0,39.0,17.0,22.0
1,Brad Stevens,BOS,,5,5,,82,55,27,410,...,,19.0,11.0,8.0,47.0,22.0,25.0,47.0,22.0,25.0
2,Kenny Atkinson,BRK,,2,2,,82,28,54,164,...,,,,,,,,,,
3,Fred Hoiberg,CHI,,3,3,,82,27,55,246,...,,,,,6.0,2.0,4.0,6.0,2.0,4.0
4,Steve Clifford,CHO,,5,5,,82,36,46,410,...,,,,,11.0,3.0,8.0,11.0,3.0,8.0


In [37]:
df.shape

(33, 26)

In [38]:
df

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Seasons,Seasons,Unnamed: 5_level_0,Regular Season,Regular Season,Regular Season,Regular Season,...,Unnamed: 16_level_0,Playoffs,Playoffs,Playoffs,Playoffs,Playoffs,Playoffs,Playoffs,Playoffs,Playoffs
Unnamed: 0_level_1,Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,w/ Franch,Overall,Unnamed: 5_level_1,Current Season,Current Season,Current Season,w/ Franchise,...,Unnamed: 16_level_1,Current Season,Current Season,Current Season,w/ Franchise,w/ Franchise,w/ Franchise,Career,Career,Career
Unnamed: 0_level_2,Coach,Tm,Unnamed: 2_level_2,#,#,Unnamed: 5_level_2,G,W,L,G,...,Unnamed: 16_level_2,G,W,L,G,W,L,G,W,L
0,Mike Budenholzer,ATL,,5,5,,82,24,58,410,...,,,,,39.0,17.0,22.0,39.0,17.0,22.0
1,Brad Stevens,BOS,,5,5,,82,55,27,410,...,,19.0,11.0,8.0,47.0,22.0,25.0,47.0,22.0,25.0
2,Kenny Atkinson,BRK,,2,2,,82,28,54,164,...,,,,,,,,,,
3,Fred Hoiberg,CHI,,3,3,,82,27,55,246,...,,,,,6.0,2.0,4.0,6.0,2.0,4.0
4,Steve Clifford,CHO,,5,5,,82,36,46,410,...,,,,,11.0,3.0,8.0,11.0,3.0,8.0
5,Tyronn Lue,CLE,,3,3,,82,50,32,205,...,,22.0,12.0,10.0,61.0,41.0,20.0,61.0,41.0,20.0
6,Rick Carlisle,DAL,,10,16,,82,24,58,804,...,,,,,58.0,28.0,30.0,120.0,58.0,62.0
7,Mike Malone,DEN,,3,5,,82,46,36,246,...,,,,,,,,,,
8,Stan Van Gundy,DET,,4,12,,82,39,43,328,...,,,,,4.0,0.0,4.0,91.0,48.0,43.0
9,Steve Kerr,GSW,,4,4,,82,58,24,328,...,,21.0,16.0,5.0,83.0,63.0,20.0,83.0,63.0,20.0


In [39]:
df.columns[0]

('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'Coach')

In [40]:
df = df[[df.columns[0], df.columns[1]]]
df.head()

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0
Unnamed: 0_level_1,Unnamed: 0_level_1,Unnamed: 1_level_1
Unnamed: 0_level_2,Coach,Tm
0,Mike Budenholzer,ATL
1,Brad Stevens,BOS
2,Kenny Atkinson,BRK
3,Fred Hoiberg,CHI
4,Steve Clifford,CHO


In [41]:
df.columns

MultiIndex([('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'Coach'),
            ('Unnamed: 1_level_0', 'Unnamed: 1_level_1',    'Tm')],
           )

In [42]:
df.columns = ['Coach', 'Tm']
df.head()

Unnamed: 0,Coach,Tm
0,Mike Budenholzer,ATL
1,Brad Stevens,BOS
2,Kenny Atkinson,BRK
3,Fred Hoiberg,CHI
4,Steve Clifford,CHO


In [43]:
df['Year'] = 2018
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Coach,Tm,Year
0,Mike Budenholzer,ATL,2018
1,Brad Stevens,BOS,2018
2,Kenny Atkinson,BRK,2018
3,Fred Hoiberg,CHI,2018
4,Steve Clifford,CHO,2018


In [44]:
baa_years = [1947, 1948, 1949]
baa_url = 'https://www.basketball-reference.com/leagues/BAA_{}_coaches.html'

df = pd.DataFrame()

for year in baa_years:
    req_url = baa_url.format(year)
    page = requests.get(req_url)
    content = page.content
    soup = BeautifulSoup(content, 'html.parser')
    table = soup.find(name='table', attrs={'id':'BAA_coaches'})
    df2 = pd.read_html(str(table))[0]
    df2 = df2[[df2.columns[0], df2.columns[1], df2.columns[6]]]
    df2.columns = ['Coach', 'Tm', 'G']
    df2['Year'] = year
    df2['Lge'] = 'BAA'
    
    df = pd.concat([df, df2])
    
aba_years = np.arange(1968, 1976, 1)
aba_url = 'https://www.basketball-reference.com/leagues/ABA_{}_coaches.html'

for year in aba_years:
    req_url = aba_url.format(year)
    page = requests.get(req_url)
    content = page.content
    soup = BeautifulSoup(content, 'html.parser')
    table = soup.find(name='table', attrs={'id': 'ABA_coaches'})
    df2 = pd.read_html(str(table))[0]
    df2 = df2[[df2.columns[0], df2.columns[1], df2.columns[6]]]
    df2.columns = ['Coach', 'Tm', 'G']
    df2['Year'] = year
    df2['Lge'] = 'ABA'
    
    df = pd.concat([df, df2])
    
nba_years = np.arange(1950, 2020, 1)
nba_url = 'https://www.basketball-reference.com/leagues/NBA_{}_coaches.html'

for year in nba_years:
    req_url = nba_url.format(year)
    page = requests.get(req_url)
    content = page.content
    soup = BeautifulSoup(content, 'html.parser')
    table = soup.find(name='table', attrs={'id': 'NBA_coaches'})
    df2 = pd.read_html(str(table))[0]
    df2 = df2[[df2.columns[0], df2.columns[1], df2.columns[6]]]
    df2.columns = ['Coach', 'Tm', 'G']
    df2['Year'] = year
    df2['Lge'] = 'NBA'
    
    df = pd.concat([df, df2])
    
print(df.shape)
df.head()

(1857, 5)


Unnamed: 0,Coach,Tm,G,Year,Lge
0,John Russell,BOS,60,1947,BAA
1,Harold Olsen,CHS,61,1947,BAA
2,Dutch Dehnert,CLR,37,1947,BAA
3,Roy Clifford,CLR,23,1947,BAA
4,Glenn Curtis,DTF,34,1947,BAA


In [46]:
df = df.sort_values(by=['Year', 'Lge'])
df.head()

Unnamed: 0,Coach,Tm,G,Year,Lge
0,John Russell,BOS,60,1947,BAA
1,Harold Olsen,CHS,61,1947,BAA
2,Dutch Dehnert,CLR,37,1947,BAA
3,Roy Clifford,CLR,23,1947,BAA
4,Glenn Curtis,DTF,34,1947,BAA


In [49]:
df.tail()

Unnamed: 0,Coach,Tm,G,Year,Lge
28,Dave Joerger,SAC,82,2019,NBA
29,Gregg Popovich,SAS,82,2019,NBA
30,Nick Nurse,TOR,82,2019,NBA
31,Quin Snyder,UTA,82,2019,NBA
32,Scott Brooks,WAS,82,2019,NBA


In [52]:
df = df.reset_index(drop=True)
df.tail()

Unnamed: 0,Coach,Tm,G,Year,Lge
1852,Dave Joerger,SAC,82,2019,NBA
1853,Gregg Popovich,SAS,82,2019,NBA
1854,Nick Nurse,TOR,82,2019,NBA
1855,Quin Snyder,UTA,82,2019,NBA
1856,Scott Brooks,WAS,82,2019,NBA


In [58]:
drop_rows = []

for i in np.arange(6, len(df)-5, 1):
    for j in np.arange(i-5, i+5, 1):
        if (i != j) & (df.loc[i, 'Lge'] == df.loc[j, 'Lge']) & (df.loc[i, 'Year'] == df.loc[j, 'Year']) & (df.loc[i, 'Tm'] == df.loc[j, 'Tm']):
            if df.loc[i, 'G'] > df.loc[j, 'G']:
                drop_rows.append(j)
            else:
                drop_rows.append(i)
                
drop_rows = list(set(drop_rows))
drop_rows

[512,
 513,
 1537,
 1027,
 1543,
 1544,
 1546,
 11,
 12,
 13,
 523,
 525,
 21,
 536,
 1561,
 28,
 29,
 542,
 31,
 1056,
 547,
 1060,
 1571,
 1063,
 41,
 42,
 554,
 559,
 563,
 566,
 55,
 1079,
 569,
 58,
 1590,
 61,
 579,
 1093,
 70,
 71,
 1095,
 75,
 1099,
 1100,
 1612,
 593,
 1106,
 1618,
 596,
 85,
 1108,
 1109,
 600,
 1619,
 1624,
 1115,
 1628,
 1119,
 609,
 100,
 615,
 1640,
 1641,
 619,
 621,
 1645,
 1138,
 1653,
 118,
 630,
 1144,
 1654,
 130,
 131,
 132,
 644,
 645,
 647,
 136,
 646,
 649,
 139,
 1156,
 1161,
 1670,
 145,
 657,
 151,
 1175,
 156,
 668,
 158,
 160,
 1184,
 1185,
 1700,
 1190,
 685,
 1199,
 1715,
 180,
 1155,
 182,
 1207,
 1719,
 185,
 1210,
 1720,
 1212,
 1728,
 1732,
 1733,
 713,
 1738,
 206,
 209,
 1236,
 213,
 1749,
 729,
 1754,
 220,
 732,
 1246,
 1245,
 225,
 226,
 1250,
 1253,
 232,
 748,
 238,
 1264,
 244,
 1269,
 1277,
 768,
 257,
 1282,
 266,
 1290,
 268,
 1292,
 1805,
 783,
 273,
 274,
 275,
 787,
 1809,
 790,
 1816,
 285,
 1312,
 1313,
 1314,
 803,
 1

In [60]:
drop_rows.sort()
drop_rows

[11,
 12,
 13,
 21,
 28,
 29,
 31,
 41,
 42,
 55,
 58,
 61,
 70,
 71,
 75,
 85,
 100,
 118,
 130,
 131,
 132,
 136,
 139,
 145,
 151,
 156,
 158,
 160,
 180,
 182,
 185,
 206,
 209,
 213,
 220,
 225,
 226,
 232,
 238,
 244,
 257,
 266,
 268,
 273,
 274,
 275,
 285,
 297,
 298,
 303,
 307,
 320,
 322,
 326,
 327,
 328,
 329,
 332,
 334,
 335,
 340,
 369,
 376,
 381,
 382,
 392,
 396,
 411,
 415,
 421,
 422,
 426,
 446,
 447,
 461,
 463,
 467,
 480,
 481,
 490,
 496,
 512,
 513,
 523,
 525,
 536,
 542,
 547,
 554,
 559,
 563,
 566,
 569,
 579,
 593,
 596,
 600,
 609,
 615,
 619,
 621,
 630,
 644,
 645,
 646,
 647,
 649,
 657,
 668,
 685,
 713,
 729,
 732,
 748,
 768,
 783,
 787,
 790,
 803,
 809,
 811,
 814,
 818,
 822,
 834,
 835,
 836,
 838,
 848,
 852,
 857,
 861,
 879,
 895,
 903,
 924,
 927,
 928,
 931,
 940,
 942,
 944,
 945,
 954,
 965,
 971,
 975,
 976,
 994,
 995,
 1017,
 1018,
 1021,
 1027,
 1056,
 1060,
 1063,
 1079,
 1093,
 1095,
 1099,
 1100,
 1106,
 1108,
 1109,
 1115,
 111

In [61]:
df3 = df.drop(drop_rows)
df3.shape

(1596, 5)

In [63]:
df3 = df3.drop('G', axis=1)
df4 = df3[df3['Year'] < 2019]
df4.head()

Unnamed: 0,Coach,Tm,Year,Lge
0,John Russell,BOS,1947,BAA
1,Harold Olsen,CHS,1947,BAA
2,Dutch Dehnert,CLR,1947,BAA
3,Roy Clifford,CLR,1947,BAA
4,Glenn Curtis,DTF,1947,BAA


In [65]:
df4.shape

(1566, 4)

In [21]:
print(train.shape)
train.head()

(16705, 34)


Unnamed: 0,Player,Year,Lge,Pos,Age,Tm,G,GS,MP,FG,...,TRB,STL,BLK,TOV,PF,PTS,AST,Target,CAS,Szn
0,A.C. Green,1986.0,NBA,4.0,22.0,LAL,82.0,1.0,18.8,2.5,...,4.6,0.6,0.6,1.2,2.8,6.4,0.7,1.1,0.7,1
1,A.C. Green,1987.0,NBA,4.0,23.0,LAL,79.0,72.0,28.4,4.0,...,7.8,0.9,1.0,1.3,2.2,10.8,1.1,1.1,0.9,2
2,A.C. Green,1988.0,NBA,4.0,24.0,LAL,82.0,64.0,32.1,3.9,...,8.7,1.1,0.5,1.5,2.5,11.4,1.1,1.3,0.966667,3
3,A.C. Green,1989.0,NBA,4.0,25.0,LAL,82.0,82.0,30.6,4.9,...,9.0,1.1,0.7,1.5,2.1,13.3,1.3,1.1,1.05,4
4,A.C. Green,1990.0,NBA,4.0,26.0,LAL,82.0,82.0,33.0,4.7,...,8.7,0.8,0.6,1.4,2.5,12.9,1.1,0.9,1.06,5


In [129]:
# I just realized that I kept the "TOT" values for years when players were traded.
# So I redid all my cleaning/engineering, but changed "TOT" to the team where the player played the most games with
# that season. I did this in a separate notebook, which I can send to you if you want.

train = pd.read_csv('train-04.csv')
val = pd.read_csv('val-04.csv')

train.tail()

Unnamed: 0,Player,Year,Lge,Pos,Age,Tm,G,GS,MP,FG,...,TRB,STL,BLK,TOV,PF,PTS,AST,Target,CAS,Szn
16700,Žarko Čabarkapa,2005.0,NBA,PF,23.0,GSW,40.0,0.0,11.9,2.2,...,2.6,0.3,0.1,0.8,1.5,6.0,0.6,0.3,0.7,2
16701,Željko Rebrača,2002.0,NBA,C,29.0,DET,74.0,4.0,15.9,2.6,...,3.9,0.4,1.0,1.1,2.6,6.9,0.5,0.3,0.5,1
16702,Željko Rebrača,2003.0,NBA,C,30.0,DET,30.0,12.0,16.3,2.7,...,3.1,0.2,0.6,1.0,2.6,6.6,0.3,0.3,0.4,2
16703,Željko Rebrača,2004.0,NBA,C,31.0,DET,24.0,2.0,11.4,1.4,...,2.4,0.2,0.5,0.7,2.2,3.8,0.3,0.4,0.366667,3
16704,Željko Rebrača,2005.0,NBA,C,32.0,LAC,58.0,2.0,16.0,2.3,...,3.2,0.2,0.7,0.8,2.2,5.8,0.4,0.3,0.375,4


In [130]:
train['Year'].describe()

count    16705.000000
mean      1989.593834
std         16.999329
min       1947.000000
25%       1977.000000
50%       1992.000000
75%       2004.000000
max       2014.000000
Name: Year, dtype: float64

In [131]:
train.shape

(16705, 34)

In [105]:
drop_rows2 = []

for i in np.arange(6, len(df), 1):
    for j in np.arange(i-5, i, 1):
        if (df.loc[i, 'Lge'] == df.loc[j, 'Lge']) & (df.loc[i, 'Year'] == df.loc[j, 'Year']) & (df.loc[i, 'Tm'] == df.loc[j, 'Tm']):
            if df.loc[i, 'G'] < df.loc[j, 'G']:
                drop_rows2.append(i)
            else:
                drop_rows2.append(j)
                
drop_rows2 = list(set(drop_rows2))
drop_rows2

[512,
 513,
 1537,
 1027,
 1543,
 1544,
 1546,
 11,
 12,
 13,
 523,
 525,
 21,
 536,
 1561,
 28,
 29,
 542,
 31,
 1056,
 547,
 1060,
 1571,
 1063,
 41,
 42,
 554,
 559,
 563,
 566,
 55,
 1079,
 569,
 58,
 1590,
 61,
 579,
 1093,
 70,
 71,
 1095,
 75,
 1099,
 1100,
 1612,
 593,
 1106,
 1618,
 596,
 85,
 1108,
 1109,
 600,
 1619,
 1624,
 1115,
 1628,
 1119,
 609,
 100,
 615,
 1640,
 1641,
 619,
 621,
 1645,
 1138,
 1653,
 118,
 630,
 1144,
 130,
 131,
 132,
 644,
 645,
 647,
 136,
 646,
 649,
 139,
 1156,
 1161,
 1670,
 145,
 657,
 151,
 1175,
 156,
 668,
 158,
 160,
 1184,
 1185,
 1700,
 1190,
 685,
 1199,
 1715,
 180,
 1155,
 182,
 1207,
 1719,
 185,
 1210,
 1720,
 1212,
 1728,
 1732,
 713,
 1738,
 206,
 209,
 1236,
 213,
 1749,
 729,
 1754,
 220,
 732,
 1246,
 1245,
 225,
 226,
 1250,
 1253,
 232,
 748,
 238,
 1264,
 244,
 1269,
 1277,
 768,
 257,
 1282,
 266,
 1290,
 268,
 1292,
 1805,
 783,
 273,
 274,
 275,
 787,
 1809,
 790,
 1816,
 285,
 1312,
 1313,
 1314,
 803,
 1827,
 1830,
 2

In [106]:
drop_rows2.sort()

In [107]:
df5 = df.drop(drop_rows2)
df5 = df5.reset_index(drop=True)

In [108]:
df5 = df5.drop('G', axis=1)
df5.head()

Unnamed: 0,Coach,Tm,Year,Lge
0,John Russell,BOS,1947,BAA
1,Harold Olsen,CHS,1947,BAA
2,Dutch Dehnert,CLR,1947,BAA
3,Roy Clifford,CLR,1947,BAA
4,Glenn Curtis,DTF,1947,BAA


In [118]:
df6 = df5[df5['Year'] < 2015]
df6.tail()

Unnamed: 0,Coach,Tm,Year,Lge
1564,Dave Joerger,SAC,2018,NBA
1565,Gregg Popovich,SAS,2018,NBA
1566,Dwane Casey,TOR,2018,NBA
1567,Quin Snyder,UTA,2018,NBA
1568,Scott Brooks,WAS,2018,NBA


In [133]:
train.shape

(16705, 34)

In [132]:
train2 = train.merge(df6, how='left', on=['Lge', 'Year', 'Tm'])
print(train2.shape)
train2.head()

(16717, 35)


Unnamed: 0,Player,Year,Lge,Pos,Age,Tm,G,GS,MP,FG,...,STL,BLK,TOV,PF,PTS,AST,Target,CAS,Szn,Coach
0,A.C. Green,1986.0,NBA,PF,22.0,LAL,82.0,1.0,18.8,2.5,...,0.6,0.6,1.2,2.8,6.4,0.7,1.1,0.7,1,Pat Riley
1,A.C. Green,1987.0,NBA,PF,23.0,LAL,79.0,72.0,28.4,4.0,...,0.9,1.0,1.3,2.2,10.8,1.1,1.1,0.9,2,Pat Riley
2,A.C. Green,1988.0,NBA,PF,24.0,LAL,82.0,64.0,32.1,3.9,...,1.1,0.5,1.5,2.5,11.4,1.1,1.3,0.966667,3,Pat Riley
3,A.C. Green,1989.0,NBA,PF,25.0,LAL,82.0,82.0,30.6,4.9,...,1.1,0.7,1.5,2.1,13.3,1.3,1.1,1.05,4,Pat Riley
4,A.C. Green,1990.0,NBA,PF,26.0,LAL,82.0,82.0,33.0,4.7,...,0.8,0.6,1.4,2.5,12.9,1.1,0.9,1.06,5,Pat Riley


In [134]:
train2.isnull().sum()

Player       0
Year         0
Lge          0
Pos          0
Age          0
Tm           0
G            0
GS        5243
MP         464
FG           0
FGA          0
FG%          0
3P           0
3PA          0
3P%          0
2P           0
2PA          0
2P%          0
eFG%         0
FT           0
FTA          0
FT%          0
ORB       2843
DRB       2843
TRB        381
STL       3351
BLK       3350
TOV       3565
PF           0
PTS          0
AST          0
Target       0
CAS          0
Szn          0
Coach       12
dtype: int64

In [135]:
train2['Coach'] = train2['Coach'].fillna('OTHER')
train2.isnull().sum()

Player       0
Year         0
Lge          0
Pos          0
Age          0
Tm           0
G            0
GS        5243
MP         464
FG           0
FGA          0
FG%          0
3P           0
3PA          0
3P%          0
2P           0
2PA          0
2P%          0
eFG%         0
FT           0
FTA          0
FT%          0
ORB       2843
DRB       2843
TRB        381
STL       3351
BLK       3350
TOV       3565
PF           0
PTS          0
AST          0
Target       0
CAS          0
Szn          0
Coach        0
dtype: int64

I don't understand why the merged dataset has 12 more rows than the original train dataset...but oh well, it shouldn't really matter.