[NBA Salaries: Dataset by Chris Davis](https://data.world/datadavis/nba-salaries "Data.World")

[Salary Cap History](https://www.basketball-reference.com/contracts/salary-cap-history.html "Basketball Reference")

In [102]:
import pandas as pd
import numpy as np

In [103]:
salary_cap = pd.read_csv('./temp_csvs/sportsref_download.csv', sep = ',')

salary_cap.head()

Unnamed: 0,year,salary_cap,2015_dollars
0,1984,"$3,600,000","$7,934,034"
1,1985,"$4,233,000","$9,153,509"
2,1986,"$4,945,000","$10,317,292"
3,1987,"$6,164,000","$12,354,015"
4,1988,"$7,232,000","$13,829,137"


Only cleaning that needs to be done here is removing the first character from the the 2nd and 3rd columns before converting the data types to floats. We won't be accounting for inflation, so we can actually just drop the 3rd column entirely.

In [104]:
cap_clean = salary_cap.drop(columns = '2015_dollars')

cap_clean.salary_cap = cap_clean.salary_cap.str[1:]
cap_clean.salary_cap = pd.to_numeric(cap_clean.salary_cap.str.replace(',',''))
cap_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 2 columns):
year          37 non-null int64
salary_cap    37 non-null int64
dtypes: int64(2)
memory usage: 672.0 bytes


In [105]:
players = pd.read_csv('./temp_csvs/players_1985to2018.csv')
print(players.shape)
print(players.columns)
players.head(3)

(4685, 24)
Index(['_id', 'birthDate', 'birthPlace', 'career_AST', 'career_FG%',
       'career_FG3%', 'career_FT%', 'career_G', 'career_PER', 'career_PTS',
       'career_TRB', 'career_WS', 'career_eFG%', 'college', 'draft_pick',
       'draft_round', 'draft_team', 'draft_year', 'height', 'highSchool',
       'name', 'position', 'shoots', 'weight'],
      dtype='object')


Unnamed: 0,_id,birthDate,birthPlace,career_AST,career_FG%,career_FG3%,career_FT%,career_G,career_PER,career_PTS,...,draft_pick,draft_round,draft_team,draft_year,height,highSchool,name,position,shoots,weight
0,abdelal01,"June 24, 1968","Cairo, Egypt",0.3,50.2,0.0,70.1,256,13.0,5.7,...,25th overall,1st round,Portland Trail Blazers,1990,6-10,"Bloomfield in Bloomfield, New Jersey",Alaa Abdelnaby,Power Forward,Right,240lb
1,abdulza01,"April 7, 1946","Brooklyn, New York",1.2,42.8,,72.8,505,15.1,9.0,...,5th overall,1st round,Cincinnati Royals,1968,6-9,"John Jay in Brooklyn, New York",Zaid Abdul-Aziz,Power Forward and Center,Right,235lb
2,abdulka01,"April 16, 1947","New York, New York",3.6,55.9,5.6,72.1,1560,24.6,24.6,...,1st overall,1st round,Milwaukee Bucks,1969,7-2,"Power Memorial in New York, New York",Kareem Abdul-Jabbar,Center,Right,225lb


Comparable to the salary cap dataframe, we'll need to convert a lot of these data types, but many of these columns won't be needed and can be dropped entirely.

In [106]:
players_clean = players[['_id','name','height','weight','shoots']]
players_clean.weight = players_clean.weight.str[:-2]
players_clean.weight = pd.to_numeric(players_clean.weight)
players_clean.head()

Unnamed: 0,_id,name,height,weight,shoots
0,abdelal01,Alaa Abdelnaby,6-10,240.0,Right
1,abdulza01,Zaid Abdul-Aziz,6-9,235.0,Right
2,abdulka01,Kareem Abdul-Jabbar,7-2,225.0,Right
3,abdulma02,Mahmoud Abdul-Rauf,6-1,162.0,Right
4,abdulta01,Tariq Abdul-Wahad,6-6,223.0,Right


In [107]:
height = players_clean.height.str.split('-', expand = True)
height.iloc[:,0] = pd.to_numeric(height.iloc[:,0])
height.iloc[:,1] = pd.to_numeric(height.iloc[:,1])
height.columns = ['feet','inches']
height['height'] = height['feet']*12 + height['inches']
height.head()

players_clean.height = height.height
players_clean.head()

Unnamed: 0,_id,name,height,weight,shoots
0,abdelal01,Alaa Abdelnaby,82,240.0,Right
1,abdulza01,Zaid Abdul-Aziz,81,235.0,Right
2,abdulka01,Kareem Abdul-Jabbar,86,225.0,Right
3,abdulma02,Mahmoud Abdul-Rauf,73,162.0,Right
4,abdulta01,Tariq Abdul-Wahad,78,223.0,Right


In [108]:
salaries = pd.read_csv('./temp_csvs/salaries_1985to2018.csv')
print(salaries.shape)
print(salaries.info())
salaries.head(3)

(14163, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14163 entries, 0 to 14162
Data columns (total 7 columns):
league          14163 non-null object
player_id       14163 non-null object
salary          14163 non-null int64
season          14163 non-null object
season_end      14163 non-null int64
season_start    14163 non-null int64
team            14159 non-null object
dtypes: int64(3), object(4)
memory usage: 774.6+ KB
None


Unnamed: 0,league,player_id,salary,season,season_end,season_start,team
0,NBA,abdelal01,395000,1990-91,1991,1990,Portland Trail Blazers
1,NBA,abdelal01,494000,1991-92,1992,1991,Portland Trail Blazers
2,NBA,abdelal01,500000,1992-93,1993,1992,Boston Celtics


All we need from this df, in particular, are the player_id, the salary, and the season_start columns. The player_id column will allow us to merge with the players df and go get the 'name' column--ultimately allowing us to attach salaries to our much bigger player-level data elsewhere (not in this notebook).

In [109]:
salaries_clean = salaries.drop(columns = ['league','season','season_end','team'])
salaries_clean = salaries_clean.rename(columns = {'player_id':'_id',
                                                  'season_start':'year'})

salaries_clean.head()

Unnamed: 0,_id,salary,year
0,abdelal01,395000,1990
1,abdelal01,494000,1991
2,abdelal01,500000,1992
3,abdelal01,805000,1993
4,abdelal01,650000,1994


Now that the data is all clean and in a form we can use, we need to merge the players with the salaries. Thankfully, Chris Davis (who compiled this data) has provided us with player ids by which we might do such a thing.

In [110]:
playersalaries = pd.merge(players_clean, salaries_clean, on = '_id')
playersalaries = playersalaries.drop('_id', axis = 1)
playersalaries.info()
playersalaries.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14163 entries, 0 to 14162
Data columns (total 6 columns):
name      14163 non-null object
height    14163 non-null int64
weight    14163 non-null float64
shoots    14163 non-null object
salary    14163 non-null int64
year      14163 non-null int64
dtypes: float64(1), int64(3), object(2)
memory usage: 774.5+ KB


Unnamed: 0,name,height,weight,shoots,salary,year
0,Alaa Abdelnaby,82,240.0,Right,395000,1990
1,Alaa Abdelnaby,82,240.0,Right,494000,1991
2,Alaa Abdelnaby,82,240.0,Right,500000,1992
3,Alaa Abdelnaby,82,240.0,Right,805000,1993
4,Alaa Abdelnaby,82,240.0,Right,650000,1994


In [111]:
cap_clean.to_csv('data/salary_cap_history.csv')

In [114]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [115]:
# URL page we will scraping
url = "https://hoopshype.com/salaries/players/2018-2019/"
    
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html)

# use findALL() to get the column headers
soup.findAll('table', limit=2)[0]

rows = soup.findAll('tr')[1:]

salaries = [[td.getText() for td in rows[i].findAll('td')]
        for i in range(len(rows))]

In [116]:
step_two = [i[1] for i in salaries]

step_three = [i.replace('\n\n\t\t\t\t\t\t\t\t','') for i in step_two]
names = [i.replace('\t\t\t\t\t\t\t\n','') for i in step_three]

In [117]:
step_two = [i[2] for i in salaries]

step_three = [i.replace('\n\t\t\t\t\t\t\t$','') for i in step_two]
step_four = [i.replace('\t\t\t\t\t\t','') for i in step_three]
money = list(pd.to_numeric([i.replace(',','') for i in step_four]))

In [118]:
last = pd.DataFrame({'name': names,
                    'salary': money,
                    'year': 2018})

In [119]:
last.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 576 entries, 0 to 575
Data columns (total 3 columns):
name      576 non-null object
salary    576 non-null int64
year      576 non-null int64
dtypes: int64(2), object(1)
memory usage: 13.6+ KB


In [120]:
final = pd.merge(playersalaries, last, on = ['name', 'salary','year'], how = 'outer')
print(playersalaries.info())
final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14163 entries, 0 to 14162
Data columns (total 6 columns):
name      14163 non-null object
height    14163 non-null int64
weight    14163 non-null float64
shoots    14163 non-null object
salary    14163 non-null int64
year      14163 non-null int64
dtypes: float64(1), int64(3), object(2)
memory usage: 774.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14739 entries, 0 to 14738
Data columns (total 6 columns):
name      14739 non-null object
height    14163 non-null float64
weight    14163 non-null float64
shoots    14163 non-null object
salary    14739 non-null int64
year      14739 non-null int64
dtypes: float64(2), int64(2), object(2)
memory usage: 806.0+ KB


In [121]:
final.set_index(['name','year']).sort_values(['name','year']).loc['Stephen Curry',:]

Unnamed: 0_level_0,height,weight,shoots,salary
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2009,75.0,190.0,Right,2710560
2010,75.0,190.0,Right,2913840
2011,75.0,190.0,Right,3117120
2012,75.0,190.0,Right,3958742
2013,75.0,190.0,Right,9887642
2014,75.0,190.0,Right,10629213
2015,75.0,190.0,Right,11370786
2016,75.0,190.0,Right,12112359
2017,75.0,190.0,Right,34682550
2018,,,,37457154


In [122]:
file_loc = './temp_csvs/salaries.csv'

final.to_csv(file_loc)