# nba-data-prep

Use this notebook to prepare the latest version of the NBA dataset that we use in lecture.

We need two CSVs.

- `raw_data/nba-salaries-2022.csv`: download the table [here](https://www.basketball-reference.com/contracts/players.html) as a CSV
- `raw_data/nba-stats-2022.csv`: download the table [here](https://www.basketball-reference.com/leagues/NBA_2022_totals.html#totals_stats::pts) as a CSV

Update these CSVs if you want to update the dataset.

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
salaries = pd.read_csv('raw_data/nba-salaries-2022.csv', header=1)
stats = pd.read_csv('raw_data/nba-stats-2022.csv')

In [None]:
team_files = open('raw_data/team_names.txt', 'r')
team_dict = {}

while True:
    try:
        line = team_files.readline().strip().replace('\t', '').split('-')
        abb = line[0].strip()
        name = line[1].strip()
        team_dict[abb] = name
    except:
        break

In [None]:
merged = salaries.merge(stats, on='Player')
merged = merged[merged['G'] >= 15]
merged = merged[['Player', 'Tm_x', 'Pos', '2021-22']].dropna()
merged

In [None]:
merged['Player'] = merged['Player'].str.findall('(.*)\\\\.*').str[0]
merged['Team'] = merged['Tm_x'].replace(team_dict)
merged['Salary'] = merged['2021-22'].str[1:].astype(int)
merged['Position'] = merged['Pos']

In [None]:
merged = merged[['Player', 'Position', 'Team', 'Salary']] \
                .groupby('Player') \
                .first() \
                .reset_index() \
                .sort_values(['Team', 'Salary'], ascending=[True, False])

In [None]:
merged.to_csv('../data/nba-2022.csv', index=False)