# NBA Data :: New Player Dataset

This notebook aims to create a new `players` dataset containing player statistics from 2004-2024. The old players dataset from `nba_api` was insufficient for our hypothesis tests, it was missing player positions, so we were unable to find correlations between positions and player statistics.

The new dataset is pulled from the [nbaapi](<http://rest.nbaapi.com/index.html>) site, and after trying to get an automatic scraper to work, we settled on just downloading the data manually. The following functions will connect the downloaded json files and concatenate them into a pandas/polars DataFrame, and write the data to both CSV and PKL files.

## 1. Importing Packages and Data

In [3]:
import numpy as np
import pandas as pd
import polars as pl
import pyarrow
import openpyxl
import json
import os

data_dir = '/home/arch-db/Documents/github/bint-capstone/data-sources/player-responses/'

## 2. Converting the JSON files to DataFrames

In [12]:
def json_to_list_of_dfs(dir: str) -> list[pd.DataFrame]:
    dfs = []

    # fn = file name
    for fn in os.listdir(dir):
        if fn.endswith('.json'):
            # fp = file path
            fp = os.path.join(dir, fn)

            # read json file, f = file
            with open(fp, 'r', encoding='utf-8') as f:
                data = json.load(f)

                df = pd.DataFrame(data)
                dfs.append(df)

    return dfs

In [13]:
dfs = json_to_list_of_dfs(data_dir)

## 3. Concatenating the DataFrames and Writing to Files

Now that the JSON data has been successfuly converted to DataFrames, we need to concatenate them and write to a file.

In [14]:
def concat_dfs(dfs:list[pd.DataFrame])->pd.DataFrame:
    if dfs:
        return pd.concat(dfs, ignore_index=True)
    else:
        return pd.DataFrame()

In [19]:
def write_to_files(df, dir:str):
    df.to_csv(f'{dir}new-player-data.csv')
    df.to_pickle(f'{dir}new-player-data.pkl')

In [20]:
write_to_files(concat_dfs(dfs), data_dir)