# 4 - Cleaning Real Data

Now it's time for the real stuff. Let's use a load in a real dataset and discover our next steps together.

In *Appendix A - Scrape & Build NBA Salary Dataset*, we create a NBA Player salary dataset by web scraping  [hoopshype.com](hoopshype.com). We won't cover web scraping here but you can run that notebook if you want to learn more.

In [None]:
import datetime
import pathlib
import pandas as pd

# import local utils.py
import utils

In [None]:
PERFROM_SCRAPE = True
BASE_DIR = pathlib.Path().resolve()
DATASET_PATH = BASE_DIR / 'datasets'
INPUT_PATH = DATASET_PATH / 'nba-historical-salaries.csv'
print(f'Dataset *{INPUT_PATH.name}* exists:', INPUT_PATH.exists())

In [None]:
df = pd.read_csv(INPUT_PATH)

In [None]:
df.head()

In [None]:
df.shape

The above commands tell us a lot about this data already:
- Finanical data
- Columns with dollar strings need to be cleaned (`$`)
- Rename columns for consistency
- There's 14,549 records each with 5 data points.
- `adj_salary` is given data. Does this mean adjusted in today's dollars? Is this accurate?

After this assessment, let's get to work

### Column consistency

_How you do anything, is how you do everything._

Let's start with the mundane task of committing to a consistent naming convention for our columns across our entire project here. 

Before we do, let's see the columns: 

In [None]:
df.columns

If you're a seasoned programmer, you will notice the issue. If you're new to programming you might miss it. If you look at each column name you will see a subtle shift in how each column casing is done.

Casing types? Yes, seriously. Here are a few options:

- `PascalCase` -> `ThisIsPascalCase`
- `camelCase` -> `thisIsCamelCase`
- `snake_case` -> `this_is_snake_case`
- `kebab-case` -> `this-is-kebab-case` (aka `slugified-string`, `spinal-case`)


Since I use Python and create a lot of web applications, I tend to use `snake_case` or `kebab-case`. If you're a SQL database person, you'd probably use `PascalCase`. If you're from JavaScript, you'd probably use a lot of `camelCase`.

Whatever format you use, just be consistent. Let's rename our columns using `snake_case`:

In [None]:
# %pip install python-slugify

In [None]:
from slugify import slugify

def to_snake_case(val):
    # in the future, this will be stored in
    # utils.py in the courses/ directory
    kebab_case = slugify(val)
    return kebab_case.replace('-', '_')

I like using the `python-slugify` package to consistently and reliably convert any string into a url-ready slug (aka `kebab-casing`). Once we have a `slug`/`kebab-case` we can just switch out the dashes (`-`) for underscores (`_`)

In [None]:
old_columns = df.columns
new_columns = [to_snake_case(x) for x in old_columns]

In [None]:
new_column_mapping = dict(zip(old_columns, new_columns))
new_column_mapping

> `zip` is a cool built in python feature that combines two lists of the same length. Once you use `dict` around them, it will turn the left side list into keys and the right side list into values associated by their indices. I remember `zip` like  a zipper on your pants, backpacks, luggage, etc; each size has "teeth" that corresponds to the other side. 

In [None]:
df.rename(columns=new_column_mapping, inplace=True)
df.head()

## Cleaning Rows

Now that we've renamed our columns, let's clean up our rows. In `utils.py` we have the function `dollar_str_to_float` which converts dollar strings into floats

In [None]:
def clean_row(row_series):
    row_series['salary'] = utils.dollar_str_to_float(row_series['salary'])
    row_series['adj_salary'] = utils.dollar_str_to_float(row_series['adj_salary'])
    return row_series

df_cleaned = df.copy().apply(clean_row, axis=1)
df_cleaned.head()

I hope that your alarm bells are going off. We never covered `df.apply` we only covered `df['my_col'].apply`. What gives?

When you run `.apply` on an entire DataFrame, you can modify each row as you see fit instead of just an entire column. Another way to write this would be to write:

```python
df_cleaned = df.copy().apply(clean_row, axis=1)
df_cleaned['salary'] = df_cleaned['salary'].apply(utils.dollar_str_to_float)
df_cleaned['adj_salary'] = df_cleaned['adj_salary'].apply(utils.dollar_str_to_float)
```

And that would be perfectly acceptable. But there's a major difference. And it's this:

In [None]:
def clean_row_2(row_series):
    dollar_cols = ['salary', 'adj_salary']
    for col in dollar_cols:
        row_series[col] = utils.dollar_str_to_float(row_series[col])
    return row_series

df_cleaned_2 = df.copy().apply(clean_row_2, axis=1)
df_cleaned_2.head()

`clean_row_2` gives us a way to reduce complexity by iterating over the columns we want to adjust. 

In [None]:
df_cleaned_2['adj_salary'].dtype

In [None]:
players_per_year = df_cleaned_2['year_start'].value_counts(sort=False)
players_per_year

In [None]:
players_per_year.plot(title='Number of Players Per Year')

In [None]:
adj_salary_df = df_cleaned_2.copy()[['year_start', 'adj_salary']]
adj_salaries_cumlative = adj_salary_df.groupby("year_start").sum()

adj_salaries_cumlative.plot(title='Adjusted Cumaltive Salaries Over Time')

Look at this two charts! The second appears to be out-pacing the first.

- upward trend of number of players and salaries
- What happend in 2019?
- 2020 seams to be trending towards a massive year for player payments

The above dataset leaves me with a lot of questions:

- Are these adjust salary numbers correct (they are from hypehoops.com)
- Are the per-player salaries going up or just the top 5% of players?
- How does a players' salary correlate to wins / losses / other stats?
- How does a team (full of players) and their salaries correlate to wins / losses / other stats?
- Do the audience metrics support these numbers? (In person, online, etc) In other words, is there really this much economic value being generated?

Answers to these questions will inevitably leads to more questions which hopefully means more and better data analysis.


In [None]:
# Export to samples dir

# df_cleaned_2.to_csv('samples/4-player-salaries-cleaned.csv', index=False)

# players_per_year.rename(columns={"year_start": "players"}, inplace=True)
# players_per_year.to_csv('samples/4-player-salaries-per-year.csv', index_label='year', index=True)

# adj_salaries_cumlative['adj_salary_$'] = adj_salaries_cumlative['adj_salary'].apply(utils.float_to_dollars)
# adj_salaries_cumlative.rename(columns={"year_start": "year"}, inplace=True)
# adj_salaries_cumlative.to_csv("samples/4-adj-salaries-cumlative-per-year.csv", index_label="year")