# 2 - Cleaning Data with Python & Pandas


### Cleaning Data

It's true that we made this data but let's look at it as if we didn't. 

The `Player Salary` column has valid values for US Dollars but there's a key issue with them: they're strings (`str`). In this section, we'll convert this data into a `float` data type. 

The next issue is the column names. `Player Name` and `Player Salary` work but I would prefer to name them a bit more pythonic like `name` and `salary` respectively. 

Let's start by importing our sample data from `1 - Pandas & Datasets`

In [None]:
import pandas as pd
import random

# utils.py was created by us
import utils

In [None]:
# read sample data
df = pd.read_csv("samples/1.csv") 

> Are you missing the sample data? Be sure to [launched this code on Deepnote](https://deepnote.com/launch?url=https://github.com/codingforentrepreneurs/Try-Pandas)

Now, lets __change the column names__:

In [None]:
column_name_mapping = {
    "Player Name": "name",
    "Player Salary": "salary"
}



In [None]:
# we're using the first DataFrame from the top `df`.
renamed_df = df.rename(columns=column_name_mapping)

In [None]:
renamed_df.head()

The mapping is pretty simple just use a `key`/`value` pair with the way you want to rename it.

Going forward we'll use the convention `df` instead of `renamed_df` so let's make a copy:

In [None]:
df = renamed_df.copy()

Now, let's convert a Dollar `string` into a `float`:

In [None]:
salary_example = "$30,707,056.00"
salary_replacements = salary_example.replace("$", "").replace(",", "_")
salary_replacements

As you see, I replaced commas `,` with underscores `_`. As you may know, you can write large values in Python using underscores to make it more human readable just like `100000000000` becomes `100_000_000_000`

In [None]:
salary_example_as_float = float(salary_replacements)
salary_example_as_float

Now that we have a `float` value, we can do further analysis. 

But this is just one hard-coded value. How do we do this in our `DataFrame`? There's actually a few ways to do this. We'll do it by adding a column to our dataset.

Before we can make changes to any given column, let's look at all values in any given column

In [None]:
df['salary']

This shows us:
- How to grab data via column name (our renamed column of course)
- An example of Pandas `Series`
- DataFrame Index Values (based on our data).

All of the above we'll continue to look at in future videos. For now, we need to get *just* the list of values from the column we're getting data from. We'll do that with:

In [None]:
list(df['salary'].values)

So how would we convert all this data in pure python? Perhaps something like:

In [None]:
values = list(df['salary'].values)
new_values = []
for val in values:
    new_val = float(val.replace("$", "").replace(",", "_"))
    # you can also use new_val = utils.float_to_dollars(val)
    new_values.append(new_val)

print(new_values)

Let's bare something in mind here: the position (or index) of each value should correspond to it's counterpoint in our table values (ie `new_values[312]` should be the same as `values[312]`). Let's test that here: 

In [None]:
random_index = random.randint(0, len(values))
new_value_via_index = new_values[random_index]
new_value_in_dollars = utils.float_to_dollars(new_value_via_index)

assert new_value_in_dollars == values[random_index]

Now, let's add these values as a new column in our DataFrame

In [None]:
df['salary_raw_py'] = new_values
df.head()

Now we can add new columns to a Pandas DataFrame using a familiar method (much like adding a new key to a Python dictionary `dict()`). In this case, the length of the values we added matches the length of all the rows in our DataFrame. We know this because the data *came from the dataframe* in the first place.

Let's try to add arbitrary data. 

In [None]:
import datetime

this_year =  datetime.datetime.now().year # notice this 
df['year'] = this_year

In [None]:
df.head()

So we now see two properties of a DataFrame that are pretty cool. You can add a new column with 1 value or with matching number of row values.

How about data that was 1/2 the number of rows?

In [None]:
rows_length = df.shape[0]
# column_length = df.shape [1]
half_rows = int(rows_length * 0.5)
try:
    df['is_new'] = [True for x in range(0, half_rows)]
except Exception as e:
    print(e)

Now we see that you can:
- Add a value for all rows from 1 value
- Add a value fro all rows from a corresponding index value in another list

Everything we did above technically works but it adds a lot of uncessary steps that we can skip thanks to Pandas awesomeness.

In [None]:
def dollar_str_to_float(val):
    # in the future, this will be stored in
    # utils.py in the courses/ directory
    return float(val.replace("$", "").replace(",", "_"))

df['salary_as_float'] = df['salary'].apply(dollar_str_to_float)

Let's break this down:
- `df['salary_via_apply']` is declaring our new column
- `df['salary']` is a reference to the values in a pre-existing column on this dataframe
- `.apply()` will run a function on *all* values in the referenced column. 
- `dollar_str_to_float` is a function that we pass the values to in order to get the correct result.
- The original `df['salary']` remains unchanged.

In [None]:
df.head()

You can also use a lambda to simplify this further:

```python
df['salary_via_apply_lambda'] = df['salary'].apply(lambda x: float(x.replace('$', '').replace(',', '')))
```

In [None]:
# Export to samples dir
# df.to_csv("samples/2.csv", index=False)