# 2 - Cleaning Data with Python & Pandas


### Cleaning Data

It's true that we made this data but let's look at it as if we didn't. 

The `Player Salary` column has valid values for US Dollars but there's a key issue with them: they're strings (`str`). In this section, we'll convert this data into a `float` data type. 

The next issue is the column names. `Player Name` and `Player Salary` work but I would prefer to name them a bit more pythonic like `name` and `salary` respectively. 

Let's start by importing our sample data from `1 - Pandas & Datasets`

In [18]:
import pandas as pd
import random

# utils.py was created by us
import utils

In [19]:
# read sample data
df = pd.read_csv("samples/1.csv") 

> Are you missing the sample data? Be sure to [launched this code on Deepnote](https://deepnote.com/launch?url=https://github.com/codingforentrepreneurs/Try-Pandas)

Now, lets __change the column names__:

In [20]:
column_name_mapping = {
    "Player Name": "name",
    "Player Salary": "salary"
}



In [21]:
# we're using the first DataFrame from the top `df`.
renamed_df = df.rename(columns=column_name_mapping)

In [22]:
renamed_df.head()

Unnamed: 0,name,salary
0,Player-0,"$23,564,932.00"
1,Player-1,"$19,122,655.00"
2,Player-2,"$9,926,467.00"
3,Player-3,"$44,055,782.00"
4,Player-4,"$41,113,231.00"


The mapping is pretty simple just use a `key`/`value` pair with the way you want to rename it.

Going forward we'll use the convention `df` instead of `renamed_df` so let's make a copy:

In [23]:
df = renamed_df.copy()

Now, let's convert a Dollar `string` into a `float`:

In [24]:
salary_example = "$30,707,056.00"
salary_replacements = salary_example.replace("$", "").replace(",", "_")
salary_replacements

'30_707_056.00'

As you see, I replaced commas `,` with underscores `_`. As you may know, you can write large values in Python using underscores to make it more human readable just like `100000000000` becomes `100_000_000_000`

In [25]:
salary_example_as_float = float(salary_replacements)
salary_example_as_float

30707056.0

Now that we have a `float` value, we can do further analysis. 

But this is just one hard-coded value. How do we do this in our `DataFrame`? There's actually a few ways to do this. We'll do it by adding a column to our dataset.

Before we can make changes to any given column, let's look at all values in any given column

In [26]:
df['salary']

0        $23,564,932.00
1        $19,122,655.00
2         $9,926,467.00
3        $44,055,782.00
4        $41,113,231.00
              ...      
30247    $25,666,586.00
30248       $649,549.00
30249    $18,001,656.00
30250    $16,715,266.00
30251    $28,971,015.00
Name: salary, Length: 30252, dtype: object

This shows us:
- How to grab data via column name (our renamed column of course)
- An example of Pandas `Series`
- DataFrame Index Values (based on our data).

All of the above we'll continue to look at in future videos. For now, we need to get *just* the list of values from the column we're getting data from. We'll do that with:

In [27]:
list(df['salary'].values)

['$23,564,932.00',
 '$19,122,655.00',
 '$9,926,467.00',
 '$44,055,782.00',
 '$41,113,231.00',
 '$38,515,477.00',
 '$24,941,389.00',
 '$4,216,796.00',
 '$38,472,958.00',
 '$37,072,615.00',
 '$24,627,994.00',
 '$15,656,467.00',
 '$44,230,432.00',
 '$28,898,355.00',
 '$9,360,168.00',
 '$40,550,209.00',
 '$47,099,973.00',
 '$44,857,116.00',
 '$18,961,730.00',
 '$6,411,070.00',
 '$22,392,884.00',
 '$38,419,389.00',
 '$5,147,141.00',
 '$14,440,973.00',
 '$7,987,556.00',
 '$20,799,362.00',
 '$36,259,015.00',
 '$47,256,440.00',
 '$16,706,382.00',
 '$26,181,563.00',
 '$20,982,648.00',
 '$31,290,916.00',
 '$45,526,224.00',
 '$11,161,954.00',
 '$4,703,055.00',
 '$4,500,971.00',
 '$32,615,395.00',
 '$32,661,868.00',
 '$13,141,247.00',
 '$45,928,956.00',
 '$1,794,957.00',
 '$29,988,773.00',
 '$35,721,507.00',
 '$7,200,865.00',
 '$44,741,268.00',
 '$2,035,660.00',
 '$31,591,784.00',
 '$14,269,633.00',
 '$14,685,822.00',
 '$44,226,547.00',
 '$20,484,781.00',
 '$10,309,771.00',
 '$5,464,299.00',
 '$15

So how would we convert all this data in pure python? Perhaps something like:

In [28]:
values = list(df['salary'].values)
new_values = []
for val in values:
    new_val = float(val.replace("$", "").replace(",", "_"))
    # you can also use new_val = utils.float_to_dollars(val)
    new_values.append(new_val)

print(new_values)

[23564932.0, 19122655.0, 9926467.0, 44055782.0, 41113231.0, 38515477.0, 24941389.0, 4216796.0, 38472958.0, 37072615.0, 24627994.0, 15656467.0, 44230432.0, 28898355.0, 9360168.0, 40550209.0, 47099973.0, 44857116.0, 18961730.0, 6411070.0, 22392884.0, 38419389.0, 5147141.0, 14440973.0, 7987556.0, 20799362.0, 36259015.0, 47256440.0, 16706382.0, 26181563.0, 20982648.0, 31290916.0, 45526224.0, 11161954.0, 4703055.0, 4500971.0, 32615395.0, 32661868.0, 13141247.0, 45928956.0, 1794957.0, 29988773.0, 35721507.0, 7200865.0, 44741268.0, 2035660.0, 31591784.0, 14269633.0, 14685822.0, 44226547.0, 20484781.0, 10309771.0, 5464299.0, 15884172.0, 48155892.0, 30958364.0, 8252851.0, 11937134.0, 19271965.0, 24433176.0, 22254664.0, 22248516.0, 28287889.0, 26329410.0, 25716286.0, 36676054.0, 3502251.0, 30615303.0, 10914569.0, 38700332.0, 33989418.0, 49887175.0, 40143675.0, 10489167.0, 37724535.0, 39281900.0, 45963678.0, 5021758.0, 5285819.0, 1207458.0, 36122811.0, 22091316.0, 41017086.0, 2043600.0, 8584491.0

Let's bare something in mind here: the position (or index) of each value should correspond to it's counterpoint in our table values (ie `new_values[312]` should be the same as `values[312]`). Let's test that here: 

In [29]:
random_index = random.randint(0, len(values))
new_value_via_index = new_values[random_index]
new_value_in_dollars = utils.float_to_dollars(new_value_via_index)

assert new_value_in_dollars == values[random_index]

Now, let's add these values as a new column in our DataFrame

In [30]:
df['salary_raw_py'] = new_values
df.head()

Unnamed: 0,name,salary,salary_raw_py
0,Player-0,"$23,564,932.00",23564932.0
1,Player-1,"$19,122,655.00",19122655.0
2,Player-2,"$9,926,467.00",9926467.0
3,Player-3,"$44,055,782.00",44055782.0
4,Player-4,"$41,113,231.00",41113231.0


Now we can add new columns to a Pandas DataFrame using a familiar method (much like adding a new key to a Python dictionary `dict()`). In this case, the length of the values we added matches the length of all the rows in our DataFrame. We know this because the data *came from the dataframe* in the first place.

Let's try to add arbitrary data. 

In [31]:
import datetime

this_year =  datetime.datetime.now().year # notice this 
df['year'] = this_year

In [32]:
df.head()

Unnamed: 0,name,salary,salary_raw_py,year
0,Player-0,"$23,564,932.00",23564932.0,2021
1,Player-1,"$19,122,655.00",19122655.0,2021
2,Player-2,"$9,926,467.00",9926467.0,2021
3,Player-3,"$44,055,782.00",44055782.0,2021
4,Player-4,"$41,113,231.00",41113231.0,2021


So we now see two properties of a DataFrame that are pretty cool. You can add a new column with 1 value or with matching number of row values.

How about data that was 1/2 the number of rows?

In [33]:
rows_length = df.shape[0]
# column_length = df.shape [1]
half_rows = int(rows_length * 0.5)
try:
    df['is_new'] = [True for x in range(0, half_rows)]
except Exception as e:
    print(e)

Length of values (15126) does not match length of index (30252)


Now we see that you can:
- Add a value for all rows from 1 value
- Add a value fro all rows from a corresponding index value in another list

Everything we did above technically works but it adds a lot of uncessary steps that we can skip thanks to Pandas awesomeness.

In [35]:
def dollar_str_to_float(val):
    # in the future, this will be stored in
    # utils.py in the courses/ directory
    return float(val.replace("$", "").replace(",", "_"))

df['salary_as_float'] = df['salary'].apply(dollar_str_to_float)

Let's break this down:
- `df['salary_via_apply']` is declaring our new column
- `df['salary']` is a reference to the values in a pre-existing column on this dataframe
- `.apply()` will run a function on *all* values in the referenced column. 
- `dollar_str_to_float` is a function that we pass the values to in order to get the correct result.
- The original `df['salary']` remains unchanged.

In [36]:
df.head()

Unnamed: 0,name,salary,salary_raw_py,year,salary_as_float
0,Player-0,"$23,564,932.00",23564932.0,2021,23564932.0
1,Player-1,"$19,122,655.00",19122655.0,2021,19122655.0
2,Player-2,"$9,926,467.00",9926467.0,2021,9926467.0
3,Player-3,"$44,055,782.00",44055782.0,2021,44055782.0
4,Player-4,"$41,113,231.00",41113231.0,2021,41113231.0


You can also use a lambda to simplify this further:

```python
df['salary_via_apply_lambda'] = df['salary'].apply(lambda x: float(x.replace('$', '').replace(',', '')))
```

In [47]:
# Export to samples dir
# df.to_csv("samples/2.csv", index=False)