<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Python_Data_Analytics_Course/blob/main/2_Advanced/02_Pandas_Data_Cleaning.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Pandas Data Cleaning

Load data.

In [3]:
# Importing Libraries
import pandas as pd
from datasets import load_dataset
import matplotlib.pyplot as plt

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Data Cleanup
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])

ValueError: Invalid pattern: '**' can only be an entire path component

## Handling Missing Data

### Review

This is what we learned in the basics section, this is just a refresher of how we've handled null values before. Feel free to skip this.

#### Notes

- `df.dropna()`: Drop missing values.

##### Examples

Here we are only drop values if all of their values are missing.

In [None]:
df_cleaned = df.dropna(how='all')
df_cleaned.info()

Right now all we can do is drop values if they're missing. But that's not useful right now because our DataFrame didn't have any.

So, what if we wanted to fill the missing values with something else? This is expecially useful so we don't run into errors when dealing with NaN values.

### Fillna

#### Notes

- `df.fillna()`: Fill missing values

#### Examples

Let's fill in instances where there's no salary info (aka these columns have NaN values `salary_rate`, `salary_year_avg`, `salary_hour_avg`) with 0.

We're going to look at a few rows in these 3 columns right now, so we can compare what we've done before to after.

We'll use `iloc` to look at the first 10 rows `:20` and the salary information rows `11:14`.

In [None]:
df_cleaned.iloc[:10,11:14]

We fill the values for the 3 columns with 0 using `fillna()`.

In [None]:
fill_values = ['salary_rate', 'salary_year_avg', 'salary_hour_avg']
df_filled = df_cleaned.fillna(0)
df_filled.info()

Now if we compare the results using `iloc` again on our new DataFrame. We see that the previous NaN values in the columns ( `salary_rate`, `salary_year_avg`, `salary_hour_avg`) have been replaced with 0.

In [None]:
df_filled.iloc[:10,11:14]

## Drop Duplicates

### Notes

* `drop_duplicates()`: Remove duplicate rows.
* Analysts will often need to clean up data and one of the most common issues we run into is duplicate values.

### Examples

Now that we've dealt with NaN values. Let's continue cleaning the data by removing any duplicate rows.

In [None]:
df_unique = df_filled.drop_duplicates()
df_unique.info()

If you compare that with the original DataFrame which had 787686 entries. This new DataFrame `df_unique` has 787577 entries. It removed 109 entries.

Now let's see what would happen if we tried to remove duplicates from `job_title`.

In [None]:
df_unique = df_filled.drop_duplicates(subset=['job_title'])
df_unique.info()

If we look at it now. It looks like we removed quite a few rows. Now all of these rows have unique `job_title`s.

In [None]:
df_unique.head(10)

For our example we don't really need to remove any duplicates right now, but it's important to understand the concept.