# Cleaning data using Pandas

#### Description:

This codebook covers methods for cleaning data using Pandas.

#### Skill level:

- Beginner

### Import the required libraries
-------------------------

In [1]:
import os
import sys

platform_path = os.path.abspath(os.path.join(os.path.abspath(''), '../../../'))
sys.path.append(platform_path)

In [2]:
import pandas as pd
import numpy as np

from pandas.api.types import is_numeric_dtype

### Read data into a dataframe
-------------------------

#### References:
- Simple useage examples from the Pandas docs: https://pandas.pydata.org/docs/user_guide/basics.html

In [3]:
df_raw = pd.read_csv(os.path.join(platform_path, 'DATA/simple_data.csv'))

### Check the shape and head of the dataframe
-------------------------

#### Notes:
- Since we are dealing with a small dataframe, we will inspect the whole dataset. Otherwise, you may want to consider using df_raw.head() to inspect only the first five rows.

In [4]:
df_raw.shape

(10, 4)

In [5]:
df_raw

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


### Check common statistics for numeric columns
-------------------------

In [6]:
df_raw.describe()

Unnamed: 0,Age,Salary
count,9.0,9.0
mean,38.777778,63777.777778
std,7.693793,12265.579662
min,27.0,48000.0
25%,35.0,54000.0
50%,38.0,61000.0
75%,44.0,72000.0
max,50.0,83000.0


### Make a copy of the dataframe which we can clean the data for
-------------------------

In [7]:
df = df_raw

### Check each dataframe column for missing values
-------------------------

In [8]:
for col in df.columns:
    print(col + ':', df[col].isnull().values.sum())

Country: 0
Age: 1
Salary: 1
Purchased: 1


### Replace missing values for specific numeric columns using the column mean
-------------------------

In [9]:
cols = ['Age']

for col in cols:
    df[col].fillna((df[col].mean()), inplace=True)

### Replace missing values for specific columns using the most common column value
-------------------------

In [10]:
cols = ['Salary']

for col in df.columns:
    df[col].fillna((df[col].mode()), inplace=True)

### Drop rows where a specific column has missing values
-------------------------

In [11]:
cols = ['Purchased']

df = df.dropna(subset=cols)

### Re-check each dataframe column for missing values
-------------------------

In [12]:
for col in df.columns:
    print(col + ':', df[col].isnull().values.sum())

Country: 0
Age: 0
Salary: 0
Purchased: 0


### Check each numeric column for outliers
-------------------------

#### Notes:
- You should be careful how you define an 'outlier', here we are defining an outlier as anything greater than one standard deviation beyond the mean

In [13]:
std_devs = 1

for col in df.columns:
    if is_numeric_dtype(df[col]):
        print(col + ':', df.loc[np.abs(df[col] - df[col].mean()) > (std_devs * df[col].std()), col].count())

Age: 4
Salary: 3


### Truncate outliers in each numeric column
-------------------------

In [14]:
std_devs = 1

for col in df.columns:
    if is_numeric_dtype(df[col]):
        vals_wo_outliers = df.loc[np.abs(df[col] - df[col].mean()) < (std_devs * df[col].std()), col].values
        
        if len(vals_wo_outliers) > 0:
            vals_wo_outliers_min = vals_wo_outliers.min()
            vals_wo_outliers_max = vals_wo_outliers.max()
        
            df.loc[df[col] < vals_wo_outliers_min, col] = vals_wo_outliers_min
            df.loc[df[col] > vals_wo_outliers_max, col] = vals_wo_outliers_max

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in th

### Re-check common statistics for each numeric column
-------------------------

In [15]:
df.describe()

Unnamed: 0,Age,Salary
count,9.0,9.0
mean,39.197531,62222.222222
std,3.995282,8700.255424
min,35.0,52000.0
25%,35.0,54000.0
50%,38.777778,61000.0
75%,44.0,72000.0
max,44.0,72000.0
