# Data Cleaning
In this guide, we discuss tools for `missing data`, `duplicate data`, `string manipulation`, and some other `analytical data transformations`. 

1. Missing data
2. Duplicate data
3. String manipulation
4. Analytical data transformation

<hr style="border:2px solid gray"> </hr>.

## 0. Import Pandas library

In [1]:
import pandas as pd
import numpy as np

from numpy import random

<hr style="border:2px solid gray"> </hr>.

# 1. Key functions for missing data operations
Operations on missing data -  We will `drop` missing data, `fill` it or check whether its `null` or `not null`.

- `dropna`: Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.<br><br>

- `fillna`: Fill in missing data with some value or using an interpolation method such as'ffill'or'bfill'.<br><br>

- `isnull`: Return boolean values indicating which values are missing/NA.<br><br>

- `notnull`: Negation of `isnull`<br><br>

<hr style="border:2px solid gray"> </hr>.

# 2. Dropna()

**1.1 Filtering out missing data in a Pandas Series**

In [2]:
#Create a Pandas Series
data = pd.Series([1, np.nan, 3.5, np.nan, 7])

In [3]:
#Drop missing data
data = data.dropna()

In [4]:
#View the data after removing missing fields
data

0    1.0
2    3.5
4    7.0
dtype: float64

In [5]:
#data.dropna() is equal to data[data.notnull()]
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

**1.2 Filtering out missing data in a Pandas DataFrame**

In [6]:
#Create the DataFrame
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],[np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])

In [7]:
#View the data
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [8]:
#Clean the data
cleaned = data.dropna()

In [9]:
#View the cleaned data
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [10]:
#Passing how='all' will only drop rows that are all NA
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


A related way to filter out DataFrame rows tends to concern time series data. 
Suppose you want to keep only rows containing a certain number of observations. 
You can indicate this with the thresh argument:

In [11]:
df = pd.DataFrame(np.random.randn(7, 3))

In [12]:
df.iloc[:4, 1] = np.nan

In [13]:
df.iloc[:2, 2] = np.nan

In [14]:
df

Unnamed: 0,0,1,2
0,0.384029,,
1,-0.892843,,
2,0.776398,,-0.5038
3,-1.289748,,0.984805
4,0.159865,0.706756,0.254438
5,-0.748612,0.494608,0.090972
6,0.508892,-0.954282,-0.23648


In [15]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.776398,,-0.5038
3,-1.289748,,0.984805
4,0.159865,0.706756,0.254438
5,-0.748612,0.494608,0.090972
6,0.508892,-0.954282,-0.23648


**1.3 Drop columns**

In [16]:
#Create the DataFrame
data = pd.DataFrame([[1., 6.5, np.nan], [1., np.nan, np.nan],[np.nan, np.nan, np.nan], [np.nan, 6.5, np.nan]])

In [17]:
#View the data
data

Unnamed: 0,0,1,2
0,1.0,6.5,
1,1.0,,
2,,,
3,,6.5,


In [18]:
#Drop a column with all NaN values
data.dropna(axis=1, how='all')

Unnamed: 0,0,1
0,1.0,6.5
1,1.0,
2,,
3,,6.5


<hr style="border:2px solid gray"> </hr>.

# 3. fillna()
Rather than filtering out missing data (and potentially discarding other data along with it), you may want to `fill` in the “holes” in any number of ways. For most pupposes, the `fillna method is the workhorse function to use`. Calling fillna with a constant replaces missing values with that value:

**3.1 fillna() basic**

#View the data first
data

In [21]:
#View data after fillna()
data.fillna(0)

Unnamed: 0,0,1,2
0,1.0,6.5,0.0
1,1.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,6.5,0.0


In [27]:
#Create the DataFrame again
data = pd.DataFrame([[1., 6.5, np.nan], [1., np.nan, np.nan],[np.nan, np.nan, np.nan], [np.nan, 6.5, np.nan]])

In [28]:
#Calling fillna with a dict, you can use a different fill value for each column:
data.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,1.0,6.5,0.0
1,1.0,0.5,0.0
2,,0.5,0.0
3,,6.5,0.0


In [31]:
#fillna returns a new object, but you can modify the existing object in-place:
_ = data.fillna(0, inplace=True)

In [32]:
#View the dataframe
data

Unnamed: 0,0,1,2
0,1.0,6.5,0.0
1,1.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,6.5,0.0


**3.2 fillna() with a mean/median etc.**<br>
With fillna you can do lots of other things with a little creativity. 
For example, you might pass the mean or median value of a Series:

In [34]:
#Create a Series
data = pd.Series([1., np.nan, 3.5, np.nan, 7])

In [35]:
#Fill with mean
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

<hr style="border:2px solid gray"> </hr>.

# 4. Removing Duplicates
The DataFrame method `duplicated` returns a boolean Series indicating `whether each row is a duplicate` (has been observed in a previous row) or not:

**4.1 Check for Duplicates**

In [37]:
#Create a dataframe
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],'k2': [1, 1, 2, 3, 3, 4, 4]})

In [38]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [39]:
#Check which rows are duplicated
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

**4.2 Drop Duplicates**

In [40]:
#Drop duplicates
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both of these methods by default consider all of the columns; alternatively, you can specify any subset of them to detect duplicates.

`duplicated` and `drop_duplicates` by default keep the first observed value combination. Passing `keep='last'` will return the last one:

In [42]:
#Keep the last row
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
6,two,4


<hr style="border:2px solid gray"> </hr>.

# 5. Drop Entries from an Axis

**Dropping entries from a Series**

In [13]:
#Create a new series
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [14]:
#Drop a single entry
obj.drop('c')

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [15]:
#Drop multiple entries
obj.drop(['d','e'])

a    0.0
b    1.0
c    2.0
dtype: float64

**Dropping entries from a Dataframe**<br>
With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:

In [16]:
#Create a new dataframe
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

In [17]:
#View the dataframe
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [18]:
#Drop rows
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [19]:
#Drop a column
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [20]:
#Drop multiple columns
data.drop(['three', 'four'], axis='columns')

Unnamed: 0,one,two
Ohio,0,1
Colorado,4,5
Utah,8,9
New York,12,13


The drop function returns a copy of the dataframe with a specific row/column dropped. However, it doesn't manipulate the original dataframe. 

To change the original dataframe, we need to use `inplace`.

In [21]:
#Drop column two
data.drop('two', axis=1, inplace=True)

In [22]:
#View the dataframe
data

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


<hr style="border:2px solid gray"> </hr>.

# 6. Replace values
Filling in missing data with the `fillna method` is a special case of more general value replacement. As you’ve already seen, map can be used to modify a subset of values in an object but `replace` provides a simpler and more flexible way to do so. Let’s consider this Series:

In [56]:
#Create a new series
data = pd.Series([1., -999., 2., -999., -1000., 3.])

In [57]:
#The -999 values might be sentinel values for missing data. To replace these with NA values that pandas understands, 
#we can use replace, producing a new Series (unless you pass inplace=True):
data.replace(-999, np.nan, inplace=True)

In [59]:
#If you want to replace multiple values at once, you instead pass a list and then the substitute value:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [60]:
#To use a different replacement for each value, pass a list of substitutes:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [61]:
#The argument passed can also be a dict:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

# End of sheet