# 06 Cleaning Data
__Math 3080: Fundamentals of Data Science__

Reading:
* McKinney, Chapter 7 - Data Cleaning and Preparation

Outline:
1. Unlabeled data
2. Missing data
    1. Locating missing data
    2. Filling missing values
    3. Dropping missing values
    4. Removing duplicate observations
3. Mapping

-----
If we are lucky, datasets are perfect and ready to use. Most of the time, however, there are problems
* Unlabeled data
* Missing data
* Unorganized

In this segment, we will look at how to document unlabeled data and how to clean the data. We will address data wrangling which deals with unorganized data in the next lecture.

## 6.1 Unlabeled data
Consider the following dataset:

In [1]:
import pandas as pd

df = pd.read_csv('../Datasets/unlabeled_data.csv', header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,71,72,73,74,75,76,77,78,79,80
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


All data should have 2 things: Labeled Columns and Documentation. Now, consider the original dataset:

In [2]:
df = pd.read_csv('../Datasets/Housing_Data/train.csv')
df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


* Open the file ./math3080/Datasets/Housing_Data/data_description.txt

Every dataset should come with labeled columns and with another file that provides more information for each variable.

## 6.2 Missing data
Another potential problem is that data could be unorganized. Sometimes data is just a list. Sometimes values are not in the right place. Sometimes they have the wrong units. So before anything else can be done, we have to organize the data.

#### 6.2.1 Locating missing data
How do we determine if data is missing? Missing values can have any of the following:
* An extreme number (-9999)
* NaN (Not a Number)
  * In python, a NaN is produced using `np.nan`
* Blank entries (no information) - programs usually fill these with NaN

Let's look at the following dataset on passengers of Titanic:

In [1]:
titanic = pd.read_csv('../Datasets/Titanic/titanic_train.csv')
titanic

NameError: name 'pd' is not defined

There are definitely some missing values in the __Age__ and __Cabin__ columns. Are there any others?

We can test for missing values using the `.isnull()` command. The result will be a DataFrame with True/False values indicated if the value is missing (True) or if it is not missing (False).

In [5]:
titanic.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


By taking the sum of this `.isnull()` DataFrame, we can find out exactly how many elements are missing.

In [6]:
titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

#### 6.2.2 Filling missing values
We can't just leave these missing values, otherwise it will affect our data. Here are 4 ways we can fill in the data:
* Replace a value or a missing value with another predetermined number
* Filling missing values with column/row average
  * If there are few missing values, and all the other values in the column/row are random
  * Using row averages will be rare since rows are generally composed of different variables
* Filling missing values with previous/next value
  * If there are few missing values, and there is an order to the values in that variable
* Filling missing values with avg of previous and next values
  * If there are few missing values, and there is an order to the values in that variable
* Filling missing values with an interpolation of values
  * Looks at the pattern and fill with a value that fits in that pattern
* Removing NaN rows and columns
  * If it doesn't make sense to fill them
  * If that column is needed but missing values will cause problems
  * If there are too many 

Following are commands to fill missing values:

In [None]:
# Replace all -999 entries with np.nan
titanic.replace(-999,np.nan)

# Fills all missing values with a 0
titanic.fillna(value=0) 

# Fills all missing values with the value stored in "the_mean"
titanic.fillna(value=the_mean) 

# Fills all missing values in "Embarked" column with column mean
titanic['Embarked'].fillna(value=titanic['Embarked'].mean()) 

# Fills all missing values with an interpolation of the variable
titanic['Age'].interpolate(method='linear')

Following are commands to drop missing values:

In [None]:
# Drop rows with NaN values
titanic.dropna(axis=0) 

# Drop columns with NaN values
titanic.dropna(axis=1) 

# Drop columns with at least 2 NaN values
titanic.dropna(axis=1, thresh=10) 

To see how this works, let's work with a smaller dataset:

In [19]:
import numpy as np
import pandas as pd

missing_data = np.array([[1,1,2,1,1,1],
                         [1,3,-999,2,9,2],
                         [2,4,-999,-999,1,3],
                         [9,-999,-999,3,5,4],
                         [4,3,2,3,-999,-999],
                         [3,7,-999,-999,2,6],
                         [1,1,2,3.5,5,1],
                         [1,1,2,3.5,5,1]])

missing = pd.DataFrame(missing_data)
missing.columns=['Variable1','Variable2','Variable3','Variable4','Variable5','Variable6']
missing

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,1.0,1.0,1.0
1,1.0,3.0,-999.0,2.0,9.0,2.0
2,2.0,4.0,-999.0,-999.0,1.0,3.0
3,9.0,-999.0,-999.0,3.0,5.0,4.0
4,4.0,3.0,2.0,3.0,-999.0,-999.0
5,3.0,7.0,-999.0,-999.0,2.0,6.0
6,1.0,1.0,2.0,3.5,5.0,1.0
7,1.0,1.0,2.0,3.5,5.0,1.0


Notice that the missing values are reported as a -999. These won't register as a NaN:

In [20]:
missing.isna()

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,False,False


So, we need to first change the -999 values to something the computer recognizes as NaN. We do this using the `replace` function:

In [21]:
missing.replace(-999,np.nan)

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,1.0,1.0,1.0
1,1.0,3.0,,2.0,9.0,2.0
2,2.0,4.0,,,1.0,3.0
3,9.0,,,3.0,5.0,4.0
4,4.0,3.0,2.0,3.0,,
5,3.0,7.0,,,2.0,6.0
6,1.0,1.0,2.0,3.5,5.0,1.0
7,1.0,1.0,2.0,3.5,5.0,1.0


But notice that this isn't a permanent change.

In [22]:
missing

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,1.0,1.0,1.0
1,1.0,3.0,-999.0,2.0,9.0,2.0
2,2.0,4.0,-999.0,-999.0,1.0,3.0
3,9.0,-999.0,-999.0,3.0,5.0,4.0
4,4.0,3.0,2.0,3.0,-999.0,-999.0
5,3.0,7.0,-999.0,-999.0,2.0,6.0
6,1.0,1.0,2.0,3.5,5.0,1.0
7,1.0,1.0,2.0,3.5,5.0,1.0


To make it permanent, add the `inplace=True` argument. This will be a common argument we'll use in data wrangling.

In [23]:
missing.replace(-999, np.nan, inplace=True)
missing

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,1.0,1.0,1.0
1,1.0,3.0,,2.0,9.0,2.0
2,2.0,4.0,,,1.0,3.0
3,9.0,,,3.0,5.0,4.0
4,4.0,3.0,2.0,3.0,,
5,3.0,7.0,,,2.0,6.0
6,1.0,1.0,2.0,3.5,5.0,1.0
7,1.0,1.0,2.0,3.5,5.0,1.0


Now, we can deal with the missing numbers.

In [24]:
missing.isna()

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,False,False,False,False,False,False
1,False,False,True,False,False,False
2,False,False,True,True,False,False
3,False,True,True,False,False,False
4,False,False,False,False,True,True
5,False,False,True,True,False,False
6,False,False,False,False,False,False
7,False,False,False,False,False,False


Let's start with Variable5. It has one missing value that can be replaced with the column mean.

In [26]:
missing['Variable5'].fillna(value=missing['Variable5'].mean(), inplace=True)
missing

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,1.0,1.0,1.0
1,1.0,3.0,,2.0,9.0,2.0
2,2.0,4.0,,,1.0,3.0
3,9.0,,,3.0,5.0,4.0
4,4.0,3.0,2.0,3.0,4.0,
5,3.0,7.0,,,2.0,6.0
6,1.0,1.0,2.0,3.5,5.0,1.0
7,1.0,1.0,2.0,3.5,5.0,1.0


But if there's an obvious pattern (such as in time series data, such as stock values), we may want the filled data to match the pattern. This is the case with Variable 6:

In [7]:
missing['Variable6'].interpolate(method='linear', inplace=True)
missing

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,1.0,1.0,1.0
1,1.0,3.0,,2.0,9.0,2.0
2,2.0,4.0,,,1.0,3.0
3,9.0,,,3.0,5.0,4.0
4,4.0,3.0,2.0,3.0,4.0,5.0
5,3.0,7.0,,,2.0,6.0
6,1.0,1.0,2.0,3.5,5.0,1.0
7,1.0,1.0,2.0,3.5,5.0,1.0


__Fill with previous value__

Sometimes, if the data has a set pattern of changing, a missing value could be interpreted and jut not having changed. This would be the case with missing values in Variable 4. To fix this, we just fill the missing value with the previous value.

In [8]:
missing['Variable4'].fillna(method="ffill",inplace=True)
missing

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,1.0,1.0,1.0
1,1.0,3.0,,2.0,9.0,2.0
2,2.0,4.0,,2.0,1.0,3.0
3,9.0,,,3.0,5.0,4.0
4,4.0,3.0,2.0,3.0,4.0,5.0
5,3.0,7.0,,3.0,2.0,6.0
6,1.0,1.0,2.0,3.5,5.0,1.0
7,1.0,1.0,2.0,3.5,5.0,1.0


#### 6.2.3 Dropping missing values
Sometimes it doesn't make sense to fill values, or there are just so many missing values that the variable (or observation) in question is useless. If we assume that it doesn't make sense to fill observations in Variable2, then we would just want to drop observation 3:

In [69]:
missing.dropna(axis=0)

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,1.0,1.0,1.0
4,4.0,3.0,2.0,3.0,4.0,5.0
6,1.0,1.0,2.0,3.5,5.0,1.0
7,1.0,1.0,2.0,3.5,5.0,1.0


But this dropped all observations that had any NaN at all (observations 1, 2, and 5). We want to keep some of those observations since they have valid numbers for Variable2. We can specify to drop only variables with missing values for Variable2.

In [70]:
missing.dropna(axis=0,subset='Variable2', inplace=True)
missing

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,1.0,1.0,1.0
1,1.0,3.0,,2.0,9.0,2.0
2,2.0,4.0,,2.0,1.0,3.0
4,4.0,3.0,2.0,3.0,4.0,5.0
5,3.0,7.0,,3.0,2.0,6.0
6,1.0,1.0,2.0,3.5,5.0,1.0
7,1.0,1.0,2.0,3.5,5.0,1.0


Let's also look at Variable 3. Even though an average might make sense to fill the value, there are so many missing values that this column is pretty much useless. So, let's just remove it.

In [71]:
missing.drop('Variable3', axis=1, inplace=True)
missing

Unnamed: 0,Variable1,Variable2,Variable4,Variable5,Variable6
0,1.0,1.0,1.0,1.0,1.0
1,1.0,3.0,2.0,9.0,2.0
2,2.0,4.0,2.0,1.0,3.0
4,4.0,3.0,3.0,4.0,5.0
5,3.0,7.0,3.0,2.0,6.0
6,1.0,1.0,3.5,5.0,1.0
7,1.0,1.0,3.5,5.0,1.0


#### 6.2.4 Removing duplicate observations
Another common problem is duplicate data. Notice that in our original data, the last two rows are the same. In same cases, this could be valid. However, most circumstances would call this a suspicious entry. We can remove individual rows, but it is simpler to remove all duplicates at once.

In [72]:
missing.drop_duplicates(inplace=True)
missing

Unnamed: 0,Variable1,Variable2,Variable4,Variable5,Variable6
0,1.0,1.0,1.0,1.0,1.0
1,1.0,3.0,2.0,9.0,2.0
2,2.0,4.0,2.0,1.0,3.0
4,4.0,3.0,3.0,4.0,5.0
5,3.0,7.0,3.0,2.0,6.0
6,1.0,1.0,3.5,5.0,1.0


And there we are. Our data is now cleaned and ready to use.

-----
Let's take a break. I'd love some feedback for the course so far:
* [https://drive.google.com/file/d/1wMq_eWU3jZKyALEQoRMonhZmW70n9QN5/view?usp=sharing](https://drive.google.com/file/d/1wMq_eWU3jZKyALEQoRMonhZmW70n9QN5/view?usp=sharing)

-----