# Lecture 9 Cleaning Data
__Math 3080: Fundamentals of Data Science__

Reading:
* [McKinney, *Python for Data Science*, Chapter 6](https://wesmckinney.com/book/accessing-data)
* Chapter 7

Class notes are found through GitHub. As changes are made, they will automatically be uploaded to GitHub. A link to the repository is on Canvas.

-----
## Outline
* Missing data
  * Locating/Identifying missing data
  * Ways to handle missing data
* Formats
  * Changing str to int
  * DateTime

In order to have data ready for analysis or for modeling, the data needs to be prepared. We call this __preprocessing__.
1. Cleaning the Data
    * Handling missing data
    * Formats
      * str to int/float
      * DateTime
2. Data Wrangling
    * Encoding categorical data
    * Rearranging data
    * Combining datasets

-----
## Missing Data

There are many ways to identify missing data:
* Common in coding: `NaN` (Not a number)
* Blank spaces (auto fill in with NaN)
* Large, unreasonable values

In [1]:
import numpy as np
import pandas as pd

dataset = pd.DataFrame({
    'day': ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'],
    'Number of Customers': [62,54,71,9999,65,9999,52],
    'Revenue': [321.45, 295.74, 441.24, 9999, 512.64, 652.31, 512.04],
    'Shoplifters': [9999, 9999, 2, 9999, 9999, 5, 1],
    'Expenses': [51.40, 53.75, 9999, 59.63, 61.42, 64.25, 65.12]
})
display(dataset)

Unnamed: 0,day,Number of Customers,Revenue,Shoplifters,Expenses
0,Monday,62,321.45,9999,51.4
1,Tuesday,54,295.74,9999,53.75
2,Wednesday,71,441.24,2,9999.0
3,Thursday,9999,9999.0,9999,59.63
4,Friday,65,512.64,9999,61.42
5,Saturday,9999,652.31,5,64.25
6,Sunday,52,512.04,1,65.12


While programing, it will be most helpful to deal with `NaN` entries so they don't mess up our calculations.

In [4]:
dataset.replace(9999, np.nan, inplace=True)
dataset

Unnamed: 0,day,Number of Customers,Revenue,Shoplifters,Expenses
0,Monday,62.0,321.45,,51.4
1,Tuesday,54.0,295.74,,53.75
2,Wednesday,71.0,441.24,2.0,
3,Thursday,,,,59.63
4,Friday,65.0,512.64,,61.42
5,Saturday,,652.31,5.0,64.25
6,Sunday,52.0,512.04,1.0,65.12


### Locating and Handling missing data

Look for numbers of missing values in rows and columns
* If number of missing values reaches ____% of the total number of values, we can simply drop that data as it wouldn't give us enough information anyway.
  * Remove rows with too many missing values
  * Remove any row with a missing value in a column
  * Remove columns with too many missing values
* If there isn't too much missing data, we can determine what to do based on the variable that is missing
  * Fill all missing values with a value (0, 0.5, average)
    * `df.fillna(0.5)`
    * `df.fillna({'Col1':val1, 'Col2':val2})`
  * Fill all missing values with the following value
    * `df.fillna(method="ffill")`
  * Fill all missing values based on the data in that variable
    * `df.fillna(df.mean())`
    * `df.fillna(df.median())`
    * `df.fillna(df.min())`
    * `df.fillna(df.max())`

In [None]:
# drop columns where more than 10% of values are missing
limit_cols = 0.1*len(df)

for col in df.columns:
    if df[col].isna().sum() > limit_cols:
        df.drop(col, inplace=True)

# drop rows where more than 10% of values are missing
limit_rows = 0.1*len(df.iloc[0])

df = df[df.isna().sum(axis=1) < limit_rows]