# Lecture 9 Cleaning Data - Part 1 - Missing Values
__Math 3080: Fundamentals of Data Science__

Reading:
* [McKinney, *Python for Data Science*, Chapter 6](https://wesmckinney.com/book/accessing-data)
* Chapter 7

Class notes are found through GitHub. As changes are made, they will automatically be uploaded to GitHub. A link to the repository is on Canvas.

-----
## Outline
* Missing data
  * Locating/Identifying missing data
  * Ways to handle missing data

-----

In order to have data ready for analysis or for modeling, the data needs to be prepared. We call this __preprocessing__.
1. Cleaning the Data
    * Handling missing data
    * Cleaning Labels
    * Formats
      * str to int/float
      * DateTime
2. Data Wrangling
    * Encoding categorical data
    * Rearranging data
    * Combining datasets

-----
## Missing Data

There are many ways to identify missing data:
* Common in coding: `NaN` (Not a number)
* Blank spaces (auto fill in with NaN)
* Large, unreasonable values
* Characters/strings, such as `-` or `missing`

We can deal with missing values in two ways:
1. Dropping them
2. Filling them in with a reasonable value

In [None]:
import numpy as np
import pandas as pd

info = {
    'day': ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'],
    'Number of Customers': [62,54,71,9999,65,9999,52],
    'Revenue': [321.45, 295.74, 441.24, 9999, 512.64, 652.31, 512.04],
    'Shoplifters': [9999, 9999, 2, 9999, 9999, 5, 1],
    'Expenses': [51.40, 53.75, 9999, 59.63, 61.42, 64.25, 75.12]
}

dataset = pd.DataFrame(info)
display(dataset)

While programing, it will be most helpful to deal with `NaN` entries so they don't mess up our calculations.

In [None]:
dataset.replace(9999, np.nan, inplace=True)
dataset

### Dropping missing values
When to drop a variable
* When so much data is missing, the observation/variable doesn't provide any significant information
    * If number of missing values reaches ____% of the total number of values, we can simply drop that data as it wouldn't give us enough information anyway.
* Duplicated entries

  * Remove rows with too many missing values
  * Remove any row with a missing value in a column
  * Remove columns with too many missing values

In [None]:
display(dataset)

# Drop a column with too many missing values
display(dataset.drop('Shoplifters', axis=1, inplace=False))

# Drop a row with too many missing values
display(dataset.drop(3, axis=0, inplace=False))

In [None]:
# Filtering
print(f"There are {dataset['Number of Customers'].isna().sum()} missing values.")

dataset[ dataset['Number of Customers'].notna() ]

In [None]:
# Using the .dropna() method
display(dataset.dropna())

# .dropna() - "how" argument  -  default: how='any'
display(dataset.dropna(how='all', axis=0, inplace=False))

# .dropna() if missing in a particular column
display(dataset.dropna(subset='Number of Customers', inplace=False))
display(dataset.dropna(subset=['Number of Customers','Expenses'], inplace=False))

In [None]:
dataset.dropna(subset=['Number of Customers','Expenses'], inplace=True)
dataset.dropna(axis=1, how="any", inplace=True)
display(dataset)

In [None]:
# drop columns where more than 10% of values are missing
limit_cols = 0.1*len(dataset)

for col in dataset.columns:
    if dataset[col].isna().sum() > limit_cols:
        dataset.drop(col, inplace=True)

# drop rows where more than 10% of values are missing
limit_rows = 0.1*len(dataset.iloc[0])

dataset = dataset[dataset.isna().sum(axis=1) < limit_rows]

### Duplicate Values

### Filling missing values
Sometimes it makes more sense to fill in a missing value with another value. There are a few ways we can fill in missing values:
* Constant
* Calculation (mean/median/min/max)
* Forward fill / Backward fill

Filling in with a constant or a calculation is good if values are random. The mean is frequently used as then it doesn't affect calculations too much. Forward or Backward fill is used when we can see an order to the data.

We can fill in values using the `.fillna()` and `.interpolate()` methods

In [None]:
dataset = pd.DataFrame(info)
dataset.replace(9999, np.nan, inplace=True)
dataset

In [None]:
dataset.fillna(0.5, inplace=False)

In [None]:
dataset['Number of Customers'].fillna(dataset['Number of Customers'].mean())

In [None]:
dataset['Expenses'].fillna(dataset['Expenses'].mean())

In [None]:
dataset['Expenses'].fillna(method='ffill')

In [None]:
dataset['Shoplifters'].fillna(method='bfill', limit=1, inplace=False)

In [None]:
print(dataset['Expenses'])
print(dataset['Expenses'].interpolate())

### Locating and Handling missing data

Look for numbers of missing values in rows and columns

* If there isn't too much missing data, we can determine what to do based on the variable that is missing
  * Fill all missing values with a value (0, 0.5, average)
    * `df.fillna(0.5)`
    * `df.fillna({'Col1':val1, 'Col2':val2})`
  * Fill all missing values with the following value
    * `df.fillna(method="ffill")`
  * Fill all missing values based on the data in that variable
    * `df.fillna(df.mean())`
    * `df.fillna(df.median())`
    * `df.fillna(df.min())`
    * `df.fillna(df.max())`