# Data Cleaning

## Why is Data Cleaning so important?

Decisions and analytics are increasingly driven by adata and models. 

Key aspects of Machine Learning Workflow depend on cleaned data: 
- Observations: An instance of the data (usually a point or row in a dataset)
- Labels: Computer programs that estimate models based on available data
- Model: Hypothesized relationship between observation anda data

Messy data can lead to "garbage in, garbage out" effect, and unreliable outcomes. 

The main data problems companies face: 

- Lack of data 
- Too much data 
- Bad data

Having data ready for ML and AI ensures you are ready to infuse AI into your organization.

---

### How can Date be Messy? 

- Duplicate or unnecessary data
- Inconsistent text and typos
- Missing data
- Outliers
- Data sourcing issues:
    - Multiple systems
    - Different database types
    - On premises, in cloud

--- 
### Duplicate or Unncesary Data

Pay attention to **duplicate values** and research why there are multiple values. 

It's a good idea to look at the features you're bringing in and **filter** the data as necessary (be careful not to filter too much if you may use features later)

---

### Policies for Missing Data 

**Remove** the data: Remove the rows(s) entirely.

**Impute** the data: Replace with subsituted values. Fill in the missing data with the most commom values, the average value, etc. 

**Mask** the data: Create a category for missing values. 

What are the pros and cons of each of these approaches?

---
### Outliers 

An **outlier** is an observation that is distant from most other observations. 

Typically, these observations are aberrations and do not accurately represent the phenomenon we are trying to explain through the model. 

If we do not identify and deal with outliers, they can have a significant impact on the model.

It is important to remember that some outliers are informative and provide insights into the data.

---
### How to find Outliers

![image.png](attachment:image.png)

---
### Detecting Outliers: Plots

![image-2.png](attachment:image-2.png)

---
### Detecting Outliers: Statistics

```python
import numpy as np
# Calculate the interquartile range
q25, q50, q75 = np.percentile(data, [25, 50, 75])
iqr = q75 - q25

# calculate the min / max limits to be considered an outlier
min = q25 - 1.5 * (iqr)
max = q75 + 1.5 * (iqr)

print(min, q25, q50, q75, max)

-6.6 7.8 11.5 17.4 31.8 

# Identify the points 
[x for x in data['Unemployment'] if x > max]

[40.0, 34.700000003]
```
![image-3.png](attachment:image-3.png)

--- 
### Detecting Outliers: Residuals 

**Residuals** (differences between actual and predicted values of the outcome variable) represent model failure.

Approaches to calculating residuals:

- **Standardized**: Residual divided by standard error.
- **Deleted**: residuals from fitting model on all data excluding current observation.
- **Studentized**: Deleted residuals divided by residual standard error. (based on all data, or all data excluding current observation)

---
### Policies for Outliers

- **Remove** them
- **Assign** the mean or median value
- **Transform** the variable
- **Predict** the what the value should be:
    - Using 'similar' observations to predict likely values.
    - Using regression. 
- **Keep them**, but focus on models that are resistant to outliers.
