## dealing with
# missing data


---


### Why does missing data exist? 
- Real world data is messy!
- Values can be missed during data acquisition process (falty sensors, human error...)
- Values can be deleted accidentaly 

### Workflow for treating missing data: <br>
1. Convert all missing values to null <br>
2. Analyze the amount and type of missingness <br>
3. Delete or impute missing Values <br>
4. Evaluate and compare the performance of the treated/imputed dataset <br>

** make a copy of the original df, it can be useful for comparisons later on! 

### Null value operations

- **None != np.nan** <br><br>
None = NoneType, supports logical operations <br>
np.nan = float, "not a number" (undefined), supports logical and arithmetic operations

#### Checking for nulls: 
```python 
    np.isnan(np.nan) # true

    None == None #true
    
    np.nan == np.nan #false 
```

### Missing Values 
Detect and replace <br> 
examples: 'NA','-', '.', '?', numbers out of range, wrong data type

**1st step**: read and print snippet (i.e. head or info) of dataset <br>
```python
    df.info() #range index, data columns (name, quatity and type), dtypes
    df.csat.unique() #show unique values
    np.sort() # sort values
    df.describe() #statitical info

```
**2nd step**: detect what is missing data (inherit or values) <br>

**3rd step**: replace missing values () <br> 
```python
    #on import
    pd.read_csv('dataset.csv', na_values='.')
```

<br>
**extra points for domain knowledge, knowing the context makes for better data analysis 

### Analyze the amount of missingness

**Basic analises**: find the total number and percentage in the dataset 
```python
    df.isnull() #returns nullity array
    df.isna() 
    
```

#### Nullity Bar 
```python
    df_nullity = df.isnull() 
    df_nullity.sum() #total quantity
    df_nullity.mean() * 100 #percentage
```

### Missingno package 
graphical analyses of missing values 

```python
    import missingno as msno 
    msno.bar(df) #completness of the dataframe 
```
#### Nullity matrix 
```python
    msno.matrix(df, freq='') 
```

- use 'loc' to extract parts of the dataframe for analysis ;)

### Possible reasons for missing data
Random; or due to another variable

### Types
- Missing Completely at Random (MCAR): no pattern between variables
- Missing At Random (MAR): there might exist a relationship with another variable; there might be a reason that cannot be observed; 
- Missing Not At Random (MNAR): relationship between missingness and its values; 

Identifying the types helps narrow the methodology to deal with them;

### Finding correlations between missingness

- Heatmap 
```python 
    msno.heatmap(df) #color red means lower correlation
```

- Dendogram: tree diagram of missingness, describes correlation by grouping then
```python 
    msno.dendogram() 
```

### Visualizing missingness across a variable

- Plotting the values in a scatter plot 
- Fill the NaN with random values (dummy values) 
- Prepare the dataframe to see the correlation of nullity between variables

### When and how to delete missing data 

Types of deletion for MCAR: 
- Pairwaise deletion: skip NaN values on operations 

- Listwise deletion: the row is deleted

```python
    df.dropna(subset = ['n'], how='any', inplace = True)
```

! Attention to the amount of data lost, use only for small number of missing values; 

### Basic Imputation Techniques 
Using the column where the value is missing
- constant (0, 1)
- statistical parameter: mean, median, mode </br>
! When you reserve the statistical value, but don't account for the correlations = BIASED DATA 

- Using a ML model: SimpleImputer (sklearn) 

! Visualize imputations with subplots to check validity

### Imputing Time-Series Data 
Time-Series is a series of data points indexed in time order. </br> ~discrete-time data = separate points in time

#### df.fillna(method=" ")
- ffill: replace all NaN with the last observed value 
- bfill: replace all NaN with the following observed value 

! this method creates laterality in the dataset

#### df.interpolate(method="") 
- linear: input linearly or with equidistant values </br> 
10 ___ 30 = 10 20 30 <br>
! can create impression of trend

- quadratic: takes a quadratic approach </br> ! overshoots values
- nearestvalue: ffill + bfill </br> ! same problem of laterality 

> "Time-Series data usually comes with special characteristics (trend, seasonality, cyclicality) </br> which we can exploit when imputing missing values"


### Visualizing Time-Series Imputations 
- compare options graphically to decide which is the best

>"Plotting comparative graphs is essential for inferring the best imputation technique"

### Imputing Using Fancy Impute
fancyimpute package contains ML algorithms to impute missing values 

Advanced Technique: uses other columns to predict missing values 

- KNN (K-Nearest-Neighbor) <br>
Selects K nearest or similar data points using all the non-missing features; <br>
Takes advantage of the selected data points to fill in missing value 

```python
    from fancyimpute import KNN
    knn_imputer = KNN() 
    df.iloc[:,:] = knn_imputer.fit_transform(df) #implace
```

- MICE (Multiple Imputations by Chained Equations) <br> 
Performs multiple regressions over random samples of data <br>
Takes the avarege of the multiple regression <br>
Impute missing feature value for the data point <br>
    IterativeImputer 

> Datasets always have features which are correlated. Hence it becomes important to consider them as a factor for imputing missing valuer. <br> Machine Learning Models use features in the DF to find correlation and patterns and predict a selected feature.

### Imputing Categorical Data
Categorical data is usually strings

#### 1st step: encode categories
    Techniques: 
- one hot encoder: one column per category; only 0 and 1; <br>
ex: "color" becomes "color_red", "color_blue", "color_green"

- ordinal encoder: one column for all, each category recieves a value 
ex: in "color" red = 1, blue = 2, green = 3; 

```python
    from sklearn.preprocessing import OrdinalEncoding
```

#### 2nd step: impute! 
- Fill most frequent category; 
- Use statistical models 

#### 3rd step: convert the columns back to categorical values 

### Evaluation of different imputation techniques 
Imputations are used to improve model performance; maximum ML model performance is selected; <br>

Techniques: 
Applying the linear model from the stats model (statsmodel.api package); <br>
comparing the coefficients and standard error; <br>
comparing density plots;

#### Density Plots <br>
Explain the distribution of data, a good way to check for bias. 

#### Linear Regression Models 
- R-Squared explains accuracy 
- Coefficients explains the model itself 


---
### docs