# Pipeline Breakdown: Imputing

This notebook demonstrates how the the different parts of the missing data handler/imputing class works and how they come together.

# Base Class
Every imputer class must extend the MissingDataHandler class and implement its methods. We can have a different imputer per class and we can also easily swap them out for experimentation purposes. 

## Example
The next few cells demostrate a few different missing data handler class as well as the data they collect about the dataset. For this demo, we will only be working on a single column/feature

In [1]:
# Before Starting up anything, we need to add the folder containing all the source code to Jupyter Notebooks
import sys
import os

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path+"\\project_code")

### ReplaceWithHighestFrequency

One of the most simplest imputers. This scans the observable data for the most frequent categories and replaces missing data with the highest frequncy one. It also has a paramter to allow it to keep multiple topmost catergories or randomly a single one(In case of a tie)

In [6]:
from exploration_helper_functions import *
from missingdata import *

data_path = Path(r'data/netflix_data.csv')
df = load_data(data_path)
g = df['Genre'].copy()
imputer = ReplaceWithHighestFrequency(keep_all= True)
imputer.process(g, fit_data=True)
print(f'The most frequent categories are : {imputer.replace_value}')


The most frequent categories are : Drama


## Pipeline

The DataImputer() class combines all of these modular classes and performs the transform() and fit() methods. This class contains a mapping with each columns name to a MissingDataHandler class and is responsible for calling the encode() on each of these modular parts. It also contains the default scheme(which is actually the best scheme as determined from trying out different combinations). If we do not pass a specific encoder for a specific column, the defualt one is used for that specific column. 

### Future Work

The missing feature can be set as the resultant variable and the rest of the columns(including the original resultant variable) can then be treated as features. For this project however, due to the existence of multiple categories per entry and the large dimension of the data this is quite a difficult task. This is reserved as future works