# Data Processing Pipeline Example
This notebook contains sample code on how to use the data-processing pipeline  


In [4]:
# Before Starting up anything, we need to add the folder containing all the source code to Jupyter Notebooks
import sys
import os

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path+"\\project_code")

In [10]:
# Once added, we can call all our project functions without any issues
from pathlib import Path
import pandas as pd
# from project_code.missingdata import DataImputer
from encoding import DataEncoder
from missingdata import DataImputer
from exploration_helper_functions import load_data

# Load up the data
data_path = Path(r'data/netflix_data.csv')
df = load_data(data_path)

In [11]:
# Set-up the default Imputer
imputer = DataImputer()
imputer.fit_transform(df)

# Set-up the default Encoder
encoder = DataEncoder()
x, y = encoder.fit_transform(dataframe=df)

In [12]:
print(f"The shape of X is {x.shape}")
print(f"The shape of X is {y.shape}")

The shape of X is (13379, 175)
The shape of X is (13379,)


This gives us the feature vector X and the result variable y using the default fits. The default settings are different for each column.

Now if we want to customize the imputer/encoder, we can pass in a column name and a MissingDataHandler object/BaseEncoding object mapping into the Imputer/Encoder during initantiation. A small example is given below.

The Data contains a Genre column. If we want, we can impute the genre column the most frequenct values. 

In [13]:
from missingdata import ReplaceWithHighestFrequency

df = load_data(data_path)
custom_column_imputer = ReplaceWithHighestFrequency()
imputer = DataImputer(scheme={'Genre' : custom_column_imputer})
imputer.fit_transform(df)

Lets also assume for this Genre column, we want to only keep the top 10 most common genres. If an entry has one or multiple of the most frequent genres we only keep those and place a 1.0 for each of them and drop the rest. If for example, an entry does not contain any of them top genres, we will label encode it with all zeroes.

In [14]:
from encoding import KeepTopN

custom_column_encoder = KeepTopN(N = 10)
encoder = DataEncoder(scheme={'Genre' : custom_column_encoder})
x, y = encoder.fit_transform(df)

In [15]:
custom_column_encoder.category_names

['Family',
 'Fantasy',
 'Animation',
 'Adventure',
 'Crime',
 'Romance',
 'Thriller',
 'Action',
 'Comedy',
 'Drama']

The top-10 categories as determined by the encoder.(With the momst occuring one starting out from the bottom, in this case 'Drama' 

After trying out simple regression algorithms, XGBoost usually gave the best results.(Without delving into perceptrons). So for most of the project, different data imputation and encoding schemes are measured using XGBoost. The code in the cells below demonstrate the evaluation pipeline

In [18]:
# Create a model
from method_evaluation import evaluate_model
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(random_state=42, n_estimators=120)
mean, var = evaluate_model(model, x, y)

In [23]:
print(f"Mean: {mean:,.3f} and Variance: {var:,.5f}")

Mean: 0.014 and Variance: 0.00007


The evaluate_model function applies 5-fold cross-validation(by default) and returns mean and the variace for the test error(MSE for this project). From these two values, we can get a good understanding of how well the model performs. We want a low mean as a as a low varaince. 