# Data Cleaning Notebook
In this notebook, I will be cleaning the notebook, and preparing a pipeline for use in the modeling process.

In [20]:
# Import Statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'figure.figsize':(11.7,8.27)})

In [21]:
# Load training data into dataframe
X_train = pd.read_csv('./Data/Training_Features.csv')

y_train = pd.read_csv('./Data/Training_Labels.csv')

## Training Labels
The only thing that needs to be done to the y_train dataframe is encoding. The string values need to be turned into numbers. For simplicity, I'll do it by hand rather then using sklearn

In [22]:
y_train['status_group'].value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64

In [25]:
# Ordinally encoding the target.
y_train['status_group'].replace({'functional': 1, 'non functional': 0, 'functional needs repair': 2}, inplace=True)
y_train['status_group'].value_counts()

1    32259
0    22824
2     4317
Name: status_group, dtype: int64

## Features and Pipeline
I'll need to do these things before modeling:
- Imputing NaN values
- Ordinal encoding
- One hot encoding

These I'll need to have a seperate way of dealing with NaN values depending on if the object type of the column is numreric or not. I will also need to encode the non-numeric features. I plan to use OHE for anything with < 10 unique values, and ordinal encode for anything with > 10 unique values

In [26]:
# initialize three columns
num_cols = []
ohe_cols = []
freq_cols = []

In [27]:
# make the lists of columns
# num = any columns with numerical value
# ohe = any columns with object value with less than 10 unique values
# freq = any columns with object value with 10 or more unique values
for c in X_train.columns:
    if X_train[c].dtype in ['float64', 'int64']:
        num_cols.append(c)
    elif X_train[c].nunique() < 10:
        ohe_cols.append(c)
    else:
        freq_cols.append(c)