# Data Cleaning Notebook
In this notebook, I will be cleaning the notebook, and preparing a pipeline for use in the modeling process. Then later, I will use the pipleline to create some basic models.

In [57]:
# Import Statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
sns.set(rc={'figure.figsize':(11.7,8.27)})

In [55]:
# Load training data into dataframe
X_train = pd.read_csv('./Data/Training_Features.csv')

X_test = pd.read_csv('./Data/Test_Features.csv')

y_train = pd.read_csv('./Data/Training_Labels.csv')

## Labels
The only thing that needs to be done to the label dataframes is encoding. The string values need to be turned into numbers. For simplicity, I'll do it by hand rather then using sklearn

In [37]:
y_train['status_group'].value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64

In [38]:
# Ordinally encoding the target.
y_train['status_group'].replace({'functional': 1, 'non functional': 0, 'functional needs repair': 2}, inplace=True)
y_train['status_group'].value_counts()

1    32259
0    22824
2     4317
Name: status_group, dtype: int64

## Features and Pipeline
I'll need to do these things before modeling:
- Imputing NaN values
- Ordinal encoding
- One hot encoding

These I'll need to have a seperate way of dealing with NaN values depending on if the object type of the column is numreric or not. I will also need to encode the non-numeric features. I plan to use OHE for anything with < 10 unique values, and ordinal encode for anything with > 10 unique values

In [39]:
# initialize three columns
num_cols = []
ohe_cols = []
ord_cols = []

In [40]:
# make the lists of columns
# num = any columns with numerical value
# ohe = any columns with object value with less than 10 unique values
# ord = any columns with object value with 10 or more unique values
for c in X_train.columns:
    if X_train[c].dtype in ['float64', 'int64']:
        num_cols.append(c)
    elif X_train[c].nunique() < 10:
        ohe_cols.append(c)
    else:
        ord_cols.append(c)

In [49]:
# First, the numeric columns.
num_transformer = Pipeline(steps=[
    # Fill the unknown value with the median value for the column
    ('num_imputer', SimpleImputer(strategy='median'))
    ])

In [53]:
ohe_transformer = Pipeline(steps=[
    # For each unknown value, fill in "Unknown".
    ('ohe_imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    # One Hot encode, and ignore unknown categories
    ('oh_encoder', OneHotEncoder(handle_unknown='ignore'))
])

ord_transformer = Pipeline(steps=[
    # For each unknown value, fill in "Unknown".
    ('ord_imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    # Ordinal encode, and ignore unknown categories
    ('ord_encoder', OrdinalEncoder(handle_unknown='ignore')),
])

In [54]:
# Now that the transformers have been set up, package them together into a transformer.
preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', num_transformer, num_cols),
        ('ohe', ohe_transformer, ohe_cols),
        ('ord', ord_transformer, ord_cols)
    ])

# Modeling
Now, I'll use this pipeline to create some simple models. The competition is using accuracy as the primary evaluation metric, so I'll do the same moving forward.

## Logistic Regression
Let's start small with a simple logistic regression model.

In [59]:
# Create pipeline with prep
lr_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('LogisticReg', LogisticRegression())
])