# Data Cleaning Notebook
In this notebook, I will be cleaning the notebook, and preparing a pipeline for use in the modeling process. Then later, I will use the pipleline to create some basic models.

In [200]:
# Import Statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier
sns.set(rc={'figure.figsize':(11.7,8.27)})

In [176]:
# Load training data into dataframe
X_train = pd.read_csv('./Data/Training_Features.csv')
X_train.drop(columns=['date_recorded', 'permit', 'public_meeting'], inplace=True)
X_test = pd.read_csv('./Data/Test_Features.csv')

y_train = pd.read_csv('./Data/Training_Labels.csv')
y_train = y_train['status_group']

## Labels
The only thing that needs to be done to the label dataframes is encoding. The string values need to be turned into numbers. For simplicity, I'll do it by hand rather then using sklearn

In [177]:
y_train.value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64

In [178]:
# Ordinally encoding the target.
y_train.replace({'functional': 1, 'non functional': 0, 'functional needs repair': 2}, inplace=True)
y_train.value_counts()

1    32259
0    22824
2     4317
Name: status_group, dtype: int64

## Features and Pipeline
I'll need to do these things before modeling:
- Imputing NaN values
- Ordinal encoding
- One hot encoding

These I'll need to have a seperate way of dealing with NaN values depending on if the object type of the column is numreric or not. I will also need to encode the non-numeric features. I plan to use OHE for non-numeric columns.

In [179]:
# initialize three columns
num_cols = []
ohe_cols = []
ord_cols = []

In [180]:
# make the lists of columns
# num = any columns with numerical value
# ohe = any columns with object value
for c in X_train.columns:
    if X_train[c].dtype in ['float64', 'int64']:
        num_cols.append(c)
    else:
        ohe_cols.append(c)

In [181]:
# First, the numeric columns.
num_transformer = Pipeline(steps=[
    # Fill the unknown value with the median value for the column
    ('num_imputer', SimpleImputer(strategy='median')),
    ('StandardScaler', MinMaxScaler())
    ])

In [182]:
ohe_transformer = Pipeline(steps=[
    # For each unknown value, fill in "Unknown".
    ('ohe_imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    # One Hot encode, and ignore unknown categories
    ('oh_encoder', OneHotEncoder(handle_unknown='ignore'))
])


In [183]:
# Now that the transformers have been set up, package them together into a transformer.
preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', num_transformer, num_cols),
        ('ohe', ohe_transformer, ohe_cols),
    ])

# Modeling
Now, I'll use this pipeline to create some simple models. The competition is using accuracy as the primary evaluation metric, so I'll do the same moving forward.

## Logistic Regression
Let's start small with a simple logistic regression model.

In [197]:
# Create pipeline with preprocessor and Logistic Regression predictor
lr_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    # Using 'saga' solver because default solver "lbfgs" was not converging
    ('LogisticReg', LogisticRegression(solver='saga', random_state=15))
])

In [198]:
# Cross validate since we don't have access to the testing labels
lr_scores = cross_validate(lr_pipe, X_train, y_train)

In [199]:
# Print mean of all test scores
np.mean(lr_scores['test_score'])

0.7891414141414141

### Analysis
The baseline logistic regression model scored about 0.78 / 0.79 accuracy. The model will almost certainly perform worse on the test data, but this is about what I would expect from this model type with no tuning. Let's also test a decision tree model, and see if it performs better then a logistic regresson model.

## Sklearn Decision Tree


In [None]:
dt_pipe = Pipeline(steps=[
    ('dt_preprocessor', preprocessor),
    ('DecisionTree', )
])