# Project Introduction 
This is the motivation and background for the project. 

*THIS IS THE OVERALL GOAL & DESIRED OUTCOME OF OUR PROJECT*

## Process Overview
1. Ingest the Data
    * Import the raw data files (from ```data/raw```)
    * Define & call functions to perform baseline processing
        * Categorical to dummy variables
        * Continous numerical to mean (to 0) and variation (to 1) scaling 
        * Mapping of Target variable to appropriate classes
    * Save data into ```data/processed``` with appropriate naming
1. Exploratory Data Analysis
    * This is done iteratively with Ingest Data phase above
    * Understand the distribution of values fore features of interest
    * Identify anomalies in the data
    * Categories ultimately get transformed as described above for ease of interfacing with ML algorithms
    * Pay particular attention to the output variable
    * Consider mapping the data via latitude/longitude to visualize the mapping of the data
    * Look into the correlation plots for features - see if there are "duplicated" variables
    * Consider what kind of data features may be missing from the data set
    * Note this strengths and limitations of the dataset
    * Consider augmenting the dataset with additional sources
1. Split Data into Training, Validation, & Test Sets
    * Based on the EDA above, determine how to responsibly split the data into subsets
    * Define functions to split up the data
    * Save the split up data - ensure consistency with the data subsets among users
    * Sample the data subsets to create workable subsets for faster analysis (1000s of examples instead of 100000s of examples). Call this train_mini, val_mini, test_mini
1. KNN with routine normalization of features as baseline
    * Load in the processed data from ```data/processed```
    * Apply KNN for ```k = [1, 3, 5, 7, 9, 11, 13]```
    * For each value of K, call ```classification_report``` to note various metrics
    * Comment on the results and identify next steps and opportunities 
    * Do any follow up experiments on this initial data with KNN to explore 
1. Feature Reduction for feature relationships & visualization via PCA
    * Apply PCA to reduce data to 2 dimensions 
    * Be sure to create callable functions
    * Save this data as PCA data in ```data/processed```
    * Decompose the PCA to see which features 
1. Visualize 2D PCA output
    * Use matplotlib to visualize the training dataset
    * Identify the key clusters - how many are there?
    * Add coloring with the intended labels to see if there are any relationships
1. Apply Unsupervised Learning to 2D PCA
    * Apply K-Mean Clustering
        * Try different numbers of clusters from 2 to 16 (or based on above) 
        * Plot the results in a grid with color coding
    * Apply GMM Clustering
        * Plot results
    * Comment on the results. Draw conclusions
1. Perform more Feature Engineering
    * Based on insights above, fine tune the dataset to maintainable interpretability
1. Apply Supervised Learning Methods with attention to feature weights
    * Return to KNN - again, this is just memorizing the data, but hopefully we do better than the 1st pass
    * Apply Decision Trees
        * Expand to Random Forests
        * Expand to Gradient Boosting 
    * Apply Logistic Regression
    * For all of the above, examine the feature weights
    * For all of the above, evaluate the interpretability
    * For all of the above, review the performance metrics
1. Conclusions & Recommendations
    * What do we see in the data?
    * Which model performs the best from a performance metrics perspective?
    * Which model performs the best from an interpretability perspective?
    * Which features do we identify as the strongest identifiers?
1. Propose Next Steps
    * If the project continued, this is what we would recommend looking at next...


### Python Library Imports

In [2]:
# To format the notebook to include plots and figures inline
%matplotlib inline

# General Python libraries
import matplotlib.pyplot as plt
import numpy as np
import os
import re
import time

# Feature Extraction & Unsupervised Libraries
from sklearn.feature_extraction.text import *
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.mixture import GMM

from matplotlib.colors import LogNorm

# Learning/Model Libraries
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
# Add trees libraries

# Evaluation Libarires
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
