**This is an ML script to code Naive Bayes and Logistic Regression models for our project, starting with 8 features.**

Naive Bayes Overview:
"...if we make very naive assumptions about the generative model for each label, we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification.  Different types of naive Bayes classifiers rest on different naive assumptions about the data..." (VanderPlas, Jake.  Python Data Science Handbook.  O'Reilly Media, Inc.: 2016.

Logistic Regression Overview:
ogistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X. (https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8)

Description of Target(replace with .gov source in final notebook, this is enough to get started though): 
https://regulatorysol.com/action-taken-action-taken-date/

In [None]:
%matplotlib inline

import os
import json
import time
import pickle

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

In [None]:
#delete this in a future version of the notebook
root = ''

**Import 2017 sample of 25,000 observations.**  Note import warning:"Columns (29,30,39,40) have mixed types. Specify dtype option on import or set low_memory=False."

In [None]:
# Fetch the data if required
filepath = os.path.abspath(os.path.join( "..", "fixtures", "hmda2017sample.csv"))
DATA = pd.read_csv(filepath)
DATA.describe(include='all')

**Write the initial script using subset of features which are already int or float, plus the target** Future version of script will address full set of features, and will move away from use of the lambda function for readability.

In [None]:
DATA['action_taken'] = DATA.action_taken_name.apply(lambda x: 1 if x in ['Loan purchased by the institution', 'Loan originated'] else 0)
pd.crosstab(DATA['action_taken_name'],DATA['action_taken'], margins=True)

In [None]:
DATA = DATA[['tract_to_msamd_income', 
            'population', 
            'minority_population', 
            'number_of_owner_occupied_units', 
            'number_of_1_to_4_family_units', 
            'loan_amount_000s', 
            'hud_median_family_income', 
            'applicant_income_000s', 
            'action_taken']]
DATA.info()
#resolve missing values in applicant_income_000s
DATA.fillna(DATA.mean(), inplace=True)
DATA.info()

In [None]:
#TO DO: fix column [0]
tofilepath = os.path.abspath(os.path.join( "..", "fixtures", "hmda2017sample_test.csv"))
DATA.to_csv(tofilepath, index=False)

In [None]:
FEATURES  = [
    'tract_to_msamd_income', 
    'population', 
    'minority_population', 
    'number_of_owner_occupied_units', 
    'number_of_1_to_4_family_units', 
    'loan_amount_000s', 
    'hud_median_family_income', 
    'applicant_income_000s', 
    'action_taken'
]

ACTION_TAKEN_MAP = {
    1: "originated or purchased",
    0: "other"
}

In [None]:
# Determine the shape of the data
print("{} instances with {} features\n".format(*DATA.shape))

# Determine the frequency of each class
print(pd.crosstab(index=DATA['action_taken'], columns="count"))

**Stage the data for ML algorithms.** Need to determine whether we can keep y as binary or if it in fact has to be labeled for Scikit-Learn, Yellowbrick et al to work.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Extract our X and y data
X = DATA[FEATURES[:-1]]
y = DATA['action_taken']

# Encode our target variable
encoder = LabelEncoder().fit(y)
y = encoder.transform(y)

print(X.shape, y.shape)

In [None]:
# Create a scatter matrix of the dataframe features
from pandas.plotting import scatter_matrix
scatter_matrix(X, alpha=0.2, figsize=(12, 12), diagonal='kde')
plt.show()

## Data Extraction 

One way that we can structure our data for easy management is to save files on disk. The Scikit-Learn datasets are already structured this way, and when loaded into a `Bunch` (a class imported from the `datasets` module of Scikit-Learn) we can expose a data API that is very familiar to how we've trained on our toy datasets in the past. A `Bunch` object exposes some important properties:

- **data**: array of shape `n_samples` * `n_features`
- **target**: array of length `n_samples`
- **feature_names**: names of the features
- **target_names**: names of the targets
- **filenames**: names of the files that were loaded
- **DESCR**: contents of the readme

**Note**: This does not preclude database storage of the data, in fact - a database can be easily extended to load the same `Bunch` API. Simply store the README and features in a dataset description table and load it from there. The filenames property will be redundant, but you could store a SQL statement that shows the data load. 

In order to manage our data set _on disk_, we'll structure our data as follows:

In [None]:
from sklearn.datasets.base import Bunch

In [None]:
def load_data(root=root):
    # Construct the `Bunch` for the HMDA dataset
    filenames     = {
        'meta': os.path.join(root, 'fixtures','hmdameta.json'),
        'rdme': os.path.join(root, 'fixtures','hmdareadme.txt'),
        'data': os.path.join(root, 'fixtures','hmda2017sample_test.csv'),
    }

    # Load the meta data from the meta json
    with open(filenames['meta'], 'r') as f:
        meta = json.load(f)
        target_names  = meta['target_names']
        feature_names = meta['feature_names']

    # Load the description from the README. 
    with open(filenames['rdme'], 'r') as f:
        DESCR = f.read()

    # Load the dataset from the text file.
    dataset = np.loadtxt(filenames['data'], delimiter = ",", dtype=str, skiprows=1)

    # Extract the target from the data
    data   = dataset[:, 0:8]
    data = data.astype(np.float64)
    target = dataset[:, -1]
    target = target.astype(np.float64)

    # Create the bunch object
    return Bunch(
        data=data,
        target=target,
        filenames=filenames,
        target_names=target_names,
        feature_names=feature_names,
        DESCR=DESCR
    )

# Save the dataset as a variable we can use.
dataset = load_data()

print(dataset.data.shape)
print(dataset.target.shape)

## Classification 

Now that we have a dataset `Bunch` loaded and ready, we can begin the classification process. Let's attempt to build a classifier with kNN, SVM, and Random Forest classifiers. 

In [None]:
from sklearn import metrics

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import KFold

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [None]:
def fit_and_evaluate(dataset, model, label, **kwargs):
    start  = time.time() # Start the clock! 
    scores = {'precision':[], 'recall':[], 'accuracy':[], 'f1':[]}
    
    kf = KFold(n_splits = 12, shuffle=True)
    
    for train, test in kf.split(dataset.data):
        X_train, X_test = dataset.data[train], dataset.data[test]
        y_train, y_test = dataset.target[train], dataset.target[test]
        
        estimator = model(**kwargs)
        estimator.fit(X_train, y_train)
        
        expected  = y_test
        predicted = estimator.predict(X_test)
        
        # Append our scores to the tracker
        scores['precision'].append(metrics.precision_score(expected, predicted, average="weighted"))
        scores['recall'].append(metrics.recall_score(expected, predicted, average="weighted"))
        scores['accuracy'].append(metrics.accuracy_score(expected, predicted))
        scores['f1'].append(metrics.f1_score(expected, predicted, average="weighted"))

    # Report
    print("Build and Validation of {} took {:0.3f} seconds".format(label, time.time()-start))
    print("Validation scores are as follows:\n")
    print(pd.DataFrame(scores).mean())
    
    # Write official estimator to disk
    estimator = model(**kwargs)
    estimator.fit(dataset.data, dataset.target)
    
    outpath = label.lower().replace(" ", "-") + ".pickle"
    with open(outpath, 'wb') as f:
        pickle.dump(estimator, f)

    print("\nFitted model written to:\n{}".format(os.path.abspath(outpath)))
    

In [None]:
# Perform Gaussian Naive Bayes
# need to try this out and extend to MultinomialNB, introducing Pipeline
fit_and_evaluate(dataset, GaussianNB, "Gaussian Naive Bayes",)

In [None]:
# Perform Logistic Regression 
fit_and_evaluate(dataset, LogisticRegression, "Logistic Regression", )

In [None]:
fit_and_evaluate(dataset, SVC, "SVM Classifier", gamma = 'auto')