# Model Training
Authors: Wei Mai, Christina Xu

## 1. Model Selection

While there are several classifiers available, we show how to train the following classifiers, compare and select one.

1. [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
2. [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
3. [Logistic Regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)


In [7]:
# models 
from sklearn import tree 
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

## 1.1 Load training and testing data

In [56]:
import numpy as np

data_file_path = '../data/traintest/'
model_file_path   = '../models/'

X_train_prepared = np.loadtxt(filepath + "X_train_prepared.csv", delimiter=",")
X_test_prepared = np.loadtxt(filepath + "X_test_prepared.csv", delimiter=",")
y_train_prepared = np.loadtxt(filepath + "y_train_prepared.csv", delimiter=",")
y_test_prepared = np.loadtxt(filepath + "y_test_prepared.csv", delimiter=",")

## 1.1 Initalize results and define function to evaluate model perfomance

In [61]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

def evaluate_model(model):
    accuracy = model.score(X_train_prepared,y_train_prepared)
    cross_validation = cross_val_score(model, X_train_prepared, y_train_prepared, cv=3, scoring='accuracy')
    y_pred = model.predict(X_train_prepared)
    rmse = mean_squared_error(y_train_prepared, y_pred)
    print (f'Accuracy: {accuracy}' + f'\nCross validation score: {cross_validation}' + f'\nRMSE: {rmse}')

## 2. Training and Evaluation

## 2.1 Decision Tree

https://scikit-learn.org/stable/modules/tree.html

In [29]:
# model training
dtc = tree.DecisionTreeClassifier()

decision_tree = dtc.fit(X_train_prepared, y_train_prepared)

In [62]:
# model evaluation
eval_metrics0 = evaluate_model(decision_tree)

Accuracy: 1.0
Cross validation score: [0.82199441 0.83038211 0.83970177]
RMSE: 0.0


In [58]:
# save model 
import pickle 
pickle.dump(decision_tree, open(model_file_path + 'dtc.pkl', 'wb'))

## 2.2 Random Forest 
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

In [33]:
# model training
rfc = RandomForestClassifier()

random_forest = rfc.fit(X_train_prepared, y_train_prepared)

In [63]:
eval_metrics1 = evaluate_model(random_forest)

Accuracy: 1.0
Cross validation score: [0.89561976 0.91053122 0.92078285]
RMSE: 0.0


In [59]:
# save model 
pickle.dump(decision_tree, open(model_file_path + 'rfc.pkl', 'wb'))

## 2.3 Logistic Regression
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [41]:
# model training
lr = LogisticRegression()

log_reg = lr.fit(X_train_prepared, y_train_prepared)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [64]:
eval_metrics2 = evaluate_model(log_reg)

Accuracy: 0.8990369680024852
Cross validation score: [0.88536813 0.89282386 0.90866729]
RMSE: 0.10096303199751476


In [60]:
# save model 
pickle.dump(log_reg, open(model_file_path + 'lr.pkl', 'wb'))

## 3. Hyperparameter Tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# setup random seed for reproducibility
np.random.seed(42)

# number of trees
n_estimators = np.arange(10, 1000, 50)

# maximnum number of levels in tree
max_depth = [None, 3, 5, 10]

# minimum number of samples required to split a node
min_samples_split = np.arange(2, 20, 2)

# minimum number of samples required at each leaf node
min_samples_leaf = np.arange(1, 20, 2)

# hyperparameter grid for RandomForestClassifier
rf_grid = {"n_estimators": n_estimators,
           "max_depth": max_depth,
           "min_samples_split": min_samples_split,
           "min_samples_leaf": min_samples_leaf}


# initalize the random search model
rf_random = RandomizedSearchCV(estimator = r,
                               param_distributions=rf_grid,
                               cv=5,
                               n_iter=20,
                               verbose=True)
                           
# fit the random search model
rf_random.fit(X_train_prepared, y_train_prepared)