# Comparison of how to train different models

The process of training different models follow similar patterns:
1. prepare data into a format that `sklearn` can understand (i.e. target data in a 2-dimensional array, and target data in a 1-dimensional array)
2. split data into training set and test set
3. choose a model (e.g. LinearRegression, LogisticRegression, RandomForestClassifier, etc) and train the model using the `.fit()` method
4. Evaluate model

In [5]:
# load libraries
from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import train_test_split

# load the iris datasets
dataset = datasets.load_iris()

In [57]:
# defining some helper methods
def print_header(title):
    newline = "\n============================================\n"
    print(newline + title + newline)
    
def print_model_header(title):
    newline = "\n****************************************************\n"
    print(newline + title + newline)

def print_evaluation_tables(model):
    print_model_header(type(model).__name__)
    # 4. Evaluate our model
    training_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    print_header("COMPARING SCORES OF TRAINING AND TEST SET")
    print("training set score: %f" % training_score)
    print("test set score: %f" % test_score)

    # 5. make predictions
    expected = dataset.target
    predicted = model.predict(dataset.data)
    print_header("CLASSIFICATION REPORT")
    classification_report = metrics.classification_report(expected, predicted)
    print(classification_report)

    print_header("CONFUSION MATRIX")
    confusion_matrix = metrics.confusion_matrix(expected, predicted)
    print(confusion_matrix)

## Common steps for preparing data for modeling

In [28]:
# 1. prepare data into a format that sklearn can understand
X = dataset.data
y = dataset.target

# 2. Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

## Logistic Regression

In [36]:
# 3. Choose our model and train our model 
from sklearn.linear_model import LogisticRegression     ## IMPORTANT: These 2 lines are the only lines that 
logistic_regression_model = LogisticRegression()        ## you need to change to build a different model
logistic_regression_model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [30]:
# 4. Evaluate our model
training_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print_header("COMPARING SCORES OF TRAINING AND TEST SET")
print("training set score: %f" % training_score)
print("test set score: %f" % test_score)

# 5. make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
print_header("CLASSIFICATION REPORT")
classification_report = metrics.classification_report(expected, predicted)
print(classification_report)

print_header("CONFUSION MATRIX")
confusion_matrix = metrics.confusion_matrix(expected, predicted)
print(confusion_matrix)


COMPARING SCORES OF TRAINING AND TEST SET

training set score: 0.946429
test set score: 0.868421

CLASSIFICATION REPORT

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        50
          1       1.00      0.78      0.88        50
          2       0.82      1.00      0.90        50

avg / total       0.94      0.93      0.93       150


CONFUSION MATRIX

[[50  0  0]
 [ 0 39 11]
 [ 0  0 50]]


## Naive Bayes

In [37]:
# 3. Choose our model and train our model 
from sklearn.naive_bayes import GaussianNB
naive_bayes_model = GaussianNB()
naive_bayes_model.fit(X_train, y_train)

GaussianNB(priors=None)

## k-Nearest Neighbour

In [38]:
# 3. Choose our model and train our model 
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

## Decision Trees

In [None]:
# 3. Choose our model and train our model 
from sklearn.tree import DecisionTreeClassifier
decision_trees_model = DecisionTreeClassifier()
decision_trees_model.fit(X_train, y_train)


## Support Vector Machines

In [None]:
# 3. Choose our model and train our model 
from sklearn.svm import SVC
svm_model = SVC()
svm_model.fit(X_train, y_train)

In [58]:
models = [logistic_regression_model, naive_bayes_model, knn_model, decision_trees_model ,svm_model]
for model in models:
    print_evaluation_tables(model)


****************************************************
LogisticRegression
****************************************************


COMPARING SCORES OF TRAINING AND TEST SET

training set score: 0.946429
test set score: 0.868421

CLASSIFICATION REPORT

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        50
          1       1.00      0.78      0.88        50
          2       0.82      1.00      0.90        50

avg / total       0.94      0.93      0.93       150


CONFUSION MATRIX

[[50  0  0]
 [ 0 39 11]
 [ 0  0 50]]

****************************************************
GaussianNB
****************************************************


COMPARING SCORES OF TRAINING AND TEST SET

training set score: 0.946429
test set score: 1.000000

CLASSIFICATION REPORT

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        50
          1       0.94      0.94      0.94        50
          2       0.94    