# Hands-on 5: Classification - Naive Bayes


## Task 3: Comparing different classifiers on the Iris dataset

  - Using the training/test splits from Practicum 2, train different classifiers and compare their performance.
  - Documentation for the classifiers can be found here:
    * [Decision Trees](http://scikit-learn.org/stable/modules/tree.html)
    * [Nearest Neighbors](http://scikit-learn.org/stable/modules/neighbors.html)
    * [Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html)
    * [Random Forest](http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees)

## References

  - Scikit-learn
    * [Tutorial](http://scikit-learn.org/stable/tutorial/index.html)
    * [Supervised learning](http://scikit-learn.org/stable/supervised_learning.html)
    * [Class and function reference](http://scikit-learn.org/stable/modules/classes.html)
    * [Dataset loading utilities](http://scikit-learn.org/stable/datasets/index.html)
  - [Scikit-learn tutorial video](https://vimeo.com/53062607) and [online material](http://www.astroml.org/sklearn_tutorial/)

Train different classifiers and compare their performance by filling out the following table:

| Method                   | Accuracy | Error rate |
| ------------------------ | -------- | ---------- |
| Decision Tree            |   0.94   |    0.06    |
| Nearest Neighbors (k=3)  |   0.98   |    0.02    |
| Naive Bayes (Gaussian)   |   0.94   |    0.06    |
| Random Forest (trees=10, max. features=2)           |   0.94   |    0.06    |

Since Nearest Neighbours and Random Forests classifiers have parameters to possibly tune, they might have multiple rows for different parameter values.

In [1]:
import csv
from collections import Counter
import numpy as np
import pprint
import math

ModuleNotFoundError: No module named 'numpy'

We import the scikit-learn classes implementing the classifiers of interest, e.g., a decision tree.

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

Data loader:

In [3]:
def load_data(filename):
    train_x = []
    train_y = []
    test_x = []
    test_y = []
    with open(filename, 'rt') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        i = 0
        for row in csvreader:
            if len(row) == 5:
                i += 1
                instance = [float(row[i]) for i in range(4)]  # first four values are attributes
                label = row[4]  # 5th value is the class label
                if i % 3 == 0:  # test instance
                    test_x.append(instance)
                    test_y.append(label)
                else:  # train instance
                    train_x.append(instance)
                    train_y.append(label)
                    
    return train_x, train_y, test_x, test_y

And predictions evaluator:

In [4]:
def evaluate(predictions, true_labels):
    correct = 0
    incorrect = 0
    for i in range(len(predictions)):
        if predictions[i] == true_labels[i]:
            correct += 1
        else:
            incorrect += 1

    print("\tAccuracy:   ", correct / len(predictions))
    print("\tError rate: ", incorrect / len(predictions))

#### Main logic

Load data.

In [5]:
train_x, train_y, test_x, test_y = load_data("data/iris.data")

For each classifier, we train a model with it, use it for getting predictions for testing instances, and evaluate those predictions versus the known class labels.

In [6]:
classifiers = {"Decision Tree": DecisionTreeClassifier(),
               "Nearest Neighbors": KNeighborsClassifier(n_neighbors=3),
               "Naive Bayes (Gaussian)": GaussianNB(),
               "Random Forests": RandomForestClassifier(n_estimators=10, max_features=2)  # number of trees in the forest, and maximum number of features in each tree
               }
for name, clf in classifiers.items():
    print(name)
    clf.fit(train_x, train_y)
    predictions = clf.predict(test_x)
    evaluate(predictions, test_y)

Nearest Neighbors
	Accuracy:    0.98
	Error rate:  0.02
Decision Tree
	Accuracy:    0.94
	Error rate:  0.06
Random Forests
	Accuracy:    0.92
	Error rate:  0.08
Naive Bayes (Gaussian)
	Accuracy:    0.94
	Error rate:  0.06
