# Part 1: Classification

A common task in computational research is to classify an object based on a set of features. In superivsed machine learning, we can give an algorithm a dataset of training examples that say "here are specific features, and this is the target class it belongs to". With enough training examples, a model can be built that recognizes important features in determining an objects class. This model can then be used to predict the class of an object given its known features.

## 1) Iris Dataset

We'll start off by loading scikit-learn's [Iris](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) dataset. Using this dataset we can classify an iris flower as one of three types: setosa, versicolour, or virginica. The features that we'll use to predict this are sepal length, sepal width, petal length, and petal width.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

In [None]:
type(iris)

Let's look inside of it to see what datatypes scikit-learn wants, and how their sample dataset is formatted, so that we can prepare our own datasets later:

In [None]:
iris.keys()

So the data is in dictionary format, and we can access the data and labels by indexing certain keys:

In [None]:
print(iris.DESCR)

Again, here are the features:

In [None]:
print(iris.feature_names)
print(len(iris.feature_names))

And here's what we're predicting:

In [None]:
print(iris.target_names)
print(len(iris.target_names))

So we are using 4 features for each observation, trying to classfiy each observation into one of three categories, using only those 4 features. How are these input features formatted?

In [None]:
print(iris.data.shape)
print(type(iris.data))
iris.data[0:2]

We have a large numpy array of length 150, one for each observation, and each observation has its own numpy array of length 4, one for each feature. Each inner array *must* lineup with the order of the variables *and* all other arrays. **ORDER MATTERS**.

What about the target?

In [None]:
print(iris.target.shape)
print(type(iris.target))
iris.target

Again, we have 150 observations, but *no* sub arrays. The target data is one dimension. Order matters here as well, they should correspond to the feature indices in the data array. The targets are the correct classes corresponding each observation in our dataset.

In other words, the data and the targets indices should match up like this for three of the observations:

In [None]:
for x in [0, 50, 100]:
    print("Data:", iris.data[x])
    print("Target:", iris.target[x])

Hopefully this helps you convert your data from CSV or other formats into the correct numpy arrays for scikit-learn.

Now we will split the data into training and testing, but first thing's first: **set the random seed!** This is very important for reproducibility of your analyses.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

np.random.seed(10)

Here we'll use 75% of the data for training, and test on the remaining 25%.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25)

In [None]:
X_train.shape, X_test.shape

Now that we've split our data up into `train` and `test` sets, let's look to see how the target classes are distributed within the two datasets. This is known as the **class distribution**.

In [None]:
plt.figure(figsize=(13,5))
plt.subplot(1,2,1)
plt.hist(y_train, bins=5)
plt.title('Train')
plt.subplot(1,2,2)
plt.hist(y_test, bins=5);
plt.title('Test');

Imbalanced classes can cause problems for model performance and evaluation. 

When we started, there was an equal distribution of 50 observations for each target class in the dataset. After splitting the data in training and testing sets, we didn't distribute the target classes evenly across our partitions. Fortunately we can tell `sklearn` to split targets in equal distributions using the `stratify` parameter as follows:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.20,
                                                   stratify=iris.target)

In [None]:
plt.figure(figsize=(13,5))
plt.subplot(1,2,1)
plt.hist(y_train, bins=5)
plt.title('Train')
plt.subplot(1,2,2)
plt.hist(y_test, bins=5);
plt.title('Test');

That's much better, they are all equal now!

## 2) Decision Trees

The first model we're going to explore is [Decision Trees: Classification](http://scikit-learn.org/stable/modules/tree.html#classification).

After the train/test split, scikit-learn makes the rest of the process relatively easy since it already has a Decision Tree (DT) classifier for us, we just have to choose the parameters:

In [None]:
from sklearn import tree

dt_classifier = tree.DecisionTreeClassifier(criterion='gini',  # or 'entropy' for information gain
                       splitter='best',  # or 'random' for random best split
                       max_depth=None,  # how deep tree nodes can go
                       min_samples_split=2,  # samples needed to split node
                       min_samples_leaf=1,  # samples needed for a leaf
                       min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                       max_features=None,  # number of features to look for when splitting
                       max_leaf_nodes=None,  # max nodes
                       min_impurity_decrease=1e-07, #early stopping
                       random_state = 10) #random seed

We then use the `fit` method to fit our model to the training data. The syntax is a little strange at first, but it's powerful. All the functions for fitting data, making predictions, and storing parameters are encapsulated in a single model object. 

In [None]:
dt_classifier.fit(X_train, y_train);

To see how our model performs on the test data, we use the `score` method which returns the mean accuracy. Accuracy can be defined as:

$$ Accuracy= \frac{\sum{\text{True Positives}}+\sum{\text{True Negatives}}}{\sum{\text{Total Population}}}$$

Where "True Positives" are those data points whose value should be 1, and they are predicted to be 1, and "True Negatives" are those data points whose values should be 0, and they are predicted to be 0.

`score` can be used on both the train and test datasets. Using the train data will give us the in-sample accuracy score.

In [None]:
print(dt_classifier.score(X_train, y_train))

That's a perfect score of `1.0`! But the model may be overfit to the train data, so we should evaluate the performance of this model using the test data.

In [None]:
print(dt_classifier.score(X_test, y_test))

Not quite perfect, but still really good!

We can get the feature importance (Gini importance) of the four features to see which one(s) are important in determining the classification:

In [None]:
dt_classifier.feature_importances_

Looks like the fourth variable is most important. Let's find out which feature that is.

In [None]:
iris.feature_names[dt_classifier.feature_importances_.argmax()]

There are  metrics other than accuracy to quantify classification performance. Some common metrics in machine learning are:

1. **Precision**: 
$$\frac{\sum{\text{True Positives}}}{\sum{\text{Predicted Positives}}}$$
2. **Recall** (or **Sensitivity**): 
$$\frac{\sum{\text{True Positives}}}{\sum{\text{Condition Positives}}}$$ 
3. **Specificity** (like recall for negative examples): 
$$\frac{\sum{\text{True Negatives}}}{\sum{\text{Condition Negatives}}}$$


Below is a table showing how these metrics fit in with other confusion matrix concepts like "True Positives" and "True Negatives" [wikipedia](https://en.wikipedia.org/wiki/Confusion_matrix)

<img src='https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg' width=500>/

Scikit-learn can print out the **Recall** and **Precision** scores for a classification model by using `metrics.classification_report()`.

In [None]:
from sklearn import metrics

dt_predicted = dt_classifier.predict(X_test)
print("Classification report:")
print(metrics.classification_report(y_test, dt_predicted)) 

## 3) Tuning Hyperparameters: Cross-Validation & Grid Search

Tuning hyperparameters is one of the most important steps in building a ML model. Hyperparameters are external to the model cannot be estimated from data, so you, the modeler, must pick these!

One way to find the best combination of hyperparameters is by using what's called a [grid search](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). A grid search tests different possible parameter combinations to see which combination yields the best results. Fortunately, scikit-learn has a function for this which makes it very easy to do.

Here, we'll see what the best combination of the hyperparameters `min_samples_split` and `min_samples_leaf` are. We can make a dictionary with the names of the hyperparameters as the keys and the range of values as the corresponding values.

In [None]:
param_grid = {'min_samples_split': range(2,10),
              'min_samples_leaf': range(1,10)}

param_grid

Then we can implement the grid search and fit our model according to the best parameters.

In [None]:
from sklearn.model_selection import GridSearchCV

model_dt = GridSearchCV(dt_classifier, param_grid, cv=3, return_train_score=True)
model_dt.fit(X_train, y_train);

We can see what the model parameters are that produce the highest accuracy on the test set data by finding the max `mean_test_score`, and it's assoiated parameter values:

In [None]:
best_index = np.argmax(model_dt.cv_results_["mean_test_score"])

print('Best parameter values are:', model_dt.cv_results_["params"][best_index])
print('Best Mean Cross-Validation train accuracy: %.03f' % (model_dt.cv_results_["mean_train_score"][best_index]))
print('Best Mean Cross-Validation test (validation) accuracy: %.03f' % (model_dt.cv_results_["mean_test_score"][best_index]))
print('Overal mean test accuracy: %.03f' % (model_dt.score(X_test, y_test)))

We can also look at all of the combinations and their test and train scores:

In [None]:
n_grid_points = len(model_dt.cv_results_['params'])
min_samples_leaf_vals = np.empty((n_grid_points,))
min_samples_split_vals = np.empty((n_grid_points,))
mean_train_scores = np.empty((n_grid_points,))
mean_test_scores = np.empty((n_grid_points,))
for i in range(n_grid_points):
    min_samples_leaf_vals[i] = model_dt.cv_results_['params'][i]['min_samples_leaf']
    min_samples_split_vals[i] = model_dt.cv_results_['params'][i]['min_samples_split']
    mean_train_scores[i] = model_dt.cv_results_['mean_train_score'][i]
    mean_test_scores[i] = model_dt.cv_results_['mean_test_score'][i]

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import matplotlib.pyplot as plt
import numpy as np

In [None]:
fig = plt.figure(figsize=(20,10))
ax = fig.gca(projection='3d')
surf = ax.plot_trisurf( min_samples_leaf_vals, min_samples_split_vals, mean_train_scores, cmap=cm.coolwarm,
                       linewidth=10, antialiased=False)
ax.set_title('Mean Train Scores', fontsize=18)
ax.set_xlabel('min_samples_leaf', fontsize=18)
ax.set_ylabel('min_samples_split', fontsize=18)

In [None]:
fig = plt.figure(figsize=(20,10))
ax = fig.gca(projection='3d')
surf = ax.plot_trisurf( min_samples_leaf_vals, min_samples_split_vals, mean_test_scores, cmap=cm.coolwarm,
                       linewidth=10, antialiased=False)
ax.set_title('Mean Test Scores', fontsize=18)
ax.set_xlabel('min_samples_leaf', fontsize=18)
ax.set_ylabel('min_samples_split', fontsize=18)

## 4) Random Forests

Now we'll look at [Random Forests](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

- random forests are an ensemble method (the classification decision is pooled across many simpler classifiers)
- each decision tree is fit to a subset of the data (bagging), and uses only a subset of the features (random subspace). 

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn import ensemble

rf_classifier = ensemble.RandomForestClassifier(n_estimators=10,  # number of trees
                       criterion='gini',  # or 'entropy' for information gain
                       max_depth=None,  # how deep tree nodes can go
                       min_samples_split=2,  # samples needed to split node
                       min_samples_leaf=1,  # samples needed for a leaf
                       min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                       max_features='auto',  # number of features for best split
                       max_leaf_nodes=None,  # max nodes
                       min_impurity_decrease=1e-07,  # early stopping
                       n_jobs=1,  # CPUs to use
                       random_state = 10,  # random seed
                       class_weight="balanced")  # adjusts weights inverse of freq, also "balanced_subsample" or None

Now we fit the model on our training data.

In [None]:
rf_model = rf_classifier.fit(X_train, y_train)

Let's look at the classification performance on the test data:

In [None]:
print("Score of model with test data defined above:")
print(rf_model.score(X_test, y_test))
print()

predicted = rf_model.predict(X_test)
print("Classification report:")
print(metrics.classification_report(y_test, predicted)) 
print()

Let's do another grid search to determine the best parameters:

In [None]:
param_grid = {'min_samples_split': range(2,10),
              'min_samples_leaf': range(1,10)}

model_rf = GridSearchCV(ensemble.RandomForestClassifier(n_estimators=10), param_grid, cv=3)
model_rf.fit(X_train, y_train)

best_index = np.argmax(model_rf.cv_results_["mean_test_score"])

print("Best parameter values:", model_rf.cv_results_["params"][best_index])
print("Best Mean cross-validated test accuracy:", model_rf.cv_results_["mean_test_score"][best_index])
print("Overall Mean test accuracy:", model_rf.score(X_test, y_test))

## 5) Predict

Great! That's quite accurate. So let's say we're walking through a garden and spot an iris, but have no idea what type it is. We take some measurements:

In [None]:
random_iris = [5.1, 3.5, 2, .1]

for i in range(len(random_iris)):
    print(iris.feature_names[i])
    print(random_iris[i])
    print()

Can we use our model to predict the type?

In [None]:
label_idx = model_rf.predict([random_iris])
label_idx

Now we can just index our labels:

In [None]:
iris.target_names[label_idx]

# Challenge: AdaBoost

Adaboost is another ensemble method that relies on 'boosting'. Similar to 'bagging', 'boosting' samples many subsets of data to fit multiple classifiers, but resamples preferentially for mis-classified data points. 

### Part 1

Using the scikit-learn [documentation](http://scikit-learn.org/stable/modules/ensemble.html#adaboost), build your own AdaBoost model to test on the iris data set! Start off with `n_estimators` at 100, and `learning_rate` at .5. Use 10 as the `random_state` value.

### Part 2

Now use a grid search to determine what the best values for the `n_estimators` and `learning_rate` parameters are. For `n_estimators` try a range of 50 to 500 with a step of 50, and for `learning_rate` try a range of .1 to 1.1 with a step of .1. For decimal steps in a range use the `np.arange` function.