# Part 1: Classification

A common task in computational research is to classify an object based on a set of features. In superivsed machine learning, we can give an algorithm a dataset of training examples that say "here are specific features, and this is the class it belongs to". With enough training examples, a model can be built that recognizes important features in determining an objects class. This model can then be used to predict the class of an object given its known features.

## 1) Iris Dataset

We'll start off by loading scikit-learn's [Iris](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) dataset. Using this dataset we can classify an iris flower as one of three types: setosa, versicolour, or virginica. The features that we'll use to predict this are sepal length, sepal width, petal length, and petal width.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

Let's look inside of it to see what datatypes scikit-learn wants, and how their sample dataset is formatted, so that we can prepare our own datasets later:

In [None]:
iris.keys()

So the data is in dictionary format, and we can access the data and labels by indexing certain keys:

In [None]:
iris.DESCR

Again, here are the features:

In [None]:
print(iris.feature_names)
print(len(iris.feature_names))

And here's what we're predicting:

In [None]:
print(iris.target_names)
print(len(iris.target_names))

So we are using 4 features for each observation, trying to classfiy each observation into one of three categories, using only those 4 features. How are these input features formatted?

In [None]:
print(len(iris.data))
print(type(iris.data))
iris.data

We have a large numpy array of length 150, one for each observation, and each observation has its own numpy array of length 4, one for each feature. Each inner array *must* lineup with the order of the variables *and* all other arrays. **ORDER MATTERS**.

What about the prediction?

In [None]:
print(len(iris.target))
print(type(iris.target))
iris.target

Again, we have 150 observations, but *no* sub arrays. The target data is one dimension. Order matters here as well, they should correspond to the feature indices in the data array. These are the correct class corresponding to the data arrays.

In other words, the data and the targets should match up like this for three of the observations:

In [None]:
for x in [0, 50, 100]:
    print("Data:", iris.data[x])
    print("Target:", iris.target[x])

This should be enough explanation to be able to get your own data from CSV or other formats into the correct numpy arryays for scikit-learn.

Now we split the data into training and testing, but first thing's first: **set the random seed!**. This is very important for reproducibility of your analyses.

In [None]:
import numpy as np

np.random.seed(10)

Here we'll use 75% of the data for training, and test on the remaining 25%.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
                                                    train_size=0.75, test_size=0.25)

## 2) Decision Trees

The first model we're going to explore is [Decision Trees](http://scikit-learn.org/stable/modules/tree.html).

After the train/test split, scikit-learn makes the rest of the process relatively easy, since it already has a DT classifier algorith for us, we just have to decide on the parameters:

In [None]:
from sklearn import tree

dt_classifier = tree.DecisionTreeClassifier(criterion='gini',  # or 'entropy' for information gain
                       splitter='best',  # or 'random' for random best split
                       max_depth=None,  # how deep tree nodes can go
                       min_samples_split=2,  # samples needed to split node
                       min_samples_leaf=1,  # samples needed for a leaf
                       min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                       max_features=None,  # number of features to look for when splitting
                       max_leaf_nodes=None,  # max nodes
                       min_impurity_split=1e-07, #early stopping
                       random_state = 10) #random seed

Then we use the `fit` method on the train data to fit our model.

In [None]:
model = dt_classifier.fit(X_train, y_train)

To see how our model performs on the test data, we use the `score` method.

In [None]:
print(model.score(X_test, y_test))

## 3) Grid Search

Tuning parameters is one of the most important steps in building a ML model. One way to do this is by using what's called a [grid search](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). A grid search tests different possible parameter combinations to see which combination yields the best results. Fortunately, scikit-learn has a function for this which makes it very easy to do.

Here we'll see what the best combination of the parameters `min_samples_split` and `min_samples_leaf` is. We can make a dictionary with the names of the parameters as the keys and the range of values as the corresponding values.

In [None]:
param_grid = {'min_samples_split': range(2,10),
              'min_samples_leaf': range(1,10)}

Then we can implement the grid search and fit our model according to the best parameters.

In [None]:
from sklearn.model_selection import GridSearchCV

model_c = GridSearchCV(tree.DecisionTreeClassifier(), param_grid)
model_c.fit(X_train, y_train)

We can see what the best parameters are:

In [None]:
best_index = np.argmax(model_c.cv_results_["mean_test_score"])

print(model_c.cv_results_["params"][best_index])
print(max(model_c.cv_results_["mean_test_score"]))
print(model_c.score(X_test, y_test))

We can also look at all of the combinations and their test and train scores:

In [None]:
model_c.cv_results_.keys()

for x in range(len(model_c.cv_results_['params'])):
    print("Parameters:")
    print(model_c.cv_results_['params'][x])
    print("Mean Test Score:")
    print(model_c.cv_results_['mean_test_score'][x])
    print("Mean Train Score:")
    print(model_c.cv_results_['mean_train_score'][x])
    print()

## 4) Random Forests

Now we'll look at [Random Forests](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
from sklearn import ensemble, metrics
from sklearn.model_selection import cross_val_score

rf_classifier = ensemble.RandomForestClassifier(n_estimators=10,  # number of trees
                       criterion='gini',  # or 'entropy' for information gain
                       max_depth=None,  # how deep tree nodes can go
                       min_samples_split=2,  # samples needed to split node
                       min_samples_leaf=1,  # samples needed for a leaf
                       min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                       max_features='auto',  # number of features for best split
                       max_leaf_nodes=None,  # max nodes
                       min_impurity_split=1e-07,  # early stopping
                       n_jobs=1,  # CPUs to use
                       random_state = 10,  # random seed
                       class_weight="balanced")  # adjusts weights inverse of freq, also "balanced_subsample" or None

Now we fit the model on our training data.

In [None]:
model = rf_classifier.fit(X_train, y_train)

Let's look at our results:

In [None]:
print("Score of model with test data defined above:")
print(model.score(X_test, y_test))
print()

predicted = model.predict(X_test)
print("Classification report:")
print(metrics.classification_report(y_test, predicted)) 
print()

scores = cross_val_score(model, iris.data, iris.target, cv=10)
print("10-fold cross-validation:")
print(scores)
print()

print("Average of 10-fold cross-validation:")
print(np.mean(scores))

Let's do another grid search to determine the best hyperparameters:

In [None]:
param_grid = {'min_samples_split': range(2,10),
              'min_samples_leaf': range(1,10)}

model_r = GridSearchCV(ensemble.RandomForestClassifier(), param_grid)
model_r.fit(X_train, y_train)

best_index = np.argmax(model_r.cv_results_["mean_test_score"])

print("Best index:", model_r.cv_results_["params"][best_index])
print("Mean test score:", max(model_r.cv_results_["mean_test_score"]))
print("Held-out:", model_r.score(X_test, y_test))

# Challenge: AdaBoost

### Part 1

Using the scikit-learn [documentation](http://scikit-learn.org/stable/modules/ensemble.html#adaboost), build your own AdaBoost model to test on the iris data set! Start off with `n_estimators` at 100, and `learning_rate` at .5. Use 10 as the `random_state` value.

### Part 2

Now use a grid search to determine what the best values for the `n_estimators` and `learning_rate` parameters are. For `n_estimators` try a range of 50 to 500 with a step of 50, and for `learning_rate` try a range of .1 to 1.1 with a step of .1. For decimal steps in a range use the `np.arange` function.