# Part One: Introduction & Classification with Scikit-learn

## Main Goals

- Define Jupyter Notebooks
- Learn what machine learning is
- Represent data in `scikit-learn`
- Classify data with the `iris` data
- Modify and extend a classification pipeline to new models and data

## About Notebooks in Google Colab

This lesson is composed as a Jupyter Notebook. A Jupyter Notebook provides a computational environment in which to run, write, and edit code snippets alongside text and other content in Markdown format.

You have a number of options for how to run Jupyter Notebooks. For this workshop, we will use the cloud-based computational environment *Google Colaboratory* (or *Google Colab*). To execute code in a cell, select it,  then either press the play button to the left of the code or use the keyboard shortcut `Command/Ctrl+Enter`. To edit code, just click the cell and start editing.

## About Scikit-Learn

[`Scikit-Learn`](http://github.com/scikit-learn/scikit-learn) is a Python package designed to give access to **well-known** machine learning algorithms within Python code through a **clean, well-thought-out API**. Hundreds of contributors from around the world built it and it is used across industry and academia.

`Scikit-Learn` is built upon Python's [`NumPy` (Numerical Python)](http://numpy.org) and [`SciPy` (Scientific Python)](http://scipy.org) libraries, which enable efficient in-core numerical and scientific computation within Python. Other popular ML libraries in Python include [`TensorFlow`](https://www.tensorflow.org/) and [`PyTorch`](https://pytorch.org/).

## Outline

1. Preparing data for classification
2. Training a classifier on the Iris dataset
3. Validating a model with test/train splits
4. Applying the pipeline to new data

## 1. Preparing data for classification

### What is Machine Learning?

In this section we will begin to explore the basic principles of machine learning.
Machine Learning is a familiy of techniques that derive patterns from observed data and apply those patterns to previously unobserved data. 

One common flow in working with machine learning is to:

1. Gather and observe a collection of data that is representative of the world
2. Abstract out patterns, rules, and associations between a collection of observations (or "features") and a desired "target" variable, and 
3. Predict the target variable in previously unseen collection of data, given those features.

To get started, we'll focus on one subset of machine learning approaches: **classification**. Classification is a supervised machine learning technique that attempts to predict a target category or label for a new observation, given other variables or **features** that describe that data.


### How to represent data for machine learning classification

Machine learning is about creating models from data; therefore we'll start by
discussing how to represent data in a way the computer understands.  Along
with this, we'll create `matplotlib` examples to visualize the data.

Classification algorithms implemented in `scikit-learn` expect the following data structures:

**Feature matrix X:** A 2D array or matrix of features about our data (a collection of variables corresponding to each sample (each simple has a unique role).

**Target vector y:** A 1D array or vector with a target variable, one per sample.

![Data Layout](https://github.com/jakevdp/sklearn_tutorial/blob/master/notebooks/images/data-layout.png?raw=1)

(Figure from the [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook))

## 2. Training a classifier on the Iris dataset

### Introduction

We're going to use the `iris` data bundled with `scikit-learn` (Anderson, 1936; Fisher, 1936). The data consists of measurements of the three different species of iris shown below.

![alt text](http://raw.githubusercontent.com/jakevdp/sklearn_tutorial/master/notebooks/images/iris_setosa.jpg "Iris Setosa")

Iris Setosa

![alt text](http://raw.githubusercontent.com/jakevdp/sklearn_tutorial/master/notebooks/images/iris_versicolor.jpg "Iris Versicolor")

Iris Versicolor

![alt text](http://raw.githubusercontent.com/jakevdp/sklearn_tutorial/master/notebooks/images/iris_virginica.jpg "Iris Virginica")

Iris Virginica

### Setup

Run the code below to setup our notebook and bring in relevant packages.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

plt.style.use('seaborn')

Let's load our data into a new Python object, `iris`. scikit-learn conveniently allows us to bundle all relevant functionality for sample datasets into one object.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

# make sure our data loaded in correctly
print(iris.DESCR[1:2000])

In [None]:
print("Below is what the data attribute of the iris object looks like:\n")
print(iris.data)

print("\nAnd here are corresponding labels for each variable:\n")
print(iris.feature_names)

print("\nBelow is what the target attribute of the iris object looks like:\n")
print(iris.target)

print("\nAnd here is what the target label names look like:\n")
print(iris.target_names)

In [None]:
# here is a prettier version
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.head()

This data is four dimensional, but we can visualize two of the dimensions
at a time using a simple scatter-plot:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x_index = 0
y_index = 1

# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])

plt.scatter(iris.data[:, x_index], iris.data[:, y_index],
            c=iris.target, cmap=plt.cm.get_cmap('RdYlBu', 3))
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.clim(-0.5, 2.5)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index]);

### Quick Exercise:

**Change** `x_index` **and** `y_index` **in the above script
and find a combination of two parameters
which maximally separates the three classes.**

This exercise is a preview of **dimensionality reduction**, which we'll see later.

### Train classifier with k-Nearest Neighbors

k-Nearest Neighbors (kNN) is a simple, useful strategy for learning an estimator function. Draw the boundaries of your feature vector (in our case, similar to the plot above, but in 4D!) and find the k-nearest training points. Give each point a vote. Which class wins? Assin that region our target class.

![](http://cs231n.github.io/assets/knn.jpeg)

When considering a new observation in the future, plot the point according to its features in the feature vector space. What class does the point fall into? This gives us our class prediction.

If you'd like to play with an interactive version of k-NN, take a look at [this excellent blog post](http://scott.fortmann-roe.com/docs/BiasVariance.html)

Let's try it out on our iris classification problem:

In [None]:
# create a features matrix and target array
X, y = iris.data, iris.target

In [None]:
# import the relevant collection of classifiers into our code
from sklearn import neighbors

# instatiate the model
knn = neighbors.KNeighborsClassifier(n_neighbors=3)

# fit the model
knn.fit(X, y)

In [None]:
# how'd this model perform? test for accuracy (more on this shortly)
knn.score(X, y)

In [None]:
# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?
# call the "predict" method:

sample_data = [[3, 5, 4, 2],]

result = knn.predict(sample_data)

In [None]:
# print the predicted label

print(iris.target_names[result])

In some cases, predicting a label will not be sufficient. We may also wish to know the confidence the model places in that prediction - or in other words, associate a probability with the each possible target labels.

Many models in scikit-learn allow for the `.predict_proba()` method. Like `.predict()`, `.predict_proba()` accepts a set of observatrions as input. In this case, `.predict_proba()` also returns an array of probabilities assigned to each label. Let's try:

In [None]:
knn.predict_proba([[3, 5, 4, 2],])

#### CHALLENGE 1: Think of a sample_data observation that would challenge your expectations. Modify sample_data and rerun the above cells. What results do you see? Why do you think that is?

#### CHALLENGE 2: Instantiate a new knn model with a different n_neighbors parameter. Generate a score, prediction, and probabilities. How does it compare?

In [None]:
# your code here
# knn2 = ....


In [None]:
# want to set even more parameters?
# take a look: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

### Recap: Scikit-learn's estimator interface

The `Scikit-learn` developers have tried to maintain a uniform user interface across all methods. We'll see examples of these below. Given a `scikit-learn` *estimator*
object named `model`, the following methods are available:

- Available in **all estimators**
  + `model.fit()` : fit training data. For supervised learning applications,
    this accepts two arguments: the data `X` and the labels `y` (e.g. `model.fit(X, y)`).
    For unsupervised learning applications, this accepts only a single argument,
    the data `X` (e.g. `model.fit(X)`).
- Available in **supervised estimators**
  + `model.predict()` : given a trained model, predict the label of a new set of data.
    This method accepts one argument, the new data `X_new` (e.g. `model.predict(X_new)`),
    and returns the learned label for each object in the array.
  + `model.predict_proba()` : For classification problems, some estimators also provide
    this method, which returns the probability that a new observation has each categorical label.
    In this case, the label with the highest probability is returned by `model.predict()`.
  + `model.score()` : for classification or regression problems, most (all?) estimators implement
    a score method.  Scores are between 0 and 1, with a larger score indicating a better fit.
- Available in **unsupervised estimators**
  + `model.predict()` : predict labels in clustering algorithms.
  + `model.transform()` : given an unsupervised model, transform new data into the new basis.
    This also accepts one argument `X_new`, and returns the new representation of the data based
    on the unsupervised model.
  + `model.fit_transform()` : some estimators implement this method,
    which more efficiently performs a fit and a transform on the same input data.
    
- Available in **some estimators**
  + `model.predict_proba()`: return probabilities of class predictions, ordered by class label

## Model validation

An important piece of machine learning is **model validation**: that is, determining how well your model will generalize from the training data to future unlabeled data.

With the iris data, let's look at our score with a knn with k=1:

In [None]:
# compare our predictions from earlier to the actual class labels
# feel free to substitute knn with your own trained model

knn = neighbors.KNeighborsClassifier(n_neighbors=3)
# fit the model
knn.fit(X, y)
knn.score(X, y)

A more useful way to look at the results is to view the **confusion matrix**, or the matrix showing the frequency of inputs and outputs:

In [None]:
from sklearn.metrics import confusion_matrix

# generate predictions for y
y_pred = knn.predict(X)

print(confusion_matrix(y, y_pred))

For each class, all 50 training samples are correctly identified. But this **does not mean that our model is perfect!** In particular, such a model generalizes extremely poorly to new data. We can simulate this by splitting our data into a *training set* and a *testing set*. `scikit-learn` contains some convenient routines to do this:

In [None]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.2)

In [None]:
# Let's take a look
print(Xtrain.shape)
print(ytrain.shape)

print(Xtest.shape)
print(ytest.shape)

In [None]:
# instantiate
knn = neighbors.KNeighborsClassifier(n_neighbors=3)

# fit on TRAINING data
knn.fit(Xtrain, ytrain)

# predict based on TEST data
ypred = knn.predict(Xtest)
print(confusion_matrix(ytest, ypred))

#### Challenge 1: Try rerunning the cell with train_test_split a few times to get a sense of performance across different splits. Also try modifying n_neighbors when instantiating models. What do you notice?

## Your turn: 

### Option One: Run classification on the iris dataset using a Support Vector Classifier model AND a test/train split

#### 1) Import your data

In [None]:
## load our data again

iris = datasets.load_iris()

## create feature matrix and target array based on the dataset
# X, y = ??

#### 2) Set train and test splits

In [None]:
# Xtrain, Xtest, ytrain, ytest = ???

#### 3) Instantiate your model as an object (and, optionally, set any parameters)

In [None]:
from sklearn import svm

## instantiate your model as an object
## in this case, let's use the 'rbf' kernel
svm = svm.SVC(kernel='rbf')




#### 4) Fit your model to TRAINING data

In [None]:
# svm..???

#### 5) Apply your model (make predictions)

In [None]:
### new_sample = [[??, ??, ??, ??],]

#### 6) Validate on TEST data (and, for your reference, compare to performance on training data)

In [None]:
## score on train
svm.score(X_train, y_train)

In [None]:
## score on test

### Option Two: Try classification with a new dataset: blood donation

For this example, we are going to use a new datase from **openml** 

#### 1) Import your data

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
blood = fetch_openml('blood-transfusion-service-center')

In [None]:
print(blood.DESCR)

In [None]:
# note that this data does not have a .target_names property, only feature_names

# print out to learn more about the dataset...

#### 2) Set train and test splits

#### 3) Instantiate your model as an object (and, optionally, set any parameters)

#### 4) Fit your model to TRAINING data

#### 5) Apply your model (make predictions)

#### 6) Validate on TEST data (and, for your reference, compare to performance on training data)

## Acknowledgments

Thank you first and foremost to my fantastic reviewers: Victor Ruiz, Sheila Braun, and Remo Williams.
    
These notebooks are based primarily on two sources:

1. [Jake Vanderplas' Scikit-learn tutorial](https://github.com/jakevdp/sklearn_tutorial/) 
View [license](https://github.com/jakevdp/sklearn_tutorial/blob/master/LICENSE)
2. [Andreas Müller's Introduction to Machine learning with Python workshop](https://github.com/amueller/ml-workshop-1-of-4) 
View [license](https://github.com/amueller/ml-workshop-1-of-4/blob/master/LICENSE)

This notebook is a project of [Arcus Education](https://education.arcus.chop.edu/). The primary author is [Zoë Wilkinson Saldaña](https://github.com/zoews).