# Lect 5 : Scikit Learn : Supervised Learning and Cross Validation

## Supervised Learning - Classification & Regression
In **Supervised Learning**, we have a dataset consisting of both features and labels.
The task is to construct an estimator which is able to predict the label of an object
given the set of features. A relatively simple example is predicting the species of 
iris given a set of measurements of its flower. This is a relatively simple task. 

Some more complicated examples are:

- given a multicolor image of an object through a telescope, determine
  whether that object is a star, a quasar, or a galaxy.
- given a photograph of a person, identify the person in the photo.
- given the history of items that you have purchased from Amazon, provide recommendations for items that you are most like to buy next.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Recap of LinearRegression model.

#Create some noisy data. 
np.random.seed(0)
X = np.random.random(size=(20, 1))
y = 3 * X.squeeze() + 2 + np.random.randn(20)

plt.plot(X.squeeze(), y, 'o');

In [None]:
#import the LinearRegression class
from sklearn.linear_model import LinearRegression

#Intantiate the estimator object : i.e., model with some parameters.
model = LinearRegression()

#Use the fit method of the model object.
model.fit(X, y)

#Get some new X data whose y values need to be predicted
X_f = np.linspace(0.0,1.0,100)[:,np.newaxis]

# Predict the y values for the new X data
y_f = model.predict(X_f)

plt.plot(X.squeeze(), y, 'ro')
plt.plot(X_f.squeeze(), y_f, 'k-')

In [None]:
#import the RandomForestRegression class
from sklearn.ensemble import RandomForestRegressor

#Intantiate the estimator object : i.e., model with some parameters.


#Use the fit method of the model object.


#Get some new X data whose y values need to be predicted


# Predict the y values for the new X data


# Plot the data along with the fit. 
plt.plot(X.squeeze(), y, 'ro')
plt.plot(X_f.squeeze(), y_f, 'k-')


## Polynomial Regression

For Polynomial regression there exsits a nice extrapolation to Linear Regression. It involves *sklearn.preprocessing* and *sklearn.pipeline*

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model  import LinearRegression
from sklearn.pipeline import Pipeline

#Intantiate the estimator object : i.e., model with some parameters.
model =Pipeline([('poly',PolynomialFeatures(degree=3)), 
        ('linear', LinearRegression(fit_intercept=False))])

#Create some data 
X = np.arange(45)
y = 3.0 - 2.0*X + 5.0*X*X + 0.75*X*X*X

#Use the fit method of the model object.
model.fit(X[:,np.newaxis], y)

#print the coeffs.
model.named_steps['linear'].coef_


## Data format in Scikit Learn

Most machine learning algorithms implemented in scikit-learn expect data to be stored in a
**two-dimensional array or matrix**.  The arrays can be
either ``numpy`` arrays, or in some cases ``scipy.sparse`` matrices.
The size of the array is expected to be `[n_samples, n_features]`

- **n_samples:**   The number of samples: each sample is an item to process (e.g. classify).
  A sample can be a document, a picture, a sound, a video, an astronomical object,
  a row in database or CSV file,
  or whatever you can describe with a fixed set of quantitative traits.
- **n_features:**  The number of features or distinct traits that can be used to describe each
  item in a quantitative manner.  Features are generally real-valued, but may be boolean or
  discrete-valued in some cases.


A simple dataset that we will look is the Iris dataset. This is very widely used in all tutorials to explain the basic principle of machine learning. 

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

### Loading the Iris Data with Scikit-Learn

Scikit-learn has a very straightforward set of data on these iris species.  The data consist of
the following:

- Features in the Iris dataset:

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm

- Target classes to predict:

  1. Iris Setosa
  2. Iris Versicolour
  3. Iris Virginica
  
``scikit-learn`` embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:

In [None]:
# Information about the data
print iris.data.shape
print iris.feature_names
print iris.target_names
print iris.target

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x_index = 0
y_index = 1

# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])

plt.scatter(iris.data[:, x_index], iris.data[:, y_index],
            c=iris.target, cmap=plt.cm.get_cmap('RdYlBu', 3))
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.clim(-0.5, 2.5)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index])

### Exercise. 

Try to play with the combination of features where you would see the maximum separation with eye in the data.


### More Data. 

Scikit Learn has three types of data from various domains available for exercises and learning. 

- **Packaged data** : datasets.load_..
- **Downloadable data** : datasets.fetch_..
- **Customized data**   : datasets.make_..

## Classification using K-NearestNeighbours.

In [None]:
from sklearn import neighbors

#Intantiate the estimator object : i.e., model with some parameters.
model = neighbors.KNeighborsClassifier(n_neighbors=5)

#Use the fit method of the model object.
X, y = iris.data, iris.target
model.fit(X, y)

#Get some new X data whose y values need to be predicted
X_f = np.array([[3,5,4,2],])


# Predict the y values for the new X data
y_f = model.predict(X_f)
print iris.target_names[y_f]

model.predict_proba(X_f)


In [None]:
X = iris.data[:, :2]  # we only take the first two features. 
y = iris.target

knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)

x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                    np.linspace(y_min, y_max, 100))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])

#Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z,alpha=0.4)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.xlabel('sepal length (cm)')
plt.ylabel('sepal width (cm)')
ax = plt.axis('tight')

## Exercise 

Classify the Iris datasets using the Support Vector Classifier.
*from sklearn.svm import SVC*

In [None]:
from sklearn.svm import SVC

#Intantiate the estimator object : i.e., model with some parameters.


#Use the fit method of the model object.


#Get some new X data whose y values need to be predicted


# Predict the y values for the new X data



## Cross Validation -- Tip of Iceberg. 

Now it makes absolute sense to verify which of the models perform better and how can we optimize the values of the parameters to be given to the model. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier
X, y = iris.data, iris.target
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)
y_pred = clf.predict(X)
print(np.all(y == y_pred))

In [None]:
#Confusion matrix a measure of score for the Classifier.

from sklearn.metrics import confusion_matrix, accuracy_score
print(confusion_matrix(y, y_pred))
print accuracy_score(y, y_pred)

### Test and Train data sets. 

The model has to be trained on a subset of data and then tested on the rest. BUT!! How to split the test and train so that the splitting makes statistically some sense (mixed sample). 

In [None]:
from sklearn.cross_validation import train_test_split
Xtr, Xte, ytr, yte = train_test_split(X, y)
print Xtr.shape, Xte.shape

clf = KNeighborsClassifier(n_neighbors=7)
clf.fit(Xtr, ytr)
ypred = clf.predict(Xte)
print (confusion_matrix(yte, ypred))
print (accuracy_score(yte, ypred))

## Application for Reading Digits. - Optical Character Recognition

In [None]:
from sklearn import datasets
digits = datasets.load_digits()
digits.images.shape

In [None]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
    ax.text(0.05, 0.05, str(digits.target[i]),
            transform=ax.transAxes, color='green')
    ax.set_xticks([])
    ax.set_yticks([])

In [None]:
# The images themselves
print(digits.images.shape)
print(digits.images[0])

In [None]:
print(digits.data.shape)
print(digits.data[0])

In [None]:
print(digits.target)

So we have a data of size 1797 hand-written numbers and each number has 64 pixel values.

## Exercise Task. 
Use the *LogisticRegression* Classifier and apply the cross-validation method that we learned to get the accuracy score and also display the confusion matrix using the following plot command. 

In [None]:
# Your LogisticRegression Classifier model. 
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split


In [None]:
plt.imshow(np.log(confusion_matrix(yte, ypred)),
           cmap='Blues', interpolation='nearest')
plt.grid(False)
plt.ylabel('true')
plt.xlabel('predicted');

Finally see for yourself which numbers have not been correctly identified 
with your classifier. 

In [None]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)

for i, ax in enumerate(axes.flat):
    ax.imshow(Xte[i].reshape(8, 8), cmap='binary')
    ax.text(0.05, 0.05, str(ypred[i]),
            transform=ax.transAxes,
            color='green' if (yte[i] == ypred[i]) else 'red')
    ax.set_xticks([])
    ax.set_yticks([])