# <font color='blue'> Week - 1: It's All About Neighbors </font>
*Provided by:*
![ENSTABrain](images/LogoEB.png)

# <font color='blue'>Agenda</font>

- What is the K-Nearest Neighbors (KNN) classification model?
- What is the famous Iris dataset and how do we load it?
- Why scikit-learn?
- Three Steps to build a model (Build, Train, Test)
- How to Evaluate the Estimator?
- Overfit problem and how to face it?

# <font color='blue'>K-Nearest Neighbors Classifier</font>

1. Pick a value for K.
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.
3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.

### <font color='blue'>Example training data</font>

![Training data](images/data.png)

### <font color='blue'>KNN classification map (K=1)</font>

![1NN classification map](images/data1N.png)

### <font color='blue'>KNN classification map (K=5)</font>

![5NN classification map](images/data5N.png)

*Image Credits: [Data3classes](http://commons.wikimedia.org/wiki/File:Data3classes.png#/media/File:Data3classes.png), [Map1NN](http://commons.wikimedia.org/wiki/File:Map1NN.png#/media/File:Map1NN.png), [Map5NN](http://commons.wikimedia.org/wiki/File:Map5NN.png#/media/File:Map5NN.png) by Agor153. Licensed under CC BY-SA 3.0*

# <font color='blue'>Introducing the Iris dataset</font>


![iris](images/iris.png)

- 50 samples of 3 different species of iris (150 samples total)
- Measurements: sepal length, sepal width, petal length, petal width

# <font color='blue'>Loading the data</font>

In [70]:
# since iris is extremely famous, sklearn added it to their library
# that's how we're going to load it
from sklearn.datasets import load_iris
import pandas as pd # import pandas to make matrix prettier

# load a "Bunch" object that contains all iris dataset utilities
iris = load_iris()

# Create an Input dataframe with the iris input data
# Name the columns with the iris feature names
X = pd.DataFrame(iris.data, columns=iris.feature_names)

# show the first 5 rows
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [34]:
# show the last 5 rows
X.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


In [35]:
# initialize output vector (labels) with iris targets (labels)
# call this vector 'species'
y = pd.Series(iris.target, name='species')
y.head()

0    0
1    0
2    0
3    0
4    0
Name: species, dtype: int32

In [36]:
# verify we have as much samples in Inputs and Labels
print(X.shape)
print(y.shape)

(150, 4)
(150,)


In [37]:
# Do you want to see something cool ?
# let's change numeric labels to strings corresponding to flowers species ?
target = pd.DataFrame(y)
target.head()

Unnamed: 0,species
0,0
1,0
2,0
3,0
4,0


In [38]:
target['in_string'] = target.species.apply(lambda x:'setosa' if x==0 else 'versicolor' if x==1 else 'virginica')
target.head()

Unnamed: 0,species,in_string
0,0,setosa
1,0,setosa
2,0,setosa
3,0,setosa
4,0,setosa


In [39]:
target.tail()

Unnamed: 0,species,in_string
145,2,virginica
146,2,virginica
147,2,virginica
148,2,virginica
149,2,virginica


# <font color='blue'>Why Scikit-Learn?</font>

### Benefits:

- **Consistent interface** to machine learning models
- Provides many **tuning parameters** but with **sensible defaults**
- Exceptional **documentation**
- Rich set of functionality for **companion tasks**
- **Active community** for development and support

### Potential drawbacks:

- Harder (than R) to **get started with machine learning**
- Less emphasis (than R) on **model interpretability**

# <font color='blue'>Build it, Train it, Test it !</font>
**Step1:** Build the Classifier

In [106]:
# Start by importing the Classifier, in our case it's the KNN
# The model to use is a Class that can initialize our Estimator
# 'Estimator' stands for model (thinking part of intelligent machine)
# 'Estimator' is used as a reference to Statistics estimators.
from sklearn.neighbors import KNeighborsClassifier

# Instantiate the classifier to use
# Instantiate means make an instance of ==> create an object
knn = KNeighborsClassifier(n_neighbors=1)

#see it
print(knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')


- Name of the object does not matter
- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters not specified are set to their defaults

** Minkowski metric:**

$$Minkowski=\left ( \sum_{i=1}^{n} \left | x_{i} - y_{i} \right | ^{ p} \right )^{\frac{1}{p}}$$

** for p = 2, it become the usual euclidian metric**

$$Euclidian=\sqrt{\sum_{i=1}^{n} \left | x_{i} - y_{i} \right | ^{ 2}}$$

**step2:** Train the Classifier on your train dataset

- Model is learning the relationship between X and y
- Occurs in-place (inside the same object)

In [107]:
# Feel the scikit-learn power
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

**step3:** Test the Classifier on new data

In [108]:
# Use some arbitrary new data
X_test = pd.DataFrame([[3, 5, 4, 2], [5, 4, 3, 2]], columns=iris.feature_names)
X_test.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,3,5,4,2
1,5,4,3,2


In [109]:
# Feel more scikit-learn power
result = knn.predict(X_test)

# show the Estimator results
print(result)

# visualize its type
print(type(result))

[2 1]
<class 'numpy.ndarray'>


In [110]:
# Integers may be not clear enough, so let's change them with their respective names
# for that we will use numpy.where() function
import numpy as np

# Create our conversion function
def convert(x):
    """
    Args:
        x: list like object (list, numpy array, or pandas series)
    
    Returns:
        numpy.ndarray Object containing respective iris flowers names.
    """
    # use numpy.copy() to make sure our x is transformed to a numpy array to
    # be able to support all list like objects and make sure we don't change the
    # original object passed to the function.
    x = np.copy(x)
    
    # swap the integers with corresponding iris flower names
    return np.where(x==0, 'setosa', np.where(x==1, 'versicolor', 'verignica'))

# let's see how it looks
print(convert(result))

['verignica' 'versicolor']


In [111]:
# Build, Train, Test
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)
result = knn.predict(X_test)

# print the results
print(result)
print(convert(result))

[1 1]
['versicolor' 'versicolor']


# <font color='blue'>Model Evaluation</font>

** First, we build, train and test our model**

In [None]:
# Build our model with k = 5
knn = KNeighborsClassifier(n_neighbors=5)
# Train the Estimator
knn.fit(X, y)
# Predict on the training data
y_pred = knn.predict(X)

** Accuracy score:**
- Defines how accurate the Estimator is. Therefore, our objective is to maximize it
- This metric is widely used in classification problems

$$Accuracy=\frac{1}{n} \sum_{i=0}^n I(label_{i} = prediction_{i})$$

** It is important to note that it represents the opposite of the popular classification error** ( Accuracy = 1 - CE )

In [57]:
# To evaluate our models we use metrics
# In this case we will use accuracy metric
from sklearn import metrics

# Accuracy score is a function that we need to maximize
train_criteria = metrics.accuracy_score(y, y_pred)
print('Accuracy score is equal to :', train_criteria)

Accuracy score is equal to : 0.966666666667


In [58]:
# let's implement this metric on our own !!
def accuracy(y_true, y_pred):
    sum = 0
    for truth, predicted in zip(y_true, y_pred):
        if truth == predicted:
            sum += 1
    acc = sum / len(y_pred)
    return acc

# testing it.
print(accuracy(y, y_pred))

0.9666666666666667


In [59]:
# Build, Train and test on the same data with k = 1
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
y_pred = knn.predict(X)

train_criteria = metrics.accuracy_score(y, y_pred)
print('Accuracy score is equal to :', train_criteria)

Accuracy score is equal to : 1.0


In [60]:
# our implemented accuracy?
print(accuracy(y, y_pred))

1.0


# <font color='blue'> Overfit problem and how to face it?</font>

**Overfitting:**
![overfit](images/overfit.png)


In [61]:
# Sklearn contains optmized methods that allows us to divide our data into
# training and testing sets.
from sklearn.cross_validation import train_test_split

# Devide Inputs and Outputs into training and testing sets.
# Specify a unique random_state to make sure we get same results.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)

** Dividing dataset into train and test sets: **

![train_test](images/train_test.png)

In [62]:
# Build, Train, Test k = 1
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

# This time, we calculate the criteria on data never seen before
test_criteria = metrics.accuracy_score(y_test, y_pred)
print(test_criteria)

0.95


In [63]:
# testing our cool accuracy function.
print(accuracy(y_test, y_pred))

0.95


In [64]:
# Build, Train, Test k = 5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

# We calculate the test_criteria once again
test_criteria = metrics.accuracy_score(y_test, y_pred)
print(test_criteria)

0.966666666667


In [65]:
# testing our implemented metric
print(accuracy(y_test, y_pred))
# Awesome !

0.9666666666666667


# <font color='blue'>Want to use other Models? No problem at all. </font>

In [67]:
# Since sklearn is build to be used for general machine learning purpose
# its architecture allow to use the same steps for every supervised
# learning solving using differents Estimators.
from sklearn.linear_model import LogisticRegression

LogReg = LogisticRegression()
LogReg.fit(X_train, y_train)
y_pred = LogReg.predict(X_test)
test_criteria = metrics.accuracy_score(y_test, y_pred)
print(test_criteria)

0.95


# <font color='blue'>Ressources</font>

- ** Stanford University Book:** [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)
- ** Github repository:** [Introduction to scikit-learn](https://github.com/justmarkham/scikit-learn-videos)
- ** Github repository:** [Introduction to ML with python](https://github.com/amueller/introduction_to_ml_with_python)
- ** Online PDF Book:** [Artificial Intelligence with python](https://drive.google.com/file/d/0B1QpT-7pDV8nYWhkUWd6N0NaeUU/view)

# <font color='blue'>Week - 1 Challenge</font>

** For this Challenge, you will need to:**
- The dataset used in the challenge will be provided on "ENSTABrain Classroom" when posting the course.
- Plot Accuracy scores provided by knn for a k in a range from 1 to 30.
- Extract the best Neighbors number (k) that provides best Accuracy results
- Send the Jupyter Notebook containing your application to "ENSTABrain@gmail.com"

# <font color='blue'>Thank your for Your attention, see you Next time </font>