# Training a machine learning model with scikit-learn
*From the video series: [Introduction to machine learning with scikit-learn](https://github.com/justmarkham/scikit-learn-videos)*

## Agenda

- What is the **K-nearest neighbors** classification model?
- What are the four steps for **model training and prediction** in scikit-learn?
- How can I apply this pattern to **other machine learning models**?

## Reviewing the iris dataset

In [1]:
from IPython.display import IFrame
IFrame('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', width=300, height=200)

- 150 **observations**
- 4 **features** (sepal length, sepal width, petal length, petal width)
- **Response** variable is the iris species
- **Classification** problem since response is categorical
- More information in the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)

## How might humans predict the species of an unknown iris given its measurements?
 - When looking at the data we might notice that the three species have somewhat dissimilar measurements
 - If so, we might hypothesise that the species of an unknown Iris might be predicted bhy looking for an Irises in the data with similar measurements 
 - We could assume that our unknown Iris is the same species as those similar Irises 
 - This process is similar to how the KK classification model works

## K-nearest neighbors (KNN) classification
1. Pick a value for K 
    - eg 5, We'll see in the next video how to choose this value
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.
    - That is, the model calculates the numerical distance between the unknown Iris and each of the 150 known Irises and selects the 5 known Irises with the smallest distance to the unknown Iris
    - Euclidean distance is often used but others are also used
3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.

### Example training data
This is not the Iris training dataset
- This dataset has two numerical features represented by the X & Y co-ordinates
- Each point represents an observation and the colour of the point represents its Response class (Red, Blue or Green)

![Training data](images/04_knn_dataset.png)

### KNN classification map (K=1)
- The background here has been coloured Red for all areas where the nearest neighbours are all Red
    - The same has been done for Blue and Green
    - The backgrounbd colour tells you what the predicted Response value would be for a new observation depending on its X & Y features
    - If there was a new point was in the green area, inside the Blue area adjacent to another Green area, bottom left of the map), its predicted Response class would be Green because its nearest neighbour is Green

![1NN classification map](images/04_1nn_map.png)

### KNN classification map (K=5)
 - Decision boundaries - The boundaries between colours have changed because more neighbours are taken into account when making predictions
 - The predicted Response for a new observation in the example above, has now changed from Green to Blue because four of its nearest neighbours are Blue
 - The white areas are where KNN doesn't have enough information to make a clear decision because there is a tie between two classes
 - KNN is a very simple ML model but it can make highly accurate predictions if the different classes in the dataset have very dissimilar feature values 
![5NN classification map](images/04_5nn_map.png)

*Image Credits: [Data3classes](http://commons.wikimedia.org/wiki/File:Data3classes.png#/media/File:Data3classes.png), [Map1NN](http://commons.wikimedia.org/wiki/File:Map1NN.png#/media/File:Map1NN.png), [Map5NN](http://commons.wikimedia.org/wiki/File:Map5NN.png#/media/File:Map5NN.png) by Agor153. Licensed under CC BY-SA 3.0*

## Loading the data

In [3]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

# save "bunch" object containing iris dataset and its attributes
iris = load_iris()

# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

In [4]:
# print the shapes of X and y
print(X.shape)
print(y.shape)

(150, 4)
(150,)


 - X is a 2d array with 150 rows and 4 columns
 - Y is a 1d array with length 150
     - Since there is one response value for each observation

When loading your own data into scikit-learn, make sure to the four key requirements of input data that was outlined in the previous lecture

## scikit-learn 4-step modeling pattern
Let's begin the actual ML process
scikit-learn provides a uniform interface to ML models and thus there is a common pattern that can be re-used across different models...

**Step 1:** Import the class you plan to use

In [5]:
from sklearn.neighbors import KNeighborsClassifier

scikit-learn is carefully organised into modules, such as neighbours, so that it is easy to find the class that you are looking for

**Step 2:** "Instantiate" the "estimator"

- "Estimator" is scikit-learn's term for model
    - scikit-learn refers to its models and estimators as their primary role is to estimate unknown quantities
- "Instantiate" means "make an instance of"
    - in this example, the KNeighborsClassifier class

In [6]:
knn = KNeighborsClassifier(n_neighbors=1)

- We have now created an instance of the KNeighborsClassifier class and called it knn
    - ie We now have an object called knn that knows how to do KNN classification and is just waiting for some data
- Name of the object does not matter but something that helps you remember what it is is a good idea
    - Possible names are est for estimator or clf for classifier
    - You might choose something to remind you what model you are using
- Can specify tuning parameters (aka "hyperparameters") during this step
    - ``(n_neighbors=1)`` - That's how we tell the knn object that when it runs the KNN algorithm it should be looking for the 1 nearest neighbour
    - n_neighbors - Is a tuning or hyper parameter (next lecture)
- All parameters not specified are set to their defaults
    - By printing out the estimator object we can see all of those parameters

In [7]:
print(knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')


scikit-learn provides sensible defaults for its models so that you can get started with a new model without researching the meaning of every parameter

**Step 3:** Fit the model with data (aka "model training")

- Model is learning the relationship between X and y
- Occurs in-place

In [8]:
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

 - Where X is the feature matrix and y is the response vector
 - This operation occurs inplace which is why I do not have to assign the result to another object

**Step 4:** Predict the response for a new observation<br>
That is, I am inputting the measurements for an unknown Iris and asking the fitted model to predict the Iris species based on what it has learned in the previous step 

- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process

In [9]:
knn.predict([[3, 5, 4, 2]])

array([2])

We use the predict method on the knn object and pass it the features of the unknown Iris (as a Python list)<br>
It is expecting a Numpy array but it still works with a list since Numpy automatically converts it to an array of the appropriate shape<br>
- The predict method does returns an object - a NumPy array with the predicted response value
    - In this case, the KNN algorithm, using k=1, predicts a response value of 2
    - scikit-learn does not know what this 2 represents, so we need to keep track of the fact that 2 is the encoding for Virginica & thus Virginica is the predicted species for the unknown Iris
- Can predict for multiple observations at once
    - Below we create a list of lists for two new observations...

In [10]:
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)

array([2, 1])

- When we pass X_new to the predict method, it gets converted to a Numpy array. This time with a shape of 2 x 4 which is interpreted as 2 observations with 4 features each
- The predict method returns a Numpy array with values 2 & 1 which means that the prediction for the first unknown Iris was 2 or Virginica and its prediction for the second unknown Iris was 1 or Versicolor

## Using a different value for K
This is known as model tuning in which you are varying the arguments that you pass to the model. Note that you do not have to import the class again, you just instantiate the model with the new ``n_neighbors`` argument, fit the model with the data and make predictions...

In [11]:
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(X, y)

# predict the response for new observations
knn.predict(X_new)

array([1, 1])

This time, the model predicts the value 1 for both unknown Irises

## Using a different classification model
All scikit-learn models have a uniform interface, which means that you can use the same 4 step pattern on a different model with relative ease eg if we wanted to try logistic regression, which is another model used for classification...
- I simply import logistic regression from the linear model module  
- Instantiate the model with all of the default parameters
- Fit the model with data &
- Make predictions

**Note:** Logistic Regression is an extension of Linear Regression, so we may need a grounding in linera Regression before we can tackle Logistic Regression

In [12]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X, y)

# predict the response for new observations
logreg.predict(X_new)

array([2, 0])

This time, the model predicts a value of 2 or Virginica for the first unknown Iris and a value of 0 or Setosa for the second unknown Iris<br><br>
**Which model produced the correct predictions?**<br>
We don't know! These are out of sample observations meaning that we do not know the true response values<br>
Our goal with Supervised Learning is to build models that generalise to new data. However, we are often unable to truly measure how well our models will perform on out of sample data<br><br>
**Model evaluation procedures**<br>
In the next lecture, we are going to look at evaluation procedures, for our models, which will allow us to estimate how well our models are likely to perform on out of sample data using our existing label data.<br>
These procedures will help us to choose which value of k is best for KNN and to choose whether KNN or Logistic Regression is a better choice for our particular task

## Resources

- [Nearest Neighbors](http://scikit-learn.org/stable/modules/neighbors.html) (user guide), [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) (class documentation)
- [Logistic Regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) (user guide), [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (class documentation)
- [Videos from An Introduction to Statistical Learning](http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/)
    - Classification Problems and K-Nearest Neighbors (Chapter 2)
    - Introduction to Classification (Chapter 4)
    - Logistic Regression and Maximum Likelihood (Chapter 4)

## Comments or Questions?

- Email: <kevin@dataschool.io>
- Website: http://dataschool.io
- Twitter: [@justmarkham](https://twitter.com/justmarkham)

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()