# 4. Training a Machine Learning model with scikit-learn

- What is the K-nearest neighbors classification model?

- What are the four steps for model training and prediction in scikit-learn?

- How can I apply this pattern to other Machine Learning models?

## K-nearest neighbors (KNN) classification

1. Pick a value for K.

2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris

3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.

## Example using IRIS dataset

### Loading the data

In [5]:
# Import load_iris function from datasets module
from sklearn.datasets import load_iris

# save "bunch" object containing iris dataset and its attributes
iris = load_iris()

# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

In [6]:
# print the shapes
print(X.shape)
print(y.shape)

(150, 4)
(150,)


## <span style="color:green"> scikit-learn 4-step modeling pattern</span>


#### <span style="color:blue"> Step 1: Import the class you plan to use</span>


In [8]:
from sklearn.neighbors import KNeighborsClassifier

#### <span style="color:blue"> Step 2: "Instantiate" the estimator</span>

- "Estimator" is the `scikit-learn`'s term for model

- "Instantiate" means "make an instance of"

In [10]:
knn = KNeighborsClassifier(n_neighbors = 1)

Note the following:

- Name of the object does not matter

- Can specify tuning parameters (aka "hyperparameters") during this step

- All parameters not specified are set to their defaults

In [14]:
# When you print or fit an estimator (such as "knn"), 
#   it only displays the parameters that have been changed from the default values
print(knn)

KNeighborsClassifier(n_neighbors=1)


#### <span style="color:blue"> Step 3: Fit the model with data (aka "model training")</span>

- Model is learning the relationship between X and y

- Occurs in-place i.e. you do not need to assign the result to some object

In [17]:
knn.fit(X,y)

KNeighborsClassifier(n_neighbors=1)

#### <span style="color:blue"> Step 4: Predict the response for a new observation</span>.

- New observations are called "out-of-sample" data

- Uses the information it learned during the model training process

In [21]:
# option 1: pass the data as a nested list, which will be interpreted as having shape (1, 4)
knn.predict([[3, 5, 4, 2]])

array([2])

In [22]:
# option 2: explicitly change the shape to be (1, 4)
import numpy as np
knn.predict(np.reshape([3, 5, 4, 2], (1, 4)))

array([2])

In [23]:
# option 3: explicitly change the first dimension to be 1, let NumPy infer that the second dimension should be 4
knn.predict(np.reshape([3, 5, 4, 2], (1, -1)))

array([2])

Note that 

- `.predict()` returns a `NumPy` array

- can predict multiple instances at once

In [25]:
# predict multiple obs
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new) # predicts 2 and 1 respectively

array([2, 1])

### Using different value for k

In [36]:
# you no longer need to import the class


# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors = 5)

# fit the model with data
knn.fit(X, y)

# predict the response for new observations
knn.predict(X_new)

array([1, 1])

### Using different classification model

Apply the same 4-step!

In [40]:
# 1: Import the class
from sklearn.linear_model import LogisticRegression

# 2: Instantiate the model 
logreg = LogisticRegression(solver = 'liblinear') # why this solver? just to match with the video

# 3: Fit the model with data
logreg.fit(X, y)

# 4: Predict the response for new observations
logreg.predict(X_new)

array([2, 0])