## This notebook introduces the K Nearest Neighbor classification model

### Background Info:
#### KNN classification model is used to classify the category of a new observation based on past observations. It searches for K number of past (or known) observations that are the closest (in numerical euclidean distance) to the observation that needs to be classified, and based on the nearest values it determines the category of the new observation.

### Steps KNN uses to predict classification of new observation:
#### Step 1: Pick a value for K
#### Step 2: Search for K observations in the training data that are the "nearest" to the measurements of the 
#### unknow observation
#### Step 3: Use the most popular response value from the K nearest neighbors as the presicted reponse values for the unknown observation

### Data requirements for working with Scikit-Learn:
#### Requirement 1: Features and response/target must be separate objects
#### Requirement 2: Features and response/target must be numeric data type
#### Requirement 3: Features and response/target must be Numpy arrays
#### Requirement 4: Features and response should have a specific and compatible shape (dimensions)

### Steps to create and apply KNN model:
#### Step 1: Import required libraries (such as pandas and numpy) and read/load dataset
#### Step 2: Separate features from response/target by assigning feature matrix to variable 'X' and target vector to variable 'y'
##### Note: convert to Numpy arrays if needed before assigning to 'X' & 'y'
#### Step 3 (optional): Verify X & y have appropriate shapes
#### Step 4: Import the class/model you plan to use
#### Step 5: 'Instantiate' (i.e. make an instance of) the 'estimator' (i.e. scikit-learn's term for model)
#### Step 6 (optional): Specify tuning parameters (aka 'hyperparameters') if required
##### Note: hyperparameters refer to the model selection task, or algorithm hyperparameters, that in principle have no influence on the performance of the model but affect the speed and quality of the learning process.
#### Step 7: Fit/train the model with data (aka "model training")
#### Step 8: Use model to predict classification of new observation

In [1]:
#Steps 1 & 2 - import reuqired libraries and load dataset
import pandas as pd
import numpy as np

#import iris dataset
from sklearn.datasets import load_iris
iris = load_iris()

#store feature matrix in "X" - here 'X' is in uppercase to denote that it represents a matrix (in this case a 150x4
#matrix since it contains 4 features (i.e. columns/attributes) and 150 observations (i.e. rows/records))
X = iris.data

#store respone/target vector in 'y' - here 'y' is in lowercase to denote that it represent a vector (i.e. a 
#1-Dimensional series)
y = iris.target

In [2]:
#Step 3 (optional) - verify shape of X and y
print(X.shape)
print(y.shape)

(150, 4)
(150,)


In [3]:
#Step 4 - import KNN classifier model
from sklearn.neighbors import KNeighborsClassifier

In [4]:
#Step 5 & 6 - create instance of KNN classifier and specify hyperparameter - here we are selecting 1 as the value
#for the hyper/tuning parameter 'K')
knn = KNeighborsClassifier(n_neighbors=1) #note, KNeighborsClassifier function has additional parameters that are set
#to their default scikit-Learn values. Use 'print knn to view the default parametrs'.

In [5]:
#Step 7 - train model (teach model relationship between feature and target/response)
knn.fit(X,y)

KNeighborsClassifier(n_neighbors=1)

In [10]:
#Step 8 - predict classification of new onservation
i = np.array([3,5,4,2])
knn.predict(i.reshape(1,-1))#using reshape because inpt must be 2-Dimensional

#you can also do prediction on multiple observations by passing in a list of lists as follows
i = [[3,5,4,2],[5,4,3,2],[3,2,3,2]]
knn.predict(i)

array([2, 1, 1])

In [12]:
#let's try using a different value for k (this is called model tuning in which you are varying the arguments passed to
#the model)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X,y)
knn.predict(i)

array([1, 1, 1])

In [15]:
#let's try a different classification model such as Logistic Regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X,y)
logreg.predict(i)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


array([0, 0, 0])

Conclusion: As can be seen from above example, differnet models generate different results. In the above example, we cannot tell which model is more accurate as we are using 'out of sample' data to predict (i.e. data that we do not know the true results for). In order to determine which model is a better fit, we need to evaluate the accuracy of each model using exising labeled data. This will also allow us to determine a better value for K when using knn.

Additional Resources:
- www.scikit-learn.org
- www.dataschool.io/15-hours-of-expert-machine-learning-videos/