# 2. Classifying with k-Nearest Neighbors

[Table of Contents]

2.1 Classifying with distance measurements

## 2.1 Classifying with distance measurements

k-Nearest Neighbors
> Pros: High accuracy, insensitive to outliers, no assumptions about data
> 
> Cons: Computationally expensive, requires a lot of memory
>
> Works with: Numeric values, nominal values

General approach to kNN
1. Collect: Any method.
2. Prepare: Numeric values are needed for a distance calculation. A structured data format is best.
3. Analyze: Any method.
4. Train: Does not apply to the kNN algorithm.
5. Test: Calculate the error rate.
6. Use: This application needs to get some input data and output structured num- eric values. Next, the application runs the kNN algorithm on this input data and determines which class the input data should belong to. The application then takes some action on the calculated class

### 2.1.1 Prepare: importing data with Python

In [12]:
import numpy as np
import operator

In [31]:
def createDataSet():
    """
    Create a data set.
    """
    group = np.array([[1.0, 1.1],
                      [1.0, 1.0],
                      [0.0, 0.0],
                      [0.0, 0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels

In [33]:
group, labels = createDataSet()

In [34]:
group

array([[1. , 1.1],
       [1. , 1. ],
       [0. , 0. ],
       [0. , 0.1]])

In [35]:
labels

['A', 'A', 'B', 'B']

### 2.1.2 Putting the kNN classification algorithm into action

For every point in our dataset:

(1) calculate the distance between inX and the current point

(2) sort the distances in increasing order

(3) take k items with lowest distances to inX

(4) find the majority class among these items

(5) return the majority class as our prediction for the class of inX

### 2.1.3 How to test a classifier

## 2.2 Example: improving matches from a dating site with kNN

Example: using kNN on results from a dating site
1. Collect: Text file provided.
2. Prepare: Parse a text file in Python.
3. Analyze: Use Matplotlib to make 2D plots of our data.
4. Train: Doesn’t apply to the kNN algorithm.
5. Test: Write a function to use some portion of the data Hellen gave us as test ex- amples. The test examples are classified against the non-test examples. If the predicted class doesn’t match the real class, we’ll count that as an error.
6. Use: Build a simple command-line program Hellen can use to predict whether she’ll like someone based on a few inputs.

### 2.2.1 Prepare: parsing data from a text file

### 2.2.2 Analyze: creating scatter plots with Matplotlib

### 2.2.3 Prepare: normalizing numeric values

### 2.2.4 Test: testing the classifier as a whole program

### 2.2.5 Use: putting together a useful system

## 2.3 Example: a handwriting recognition system

Example: using kNN on a handwriting recognition system
1. Collect: Text file provided.
2. Prepare: Write a function to convert from the image format to the list format
used in our classifier, classify0().
3. Analyze: We’ll look at the prepared data in the Python shell to make sure it’s
correct.
4. Train: Doesn’t apply to the kNN algorithm.
5. Test: Write a function to use some portion of the data as test examples. The test examples are classified against the non-test examples. If the predicted class doesn’t match the real class, you’ll count that as an error.
6. Use: Not performed in this example. You could build a complete program to extract digits from an image, such a system used to sort the mail in the United States.

### 2.3.1 Prepare: converting images into test vectors

### 2.3.2 Test: kNN on handwritten digits

## 2.4 Summary

The algorithm has to carry around the full dataset; for large datasets, this implies a large amount of storage.

In addition, you need to calculate the distance measurement for every piece of data in the database, and this can be cumbersome.

An additional drawback is that kNN doesn’t give you any idea of the underlying structure of the data; you have no idea what an “average” or “exemplar” instance from each class looks like.