# Introduction

k-Nearest-Neighbors (kNN) is a simple technique for classification. The idea behind
it is that similar data points should have the same class, at least most of the time.
This method is very intuitive and has proven itself in many domains including
recommendation systems, anomaly detection, image/text classification and more.

In what follows we present a detailed example of a multi-class classification objective. The dataset we use contains information collected by the US Geological Survey and the US Forest Service about wilderness areas in northern Colorado. The features are measurements like soil type, elevation, and distance to water, and the labels encode the type of trees - the forest cover type - for each location. The machine learning task is to predict the cover type in a given location using the features. Overall there are seven cover types.

The notebook has two sections. In the first, we use Amazon SageMaker's python SDK in order to train a kNN classifier in its simplest setting. We explain the components common to all Amazon SageMaker's algorithms including uploading data to Amazon S3, training a model, and setting up an endpoint for online inference. In the second section we dive deeper into the details of Amazon SageMaker kNN. We explain the different knobs (hyper-parameters) associated with it, and demonstrate how each setting can lead to a somewhat different accuracy and latency at inference time.

## Dataset

We're about to work with the UCI Machine Learning Repository Covertype  dataset  ([covtype](https://archive.ics.uci.edu/ml/datasets/covertype)) (copyright Jock A. Blackard and Colorado State University). It's a labeled dataset where each entry describes a geographic area, and the label is a type of forest cover. There are 7 possible labels and we aim to solve the mult-class classification problem using kNN.
We begin by downloading the dataset and moving it to a temporary folder.

In [1]:
!wget 'https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz'

--2020-04-28 05:02:58--  https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11240707 (11M) [application/x-httpd-php]
Saving to: ‘covtype.data.gz’


2020-04-28 05:02:59 (13.6 MB/s) - ‘covtype.data.gz’ saved [11240707/11240707]



In [4]:
!rm -rf /root/kNN/01/raw
!mkdir /root/kNN/01/raw
!mv /root/kNN/01/covtype.data.gz /root/kNN/01/raw

### Attribute Information

Given is the attribute name, attribute type, the measurement unit and a brief description. The forest cover type is the classification problem. The order of this listing corresponds to the order of numerals along the rows of the database.

Name / Data Type / Measurement / Description

- Elevation / quantitative /meters / Elevation in meters
- Aspect / quantitative / azimuth / Aspect in degrees azimuth
- Slope / quantitative / degrees / Slope in degrees
- Horizontal_Distance_To_Hydrology / quantitative / meters / Horz Dist to nearest surface water features
- Vertical_Distance_To_Hydrology / quantitative / meters / Vert Dist to nearest surface water features
- Horizontal_Distance_To_Roadways / quantitative / meters / Horz Dist to nearest roadway
- Hillshade_9am / quantitative / 0 to 255 index / Hillshade index at 9am, summer solstice
- Hillshade_Noon / quantitative / 0 to 255 index / Hillshade index at noon, summer soltice
- Hillshade_3pm / quantitative / 0 to 255 index / Hillshade index at 3pm, summer solstice
- Horizontal_Distance_To_Fire_Points / quantitative / meters / Horz Dist to nearest wildfire ignition points
- Wilderness_Area (4 binary columns) / qualitative / 0 (absence) or 1 (presence) / Wilderness area designation
- Soil_Type (40 binary columns) / qualitative / 0 (absence) or 1 (presence) / Soil Type designation
- Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type designation

## Pre-Processing the Data

In [6]:
import numpy as np 
import os 

In [7]:
cwd = "/root/kNN/01/"

data_file_name = "raw/covtype.data.gz"
processed_subdir = "standardized"

raw_data_file = os.path.join(cwd, data_file_name)

# Taining Data
train_features_file = os.path.join(cwd, processed_subdir, "train/csv/features.csv")
train_labels_file = os.path.join(cwd, processed_subdir, "train/csv/labels.csv")

# Test Data 
test_features_file = os.path.join(cwd, processed_subdir, "test/csv/features.csv")
test_labels_file = os.path.join(cwd, processed_subdir, "test/csv/labels.csv")


#### Read raw data

In [8]:
raw_data_file

'/root/kNN/01/raw/covtype.data.gz'

In [9]:
# Read raw data 
print(f"Reading Raw Data from {raw_data_file}")
raw = np.loadtxt(raw_data_file, delimiter=',')                   # np.loadtxt - If the filename extension is .gz or .bz2, \
                                                                 # the file is first decompressed. Note that generators should \
                                                                 # return byte strings for Python 3k.

Reading Raw Data from /root/kNN/01/raw/covtype.data.gz


In [10]:
raw.shape

(581012, 55)

In [12]:
raw

array([[2.596e+03, 5.100e+01, 3.000e+00, ..., 0.000e+00, 0.000e+00,
        5.000e+00],
       [2.590e+03, 5.600e+01, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        5.000e+00],
       [2.804e+03, 1.390e+02, 9.000e+00, ..., 0.000e+00, 0.000e+00,
        2.000e+00],
       ...,
       [2.386e+03, 1.590e+02, 1.700e+01, ..., 0.000e+00, 0.000e+00,
        3.000e+00],
       [2.384e+03, 1.700e+02, 1.500e+01, ..., 0.000e+00, 0.000e+00,
        3.000e+00],
       [2.383e+03, 1.650e+02, 1.300e+01, ..., 0.000e+00, 0.000e+00,
        3.000e+00]])

In [14]:
# Split into train/test with a 80/20 split

np.random.seed(0)
np.random.shuffle(raw)

train_size = int(0.8 * raw.shape[0])

In [15]:
train_size

464809

In [16]:
train_features = raw[:train_size, :-1]
train_labels = raw[:train_size, -1]
test_features = raw[train_size:, :-1]
test_labels = raw[train_size:, -1]

In [17]:
train_features.shape

(464809, 54)

In [18]:
train_labels.shape

(464809,)

In [24]:
np.unique(train_labels, return_counts=True)                            # We can check class label distribution 

(array([1., 2., 3., 4., 5., 6., 7.]),
 array([169467, 226581,  28677,   2198,   7575,  13924,  16387]))

## Upload to Amazon S3

#### Lets start with Train data first

In [35]:
import io
import sagemaker.amazon.common as smac

print(f"train_features shape = {train_features.shape}")
print(f"train_labels shape = {train_labels.shape}")

train_features shape = (464809, 54)
train_labels shape = (464809,)


In [36]:
buff = io.BytesIO()
smac.write_numpy_to_dense_tensor(buff, train_features, train_labels)
buff.seek(0)

0

In [37]:
import boto3 
import os 
import sagemaker 

bucket = "ml-demo-bucket-suman"
prefix = "knn-01"
key = "recordio-pb-data"

s3 = boto3.resource("s3")
s3.Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buff)         # Upload a file-like object to S3.
                                                                                          # The file-like object must be in binary mode.
                                                                                          # This is a managed transfer which will perform a multipart 
                                                                                          # upload in multiple threads if necessary

s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('Uploaded training data location: {}'.format(s3_train_data))

Uploaded training data location: s3://ml-demo-bucket-suman/knn-01/train/recordio-pb-data


#### Shall do the same for the test data 

In [38]:
print('test_features shape = ', test_features.shape)
print('test_labels shape = ', test_labels.shape)

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buff, test_features, test_labels)
buf.seek(0)

boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test', key)).upload_fileobj(buf)
s3_test_data = 's3://{}/{}/test/{}'.format(bucket, prefix, key)
print('Uploaded test data location: {}'.format(s3_test_data))

test_features shape =  (116203, 54)
test_labels shape =  (116203,)
Uploaded test data location: s3://ml-demo-bucket-suman/knn-01/test/recordio-pb-data


## Training

We take a moment to explain at a high level, how Machine Learning training and prediction works in Amazon SageMaker. First, we need to train a model. This is a process that given a labeled dataset and hyper-parameters guiding the training process,  outputs a model. Once the training is done, we set up what is called an **endpoint**. An endpoint is a web service that given a request containing an unlabeled data point, or mini-batch of data points, returns a prediction(s).

In Amazon SageMaker the training is done via an object called an **estimator**. When setting up the estimator we specify the location (in Amazon S3) of the training data, the path (again in Amazon S3) to the output directory where the model will be serialized, generic hyper-parameters such as the machine type to use during the training process, and kNN-specific hyper-parameters such as the index type, etc. Once the estimator is initialized, we can call its **fit** method in order to do the actual training.

Now that we are ready for training, we start with a convenience function that starts a training job.

In [39]:
import matplotlib.pyplot as plt 

import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer, json_deserializer
from sagemaker.amazon.amazon_estimator import get_image_uri

In [40]:
get_image_uri(boto3.Session().region_name, "knn")

'404615174143.dkr.ecr.us-east-2.amazonaws.com/knn:1'

In [41]:
def trained_estimator_from_hyperparams(s3_train_data, hyperparams, output_path, s3_test_data=None):
    """
    Create an Estimator from the given hyperparams, fit to training data, 
    and return a deployed predictor
    
    """
    # set up the estimator
    knn = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "knn"),
                                        get_execution_role(),
                                        train_instance_count=1,
                                        train_instance_type='ml.m5.2xlarge',
                                        output_path=output_path,
                                        sagemaker_session=sagemaker.Session())
                                        
    knn.set_hyperparameters(**hyperparams)
    
    # train a model. fit_input contains the locations of the train and test data
    fit_input = {'train': s3_train_data}
    if s3_test_data is not None:
        fit_input['test'] = s3_test_data
    knn.fit(fit_input)
    return knn

In [42]:
s3_train_data

's3://ml-demo-bucket-suman/knn-01/train/recordio-pb-data'

In [43]:
s3_test_data

's3://ml-demo-bucket-suman/knn-01/test/recordio-pb-data'

Now, we run the actual training job. For now, we stick to default parameters.

In [50]:
hyperparams = {
    'feature_dim': 54,
    'k': 11,
    'sample_size': 200000,
    'predictor_type': 'classifier', 
    'index_metric': 'L2'
}

output_path = 's3://' + bucket + '/' + prefix + '/default_example/output'


In [51]:
hyperparams

{'feature_dim': 54,
 'k': 11,
 'sample_size': 200000,
 'predictor_type': 'classifier',
 'index_metric': 'L2'}

In [52]:
knn_estimator = trained_estimator_from_hyperparams(s3_train_data, hyperparams, output_path, s3_test_data=s3_test_data)

2020-04-28 06:50:59 Starting - Starting the training job...
2020-04-28 06:51:01 Starting - Launching requested ML instances...
2020-04-28 06:52:00 Starting - Preparing the instances for training......
2020-04-28 06:52:43 Downloading - Downloading input data
2020-04-28 06:52:43 Training - Downloading the training image...
2020-04-28 06:53:30 Uploading - Uploading generated training model[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[04/28/2020 06:53:27 INFO 139921556141888] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'index_metric': u'L2', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'_log_level': u'info', u'faiss_index_ivf_nlists': u'auto', u'epochs': u'1', u'index_type': u'faiss.Flat', u'_faiss_index_nprobe': u'5', u'_kvstore': u'dist_async', u'_num_kv_servers': u'1', u'mini_batch_size': u'5000'}[0m
[34m[04/28/2020 06:53:27 I

Notice that we mentioned a test set in the training job. When a test set is provided the training job doesn't just produce a model but also applies it to the test set and reports the accuracy. In the logs you can view the accuracy of the model on the test set.

## Setting up the endpoint

Now that we have a trained model, we are ready to run inference. The **knn_estimator** object above contains all the information we need for hosting the model. Below we provide a convenience function that given an estimator, sets up and endpoint that hosts the model. Other than the estimator object, we provide it with a name (string) for the estimator, and an **instance_type**. The **instance_type** is the machine type that will host the model. It is not restricted in any way by the parameter settings of the training job.

In [53]:
def predictor_from_estimator(knn_estimator, estimator_name, instance_type, endpoint_name=None): 
    knn_predictor = knn_estimator.deploy(initial_instance_count=1, instance_type=instance_type,
                                        endpoint_name=endpoint_name)
    knn_predictor.content_type = 'text/csv'
    knn_predictor.serializer = csv_serializer
    knn_predictor.deserializer = json_deserializer
    return knn_predictor

In [54]:
import time

instance_type = 'ml.m4.xlarge'
model_name = 'knn_%s'% instance_type
endpoint_name = 'knn-ml-m4-xlarge-%s'% (str(time.time()).replace('.','-'))
print('setting up the endpoint..')
predictor = predictor_from_estimator(knn_estimator, model_name, instance_type, endpoint_name=endpoint_name)

setting up the endpoint..
-------------!

## Inference

Now that we have our predictor, let's use it on our test dataset. The following code runs on the test dataset, computes the accuracy and the average latency. It splits up the data into 100 batches, each of size roughly 500. Then, each batch is given to the inference service to obtain predictions. Once we have all predictions, we compute their accuracy given the true labels of the test set.


In [60]:
predictor.predict(test_features[0])

{'predictions': [{'predicted_label': 2.0}]}

In [61]:
test_labels[0]

2.0

In [58]:

batches = np.array_split(test_features, 100)
print('data split into 100 batches, of size %d.' % batches[0].shape[0])

# obtain an np array with the predictions for the entire test set
start_time = time.time()
predictions = []
for batch in batches:
    result = predictor.predict(batch)
    cur_predictions = np.array([result['predictions'][i]['predicted_label'] for i in range(len(result['predictions']))])
    predictions.append(cur_predictions)
predictions = np.concatenate(predictions)
run_time = time.time() - start_time

test_size = test_labels.shape[0]
num_correct = sum(predictions == test_labels)
accuracy = num_correct / float(test_size)
print('time required for predicting %d data point: %.2f seconds' % (test_size, run_time))
print('accuracy of model: %.1f%%' % (accuracy * 100) )

data split into 100 batches, of size 1163.
time required for predicting 116203 data point: 82.65 seconds
accuracy of model: 91.9%


In [66]:
len(predictions)

116203

In [67]:
len(test_labels)

116203

In [77]:
from sklearn.metrics import multilabel_confusion_matrix, classification_report

array([[[ 70642,   3188],
        [  3612,  38761]],

       [[ 55005,   4478],
        [  3368,  53352]],

       [[108328,    798],
        [   645,   6432]],

       [[115606,     48],
        [   190,    359]],

       [[114018,    267],
        [   511,   1407]],

       [[112317,    443],
        [   733,   2710]],

       [[111843,    237],
        [   400,   3723]]])

In [78]:
multilabel_confusion_matrix(test_labels, predictions)

array([[[ 70642,   3188],
        [  3612,  38761]],

       [[ 55005,   4478],
        [  3368,  53352]],

       [[108328,    798],
        [   645,   6432]],

       [[115606,     48],
        [   190,    359]],

       [[114018,    267],
        [   511,   1407]],

       [[112317,    443],
        [   733,   2710]],

       [[111843,    237],
        [   400,   3723]]])

In [80]:
print(classification_report(test_labels, predictions))

              precision    recall  f1-score   support

         1.0       0.92      0.91      0.92     42373
         2.0       0.92      0.94      0.93     56720
         3.0       0.89      0.91      0.90      7077
         4.0       0.88      0.65      0.75       549
         5.0       0.84      0.73      0.78      1918
         6.0       0.86      0.79      0.82      3443
         7.0       0.94      0.90      0.92      4123

    accuracy                           0.92    116203
   macro avg       0.89      0.83      0.86    116203
weighted avg       0.92      0.92      0.92    116203



## Deleting the endpoint

We're now done with the example except a final clean-up act. By setting up the endpoint we started a machine in the cloud and as long as it's not deleted the machine is still up and we are paying for it. Once the endpoint is no longer necessary, we delete it. The following code does exactly that.

In [93]:
def delete_endpoint(predictor):
    try:
        boto3.client('sagemaker').delete_endpoint(EndpointName=predictor.endpoint)
        print('Deleted {}'.format(predictor.endpoint))
    except:
        print('Already deleted: {}'.format(predictor.endpoint))

#delete_endpoint(predictor)
            

## Conclusion

We've seen how to both train and host an inference endpoint for kNN. With absolutely zero tuning we obtain an accuracy of 92.2% on the covtype dataset. As a point of reference for grasping the prediction power of the kNN model, a linear model will achieve roughly 72.8% accuracy. There are several advanced issues that we did not discuss. In the next section we will deep-dive into issues such as run-time / latency, and tuning the model while taking into account both the accuracy and its run-time efficiency.