## Modeling with K-Means Clustering

Steps: 
    (1) load dataset to notebook from S3
    (2) clean and transform data
    (3) create model, train the model
    (4) view results
    

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime

import boto3
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role

### Step 1: Load dataset to notebook from S3

In [4]:
role = get_execution_role()
bucket = 'ml-projects-bl'
prefix = 'ufo_dataset'
data_key = 'ufo_fullset.csv'
data_location = 's3://{}/{}/{}'.format(bucket, prefix, data_key)
df = pd.read_csv(data_location, low_memory=False)

In [5]:
df.head()

Unnamed: 0,reportedTimestamp,eventDate,eventTime,shape,duration,witnesses,weather,firstName,lastName,latitude,longitude,sighting,physicalEvidence,contact,researchOutcome
0,1977-04-04T04:02:23.340Z,1977-03-31,23:46,circle,4,1,rain,Ila,Bashirian,47.329444,-122.578889,Y,N,N,explained
1,1982-11-22T02:06:32.019Z,1982-11-15,22:04,disk,4,1,partly cloudy,Eriberto,Runolfsson,52.664913,-1.034894,Y,Y,N,explained
2,1992-12-07T19:06:52.482Z,1992-12-07,19:01,circle,49,1,clear,Miller,Watsica,38.951667,-92.333889,Y,N,N,explained
3,2011-02-24T21:06:34.898Z,2011-02-21,20:56,disk,13,1,partly cloudy,Clifton,Bechtelar,41.496944,-71.367778,Y,N,N,explained
4,1991-03-09T16:18:45.501Z,1991-03-09,11:42,circle,17,1,mostly cloudy,Jayda,Ebert,47.606389,-122.330833,Y,N,N,explained


In [6]:
df.shape

(18000, 15)

### Step 2: Clean and transform the data

the objective is creating a model for clustering analysis of all locations in the dataset, so only latitude and longitude info are required.

In [8]:
df_geo = df[['latitude', 'longitude']]

In [10]:
df_geo.head()

Unnamed: 0,latitude,longitude
0,47.329444,-122.578889
1,52.664913,-1.034894
2,38.951667,-92.333889
3,41.496944,-71.367778
4,47.606389,-122.330833


In [11]:
df_geo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18000 entries, 0 to 17999
Data columns (total 2 columns):
latitude     18000 non-null float64
longitude    18000 non-null float64
dtypes: float64(2)
memory usage: 281.3 KB


Note: K-means algorithm requires  float32 datatype, they need to be changed to float32

Also, it needs to check if there are missing values in the new created datafame: df_geo 

In [13]:
missing_values = df_geo.isnull().values.any()
print('Missing values exist? {}'.format(missing_values))
if(missing_values):
    df_geo[df_geo.isnull().any(axis=1)]

Missing values exist? False


K-Means algorithm requires numpy.ndarray as training data, so it needs to transform the pandas dataframe to numpy ndarray

In [14]:
data_train = df_geo.values.astype('float32')
data_train

array([[  47.329445, -122.57889 ],
       [  52.664913,   -1.034894],
       [  38.951668,  -92.333885],
       ...,
       [  36.86639 ,  -83.888885],
       [  35.385834,  -94.39833 ],
       [  29.883055,  -97.94111 ]], dtype=float32)

### Step 3: Create and Train model

No data shuffling is required for K-Means

In [15]:
from sagemaker import KMeans

num_clusters = 10
output_location = 's3://' + bucket + '/model-artifacts-922'

kmeans = KMeans(role = role,
               train_instance_count=1,
               train_instance_type='ml.c4.xlarge',
               output_path=output_location,
               k=num_clusters)

In [18]:
job_name = 'kmeans-geo-job-{}'.format(datetime.now().strftime("%Y%m%d%H%M%S"))
print('Here is the job name {}'.format(job_name))

Here is the job name kmeans-geo-job-20190923003211


start the training job here. use %%time command below to get an idea of how long the training job runs

In [19]:
%%time
kmeans.fit(kmeans.record_set(data_train), job_name=job_name)

2019-09-23 00:36:49 Starting - Starting the training job...
2019-09-23 00:36:50 Starting - Launching requested ML instances......
2019-09-23 00:37:52 Starting - Preparing the instances for training......
2019-09-23 00:39:19 Downloading - Downloading input data...
2019-09-23 00:39:50 Training - Training image download completed. Training in progress..[31mDocker entrypoint called with argument(s): train[0m
[31m[09/23/2019 00:39:52 INFO 139658387015488] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'_enable_profiler': u'false', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u'0.0001', u'local_lloyd_max_iter': u'300', u'_disable_wait_to_read': u'false', u'extra_center_factor': u'auto', u'eval_metric

### Step 4: View the results

It needs to deserilize the model artifact generated by the training process that oupputed into the s3 buckets in order to view the training results, since the model artifact is in tar.gz foramt. then unzip it which contains model_algo-1.

In [23]:
import os
model_key = 'model-artifacts-922/' + job_name + '/output/model.tar.gz'
boto3.resource('s3').Bucket(bucket).download_file(model_key, 'model.tar.gz')
os.system('tar -zxvf model.tar.gz')
os.system('unzip model_algo-1')

2304

It mentions in AWS Sagemaker documents: when model.tar.gz is untarred, it contains model_algo-1, which is a serialized Apache MXNet object. So the serialized object an be loaded into a numpy.ndarray and view the data from the ndarray.

In [24]:
!pip install mxnet

Collecting mxnet
[?25l  Downloading https://files.pythonhosted.org/packages/50/08/186a7d67998f1e38d6d853c71c149820983c547804348f06727f552df20d/mxnet-1.5.0-py2.py3-none-manylinux1_x86_64.whl (25.4MB)
[K    100% |████████████████████████████████| 25.4MB 1.8MB/s eta 0:00:01
Collecting numpy<2.0.0,>1.16.0 (from mxnet)
[?25l  Downloading https://files.pythonhosted.org/packages/e5/e6/c3fdc53aed9fa19d6ff3abf97dfad768ae3afce1b7431f7500000816bda5/numpy-1.17.2-cp36-cp36m-manylinux1_x86_64.whl (20.4MB)
[K    100% |████████████████████████████████| 20.4MB 2.9MB/s eta 0:00:01
[?25hCollecting graphviz<0.9.0,>=0.8.1 (from mxnet)
  Downloading https://files.pythonhosted.org/packages/53/39/4ab213673844e0c004bed8a0781a0721a3f6bb23eb8854ee75c236428892/graphviz-0.8.4-py2.py3-none-any.whl
Installing collected packages: numpy, graphviz, mxnet
  Found existing installation: numpy 1.14.3
    Uninstalling numpy-1.14.3:
      Successfully uninstalled numpy-1.14.3
Successfully installed graphviz-0.8.4 mxnet

In [25]:
import mxnet as mx
Kmeans_model_params = mx.ndarray.load('model_algo-1')

In [26]:
cluster_centroids_kmeans = pd.DataFrame(Kmeans_model_params[0].asnumpy())
cluster_centroids_kmeans.columns=df_geo.columns
cluster_centroids_kmeans

Unnamed: 0,latitude,longitude
0,47.827328,-123.282776
1,43.038395,19.588501
2,40.982315,-88.115158
3,-2.605019,108.467651
4,41.236965,-75.401009
5,35.932854,-117.314888
6,30.826124,-82.21302
7,52.188309,-1.870415
8,35.129333,-99.169319
9,21.643465,-157.813766


transform the results to csv and load to s3

In [30]:
from io import StringIO
csv_buffer = StringIO()
cluster_centroids_kmeans.to_csv(csv_buffer, index=False)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'results/ten_locations_kmeans.csv').put(Body=csv_buffer.getvalue())

{'ResponseMetadata': {'RequestId': '33C0485CE8ED7201',
  'HostId': 'NSpmY/Ahbx9OTIalTy0Ot6W/wzf5ncdzQwhgjPvSinc0IlxU421tP++IOPSwXDn18jRMFdRbcnY=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'NSpmY/Ahbx9OTIalTy0Ot6W/wzf5ncdzQwhgjPvSinc0IlxU421tP++IOPSwXDn18jRMFdRbcnY=',
   'x-amz-request-id': '33C0485CE8ED7201',
   'date': 'Mon, 23 Sep 2019 01:21:29 GMT',
   'etag': '"7642bb444f037dbae3fd8eb1ddaf13d5"',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'ETag': '"7642bb444f037dbae3fd8eb1ddaf13d5"'}