(c) Ali Parandeh - Beginners Machine Learning - London

# Introduction to Unsupervised Machine Learning with AWS Sagemaker
In this interesting 3hr workshop, you will take the massive dataset of UFO sightings (80,000 reports over the past century) from [National UFO Reporting Center (NUFORC)](http://www.nuforc.org/) and use Amazon's machine learning services ([AWS Sagemaker](https://aws.amazon.com/sagemaker/)) to identify the top 10 locations that are most likely to have UFO sightings. To do so, you will need to use an unsupervised machine learning algorithm.

You will then take your trained model, deserialise it, convert its output to a csv format and visualise it on a map using AWS [Quicksight](https://aws.amazon.com/quicksight/) to see where these locations are. Then you can try correlating these locations with landmarks.

The general machine learning workflow with AWS Sagemaker is shown below. For this assignment we will not evaluate or deploy the model but only use its output to visualise the results on a world map.

<img src="https://docs.aws.amazon.com/sagemaker/latest/dg/images/ml-concepts-10.png">

### What is Unsupervised Machine Learning? 

With unsupervised learning, data features are fed into the learning algorithm, which determines how to label them (usually with numbers 0,1,2..) and based on what. This “based on what” part dictates which unsupervised learning algorithm to follow.

Most unsupervised learning-based applications utilize the sub-field called **Clustering**. 

One of the most famous topics under the realm of Unsupervised Learning in Machine Learning is k-Means Clustering. Even though this clustering algorithm is fairly simple, it can look challenging to newcomers into the field. 

### What is the difference between supervised and unsupervised machine learning?

The main difference between Supervised and Unsupervised learning algorithms is the absence of data labels in the latter.

### What does clustering mean?

**Clustering** is the process of grouping data samples together into clusters based on a certain feature that they share — exactly the purpose of unsupervised learning in the first place.

<img src="https://cdn-images-1.medium.com/max/1600/1*tWaaZX75oumVwBMcKN-eHA.png">

Source: [Clustering using K-means algorithm](https://towardsdatascience.com/clustering-using-k-means-algorithm-81da00f156f6)

### How does the K-Means Algorithm work?

Being a clustering algorithm, **k-Means** takes data points as input and groups them into `k` clusters. This process of grouping is the training phase of the learning algorithm. The result would be a model that takes a data sample as input and returns the cluster that the new data point belongs to, according the training that the model went through.

<img src="https://miro.medium.com/max/700/1*6EOTS1IE2ULWC9SKgf7mYw.png">

Source - [How Does k-Means Clustering in Machine Learning Work?](https://towardsdatascience.com/how-does-k-means-clustering-in-machine-learning-work-fdaaaf5acfa0)

<img src="https://miro.medium.com/max/700/1*4LOxZL6bFl3rXlr2uCiKlQ.gif">

Source: [How Does k-Means Clustering in Machine Learning Work?](https://towardsdatascience.com/how-does-k-means-clustering-in-machine-learning-work-fdaaaf5acfa0)

Check out the the two articles below to learn more about how the K-Means Algorithm work:

- [Clustering using K-means algorithm](https://towardsdatascience.com/clustering-using-k-means-algorithm-81da00f156f6)
- [How Does k-Means Clustering in Machine Learning Work?](https://towardsdatascience.com/how-does-k-means-clustering-in-machine-learning-work-fdaaaf5acfa0)


### Where can you use k-means?

The **k-means algorithm** can be a good fit for finding patterns or groups in large datasets that have not been explicitly labeled. Here are some example use cases in different domains:

**E-commerce**

- Classifying customers by purchase history or clickstream activity.

**Healthcare**

- Detecting patterns for diseases or success treatment scenarios.
- Grouping similar images for image detection.

**Finance**

- Detecting fraud by detecting anomalies in the dataset. For example, detecting credit card frauds by abnormal purchase patterns.

**Technology**

- Building a network intrusion detection system that aims to identify attacks or malicious activity.

**Meteorology**

- Detecting anomalies in sensor data collection such as storm forecasting.

## Step 1: Importing Data

For this part of the assignment, we need to import the following packages: 

- **Amazon SageMaker Python SDK**: Amazon SageMaker Python SDK is an open source library for training and deploying machine-learned models on Amazon SageMaker. See [Documentation](https://sagemaker.readthedocs.io/en/stable/index.html)
- **Python Built-in Library** [datetime](https://docs.python.org/2/library/datetime.html)
- **Numpy** and **Pandas**


In [1]:
# TODO: Import the above packages below
import pandas as pd
import numpy as np
import sagemaker
from datetime import datetime

> **Exercise:** Construct a url to the the dataset location in your S3 bucket using the following expression and save it to `data_location`.

In [53]:
# TODO: Construct the url path to your dataset file that you have just uploaded to your newly created S3 bucket
bucket = "YOUR-OWN-BUCKET-NAME"
prefix = "ufo_dataset"
data_key = "ufo_complete.csv"

# Construct a url string and save it to data_location variable
data_location = "s3://{}/{}/{}".format(bucket, prefix, data_key)

# print data_location
print(data_location)

s3://YOUR-OWN-BUCKET-NAME/ufo_dataset/ufo_complete.csv


In [3]:
# Internally do not process the file in chunks when loading the csv onto 
# a dataframe to ensure avoid mixed type inferences when importing the large UFO dataset. 
df = pd.read_csv(data_location, low_memory= False)

In [4]:
# Inspect the tail of the dataframe
df.tail()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
88870,09/09/2013 22:00,napa,ca,us,other,1200,hour,Napa UFO&#44,9/30/2013,38.2972222,-122.284444
88871,09/09/2013 22:20,vienna,va,us,circle,5,5 seconds,Saw a five gold lit cicular craft moving fastl...,9/30/2013,38.9011111,-77.265556
88872,09/09/2013 23:00,edmond,ok,us,cigar,1020,17 minutes,2 witnesses 2 miles apart&#44 Red &amp; White...,9/30/2013,35.6527778,-97.477778
88873,09/09/2013 23:00,starr,sc,us,diamond,0,2 nights,On September ninth my wife and i noticed stran...,9/30/2013,34.3769444,-82.695833
88874,09/09/2013 23:30,ft. lauderdale,fl,us,oval,0,still occuring,Hovering object lit with red and white lights&...,9/30/2013,26.1219444,-80.143611


In [5]:
# Inspect the shape of the dataframe
df.shape

(88875, 11)

## Step 2: Clearning, transforming and preparing the data

In [6]:
# TODO: Select the 'latitude' and 'longitude' columns and save it as a new dataframe `df_geo` with .copy().
df_geo = df[["latitude", "longitude"]].copy()

In [7]:
# Inspect the tail of df_geo
df_geo.tail()

Unnamed: 0,latitude,longitude
88870,38.2972222,-122.284444
88871,38.9011111,-77.265556
88872,35.6527778,-97.477778
88873,34.3769444,-82.695833
88874,26.1219444,-80.143611


In [8]:
# Fully inspect the df_geo dataframe
df_geo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88875 entries, 0 to 88874
Data columns (total 2 columns):
latitude     88875 non-null object
longitude    88875 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.4+ MB


Upon successfull inspection of the above dataframe, you should note the following with this dataframe:

- There are no `null` or missing values in both columns. However, we still need to check for other incorrect entries that are not **coordinates**. Example: `0`, `string`s, etc.
- The `latitude` column has a `dtype` of `object`. This means the column may have missing or string values where the rest of the values are numbers. If the entries in the column are non-homogenous, pandas will store the column as a `string` or `object` data type. To clean the data in this column we can use pandas' `.to_numeric()` method to convert the data in this column to `float` for processing. The machine learning algorithm expects data passed in to it to be numerical digits `float`s or `int`s not `string`s. - See [Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html) on how to use this method.

> **Exercise:** Convert the `latitude` column datatype to `float`. You can pass in the `errors = "coerce"` option to `.to_numeric()` method to enforce the conversion. When conversion is not possible - i.e. values are strings - these strings will be replaced with NaNs. Therefore, chain a `.dropna()` method to drop rows where `NaNs` exist. Then check whether the column formats have been converted to numerical data types `float` and if any missing values are still present. Finally, pass in the `inplace = True` argument to both `.dropna()` and `.to_numeric()` methods so that operations are performed in place and to avoid re-assignments.

In [9]:
# TODO: Convert the column values to numeric and whenever it is not possible replace the value with NaNs and then drop rows
# where NaNs exist

df_geo["latitude"] = pd.to_numeric(df_geo.latitude, errors = "coerce")

In [10]:
# Count the number of null values in the dataframe - Expecting this to be zero
print("Number of null values in the dataframe before dropping rows is {}".format(df_geo.isnull().any().sum()))

# Drop all rows that NaN Values
df_geo.dropna(inplace=True)

# Count the number of null values in the dataframe - Expecting this to be zero
print("Number of null values in the dataframe before dropping rows is {}". format(df_geo.isnull().any().sum()))

Number of null values in the dataframe before dropping rows is 1
Number of null values in the dataframe before dropping rows is 0


In [11]:
# Count how many rows in the df are have 0 values
print(df_geo[(df_geo.longitude == 0) | (df_geo.latitude == 0) ].count())

latitude     1494
longitude    1494
dtype: int64


In [12]:
# Select all coordinate values that are non-zero
df_geo = df_geo[(df_geo.longitude != 0) &(df_geo.latitude != 0) ]

In [13]:
# Check that the there are no coordinate values in the df_geo dataframe with 0
print(df_geo[(df_geo.longitude == 0) &(df_geo.latitude == 0)])

Empty DataFrame
Columns: [latitude, longitude]
Index: []


In [14]:
# Re-checking the dataframe to ensure both columns have numerical datatype such as `float` or `int`.
df_geo.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 87184 entries, 0 to 88874
Data columns (total 2 columns):
latitude     87184 non-null float64
longitude    87184 non-null float64
dtypes: float64(2)
memory usage: 2.0 MB


In [15]:
# Check if we have any missing values (NaNs) in our dataframe
missing_values = df_geo.isnull().values.any()
print("Are there any missing values? {}".format(missing_values))

Are there any missing values? False


In [16]:
# If there are any missing values in the dataframe, show them
if (missing_values):
    df_geo[df_geo.isnull().any(axis = 1)]

In [17]:
# store the cleaned up dataframe column values as a 2D numpy array (matrix) with datatype of float32
data_train = df_geo.values.astype("float32")

# Print the 2D numpy array
data_train

array([[ 29.883055, -97.94111 ],
       [ 29.38421 , -98.581085],
       [ 53.2     ,  -2.916667],
       ...,
       [ 35.65278 , -97.477776],
       [ 34.376945, -82.69583 ],
       [ 26.121944, -80.14361 ]], dtype=float32)

## Step 3: Visualising the last 5000 reports of the data on the map

One of the useful packages for visualising the data on a map is called **plotly**. 

We can import the following module from plotly package as `px`:

- **plotly**'s [express](https://plot.ly/python/plotly-express/) - Plotly Express is a terse, consistent, high-level wrapper around `plotly.graph_objects` for rapid data exploration and figure generation.

For data available as a tidy pandas DataFrame, we can use the Plotly Express function `px.scatter_geo` for a geographical scatter plot. The `color` argument is used to set the size of markers from a given column of the DataFrame.

In [55]:
import plotly.express as px

# Showing only the last 5000 rows only on a map
fig = px.scatter_geo(df_geo.iloc[-5000: -1, :], lat="latitude", lon = "longitude", 
                     title="UFO Reports by Latitude/Longitude in the world - Last 5000 Reports", color = "longitude")
fig.show()

<img src="https://i.imgur.com/LeJzFHj.png">

You may notice that most of the past 5000 UFO reports have been located in the United States. Let's take a closer look at United States by using `plotly`'s `geo` layout feature to show sightings on the US map.

In [56]:
from plotly.offline import iplot

data = [dict(
        type = 'scattergeo',
        locationmode = 'USA-states',
        lat = df_geo.iloc[-5000:-1, 0],
        lon = df_geo.iloc[-5000:-1, 1],
        mode = 'markers',
        marker = dict(
            size = 5.5,
            opacity = 0.75,
            color = 'rgb(0, 163, 81)',
            line = dict(color = 'rgb(255, 255, 255)', width = 1))
        )]

layout = dict(
         title = 'UFO Reports by Latitude/Longitude in United States - Last 5000 Reports',
         geo = dict(
             scope = 'usa',
             projection = dict(type = 'albers usa'),
             showland = True,
             landcolor = 'rgb(250, 250, 250)',
             subunitwidth = 1,
             subunitcolor = 'rgb(217, 217, 217)',
             countrywidth = 1,
             countrycolor = 'rgb(217, 217, 217)',
             showlakes = True,
             lakecolor = 'rgb(255, 255, 255)')
        )

figure = dict(data = data, layout = layout)
iplot(figure)

<img src="https://i.imgur.com/oIQQVIQ.png">

## Step 3: Create and train our model

In [18]:
# Define number of clusters and output location URL to save the trained model
num_clusters = 10
output_location = "s3://" + bucket + "/model-artifacts"

To pass a model training command to Amazon for training, we need to grab the details of the current execution role **ARN ID** whose credentials we are using to call the Sagemaker API. 

> **Exercise:** Grab the ARN ID of your current Execution role using the `sagemaker` SDK - See [Documentation](https://sagemaker.readthedocs.io/en/stable/session.html?highlight=get%20execution#sagemaker.session.get_execution_role)

In [54]:
# TODO: Get the execution role ARN ID to pass to the sagemaker API later on
role = sagemaker.get_execution_role()

# Check that you have this step correctly performed
print(role)

We now can use Amazon's built-in K-means ML algorithm to find `k` clusters of data in our unlabeled UFO dataset.

Amazon SageMaker uses a modified version of the web-scale k-means clustering algorithm. Compared with the original version of the algorithm, the version used by Amazon SageMaker is more accurate. Like the original algorithm, it scales to massive datasets and delivers improvements in training time. To do this, the version used by Amazon SageMaker streams mini-batches (small, random subsets) of the training data. The k-means algorithm expects tabular data, where rows represent the observations that you want to cluster, and the columns represent attributes of the observations. See [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/k-means.html)

To ask AWS sagemaker for training a model using this algorithm we need to define a **K-means Estimator**. KMeans Estimators can be configured by setting **hyperparameters** which are arguments passed into the Estimator Constructor Function. 

This estimator requires the following hyperparameters to be passed in `sagemaker.KMeans()`:

- `role` (str) – An AWS IAM role (either name or full ARN). The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if accessing AWS resource.
- `train_instance_count` (int) – Number of Amazon EC2 instances to use for training.
- `train_instance_type` (str) – Type of EC2 instance to use for training, for example, ‘ml.c4.xlarge’. This is the **compute resources** that you want Amazon SageMaker to use for model training. Compute resources are ML compute instances that are managed by Amazon SageMaker.
- `k` (int) – The number of clusters to produce.
- `output_path` (str) - The URL of the S3 bucket where you want to store the output of the job.

In [20]:
# TODO: Define the training API request to AWS Sagemaker
kmeans = sagemaker.KMeans(role = role,
               train_instance_count = 1,
               train_instance_type = "ml.c4.xlarge",
               output_path = output_location,
               k = num_clusters)

The following diagram shows how you train and deploy a model with Amazon SageMakern - See [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html)

<img src="https://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-architecture-training-2.png">

To train a model in Amazon SageMaker, you create a **training job** using the `kmeans.fit()` method. - See [Documentation](https://sagemaker.readthedocs.io/en/stable/kmeans.html?highlight=kmeans.fit#sagemaker.KMeans.fit)

The training job requires the following information passed in to `.fit()` method:

- `record_set(data_train)` (str) - The training records to train the KMeans Estimator on. Here `data_train` must be passed in to the `kmeans.record_set()` method to convert our 2D numpy array data to a `RecordSet` object that is required by the algorithm. - See [Documentation](https://sagemaker.readthedocs.io/en/stable/sagemaker.amazon.amazon_estimator.html?highlight=record_set()#sagemaker.amazon.amazon_estimator.AmazonAlgorithmEstimatorBase.record_set)
- `job_name` (str) - Training job name. If not specified, the estimator generates a default job name, based on the training image name and current timestamp.

Amazon SageMaker then launches the ML compute instances and uses the training code and the training dataset to train the model. It saves the resulting model artifacts and other output in the S3 bucket you specified for that purpose.

Here we are going to construct a job name using the following expression and Python's built-in `datetime` module. This ensures our `job_name` is unique even if the code cell below is run multiple times. Each training job requires a **unique** `job_name`. Otherwise, AWS will throw an error.

In [21]:
# Construct a unique job_name using datetime module
job_name = "kmeans-geo-job-{}".format(datetime.now().strftime("%Y%m%d%H%M%S"))

# Print job_name
print("Here is the job name: {}".format(job_name))

Here is the job name: kmeans-geo-job-20190825212305


> **Exercise**: Create a training job using `kmeans.fit()`. Use the AWS documentation links above to figure out how to pass in the arguments to `kmeans.fit()` for the training job to commence. 

If you do this step right, you should see outputs like this appear underneath the code cell:

```
2019-07-29 00:54:46 Starting - Starting the training job...
2019-07-29 00:54:47 Starting - Launching requested ML instances...
2019-07-29 00:55:44 Starting - Preparing the instances for training......
2019-07-29 00:56:24 Downloading - Downloading input data...
2019-07-29 00:57:05 Training - Downloading the training image..
.
.
.
2019-07-29 00:57:31 Uploading - Uploading generated training model
2019-07-29 00:57:31 Completed - Training job completed
Billable seconds: 68
CPU times: user 1.78 s, sys: 18.7 ms, total: 1.8 s
Wall time: 3min 13s
```

In [134]:
%%time

# TOOD: Create a training job and time it. Running this code cell will send a training job request to AWS Sagemaker
kmeans.fit(kmeans.record_set(data_train), job_name= job_name)

2019-07-29 00:54:46 Starting - Starting the training job...
2019-07-29 00:54:47 Starting - Launching requested ML instances...
2019-07-29 00:55:44 Starting - Preparing the instances for training......
2019-07-29 00:56:24 Downloading - Downloading input data...
2019-07-29 00:57:05 Training - Downloading the training image..
[31mDocker entrypoint called with argument(s): train[0m
[31m[07/29/2019 00:57:21 INFO 140106530510656] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'_enable_profiler': u'false', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u'0.0001', u'local_lloyd_max_iter': u'300', u'_disable_wait_to_read': u'false', u'extra_center_factor': u'auto', u'eval_metrics': u'["msd"]', u'_num_kv_s

**Congratulations** on building and training a model on the cloud using unsupervised machine learning algorithm and saving it! Next we are going to deserialise the model so that we can use its output.

## Step 4: Model Deserialisation

To deserialise the compressed model output saved on our S3 bucket we need to import the following packages.

- **Boto** is the Amazon Web Services (AWS) SDK for Python. It enables Python developers to create, configure, and manage AWS services, such as EC2 and S3. Boto provides an easy to use, object-oriented API, as well as low-level access to AWS service. See [Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)

> **Exercise**: Import `boto` package, then use the AWS Python SDK boto3 to download the compressed model output from the S3 bucket to a file. You will need to construct a url to the model output and save it to `path_to_model` variable. Then pass `path_to_model` to the following command `boto3.resource("s3").Bucket(bucket).download_file(path_to_model, file_name_to_save_to)`. - See [boto3 Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html?highlight=s3.object#S3.Client.download_file)

In [23]:
# TODO: Import the required packages for deserialisation
import boto3

# Construct a url to the model output. Compressed model outputs are saved 
# in the job_name folder under the model-artifacts folder
path_to_model = "model-artifacts/" + job_name + "/output/model.tar.gz"

# TODO: Use the AWS Python SDK boto3 to download the compressed model output from S3 bucket onto `model.tar.gz` file.
boto3.resource("s3").Bucket(bucket).download_file(path_to_model, "model.tar.gz")

To deserialise the compressed model output saved on our S3 bucket we need to import the following packages.

- **Python's Built-in module** `os` - See [Documentation](https://docs.python.org/2/library/os.html#os.system)

Python's built-in system module `os.system()` can be used to execute a shell command `tar -zxvf` on the `model.tar.gz` compressed gzipped file. The `-zxvf` flags can passed in to `os.system()` to perform the following commands: 

- `-z` - filters the archive through gzip
- `-x` - extracts files from the archive
- `-v` - verbosely lists files processed
- `-f` - uses archive files


See [Linux's tar Man Pages](https://linux.die.net/man/1/tar) for more details on the `tar` shell command. 

> **Exercise:** Use `os.system()` method to run the `tar` command on the compressed gzip file `model.tar.gz` with the above flags.

In [24]:
# TODO: Import the required packages for deserialisation
import os

# TODO: Use Python's built-in os package to open the compressed model output
os.system("tar -zxvf model.tar.gz")

0

`os.system()` later can be used to execute the `unzip` shell command on `model_algo-1`. `unzip` shell command lists, tests, or extracts files from a ZIP archive. See [Linux unzip Man Pages](https://linux.die.net/man/1/unzip) for more details on the `unzip` command.

> **Exercise:** Use `os.system()` method to unzip `model_algo-1`.

In [25]:
# TODO: Use Python's built-in os package to unzip model_algo-1 file. 
os.system("unzip model_algo-1")

2304

To load the unzipped model output parameters, we need to install `mxnet` package.

> **Exercise**: Use `!pip install` to install `mxnet`.

In [26]:
# TODO: Install mxnet package
!pip install mxnet

Collecting mxnet
[?25l  Downloading https://files.pythonhosted.org/packages/50/08/186a7d67998f1e38d6d853c71c149820983c547804348f06727f552df20d/mxnet-1.5.0-py2.py3-none-manylinux1_x86_64.whl (25.4MB)
[K    100% |████████████████████████████████| 25.4MB 1.9MB/s eta 0:00:01
Collecting numpy<2.0.0,>1.16.0 (from mxnet)
[?25l  Downloading https://files.pythonhosted.org/packages/19/b9/bda9781f0a74b90ebd2e046fde1196182900bd4a8e1ea503d3ffebc50e7c/numpy-1.17.0-cp36-cp36m-manylinux1_x86_64.whl (20.4MB)
[K    100% |████████████████████████████████| 20.4MB 3.1MB/s eta 0:00:01
[?25hCollecting graphviz<0.9.0,>=0.8.1 (from mxnet)
  Downloading https://files.pythonhosted.org/packages/53/39/4ab213673844e0c004bed8a0781a0721a3f6bb23eb8854ee75c236428892/graphviz-0.8.4-py2.py3-none-any.whl
Installing collected packages: numpy, graphviz, mxnet
  Found existing installation: numpy 1.14.3
    Uninstalling numpy-1.14.3:
      Successfully uninstalled numpy-1.14.3
Successfully installed graphviz-0.8.4 mxnet

To load the model output parameters we need to import the following package:

- **MXNet**: A flexible and efficient library for deep learning. - See [Documentation](https://mxnet.apache.org/versions/master/api/python/index.html) 

> **Exercise**: Use `mxnet`'s `.ndarray.load()` method to load the model output parameters and assign it to `Kmeans_model_params` variable - See [Documentation](https://mxnet.incubator.apache.org/api/python/ndarray/ndarray.html)

In [27]:
# TODO: Import mxnet
import mxnet as mx

# TODO: Use mxnet to load the model parameters
Kmeans_model_params = mx.ndarray.load("model_algo-1")

> **Exercise**: Convert the model parameters to a dataframe cluster_centroids_kmeans using `pd.DataFrame()`. You can grab the model output parameters using `Kmeans_model_params[0].asnumpy()` to pass to `pd.DataFrame()`.

In [28]:
# TODO: Convert the Kmeans_model_params to a dataframe using pandas and numpy: cluster_centroids_kmeans
cluster_centroids_kmeans = pd.DataFrame(Kmeans_model_params[0].asnumpy())

# TODO: Set the column names of the cluster_centroids_kmeans dataframe to match the df_geo column names
cluster_centroids_kmeans.columns = df_geo.columns

# Print cluster_centroids_kmeans
print(cluster_centroids_kmeans)

    latitude   longitude
0  35.379860 -118.177162
1  41.521103  -74.812103
2  51.608204    0.121513
3 -11.612000  128.658752
4  47.705780 -122.042778
5  35.611134  -98.932304
6  31.191694  -82.532051
7  28.319733   37.477905
8  41.149517  -87.080086
9 -18.685837  -53.455894


To write the content of the model output using An in-memory stream for text I/O we need to import the following package:

- **Python's Built-in Package** `io` - See [Documentation](https://docs.python.org/3/library/io.html#io.StringIO)

In [29]:
# TODO: Import Python's built-on package io
import io

# When a csv_buffer object is created, it is initialized using StringIO() constructor
# Here no string is given to the StringIO() so the csv_buffer object is empty.
csv_buffer = io.StringIO()

# TODO: Use pandas .to_csv() method to weite the cluster_centroids_kmeans dataframe to a csv file
cluster_centroids_kmeans.to_csv(csv_buffer, index = False)

# TODO: Use Amazon's boto3 package to create an S3 resource
s3_resource = boto3.resource("s3")

# Create an S3 object at a path given, using the .Object() method in the given `bucket`.
# Then save the content of the csv_buffer file onto the newly created S3 object using the .put() methods
s3_resource.Object(bucket, "results/ten_locations_kmeans.csv").put(Body = csv_buffer.getvalue())

{'ResponseMetadata': {'RequestId': 'A80CA456D349AA28',
  'HostId': 'tHzEAdFgatji4gI50xrA2L31eCImX7RQVFa1R3M3E/tdwGrAUrsoywBv74FMzoxw7X5wCwWWJ/Y=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'tHzEAdFgatji4gI50xrA2L31eCImX7RQVFa1R3M3E/tdwGrAUrsoywBv74FMzoxw7X5wCwWWJ/Y=',
   'x-amz-request-id': 'A80CA456D349AA28',
   'date': 'Sun, 25 Aug 2019 21:25:44 GMT',
   'etag': '"2477206b3fc6b0706e3cd0fde0ca6337"',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'ETag': '"2477206b3fc6b0706e3cd0fde0ca6337"'}

## CONGRATULATIONS!!!
Well done on compeleting this difficult part of the assignment. All is now left for you to do is to visualise the model outputs you have saved in the `ten_locations_kmeans.csv` file in your S3 bucket on a map. Simply create a **AWS Quicksight** account and use the `my-manifest.json` file under the `quicksight` folder to configure AWS Quicksight.

Again, Well done on compeleting the above assignments! This was a hard exercise and you have learned a lot about how to use AWS Sagemaker to train an unsupervised machine learning model in the cloud. We hope that you enjoyed this **Introduction to unsupervised machine learning with AWS** Workshop. To learn more about AWS Sagemaker and machine learning in the cloud check out a few resources we have provided in our repo's [README.md](https://github.com/beginners-machine-learning-london/intro_to_unsupervised_ml_with_AWS_Sagemaker).

Also make sure to sign up on our meetup group to be informed of future workshops! [London Beginners Machine Learnign Meetup](https://www.meetup.com/beginners-machine-learning-london/).

And join our [slack channel](https://join.slack.com/t/beginnersmach-wlf5812/shared_invite/enQtNzAzODA4OTY3MTcyLWU2ZDMzNGU2YTQ4ZDk5ZjY3OTk1YWU2OGU5NWRmMjM1NzkwM2MwYjk5MDNhZWE1YWVmNzY1MjgzZDk4OGE1OGE) too to ask questions, discuss ML with other BML community members and suggest us topics for future workshops.