# K-means clustering

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset. The K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

## Data

The input file ("places.txt") consists of the locations of 300 places in the US. Each location is a two-dimensional point that represents the longitude and latitude of the place. For example, "-112.1,33.5" means the longitude of the place is -112.1, and the latitude is 33.5.

Read the text file into a list of lists and then transform it as an array for input to K-means clustering algorithm.

In [1]:
import numpy as np

main_list = [i.strip('\n').split(',') for i in open('places.txt')]

main_array = np.array(main_list)
main_array[:5]

array([['-112.0707922', '33.4516246'],
       ['-112.0655423', '33.4492979'],
       ['-112.0739312', '33.4564905'],
       ['-112.0748658', '33.4701155'],
       ['-80.5256905', '43.4770992']], dtype='<U14')

## Clustering
Implement the k-means algorithm and use it to cluster the 300 locations into three clusters, such that the locations in the same cluster are geographically close to each other. After reading in the 300 locations in "places.txt" and applying the k-means algorithm (with k = 3), generate an output file named "clusters.txt".

In [0]:
from sklearn.cluster import KMeans

clustering = KMeans(n_clusters=3, random_state=0).fit(main_array)

In [3]:
clustering.labels_

array([0, 0, 0, 0, 1, 1, 2, 0, 0, 2, 0, 2, 1, 0, 2, 2, 2, 0, 2, 1, 1, 1,
       1, 2, 0, 1, 0, 1, 0, 0, 1, 2, 0, 0, 0, 2, 1, 0, 1, 2, 2, 2, 2, 2,
       2, 1, 0, 2, 0, 0, 1, 2, 0, 2, 1, 2, 0, 2, 1, 2, 0, 1, 2, 0, 1, 2,
       2, 0, 1, 2, 1, 0, 2, 1, 0, 0, 0, 2, 1, 0, 0, 2, 0, 1, 0, 0, 1, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 2, 2, 0, 0, 0, 1, 0, 0, 2, 2, 0, 1, 0, 0,
       1, 2, 0, 1, 0, 0, 1, 0, 1, 1, 2, 2, 0, 1, 2, 1, 2, 1, 2, 1, 2, 1,
       2, 0, 0, 1, 1, 0, 0, 1, 2, 2, 0, 1, 2, 2, 2, 0, 2, 2, 1, 2, 2, 1,
       1, 1, 1, 1, 2, 2, 1, 0, 1, 0, 2, 1, 1, 1, 1, 1, 1, 0, 0, 2, 2, 0,
       1, 2, 2, 1, 0, 1, 2, 2, 1, 2, 2, 1, 0, 1, 2, 2, 0, 0, 0, 2, 0, 0,
       2, 0, 1, 0, 2, 0, 1, 2, 0, 1, 2, 2, 1, 2, 2, 0, 1, 1, 0, 1, 0, 1,
       0, 1, 1, 1, 2, 2, 2, 0, 2, 0, 0, 2, 2, 1, 1, 2, 0, 2, 0, 2, 0, 0,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 1, 1, 1, 1, 2, 0, 2, 0, 1, 0, 2, 1,
       1, 0, 1, 0, 1, 0, 1, 2, 0, 2, 1, 1, 0, 0, 1, 1, 2, 2, 2, 2, 1, 1,
       2, 1, 1, 0, 1, 0, 2, 0, 1, 0, 0, 0, 1, 0], d

## Output
The output file should contain exactly 300 lines, where each line represents the cluster label of each location. Every line should be in the format: location_id cluster_label.

Create an index for location_id, for the ouput.

In [4]:
index = list(range(300))
index[:5]

[0, 1, 2, 3, 4]

Write the lines in the specified file to text file.

In [0]:
with open('clusters.txt', 'a') as f:
  for i,j in zip(index, clustering.labels_):
    f.write(str(i)+" "+str(j)+"\n")