# Clustering : K-Means : Uber Pickups

This is data of Uber pickups in New York City.  
The data is from this [kaggle competition](https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city).

Sample data looks like this
```
"Date_Time","Lat","Lon","Base"
"4/1/2014 0:11:00",40.769,-73.9549,"B02512"
"4/1/2014 0:17:00",40.7267,-74.0345,"B02512"
"4/1/2014 0:21:00",40.7316,-73.9873,"B02512"
"4/1/2014 0:28:00",40.7588,-73.9776,"B02512"
```

## Step 1: Load the Data
We will also specify schema to reduce loading time

In [None]:
import os
import urllib.request

## sample file with 10,000 records
data_location="../data/uber-nyc/uber-sample-10k.csv"
data_url = 'https://elephantscale-public.s3.amazonaws.com/data/uber-nyc/uber-sample-10k.csv'

## larger file with about 500k records
# data_location = "../data/uber-nyc/uber-raw-data-apr14.csv.gz"
# data_url = 'https://elephantscale-public.s3.amazonaws.com/data/uber-nyc/uber-raw-data-apr14.csv.gz'

if not os.path.exists (data_location):
    data_location = os.path.basename(data_location)
    if not os.path.exists(data_location):
        print("Downloading : ", data_url)
        urllib.request.urlretrieve(data_url, data_location)
print('data_location:', data_location)



In [None]:
import pandas as pd

dataset = pd.read_csv(data_location)
print (dataset.info())
dataset.sample(10)

## Step 2: Cleanup data
make sure our data is clean

In [None]:
uber_pickups_clean = dataset.dropna(subset=['Lat', 'Lon'])
print ('uber_picksups_clean : ', uber_pickups_clean.shape)

## Step 3 : Create Feature Vectors

In [None]:
## TODO : create a feature vectors using 'Lat'  and 'Lon'  attributes
x = uber_pickups_clean[['???', '???']]
x

## Step 4 : Run K-Means

In [None]:
%%time 
from sklearn.cluster import KMeans

## TODO : start with num_clusters=4
num_clusters = ???
kmeans = KMeans(n_clusters=num_clusters)

model = kmeans.fit(x)

wssse = model.inertia_

print ("num_clusters = {},  WSSSE = {:,}".format(num_clusters, wssse))

## Step 5: Let's find the best K - Hyperparameter tuning

Let's try iterating and plotting over values of k, so we can practice using the elbow method.


In [None]:
import time

kvals = []
wssses = []

## TODO : loop over k values from 2 to 10
for k in range(2, ???):
    kmeans = KMeans(n_clusters=k)
    t1 = time.perf_counter()
    model = kmeans.fit(x)
    t2 = time.perf_counter()
    wssse = model.inertia_
    print ("k={},  wssse={},  time took {:,.2f} ms".format(k,wssse, ((t2-t1)*1000)))
    kvals.append(k)
    wssses.append(wssse)

In [None]:
import pandas as pd

df = pd.DataFrame({'k': kvals, 'wssse':wssses})
df

In [None]:
%matplotlib inline
from matplotlib import pyplot

pyplot.plot(kvals, wssses)

## Step 6 : Let's run K-Means with the best K we have choosen
From the graph above, choose a good K value.  We wwill use that below

In [None]:
%%time 

## TODO : choose an appropriate k
## pick from elbow region from the graph above
num_clusters = ???
kmeans = KMeans(n_clusters=num_clusters)

model = kmeans.fit(x)

wssse = model.inertia_

print ("num_clusters = {},  WSSSE = {:,}".format(num_clusters, wssse))

## Step 7: Predict

In [None]:
predictions = model.predict(x)
predictions

In [None]:
uber_pickups_clean['prediction'] = predictions
uber_pickups_clean

### Print Cluster Center and Size

In [None]:
cluster_count = uber_pickups_clean.groupby("prediction").size()
cluster_count

In [None]:
cluster_count.sort_values(ascending=False)

## Step 8 : Ploting time!
We are going to plot the results now.  
Since we are dealing with GEO co-ordinates, let's use Google Maps!  

Go to the following URL :  
[https://jsfiddle.net/sujee/omypetfu/](https://jsfiddle.net/sujee/omypetfu/)

- Run the code cell below
- copy paste the output into Javascript section of the JSFiddle Editor (lower left)
- and click 'Run'  (top nav bar)
- Click on 'tidy' (top nav bar)  to cleanup code

See the following image 

<img src="../assets/images/kmeans_uber_trips_map.png" style="border: 5px solid grey ; max-width:100%;" />

You will be rewarded with a beautiful map of clusters on Google Maps

<img src="../assets/images/Kmeans_uber_trips.png" style="border: 5px solid grey ; max-width:100%;" />

Optional
- You can 'fork' the snippet and keep tweaking

In [None]:
### generate Javascript
s1 = "var clusters = {"

s2 = ""

prediction_count = uber_pickups_clean.groupby("prediction").size()
total_count = 0
cluster_centers = model.cluster_centers_
for i in range(0, num_clusters):
    count = prediction_count[i]
    lat = cluster_centers[i][0]
    lng = cluster_centers[i][1]
    total_count = total_count + count
    if (i > 0):
        s2 = s2 + ","
    s2 = s2 + " {}: {{ center: {{ lat: {}, lng: {} }}, count: {} }}".\
        format(i, lat, lng, count)
    #s2 = s2 + "{}: {{  center: {{ }}, }}".format(i)

s3 = s1 + s2 + "};"

s4 = """
function initMap() {
  // Create the map.
  var map = new google.maps.Map(document.getElementById('map'), {
    zoom: 10,
    center: {
      lat: 40.77274573,
      lng: -73.94
    },
    mapTypeId: 'roadmap'
  });

  // Construct the circle for each value in citymap.
  // Note: We scale the area of the circle based on the population.
  for (var cluster in clusters) {
    // Add the circle for this city to the map.
    var cityCircle = new google.maps.Circle({
      strokeColor: '#FF0000',
      strokeOpacity: 0.8,
      strokeWeight: 2,
      fillColor: '#FF0000',
      fillOpacity: 0.35,
      map: map,
      center: clusters[cluster].center,
"""

s5 = "radius: clusters[cluster].count / {} * 100 * 300 }});  }}}}".format(total_count)

# final
s = s3 + s4 + s5

print(s)

## Step 9: Let's analyze some more data

- In Step-1 select the data_file to 
```
data_file = "../data/uber-nyc/uber-raw-data-apr14.csv.gz"
```
- And select 'Cell --> Run All'  to execute all code blocks


## Step 10 : Running the script

**Use the download script**

```bash
cd   ~/data/uber-nyc
./download-data.sh
```

This will download more data.

As we run on larger dataset, the execution will take longer and Jupyter notebook might time out.  So let's run this in command line / script mode

```bash

$    cd   ~/ml-labs-python/clustering

$    time  python  kmeans-uber.py 2> logs

```

Watch the output
