## Day 48 Lecture 2 Assignment

In this assignment, we will apply density-based clustering to a dataset containing the locations of all Starbucks in the U.S.

This assignment will also use the haversine and plotly packages, which you should already have installed from the previous assignment.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
import plotly.express as px

This dataset contains the latitude and longitude (as well as several other details we will not be using) of every Starbucks in the world as of February 2017. Each row consists of the following features, which are generally self-explanatory:

- Brand
- Store Number
- Store Name
- Ownership Type
- Street Address
- City
- State/Province
- Country
- Postcode
- Phone Number
- Timezone
- Longitude
- Latitude

Load in the dataset.

In [2]:
# answer goes here

df = pd.read_csv("https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/starbucks_locations.csv")



Begin by narrowing down the dataset to a specific geographic area of interest. Try just the United States; since you won't be calculating a distance matrix you can use more than just one state.

In [3]:
# answer goes here


df = df[df["Country"]=="CN"]


Build a DBSCAN clustering model using eps=2 (miles) and min_samples=5. Some tips that may be helpful:

1. Unlike our approach for hierarchical clustering, we do not need to calculate the NxN distance matrix for DBSCAN upfront. It directly supports the haversine distance metric, provided the nearest-neighbors algorithm is a ball tree. Set the "algorithm" and "metric" parameters to the appropriate values. 
2. Scikit-learn's implementation of haversine distance expects radians instead of degrees. Therefore, it would be advisable to create two new columns, Lat_Rad and Lon_Rad, that convert the Latitude and Longitude columns into radians. (Hint: there is a numpy function that does this.)  
3. The eps parameter, which corresponds to the radius of the neighborhood, will also need to be in radians. The conversion factor for miles to radians is approximately 1/3958.748; in other words, if you want the neighborhood to have a radius of 3 miles, set eps = 3/3958.748.  

Side note: ball-tree is an indexing structure that is very useful for nearest-neighbor calculations. The general time-complexity of finding a nearest neighbor using a Ball Tree is O(nlog(n)). This is a vast improvement over the naive O($n^{2}$) and allows us to cluster on much larger subsets of the data, like the entire country. Scikit-learn directly supports creating ball-trees through sklearn.neighbors.BallTree; if inclined, you could extend the analysis in the first after-lecture assignment (in which we calculated a similarity matrix for Hawaii) to the entire country using a BallTree and identify "island Starbucks locations" on a much larger scale.

Additionally, save the predicted cluster assignments as a new column in your dataframe.

In [4]:
# answer goes here
df.head()

df["Lat_Rad"] = np.radians(df['Latitude'])

df["Lon_Rad"] = np.radians(df['Longitude'])


In [5]:
df.head()

Unnamed: 0,Brand,Store Number,Store Name,Ownership Type,Street Address,City,State/Province,Country,Postcode,Phone Number,Timezone,Longitude,Latitude,Lat_Rad,Lon_Rad
2091,Starbucks,22901-225145,北京西站第一咖啡店,Company Owned,"丰台区, 北京西站通廊7-1号, 中关村南大街2号",北京市,11,CN,100073.0,,GMT+08:00 Asia/Beijing,116.32,39.9,0.696386,2.030167
2092,Starbucks,32320-116537,北京华宇时尚店,Company Owned,"海淀区, 数码大厦B座华宇时尚购物中心内, 蓝色港湾国际商区1座C1-3单元首层、",北京市,11,CN,100086.0,010-51626616,GMT+08:00 Asia/Beijing,116.32,39.97,0.697608,2.030167
2093,Starbucks,32447-132306,北京蓝色港湾圣拉娜店,Company Owned,"朝阳区朝阳公园路6号, 二层C1-3单元及二层阳台, 太阳宫中路12号",北京市,11,CN,100020.0,010-59056343,GMT+08:00 Asia/Beijing,116.47,39.95,0.697259,2.032785
2094,Starbucks,17477-161286,北京太阳宫凯德嘉茂店,Company Owned,"朝阳区, 太阳宫凯德嘉茂一层01-44/45号, 东三环北路27号",北京市,11,CN,100028.0,010-84150945,GMT+08:00 Asia/Beijing,116.45,39.97,0.697608,2.032436
2095,Starbucks,24520-237564,北京东三环北店,Company Owned,"朝阳区, 嘉铭中心大厦A座B1层024商铺, 金融大街7号",北京市,11,CN,,,GMT+08:00 Asia/Beijing,116.46,39.93,0.69691,2.03261


In [6]:
clst = DBSCAN(eps=50.0/3958.748, min_samples=7, metric="haversine",algorithm = "ball_tree")

In [7]:
clst.fit(df[["Lat_Rad","Lon_Rad"]])

DBSCAN(algorithm='ball_tree', eps=0.012630255828357854, leaf_size=30,
       metric='haversine', metric_params=None, min_samples=7, n_jobs=None,
       p=None)

In [8]:
df["label"] = clst.labels_

In [9]:
df["label"].value_counts()

 7     1210
 18     498
 0      301
 22     103
-1       98
 10      73
 16      68
 25      41
 21      41
 4       32
 17      30
 14      28
 3       25
 13      22
 24      21
 15      18
 9       18
 8       16
 6       14
 19      12
 12      11
 11      11
 5       10
 23       9
 20       9
 2        8
 1        7
Name: label, dtype: int64

Finally, plot the resulting clusters on a map using the "scatter_geo" function from plotly.express. The map defaults to the entire world; the "scope" parameter is useful for narrowing down the region plotted in the map. The documentation can be found here:

https://www.plotly.express/plotly_express/#plotly_express.scatter_geo

How many clusters did DBSCAN produce? How many locations were treated as outliers (cluster = -1)?

In [13]:
# answer goes here
px.scatter_geo(data_frame = df, lat = 'Latitude', lon = 'Longitude', color = 'label', scope = 'asia')


From the previous plot, we should see a very large number of clusters (400+). This would suggest that our definition of neighborhood may have been too strict. Experiment with other values of eps and min_samples and see how your changes affect the output. Output a map with what you think is the "best" clustering result below.

In [11]:
# answer goes here



