# Clustering of New York Subway Stations

### Introduction

#### In this notebook, we will be analyzing the different sub way stations in New York and try to cluster them based on type of businesses and POI available around each station. 

#### Data related to subway station and geo coordinates are available at NYC Open Data website which is used for analysis. The data analysis provided here should be helpful for future businesses to come up with the location preferences. Foursquare API calls are used to identify businesses and other venues around each station. Once the data is cleaned up and processed, we should be able to cluster them in categories that are meaningful.

###  1. Import all required libraries

In [1]:
# Import the required libraries
import pandas as pd
import requests
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import json
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
import folium
%matplotlib inline

### 2. Download relevant data

#### The NYC Open Data website has the data on subway stations and their co-ordinates. This is available in _csv_ format for download.

In [2]:
# Initialize path
geo_path = "https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv?accessType=DOWNLOAD"
subway_df = pd.read_csv(geo_path) #Read csv to dataframe

#Remove unwanted columns
subway_df.drop(["URL","OBJECTID","LINE","NOTES"],axis = 1, inplace = True)

#Split the Latitude & Longitude to separate columns
subway_splt = subway_df["the_geom"].str.split("(", n = 2, expand = True)
subway_splt = subway_splt[1].str.split(")", n = 2, expand = True)
subway_splt = subway_splt[0].str.split(" ", n = 2, expand = True)

#Add Latitiude & Longitude to the Geo dataframe
subway_df["LAT"] = pd.to_numeric(subway_splt[1],errors = "coerce")
subway_df["LON"] = pd.to_numeric(subway_splt[0],errors = "coerce")

#Remove duplicate entries and unwanted columns
subway_df.drop("the_geom",axis=1,inplace = True)
subway_df.drop_duplicates(subset= "NAME", keep = "first",inplace = True)
subway_df.reset_index(drop = True, inplace = True)

subway_df.head()

Unnamed: 0,NAME,LAT,LON
0,Astor Pl,40.730054,-73.99107
1,Canal St,40.718803,-74.000193
2,50th St,40.761728,-73.983849
3,Bergen St,40.680862,-73.974999
4,Pennsylvania Ave,40.664714,-73.894886


### 3. Map the stations

#### Get the coordinates of _New York_ and create a map highlighting the Subway stations.

In [3]:
#Get the latitude and longitude
sub = 'New York City,NY'
geolocator = Nominatim(user_agent="ny_explorer")
lat_lon = geolocator.geocode(sub)
lat = lat_lon.latitude
lon = lat_lon.longitude
#longitude = location.longitude
print('The geograpical coordinate of Times Square are {}, {}.'.format(lat, lon))

The geograpical coordinate of Times Square are 40.7308619, -73.9871558.


In [4]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[lat, lon], zoom_start=11)

# add markers to map
for lat, lng, label in zip(subway_df['LAT'], subway_df['LON'], subway_df['NAME']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

### 4. Explore Businesses and POI around stations

#### Here we will connect to Foursquare and get details of ameneties and venues around the different subway stations of _New York_. Geo data of each subway station will be passed to Foursquare for getting details. 

In [5]:
#Set the Client ID details 
CLIENT_ID = 'L4JGG43AKA20IPXQLXAEVHJVZPTBS2QX41TKEFY13DL4YV1Z' # your Foursquare ID
CLIENT_SECRET = 'R0ACJYGJYEFHY3FPJSKUFZRHO5WEGF1K01ZHU2BZI1RNY4Z2' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#### Define functions for getting all venues and venue category type around each subway station

In [6]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [7]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Use latitue & longitude of subway stations to get all venue names around that neighbourhood. Only 10 venues per station will be used. The top 5 businesses/venues will be used for finalizing the categories.

In [8]:
#Get only the first 10
LIMIT = 10
subway_venues = getNearbyVenues(names=subway_df['NAME'],
                                   latitudes=subway_df['LAT'],
                                   longitudes=subway_df['LON']
                                  )

Astor Pl
Canal St
50th St
Bergen St
Pennsylvania Ave
238th St
Cathedral Pkwy (110th St)
Kingston - Throop Aves
65th St
36th St
Delancey St - Essex St
Van Siclen Ave
Norwood Ave
104th-102nd Sts
DeKalb Ave
Beach 105th St
Beach 90th St
Freeman St
Intervale Ave
182nd-183rd Sts
174th-175th Sts
167th St
Mets - Willets Point
Junction Blvd
Flushing - Main St
Buhre Ave
3rd Ave - 138th St
Castle Hill Ave
Brooklyn Bridge - City Hall
Zerega Ave
Grand Central - 42nd St
33rd St
96th St
77th St
Chauncey St
Union St
Elmhurst Ave
Ralph Ave
Pelham Pkwy
Gun Hill Rd
Nereid Ave (238 St)
Franklin Ave
Simpson St
Bronx Park East
Winthrop St
149th St - Grand Concourse
161st St - Yankee Stadium
Lexington Ave - 59th St
E 149th St
Morrison Av - Soundview
Whitlock Ave
St Lawrence Ave
Woodside - 61st St
Far Rockaway - Mott Ave
72nd St
168th St
Kingsbridge Rd
42nd St - Bryant Pk
Prospect Park
55th St
Jamaica - Van Wyck
Kew Gardens - Union Tpke
Sutphin Blvd - Archer Av
Court Sq - 23rd St
67th Ave
Grand Ave - Newtown


### 5. Pre-process & Analyze Venue data

#### The information received from Foursquare will require some cleaning in order to perform analysis and cluster them. First let's check how many unique venue categories are available.

In [9]:
#Check how many venues are there and print the unique categories()
print("Size of subway_venues is: ", subway_venues.shape)
print('There are {} uniques categories.'.format(len(subway_venues['Venue Category'].unique())))

Size of subway_venues is:  (3523, 7)
There are 318 uniques categories.


#### Convert the Venue information to numeric and create a new dataframe with Subway and Venue information.

In [10]:
# one hot encoding
subway_onehot = pd.get_dummies(subway_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
subway_onehot[0] = subway_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [subway_onehot.columns[-1]] + list(subway_onehot.columns[:-1])
subway_onehot = subway_onehot[fixed_columns]

#Rename first column to subway
subway_onehot.rename(columns= {0:"Subway"},inplace=True)

subway_onehot.head()

Unnamed: 0,Subway,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Aquarium,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Astor Pl,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Astor Pl,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Astor Pl,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Astor Pl,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Astor Pl,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Group the data by Subway and use the mean to represent the precense of venue in a neighbourhood.

In [11]:
subway_grouped = subway_onehot.groupby('Subway').mean().reset_index()
subway_grouped.head()

Unnamed: 0,Subway,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Aquarium,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,103rd St,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0
1,103rd St - Corona Plaza,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,104th St,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,104th-102nd Sts,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,110th St,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Create a dataframe to hold the top 3 ameneties or venues near each subway station.

In [12]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [13]:
#Initialize to get the top 3 venues only.
num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Subway']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Subway'] = subway_grouped['Subway']

#Loop through the list of subway stations and get venue details for each one.
for ind in np.arange(subway_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(subway_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Subway,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,103rd St,Pizza Place,Yoga Studio,Ice Cream Shop
1,103rd St - Corona Plaza,Latin American Restaurant,Coffee Shop,Deli / Bodega
2,104th St,Pharmacy,Pizza Place,Discount Store
3,104th-102nd Sts,Deli / Bodega,Metro Station,Ice Cream Shop
4,110th St,Steakhouse,Latin American Restaurant,Pet Store


### 6. Clustering the data

#### Use _kMeans_ clustering and cluster the neighbourhood data

In [14]:
# set number of clusters
kclusters = 4
subway_grouped_clustering = subway_grouped.drop('Subway', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(subway_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 2, 1, 1, 2, 1, 0, 0, 2, 0])

#### Add Clustering data and merge the Venue & Subway station information. Drop any rows that does not have a cluster label.

In [15]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
subway_merged = subway_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
subway_merged = subway_merged.join(neighborhoods_venues_sorted)
subway_merged.dropna(axis = "rows",inplace = True) #Drop any NaN
subway_merged.head()

Unnamed: 0,NAME,LAT,LON,Cluster Labels,Subway,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Astor Pl,40.730054,-73.99107,3,103rd St,Pizza Place,Yoga Studio,Ice Cream Shop
1,Canal St,40.718803,-74.000193,2,103rd St - Corona Plaza,Latin American Restaurant,Coffee Shop,Deli / Bodega
2,50th St,40.761728,-73.983849,1,104th St,Pharmacy,Pizza Place,Discount Store
3,Bergen St,40.680862,-73.974999,1,104th-102nd Sts,Deli / Bodega,Metro Station,Ice Cream Shop
4,Pennsylvania Ave,40.664714,-73.894886,2,110th St,Steakhouse,Latin American Restaurant,Pet Store


#### Create a map with the clusters

In [16]:
# create map
map_clusters = folium.Map(location=[lat, lon], zoom_start=11)

# set color scheme for the clusters
rainbow = ["green","blue","yellow","red"]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(subway_merged['LAT'], subway_merged['LON'], 
                                  subway_merged['NAME'], subway_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

### 6. Examine and Name the clusters

#### Review each cluster group by doing the following:
1. Group the data by the _1st Most Common Venue_ and find the count
2. Remove all columns except the Venue name and how many similar venues in that cluster
3. Sort the data in descending order of count
4. Pick the top 3 venues to determine how that cluster should be categorized.

#### Cluster 1

In [17]:
#Extract only the respective cluster data and group it by 1st Most Common Venue
subway_cluster1 = subway_merged[subway_merged["Cluster Labels"]==0].groupby(by="1st Most Common Venue",
                                                                           as_index=False).count()
#Rename columns and drop the ones not needed
subway_cluster1.rename(columns = {"NAME":"Frequency"},inplace = True)
subway_cluster1.drop(["LAT","LON","Cluster Labels","Subway","2nd Most Common Venue","3rd Most Common Venue"],
                     axis=1,inplace=True)

#Remove the venues that have single occurrences as we are looking for venues that are high in number
subway_cluster1 = subway_cluster1[subway_cluster1.Frequency!=1]
subway_cluster1.set_index("Frequency",drop = True,inplace = True)

#Sort the data in descending order
subway_cluster1.sort_index(ascending=False,inplace = True)

subway_cluster1.head()

Unnamed: 0_level_0,1st Most Common Venue
Frequency,Unnamed: 1_level_1
15,Coffee Shop
14,Italian Restaurant
10,Park
9,Japanese Restaurant
5,Bar


#### Review of the top 5 venues above indicate that this cluster mostly has Parks, Coffee Shops & Italian Restaurants. The stations marked in <font color =green> green</font> in the map fall under this category

#### Cluster 2

In [18]:
#Extract only the respective cluster data and group it by 1st Most Common Venue
subway_cluster2 = subway_merged[subway_merged["Cluster Labels"]==1].groupby(by="1st Most Common Venue",
                                                                           as_index=False).count()
#Rename columns and remove what is not needed
subway_cluster2.rename(columns = {"NAME":"Frequency"},inplace = True)
subway_cluster2.drop(["LAT","LON","Cluster Labels","Subway","2nd Most Common Venue","3rd Most Common Venue"],
                     axis=1,inplace=True)

#Remove the venues that have single occurrences as we are looking for venues that are high in number
subway_cluster2 = subway_cluster2[subway_cluster2.Frequency!=1]
subway_cluster2.set_index("Frequency",drop = True,inplace = True)

#Sort in descending order
subway_cluster2.sort_index(ascending=False,inplace = True)

subway_cluster2.head()

Unnamed: 0_level_0,1st Most Common Venue
Frequency,Unnamed: 1_level_1
10,Discount Store
6,Caribbean Restaurant
6,Pharmacy
6,Pizza Place
5,Fried Chicken Joint


#### Discount Store tops in this cluster. The stations falling into this cluster are shown in <font color = blue> blue </font> in the map.

#### Cluster 3

In [19]:
#Extract only the respective cluster data and group it by 1st Most Common Venue
subway_cluster3 = subway_merged[subway_merged["Cluster Labels"]==2].groupby(by="1st Most Common Venue",
                                                                           as_index=False).count()
#Rename columns and drop the unwanted ones
subway_cluster3.rename(columns = {"NAME":"Frequency"},inplace = True)
subway_cluster3.drop(["LAT","LON","Cluster Labels","Subway","2nd Most Common Venue","3rd Most Common Venue"],
                     axis=1,inplace=True)

#Remove the venues that have single occurrences as we are looking for venues that are high in number
subway_cluster3 = subway_cluster3[subway_cluster3.Frequency!=1]
subway_cluster3.set_index("Frequency",drop = True,inplace = True)

#Sort in descending order
subway_cluster3.sort_index(ascending=False,inplace = True)

subway_cluster3.head()

Unnamed: 0_level_0,1st Most Common Venue
Frequency,Unnamed: 1_level_1
18,Bar
12,Mexican Restaurant
4,Coffee Shop
3,Bakery
3,Latin American Restaurant


#### Cluster 3 looks interesting with Bar & Mexican Restaurant topping the list. The stations that come under this cluster are represented in <font color = yellow> yellow </font> in the map.

#### Cluster 4

In [20]:
#Extract only the respective cluster data and group it by 1st Most Common Venue
subway_cluster4 = subway_merged[subway_merged["Cluster Labels"]==3].groupby(by="1st Most Common Venue",
                                                                           as_index=False).count()

#Rename columns and drop unwanted columns
subway_cluster4.rename(columns = {"NAME":"Frequency"},inplace = True)
subway_cluster4.drop(["LAT","LON","Cluster Labels","Subway","2nd Most Common Venue","3rd Most Common Venue"],
                     axis=1,inplace=True)

#Remove the venues that have single occurrences as we are looking for venues that are high in number
subway_cluster4 = subway_cluster4[subway_cluster4.Frequency!=1]
subway_cluster4.set_index("Frequency",drop = True,inplace = True)

#Sort in descending order
subway_cluster4.sort_index(ascending=False,inplace = True)

subway_cluster4.head()

Unnamed: 0_level_0,1st Most Common Venue
Frequency,Unnamed: 1_level_1
19,Pizza Place
5,Caribbean Restaurant
4,Café
3,Bar
3,Coffee Shop


#### Cluster 4 is definitely a Pizza Place. Check out the <font color = red> red </font> colored clusters in the map.

### Conclusion

#### Based on the analysis using the _1st Most Common Venue_ all the 4 clusters have similar venue configuration. However along some of the subway stations, the density of certain venues are more.