# Capstone Project - Compatible Boroughs

## Introduction/Business Problem

The high growth in tech companies and the related job market has led to a large migration of professionals from all across America to tech hub cities wanting to take part in the growth.  In some cases, simply moving from one tech hub city to another.  The increase in population in these hub cities led to the exciting new openings of venues and other less desirable changes such as rising rents.  When considering to move to one of these cities, it can be difficult choosing between neighborhoods.  How does one consider a new neighborhood based on data before visiting it?  What information would be useful to help narrow down the choices?

In this project, I will explore the use case of a subject living in a high tech hub city wanting to relocate to another city.
I will be considering three metropolitan cities - Austin, Seattle, and San Francisco.  The study will provide a mapping of a borough in one city to similar boroughs in other cities.  To simplify the mapping, boroughs will be grouped in clusters so that similar clusters of boroughs can be mapped to each other.

Initial assumptions are:
<ol>
    <li>The subject wants to find a neighborhood that is similar their current one but in a different city.</li>
    <li>The subject is impartial towards climate differences.</li>
    <li>The subject is impartial towards ethnicity of city residents.</li>
</ol>


## Data

To assess similarity, I will use many of the attributes gathered form Foursquare during the labs such as venue types and counts.  This will require API calls to Foursquare followed by conversion to Panda data frames.

Zip codes to boroughs and geo coordinates of boroughs will be gathered from wikipedia and/or google searches.  The format will likely be an html table or CSV which will be converted to data frames and then merged with the Foursquare data.

Boroughs will be clustered using the data gathered so far.

I would also like to consider the demographics of the borough - population density, mean age, total population, etc.  I plan to obtain these from the US Census Bureau.  The likely format will be CSV and they data will augment the analysis after the clustering has been done.

If time and APIs allow, I would like to consider the hours of operations and popular times for the most popular venues in the borough.  These require premium API calls, so I may not be able to collect the data, or the data will be limited.

# Capstone Project Part 1 - End

# Capstone Project Part 2 - Data Preparation and Analysis

# Data Preparation

I want to compare the neighborhoods of Austin, San Francisco, and Seattle by gathering venue information from Foursquare.

Venue information consists of venue type and frequency of visits.  The frequency of visits will be normalized by dividing by the total frequency of all venues to allow consolidating all three city's venue information into a single data frame.

KMeans will be run on the consolidated city information so that similar neighborhoods in different cities can potentially be grouped into the same cluster.
Neighborhoods in the same cluster will be considered similar or compatible neighborhoods.

In [79]:
import pandas as pd

In [80]:
# List of cities and URLs at wiki for neighborhoods

CITIES = ['San Francisco', 'Austin', 'Seattle']
neighs = []
#neighborhoods
# https://en.wikipedia.org/wiki/List_of_neighborhoods_in_San_Francisco
# https://en.wikipedia.org/wiki/List_of_Austin_neighborhoods
# https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Seattle

#geo locations - obtain from google


Getting list of neighborhoods from Wikipedia was a challenge.
Neighborhood data on Wikipedia appeared in different formats for each city on Wikipedia include non-tabular format which made scraping the data programatically difficult.  I manually parsed out the neighborhood data from the Wikipedia page and saved them to csv files.

List of URLs to wikipedia:<br/>
<ul>
<li>https://en.wikipedia.org/wiki/List_of_neighborhoods_in_San_Francisco</li>
<li>https://en.wikipedia.org/wiki/List_of_Austin_neighborhoods</li>
<li>https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Seattle</li>
</ul>

As for geocoding, I found that the geocoder library never returns coordinates.  I looked into Google's geocoding APIs and found it requires a paid subscription.

As a workaround, I used google's online geocoder tool to manually gather the coordinates for each neighborhood into a csv format. https://www.mapdevelopers.com/geocode_tool.php

In [81]:
# import csv from github

def loadCsvData(url) :
    df_neigh = pd.read_csv(url)
    return df_neigh

city_data = loadCsvData("https://raw.githubusercontent.com/echoi11/data-capstone/master/San_Francisco_Geo.csv")

city_data

Unnamed: 0,Row,Neighborhood,Latitude,Longitude
0,1,Alamo Square,37.776360,-122.434689
1,2,Anza Vista,37.780836,-122.443149
2,3,Ashbury Heights,37.765442,-122.445360
3,4,Balboa Park,37.721427,-122.447547
4,5,Balboa Terrace,37.731524,-122.468539
5,6,Bayview,37.728889,-122.392500
6,7,Belden Place,37.728889,-122.403886
7,8,Bernal Heights,37.742986,-122.415804
8,9,Buena Vista,37.806532,-122.420649
9,10,Butchertown,37.744567,-122.395353


<h1>Segmenting and Clustering Neighborhoods in Cities</h1>

<h2>Part 1 - Gather Neighborhood and Geocode Data</h2>

Note: Neighborhood data on Wikipedia appeared in different formats for each city on Wikipedia include non-tabular format which made scraping the data programatically difficult.  I manually parsed out the neighborhood data from the Wikipedia page and saved them to csv files.

As for geocoding, I found that the geocoder library never returns coordinates.  I looked into Google's geocoding APIs and found it requires a paid subscription.

As a workaround, I used google's online geocoder tool to manually gather the coordinates for each neighborhood into a csv format. https://www.mapdevelopers.com/geocode_tool.php

In many cases, the geocoder tool could not locate the neighborhood.  For these cases, I entered the neighborhood into https://maps.google.com and obtained the Lat/Long from the center of the neighborhood.


<h2>Part 3 - Explore and Cluster Neighborhoods for Each City</h2>

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.
<ol>
    <li>to add enough Markdown cells to explain what you decided to do and to report any observations you make.</li>
    <li>to generate maps to visualize your neighborhoods and how they cluster together.</li>
</ol>

In [82]:
# TODO for each city

print('The dataframe has {} neighborhoods.'.format(
        city_data.shape[0]
    )
)

The dataframe has 120 neighborhoods.


#### Map of San Francisco with neighborhoods superimposed

In [83]:
#!conda clean --index-cache

In [84]:
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [85]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

In [86]:
address = 'San Francisco, California'

geolocator = Nominatim(user_agent="us_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))

The geograpical coordinate of San Francisco, California are 37.7790262, -122.4199061.


In [87]:
# create map using latitude and longitude values
map_city = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(city_data['Latitude'], city_data['Longitude'], city_data['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_city)  
    
map_city

#### Setup Foursquare

In [94]:
import requests

CLIENT_ID = 'BRC3YDAQCYI4A4KCDKO2FMWE0RXWQZYFGH13FBXIC3OL3DV3' # your Foursquare ID
CLIENT_SECRET = 'ZX1EIGLU0HWZVRXBI54OM22WLPGTJYRFWSEZT4G5JMHBR1BB' # your Foursquare Secret
VERSION = '20200414' # Foursquare API version
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: BRC3YDAQCYI4A4KCDKO2FMWE0RXWQZYFGH13FBXIC3OL3DV3
CLIENT_SECRET:ZX1EIGLU0HWZVRXBI54OM22WLPGTJYRFWSEZT4G5JMHBR1BB


In [95]:
# create a function to get venues for each neighborhood
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [96]:
def getNearbyVenuesByCity(city_data):
    venues = getNearbyVenues(names=city_data['Neighborhood'],
                                   latitudes=city_data['Latitude'],
                                   longitudes=city_data['Longitude']
                            )
    return venues

In [97]:
city_venues = getNearbyVenuesByCity(city_data)
#print(city_venues.shape)
city_venues.head()

Alamo Square
Anza Vista
Ashbury Heights
Balboa Park
Balboa Terrace
Bayview
Belden Place
Bernal Heights
Buena Vista
Butchertown
Castro
Cathedral Hill
Cayuga Terrace
China Basin
Chinatown
Civic Center
Clarendon Heights
Cole Valley
Corona Heights
Cow Hollow
Crocker-Amazon
Design District
Diamond Heights
Dogpatch
Dolores Heights
Duboce Triangle
Embarcadero
Eureka Valley
Excelsior
Fillmore
Financial District South
Financial District
Fisherman's Wharf
Forest Hill
Forest Knolls
Glen Park
Golden Gate Heights
Haight-Ashbury
Hayes Valley
Hunters Point
India Basin
Ingleside
Ingleside Terraces
Inner Sunset
Irish Hill
Islais Creek
Jackson Square
Japantown
Jordan Park
Laguna Honda
Lake Street
Lakeshore
Lakeside
Laurel Heights
Lincoln Manor
Little Hollywood
Little Russia
Little Saigon
Lone Mountain
Lower Haight
Lower Nob Hill
Lower Pacific Heights
Marina District
Merced Heights
Merced Manor
Mid-Market
Midtown Terrace
Miraloma Park
Mission Bay
Mission District
Mission Dolores
Mission Terrace
Monterey 

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Alamo Square,37.77636,-122.434689,Alamo Square,37.776045,-122.434363,Park
1,Alamo Square,37.77636,-122.434689,Alamo Square Dog Park,37.775878,-122.43574,Dog Run
2,Alamo Square,37.77636,-122.434689,Painted Ladies,37.77612,-122.433389,Historic Site
3,Alamo Square,37.77636,-122.434689,The Independent,37.775573,-122.437835,Rock Club
4,Alamo Square,37.77636,-122.434689,The Mill,37.776425,-122.43797,Bakery


In [104]:
def venueCountByCity(cityName, venues):
    print("City:" + cityName)
    print(", venue count:" + venues.count())
    return venues.count()

# check how many venues for each neighborhood
def printVenueCountByNeighborhood(cityName, venues):
    print("City:" + cityName + ", count:" + venues.groupby('Neighborhood').count())

In [110]:
city_venues.to_csv('my_csv.csv')

In [107]:
venueCountByCity("San Francisco", venues=city_venues)

City:San Francisco


TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U21') dtype('<U21') dtype('<U21')

In [108]:
printVenueCountByNeighborhood("San Francisco", city_venues)

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U26') dtype('<U26') dtype('<U26')

In [23]:
# count unique categories
print('There are {} unique venue categories.'.format(len(city_venues['Venue Category'].unique())))

There are 231 unique categories.


## Analyze Each Neighborhood

In [24]:
def oneHotCategoryGroupByNeighborhood(city_venues):
    # one hot encoding
    city_onehot = pd.get_dummies(city_venues[['Venue Category']], prefix="", prefix_sep="")

    # add neighborhood column back to dataframe
    city_onehot['Neighborhood'] = city_venues['Neighborhood'] 

    # move neighborhood column to the first column
    fixed_columns = [city_onehot.columns[-1]] + list(city_onehot.columns[:-1])
    city_onehot = city_onehot[fixed_columns]

    return city_onehot

def oneHotGroupByNeighborhood(city_onehot):
    ## TODO make sure frequency count is normalized
    city_grouped = city_onehot.groupby('Neighborhood').mean().reset_index()
    return city_grouped

Unnamed: 0,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
# TODO, normalize the frequency by dividing by total frequency

city_grouped = city_onehot.groupby('Neighborhood').mean().reset_index()
city_grouped

Print each neighborhood with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')
# put data into df
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

# Cluster Neighborhoods

In [51]:
from sklearn.cluster import KMeans

def runKMeans(kclusters=5, city_grouped):
    # set number of clusters
    kclusters = 5

    city_grouped_clustering = city_grouped.drop('Neighborhood', 1)

    # run k-means clustering
    kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(city_grouped_clustering)

    # check cluster labels generated for each row in the dataframe
    kmeans.labels_[0:10]
    
    return kmeans

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [52]:
def mergeKMeansToCityData(neighborhoods_venues_sorted, kmeans):

    # add clustering labels
    neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

    city_merged = city_data

    # merge city_grouped with city_data to add latitude/longitude for each neighborhood
    city_merged = city_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

    city_merged.sort_values('Cluster Labels')
    
    city_merged # check the last columns!
    
    print(city_merged)
    
    return city_merged

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636,0,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,Restaurant,Café,Theater,Mexican Restaurant,Shoe Store
1,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494,0,Coffee Shop,Diner,Sushi Restaurant,Gym,Park,Mexican Restaurant,Juice Bar,Italian Restaurant,Hobby Shop,Fried Chicken Joint
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Clothing Store,Coffee Shop,Café,Restaurant,Bubble Tea Shop,Japanese Restaurant,Middle Eastern Restaurant,Cosmetics Shop,Tea Room,Ramen Restaurant
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Coffee Shop,Café,Hotel,Gastropub,American Restaurant,Cocktail Bar,Italian Restaurant,Seafood Restaurant,Cosmetics Shop,Department Store
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,4,Health Food Store,Trail,Pub,Women's Store,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Restaurant,Bakery,Beer Bar,Cocktail Bar,Seafood Restaurant,Farmers Market,Cheese Shop,Italian Restaurant,Café
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,0,Coffee Shop,Italian Restaurant,Café,Sandwich Place,Bubble Tea Shop,Burger Joint,Japanese Restaurant,Salad Place,Ice Cream Shop,Fried Chicken Joint
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564,0,Grocery Store,Café,Park,Gas Station,Coffee Shop,Diner,Baby Store,Restaurant,Italian Restaurant,Athletics & Sports
8,M5H,Downtown Toronto,Richmond / Adelaide / King,43.650571,-79.384568,0,Coffee Shop,Café,Restaurant,Gym,Clothing Store,American Restaurant,Hotel,Deli / Bodega,Thai Restaurant,Salad Place
9,M6H,West Toronto,Dufferin / Dovercourt Village,43.669005,-79.442259,0,Pharmacy,Bakery,Supermarket,Brazilian Restaurant,Café,Recording Studio,Bar,Bank,Middle Eastern Restaurant,Brewery


toronto_merged.sort_values('Cluster Labels')

In [None]:
# Run analysis for all cities.

city_data = load from csv
city_venues = getNearbyVenuesByCity(city_data)
city_onehot = oneHotCategoryGroupByNeighborhood(city_venues)
city_grouped = oneHotGroupByNeighborhood(city_onehot)

# combine all city_grouped into a single df.

# try different clusters
kmeans = runKMeans(kclusters=5, all_city_grouped)

# might have to use neighborhoods_venues_sorted
all_city_merged = mergeKMeansToCityData(all_city_grouped, kmeans)

#todo
group by clusters and print out the neighborhoods of each cluster. 

In [35]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map to visualize the clusters
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# Examine Clusters

In [56]:
# print out each cluster
for i in range(0, 5) :
    print('==========================================================================')
    print('Cluster ', i)
    print(toronto_merged.loc[toronto_merged['Cluster Labels'] == i, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]])

Cluster  0
             Borough  Cluster Labels 1st Most Common Venue  \
0   Downtown Toronto               0           Coffee Shop   
1   Downtown Toronto               0           Coffee Shop   
2   Downtown Toronto               0        Clothing Store   
3   Downtown Toronto               0           Coffee Shop   
5   Downtown Toronto               0           Coffee Shop   
6   Downtown Toronto               0           Coffee Shop   
7   Downtown Toronto               0         Grocery Store   
8   Downtown Toronto               0           Coffee Shop   
9       West Toronto               0              Pharmacy   
10  Downtown Toronto               0           Coffee Shop   
11      West Toronto               0                   Bar   
12      East Toronto               0      Greek Restaurant   
13  Downtown Toronto               0           Coffee Shop   
14      West Toronto               0                  Café   
15      East Toronto               0  Fast Food Restaurant 

## Summary
#### Cluster 0 has the most neighborhoods and popular for cafes and restaurants.
#### Cluster 1 thru 4 have few neighborhoods and more popular for parks, trails and playgrounds.