## Data Sciende Capstone Project
#### Submitted by: Ajay Sharma

## Table of contents
* [Introduction: problem description](#Introduction)
* [Target Audience](#TargetAudience)
* [Data](#Data)
* [Methodology](#Methodology)
* [Explore Data](#Analysis)
* [Results and Discussion](#ResultsAndDiscussion)
* [Conclusion](#Conclusion)

## 1. Introduction <a name="Introduction"></a>

In this assignment we will try to find the neighborhood that are very similar to Brooklyn neighborhood of New York in terms of cuisines, resturants and other places that serve food. Brooklyn is known for its cultural, linguistic and ethnic diversity. Brooklyn has many tourist attractions like Museums, breweries, parks, historic sites and events (Dyker Heights Christmas Lights, Mermaid Parade, Color Runs, Hotdog eating competition, Labor day Carnival to name a few). And it also offers numerous ethnic cuisines to people living and exploring Brooklyn all year long.
I have lived in Brooklyn for over 6 years and I always loved the living in this neighborhood. And now that I have moved to Canada, I would like to find places in Toronto that offer similar experience, at least in terms of food. In this assignment, we would like to
    
    1. Find the top 5 venues for each neighborhood/borough.
    2. Find neighborhoods in toronto area that are closest match to Brooklyn neighborhoods of New york in terms of
    food.
    

## 2. Target Audience <a name="TargetAudience"></a>
This is going to be a personal project and the target audience for this project is mostly me and my friends. But, anyone trying to explore the brooklyn like neighborhoods in the northern neighborhood city can also use this data to explore the city or explore the popular food business oportunities in the city of Toronto.

## 3. Data <a name="Data"></a>
To solve this problem, we will first download the New York and Toronto Neighborhood datasets. We will fetch the New york city data from IBM cloud. The New York data contains data for its 5 boroughs including Brooklyn. We will filter this data set to only use Brooklyn Borough data. The data will include the latitude and longitudes for the neighborhood.


The Toronto city data is not available as easily as New York city data. We will use the Toronto neighborhood data listed on wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. We will then use python webscraping libraries and tools to extract the required data from the wikipedia page and format it so that we can use it in our project. After cleaning the data, we will use python Geolocation libraries to add Latitude and Longitude information for each neighborhood. And after we have compiled all the data for Toronto, we will merge it with Brooklyn Neighborhood data and use the merged data set for clusetering and analysis.

## 4. Methodology <a name="Methodology"></a>

#### Download and prepare data

In [1]:
# Download important libraries
import numpy as np
import pandas as pd
import json
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import pgeocode
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#! conda install beautifulsoup4
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

print('Libraries imported.')

Libraries imported.


In [2]:
# function to display data on map
def show_on_map(data, latitude, longitude, zoom_start):
    # create map of New York using latitude and longitude values
    data_map = folium.Map(location=[latitude, longitude], zoom_start=zoom_start)

    # add markers to map
    for lat, lng, borough, neighborhood in zip(data['Latitude'], data['Longitude'], data['Borough'], data['Neighborhood']):
        label = '{}, {}'.format(neighborhood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(data_map)  

    return data_map

#### Foursquare API configurations
Set up the configurations for using the Foursquare API to query venues information.

In [3]:
CLIENT_ID = 'S04PZERPPBIQX2VEEOIBK3YYJGLOQTG5KNPOPAUKOVW0OHBZ' # Foursquare ID
CLIENT_SECRET = '33UQDTS2GD24HTWQA4KXGY0KCDYJJBRTEXVHDJXFFJHGCE52' # Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # Foursquare API limit value

#### Download New York city data and filter Brooklyn data
Because we are only interested in comparing Brooklyn neighborhoods with Toronto neighborhoods, we will filter the data for Brooklyn neighborhood and use it for further analysis.

In [6]:
#!conda install -c conda-forge wget --yes
!pwd
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
print('Data downloaded!')

# read data from downloaded json file
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
newyork_data

# extract relevant data
ny_nh_data = newyork_data['features']
ny_nh_data[0]

# create dataframe to hold data
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
ny_neighborhoods = pd.DataFrame(columns=column_names)

# loop through data and parse necessary values into dataframe
for data in ny_nh_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    ny_neighborhoods = ny_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

# filter brooklyn data
brooklyn_data = ny_neighborhoods[ny_neighborhoods['Borough'] == "Brooklyn"]
brooklyn_data.shape

/Users/ajaysharma/Desktop/data-analysis/Coursera_Capstone_ML
Data downloaded!


(70, 4)

#### Show Brooklyn data on Map
Let us visualize the brooklyn neighborhood data on Map.

In [7]:
geolocator = Nominatim(user_agent="city_explorer")

brooklyn_address = 'Brooklyn, NY'
brooklyn_location = geolocator.geocode(brooklyn_address)
brooklyn_latitude = brooklyn_location.latitude
brooklyn_longitude = brooklyn_location.longitude
print('The geograpical coordinate of Brooklyn are {}, {}.'.format(brooklyn_latitude, brooklyn_longitude))

# show data on map
show_on_map(brooklyn_data, brooklyn_latitude, brooklyn_longitude, 10)

The geograpical coordinate of Brooklyn are 40.6501038, -73.9495823.


#### Function to scrape and parse Toronto data
We are going to scrape and parse the Toronto neighborhood data from Wikipedia and then add Latitude and Longitude information to the data.

In [8]:
def parse_data():
	page = urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
	html = page.read().decode("utf-8")
	soup = BeautifulSoup(html, 'html.parser')
	table_data = soup.find("table").findAll("td")
	parsed_data = []
	for data in table_data:
		new_data = []
		postal_code = data.find("b").text
		borough_data = data.find("span")
		if postal_code and borough_data.text != "Not assigned":
			borough_data_vals = re.split('[(]',borough_data.text)
			borough = borough_data_vals[0]
			neighborhoods = borough_data_vals[1].replace(" / ", ",").replace(")", "")
			new_data.append(postal_code)
			new_data.append(borough)
			new_data.append(neighborhoods)
			parsed_data.append(new_data)
	return parsed_data

In [9]:
data = parse_data()
toronto_neighborhoods = pd.DataFrame(data=data)
headers = ["PostalCode", "Borough", "Neighborhood"]
toronto_neighborhoods.columns = headers
toronto_neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park,Harbourfront"
3,M6A,North York,"Lawrence Manor,Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [10]:
# add latitude and longitude info to toronto neighborhood data
def get_lat(row):
    postal_code=row['PostalCode']
    nomi = pgeocode.Nominatim("ca")
    location = nomi.query_postal_code(postal_code)
    latitude = location.latitude
    return latitude

def get_lon(row):
    postal_code=row['PostalCode']
    nomi = pgeocode.Nominatim("ca")
    location = nomi.query_postal_code(postal_code)
    longitude = location.longitude
    return longitude

toronto_neighborhoods["Latitude"]= toronto_neighborhoods.apply(get_lat, axis = 1)
toronto_neighborhoods["Longitude"]= toronto_neighborhoods.apply(get_lon, axis = 1)
toronto_neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park,Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor,Lawrence Heights",43.7223,-79.4504
4,M7A,Queen's Park,Ontario Provincial Government,43.6641,-79.3889


In [11]:
# The data for postal code M7R is not parsed with geocoder library. We have to insert it manually.
toronto_neighborhoods['Latitude'] = toronto_neighborhoods['Latitude'].replace(np.nan, 43.754500)
toronto_neighborhoods['Longitude'] = toronto_neighborhoods['Longitude'].replace(np.nan, -79.330000)

# drop the PostalCode column, it is not needed for further analysis.
if 'PostalCode' in toronto_neighborhoods.columns:
    del toronto_neighborhoods['PostalCode']

toronto_neighborhoods.shape

(103, 4)

#### Show Toronto data on map
Let us visualize Toronto neighborhood data on map.

In [12]:
toronto_address = 'Toronto, Ontario'
toronto_location = geolocator.geocode(toronto_address)
toronto_latitude = toronto_location.latitude
toronto_longitude = toronto_location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(toronto_latitude, toronto_longitude))

# show data on map
show_on_map(toronto_neighborhoods, toronto_latitude, toronto_longitude, 10)

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


#### Merge the data sets for Toronto and Brooklyn
We will now merge the two datasets for further analysis and visualizations.

In [139]:
merged_data = pd.concat([brooklyn_data, toronto_neighborhoods], ignore_index=True)
merged_data.shape

(173, 4)

In [140]:
merged_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Brooklyn,Bay Ridge,40.625801,-74.030621
1,Brooklyn,Bensonhurst,40.611009,-73.99518
2,Brooklyn,Sunset Park,40.645103,-74.010316
3,Brooklyn,Greenpoint,40.730201,-73.954241
4,Brooklyn,Gravesend,40.59526,-73.973471


In [141]:
# show data on map
show_on_map(merged_data, toronto_latitude, toronto_longitude, 6)

#### Use Foursquare API to add venues information

In [16]:
# get nearby venues
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    try:
        venues_list=[]
        for name, lat, lng in zip(names, latitudes, longitudes):
            # create the API request URL
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lng, 
                radius, 
                LIMIT)

            # make the GET request
            results = requests.get(url).json()
            if not results["response"]:
                print(results)
                return venues_list
            results = results["response"]['groups'][0]['items']

            # return only relevant information for each nearby venue that serves food items
            for v in results:
                prefix = v['venue']['categories'][0]['icon']['prefix']
                if prefix and '/food/' in prefix:
                    venues_list.append([
                        name, 
                        lat, 
                        lng, 
                        v['venue']['name'], 
                        v['venue']['location']['lat'], 
                        v['venue']['location']['lng'],  
                        v['venue']['categories'][0]['name']])

        return venues_list
    except Exception as e:
        print(e)
        return venues_list

In [17]:
places_venues = getNearbyVenues(names=merged_data['Neighborhood'], latitudes=merged_data['Latitude'],longitudes=merged_data['Longitude'])

In [28]:
print("Total food venues : {}".format(len(places_venues)))

Total food veenues : 2631


In [25]:
if places_venues:
    columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    nearby_venues = pd.DataFrame(places_venues, columns=columns)
else:
    raise Exception("Error while processing nearby venue data.")


In [26]:
nearby_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bay Ridge,40.625801,-74.030621,Bagel Boy,40.627896,-74.029335,Bagel Shop
1,Bay Ridge,40.625801,-74.030621,Georgian Dream Cafe and Bakery,40.625586,-74.030196,Caucasian Restaurant
2,Bay Ridge,40.625801,-74.030621,Leo's Casa Calamari,40.6242,-74.030931,Pizza Place
3,Bay Ridge,40.625801,-74.030621,Pegasus Cafe,40.623168,-74.031186,Breakfast Spot
4,Bay Ridge,40.625801,-74.030621,Ho' Brah Taco Joint,40.62296,-74.031371,Taco Place


In [29]:
nearby_venues.shape

(2631, 7)

In [31]:
print('There are {} unique venue categories.'.format(len(nearby_venues['Venue Category'].unique())))

There are 116 unique venue categories.


## 5. Explore Data <a name="Analysis"></a>
Now we will explore the data and try to find answers to our first question in the following section.
    1. Find the top 5 venues in the data set.

We will explore the top neighborhoods for each neighborhood as well as for each borough.

In [88]:
# one hot encoding
nearby_venues_onehot = pd.get_dummies(nearby_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
nearby_venues_onehot['Neighborhood'] = nearby_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [nearby_venues_onehot.columns[-1]] + list(nearby_venues_onehot.columns[:-1])
nearby_venues_onehot = nearby_venues_onehot[fixed_columns]

nearby_venues_onehot.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Belgian Restaurant,...,Thai Restaurant,Theme Restaurant,Tibetan Restaurant,Turkish Restaurant,Varenyky restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wings Joint,Yemeni Restaurant
0,Bay Ridge,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Bay Ridge,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Bay Ridge,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Bay Ridge,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Bay Ridge,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Add borough column to dataset


In [89]:
# add borough information to dataset
nd = merged_data[['Neighborhood', 'Borough']]
nearby_venues_onehot = nearby_venues_onehot.merge(nd, how='inner', on = ['Neighborhood'])

#### Group the data by average frequency of each category of food places in a neighborhood

In [103]:
nearby_venues_grouped = nearby_venues_onehot.groupby('Neighborhood').mean().reset_index()
nearby_venues_grouped.shape

(143, 117)


#### Explore Top 5 venue categories for each Neighborhood

In [92]:
num_top_venues = 5

for hood in nearby_venues_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = nearby_venues_grouped[nearby_venues_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt ----
                       venue  freq
0  Latin American Restaurant   0.5
1             Breakfast Spot   0.5
2          Afghan Restaurant   0.0
3               Noodle House   0.0
4         Russian Restaurant   0.0


----Alderwood,Long Branch----
               venue  freq
0        Coffee Shop  0.33
1        Pizza Place  0.33
2     Sandwich Place  0.33
3  Afghan Restaurant  0.00
4       Noodle House  0.00


----Bath Beach----
                  venue  freq
0            Donut Shop  0.07
1  Fast Food Restaurant  0.07
2    Italian Restaurant  0.07
3    Chinese Restaurant  0.07
4           Pizza Place  0.07


----Bathurst Manor,Wilson Heights,Downsview North----
                       venue  freq
0              Deli / Bodega  0.17
1                Coffee Shop  0.17
2   Mediterranean Restaurant  0.17
3  Middle Eastern Restaurant  0.17
4        Fried Chicken Joint  0.17


----Bay Ridge----
                 venue  freq
0          Pizza Place  0.11
1   Italian Restaurant  0.09
2  

4     Sandwich Place  0.00


----Fairview,Henry Farm,Oriole----
                  venue  freq
0  Fast Food Restaurant  0.18
1   Japanese Restaurant  0.14
2            Restaurant  0.14
3           Coffee Shop  0.14
4             Juice Bar  0.09


----First Canadian Place,Underground city----
                 venue  freq
0          Coffee Shop  0.16
1                 Café  0.09
2           Restaurant  0.08
3  Japanese Restaurant  0.06
4           Steakhouse  0.05


----Flatbush----
                  venue  freq
0  Caribbean Restaurant  0.13
1             Juice Bar  0.13
2           Coffee Shop  0.13
3        Sandwich Place  0.07
4                Bakery  0.07


----Flatlands----
                  venue  freq
0  Caribbean Restaurant  0.25
1   Fried Chicken Joint  0.25
2  Fast Food Restaurant  0.25
3    Chinese Restaurant  0.12
4         Deli / Bodega  0.12


----Fort Greene----
                 venue  freq
0   Italian Restaurant  0.16
1             Wine Bar  0.06
2  American Restaurant  0.

4            Bakery  0.12


----Ontario Provincial Government----
                venue  freq
0  Italian Restaurant  0.12
1    Sushi Restaurant  0.12
2   Indian Restaurant  0.06
3       Burrito Place  0.06
4  Mexican Restaurant  0.06


----Paerdegat Basin----
                     venue  freq
0                     Food   0.5
1         Asian Restaurant   0.5
2  New American Restaurant   0.0
3       Russian Restaurant   0.0
4               Restaurant   0.0


----Park Slope----
                venue  freq
0         Coffee Shop  0.16
1        Burger Joint  0.11
2         Pizza Place  0.08
3          Bagel Shop  0.08
4  Italian Restaurant  0.05


----Parkdale,Roncesvalles----
                         venue  freq
0  Eastern European Restaurant  0.11
1                  Coffee Shop  0.11
2                       Bakery  0.11
3             Sushi Restaurant  0.11
4                         Café  0.07


----Parkview Hill,Woodbine Gardens----
                     venue  freq
0              Pizza Plac

#### Group the data by average frequency of each category of food places in a Borough

In [102]:
# add borough information to data
nearby_venues_onehot_b = nearby_venues_onehot
nearby_venues_grouped_b = nearby_venues_onehot_b.groupby('Borough').mean().reset_index()
nearby_venues_grouped_b.shape

(15, 117)


#### Explore Top 5 venue categories for each Borough

In [94]:
num_top_venues = 5

for borough in nearby_venues_grouped_b['Borough']:
    print("----"+borough+"----")
    temp = nearby_venues_grouped_b[nearby_venues_grouped_b['Borough'] == borough].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Brooklyn----
                venue  freq
0         Pizza Place  0.09
1         Coffee Shop  0.07
2              Bakery  0.05
3       Deli / Bodega  0.05
4  Italian Restaurant  0.05


----Central Toronto----
                venue  freq
0         Coffee Shop  0.13
1      Sandwich Place  0.10
2                Café  0.10
3  Italian Restaurant  0.08
4   Indian Restaurant  0.05


----Downtown Toronto----
                 venue  freq
0          Coffee Shop  0.16
1                 Café  0.10
2           Restaurant  0.05
3  Japanese Restaurant  0.05
4     Sushi Restaurant  0.03


----Downtown TorontoStn A PO Boxes25 The Esplanade----
                venue  freq
0         Coffee Shop  0.26
1          Restaurant  0.11
2                Café  0.05
3  Italian Restaurant  0.05
4       Deli / Bodega  0.05


----East Toronto----
                venue  freq
0    Greek Restaurant  0.18
1  Italian Restaurant  0.07
2         Coffee Shop  0.07
3      Ice Cream Shop  0.07
4          Restaurant  0.07


--

In [104]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Most common venues by Neighborhood

In [106]:
num_top_venues = 5
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
nearby_venues_grouped_sorted = pd.DataFrame(columns=columns)
nearby_venues_grouped_sorted['Neighborhood'] = nearby_venues_grouped['Neighborhood']
for ind in np.arange(nearby_venues_grouped.shape[0]):
    nearby_venues_grouped_sorted.iloc[ind, 1:] = return_most_common_venues(nearby_venues_grouped.iloc[ind, :], num_top_venues)
nearby_venues_grouped_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agincourt,Latin American Restaurant,Breakfast Spot,Yemeni Restaurant,Fast Food Restaurant,Deli / Bodega
1,"Alderwood,Long Branch",Sandwich Place,Coffee Shop,Pizza Place,Yemeni Restaurant,Cuban Restaurant
2,Bath Beach,Sushi Restaurant,Pizza Place,Bubble Tea Shop,Italian Restaurant,Fast Food Restaurant
3,"Bathurst Manor,Wilson Heights,Downsview North",Deli / Bodega,Coffee Shop,Fried Chicken Joint,Pizza Place,Mediterranean Restaurant
4,Bay Ridge,Pizza Place,Italian Restaurant,American Restaurant,Greek Restaurant,Bagel Shop


In [107]:
nearby_venues_grouped_sorted.shape

(143, 6)

#### Most common venues by Borough

In [99]:
num_top_venues = 5
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Borough']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
nearby_venues_grouped_b_sorted = pd.DataFrame(columns=columns)
nearby_venues_grouped_b_sorted['Borough'] = nearby_venues_grouped_b['Borough']
for ind in np.arange(nearby_venues_grouped_b.shape[0]):
    nearby_venues_grouped_b_sorted.iloc[ind, 1:] = return_most_common_venues(nearby_venues_grouped_b.iloc[ind, :], num_top_venues)
nearby_venues_grouped_b_sorted.head()

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Brooklyn,Pizza Place,Coffee Shop,Bakery,Deli / Bodega,Italian Restaurant
1,Central Toronto,Coffee Shop,Sandwich Place,Café,Italian Restaurant,Restaurant
2,Downtown Toronto,Coffee Shop,Café,Restaurant,Japanese Restaurant,Italian Restaurant
3,Downtown TorontoStn A PO Boxes25 The Esplanade,Coffee Shop,Restaurant,Deli / Bodega,Café,Japanese Restaurant
4,East Toronto,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Restaurant


In [101]:
nearby_venues_grouped_b.shape

(15, 117)

## Data clustering
Finally we will use the data clustering to find answer to the second question:
    
2. Find neighborhoods in toronto area are closest match to Brooklyn neighborhoods of New york.
    

In [109]:
# set number of clusters
kclusters = 5

venues_grouped_clustering = nearby_venues_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(venues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 3, 4, 3, 4, 4, 4, 4, 4, 2], dtype=int32)

In [110]:
# add clustering labels
nearby_venues_grouped_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

venues_merged = merged_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
venues_merged = venues_merged.join(nearby_venues_grouped_sorted.set_index('Neighborhood'), on='Neighborhood')

venues_merged # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Brooklyn,Bay Ridge,40.625801,-74.030621,4.0,Pizza Place,Italian Restaurant,American Restaurant,Greek Restaurant,Bagel Shop
1,Brooklyn,Bensonhurst,40.611009,-73.995180,4.0,Chinese Restaurant,Ice Cream Shop,Italian Restaurant,Donut Shop,Sushi Restaurant
2,Brooklyn,Sunset Park,40.645103,-74.010316,4.0,Latin American Restaurant,Mexican Restaurant,Pizza Place,Bakery,Fried Chicken Joint
3,Brooklyn,Greenpoint,40.730201,-73.954241,4.0,Pizza Place,Coffee Shop,Deli / Bodega,French Restaurant,Mexican Restaurant
4,Brooklyn,Gravesend,40.595260,-73.973471,3.0,Italian Restaurant,Pizza Place,Chinese Restaurant,Bakery,Donut Shop
...,...,...,...,...,...,...,...,...,...,...
168,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North",43.651800,-79.507600,4.0,Sushi Restaurant,Bakery,Breakfast Spot,Dessert Shop,Coffee Shop
169,Downtown Toronto,Church and Wellesley,43.665600,-79.383000,4.0,Japanese Restaurant,Sushi Restaurant,Coffee Shop,Restaurant,Fast Food Restaurant
170,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L,43.780400,-79.250500,4.0,Restaurant,Coffee Shop,Japanese Restaurant,Deli / Bodega,Breakfast Spot
171,Etobicoke,"Old Mill South,King's Mill Park,Sunnylea,Humbe...",43.632500,-79.493900,,,,,,


In [112]:
# drop any rows with nan values
venues_merged[venues_merged["Cluster Labels"].isnull()]
venues_merged = venues_merged.dropna()
  
# To reset the indices 
venues_merged = venues_merged.reset_index(drop = True)
venues_merged.shape

(143, 10)

#### Visualize clusters

In [142]:
# Convert float values to int
venues_merged["Cluster Labels"] = venues_merged["Cluster Labels"].astype(int)
venues_merged.head()
# create map
map_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(venues_merged['Latitude'], venues_merged['Longitude'], venues_merged['Neighborhood'], venues_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

#### Zoom on the Brooklyn and Toronto clusters to take a closer look at individual clusters

In [143]:
map_clusters

In [138]:
venues_merged.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Brooklyn,Bay Ridge,40.625801,-74.030621,4,Pizza Place,Italian Restaurant,American Restaurant,Greek Restaurant,Bagel Shop
1,Brooklyn,Bensonhurst,40.611009,-73.99518,4,Chinese Restaurant,Ice Cream Shop,Italian Restaurant,Donut Shop,Sushi Restaurant
2,Brooklyn,Sunset Park,40.645103,-74.010316,4,Latin American Restaurant,Mexican Restaurant,Pizza Place,Bakery,Fried Chicken Joint
3,Brooklyn,Greenpoint,40.730201,-73.954241,4,Pizza Place,Coffee Shop,Deli / Bodega,French Restaurant,Mexican Restaurant
4,Brooklyn,Gravesend,40.59526,-73.973471,3,Italian Restaurant,Pizza Place,Chinese Restaurant,Bakery,Donut Shop


#### Filter the data for most similar neighborhoods
From clustering, we can see that cluster 4 is very common occurance in both Brooklyn and Toronto. We can now filter the data based on cluster 4 to get a list of the most similar neighborhoods in Blooklyn and Toronto.

In [145]:
most_similar = venues_merged[venues_merged['Cluster Labels'] == 4]
most_similar_toronto = most_similar[most_similar['Borough'] != 'Brooklyn']
most_similar_toronto.reset_index(inplace=True)
del most_similar_toronto['index']
most_similar_toronto.shape

(47, 10)

In [146]:
most_similar_toronto

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Downtown Toronto,"Regent Park,Harbourfront",43.6555,-79.3626,4,Coffee Shop,Breakfast Spot,Restaurant,Food Truck,Bakery
1,North York,"Lawrence Manor,Lawrence Heights",43.7223,-79.4504,4,Coffee Shop,Restaurant,Sushi Restaurant,Bakery,Sandwich Place
2,Queen's Park,Ontario Provincial Government,43.6641,-79.3889,4,Italian Restaurant,Sushi Restaurant,Restaurant,Bubble Tea Shop,Ramen Restaurant
3,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783,4,Coffee Shop,Café,Japanese Restaurant,Middle Eastern Restaurant,Burger Joint
4,Downtown Toronto,St. James Town,43.6513,-79.3756,4,Coffee Shop,Café,Seafood Restaurant,Italian Restaurant,Restaurant
5,East Toronto,The Beaches,43.6784,-79.2941,4,Asian Restaurant,Bakery,Gastropub,Yemeni Restaurant,Fast Food Restaurant
6,Downtown Toronto,Berczy Park,43.6456,-79.3754,4,Coffee Shop,Bakery,Seafood Restaurant,Café,Japanese Restaurant
7,York,Caledonia-Fairbanks,43.6889,-79.4507,4,Bakery,Mexican Restaurant,Yemeni Restaurant,Fast Food Restaurant,Deli / Bodega
8,Scarborough,Woburn,43.7712,-79.2144,4,Korean BBQ Restaurant,Yemeni Restaurant,Fast Food Restaurant,Cupcake Shop,Deli / Bodega
9,East York,Leaside,43.7124,-79.3644,4,Restaurant,Coffee Shop,Burger Joint,Bagel Shop,Sushi Restaurant


So this gives us a list of 47 neighborhoods in Toronto that are very similar to Brooklyn neighborhood.

## 6. Results and Discussion <a name="ResultsAndDiscussion"></a>
From the above analysis, as results we have list of:

1. Top 5 venues for each neighborhood and borough in our data. We can use this list to explore best cuisines in the Toronto neighborhood. 
2. We also have a list of Toronto neighborhoods that are most similar to Brooklyn. We can use this list to explore these places and find a neighborhood that where I would like to settle in near future. 

We started this project with two questions on mind. We have used this project to answer questions thay may not have a larger audience or answer specific questions in terms of business related problems. Each year tens of thousands of newcomers from around the world make Canada as their new home. And out of nature everyone wants to explore their new home, find places to get important supplies as well as find new business opportunities as well as employment opportunities. Food is an important part of modern culture. I love to explore different ethnic cuisines and I am sure that most of the people do. When I moved to Canada, I did not know where to start exploring. But, I knew that I want to find neighborhoods that are similar to my previous home and neighborhood. And if I had a list of places that are similar to my previous home, it will give me an idea about where to start. Diversity in a neighborhood can be determined based on the diversity of food that is served in local resturants. And that is the basis of this project. 

## 7. Conclusion <a name="Conclusion"></a>

Based on the results, we have 47 neighborhoods that are very similar to my old home and I cannot wait to start exploring them as soon as the current pandemic (COVID-19) is over.

We can use this analysis to explore/compare any neighborhood based on some attributes even if they are different than the ones we have used in our dataset.