# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

#              Recommander system for Café contractor in Toronto

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for distributing a new brands of coffee in North York in Toronto in Canada. Specifically, this report will be targeted to an asian stakeholder who is interested in making not only asian residents renew with Coffee culture. A better place to start could be 'Asian restaurants'. 

North York is located directly north of Old Toronto, between Etobicoke to the west and Scarborough to the east. As of the 2011 Census, it had a population of 655,913. The neighbourhoods of North York are highly diverse, inhabited by people of many different cultures, among which asian population represents more than 40%. 
Since there are lot of restaurants in North York, we aim at finding the best restaurants in North York that can handle and share the idea to promote good quality of coffee with diverse flavour and tastes. Knowing that Coffee is Canada’s most consumed beverage amongst adults – even more than tap water. As champions for the advancement and enjoyment of coffee in Canada, the Coffee Association of Canada (CAC) continually strives to better understand and educate Canadians about their favourite brew.




Every year, the CAC commissions the proprietary Canadian Coffee Drinking Study on behalf of its members. Below is an infographic, which highlights a few key findings from the results of the 2018 survey.

<a href="http://www.coffeeassoc.com/wp-content/uploads/2018/11/CAC-Coffee-Drinking-Trends-INFOGRAPHIC-2018.pdf">CAC Coffee Drinking Trends INFOGRAPHIC 2018</a>

##### The Coffee Industry in Canada:
    $6.2 billion industry
$4.8 billion sales in Foodservice
    $1.4 billion sales in Grocery / Retail Sales

#### Coffee Creates Jobs in Canada: 

* 160,000+ jobs in Cafes and Coffee Shops

* 5,000+ jobs in Manufacturing and Roasting

* 5,000+ independent café and coffee shop owners and several thousand franchise owner-operators

* Attractive entry-level positions for young people

* Jobs in support sectors such as packaging, cup suppliers, food manufacturing etc.

Given these facts, a New brand of Café contractor wants to spring-up in Toronto, espacially in North York, to provide "Bio" and "Eco" Café to Restaurants. Globally, restaurants don't have a variety of café for their clients and therefore, client get done with one café without any particularities in termes of flavour, taste, aroms, etc. Moreover, Café culture declines a variety of café types by seasons and by origines.

The main idea behind the business is to promote a good quality with divers tastes, all over the seasons, to restaurants so that the "Café" moment become more a deep pleasure thant just appetizers' ends. It's matter of renew with the grand café values. 

This will benefit to restaurants so they will create clients' loyality and retain their actual ones. This will also benefit to the contractor as he will have broad range of customers to supply and make his trademark more known. Consequently, it's win-win business relationship with more income in both parts.

Then the contractor is going to work on this field of the market and find potential restaurants to hold this idea. The geographical scope of the study is Toronto and particularly North York borough where the contractor want to develop first this business.

The contractor should build the store where it is closest to its customers in order to minimize the cost of transporation and supply in short time. Which neighborhood (in that borough) would be a better choice for the contractor to build the store in that neighborhood? The Recommander system should provide this contractor with a sorted list of neighborhoods in which the first element of the list will be the best suggested neighborhood. 

## Data <a name="data"></a>

First, we will collect geographical coordinates information about that specific borough and the its neighborhoods. We assume that it is "North York" in Toronto. This information is given by the contractor, because the contractor has already made up his mind about the borough. The Postal Codes that fall into that borough would also be sufficient.

We will need data about different venues in different neighborhoods of that specific borough. In order to get that information we will use Wikipedia and "Foursquare" location company's dataset. By locational information for each venue we mean basic and advanced information about that venue. For example there is a venue in one of the neighborhoods. As basic information, we can obtain its precise latitude and longitude and also its distance from the center of the neighborhood. But we are looking for advanced information such as the category of that venue and whether this venue is a popular one in its category or maybe the average price of the services of this venue.

### Part 1 - Identifying Neighborhoods inside "North York"

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#!conda install lxml --yes #Already installed, if not please uncomment it and run it to install.
#!conda install BeautifulSoup4 --yes #Library Already installed, if not please uncomment it and run it to install.
import requests
from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


### Load Wikipedia page List_of_postal_codes_of_Canada:_M.html

In [3]:
html = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

### Pull html data out using Python's BeautifulSoup livrary with xml parser

In [4]:
bs = BeautifulSoup(html.text, 'lxml')

### In HTML Developer view, we see that our data are in element table and belongs to class "wikitable sortable"

In [5]:
My_array = bs.find('table',{'class':'wikitable sortable'})
#My_array

### Get Html Table headers

In [6]:
print(My_array.tr.text)


Postcode
Borough
Neighbourhood



### Construct a raw list of codes bases on TR/TD array (comm separated rows)

In [7]:
Labels = "PostCode,Borough,Neighborhood"
my_table=""
for row in My_array.find_all('tr'):
    my_row=""
    for field in row.find_all('td'):
        my_row=my_row+","+field.text
    my_table=my_table+my_row[1:]
print(my_table)

M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M5A,Downtown Toronto,Regent Park
M6A,North York,Lawrence Heights
M6A,North York,Lawrence Manor
M7A,Queen's Park,Not assigned
M8A,Not assigned,Not assigned
M9A,Etobicoke,Islington Avenue
M1B,Scarborough,Rouge
M1B,Scarborough,Malvern
M2B,Not assigned,Not assigned
M3B,North York,Don Mills North
M4B,East York,Woodbine Gardens
M4B,East York,Parkview Hill
M5B,Downtown Toronto,Ryerson
M5B,Downtown Toronto,Garden District
M6B,North York,Glencairn
M7B,Not assigned,Not assigned
M8B,Not assigned,Not assigned
M9B,Etobicoke,Cloverdale
M9B,Etobicoke,Islington
M9B,Etobicoke,Martin Grove
M9B,Etobicoke,Princess Gardens
M9B,Etobicoke,West Deane Park
M1C,Scarborough,Highland Creek
M1C,Scarborough,Rouge Hill
M1C,Scarborough,Port Union
M2C,Not assigned,Not assigned
M3C,North York,Flemingdon Park
M3C,North York,Don Mills South
M4C,East York,Woodbine Heights
M

### Retrieving resulted data into a csv file for padas processing 

In [8]:
csv_file=open("zipcode_toronto.csv","wb")
csv_file.write(bytes(my_table,encoding="ascii",errors="ignore"))

8738

### From CSV to Dataframe conversion

In [2]:
zipcode = pd.read_csv('zipcode_toronto.csv',header=None)

# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
zipcode.columns=["PostalCode","Borough","Neighborhood"]

In [3]:
zipcode.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [11]:
zipcode.shape

(288, 3)

### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [4]:
# Indices of "Not assigned" Borough's column:
NAx_Borough = zipcode[ zipcode['Borough'] =='Not assigned'].index

# Delete "Not assigned" rows based on indices:
zipcode.drop(NAx_Borough , inplace=True)
zipcode.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [13]:
zipcode.shape

(211, 3)

### Reset neighborhood to borough value if it is not set (ie.: it has "Not assigned" value)

In [5]:
zipcode.loc[zipcode['Neighborhood'] =='Not assigned' , 'Neighborhood'] = zipcode['Borough']

### GroupBy PostalCode & Borough and Join Neighborhoods with comma separated

In [6]:
zipcode = zipcode.groupby(['PostalCode','Borough'], sort=False).agg( ', '.join)

In [7]:
zipcode=zipcode.reset_index()

# This what zipcode dataframe looks like
zipcode.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [17]:
zipcode.shape

(103, 3)

### Use the Geocoder package or the csv file to create North York dataframe

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [18]:
!wget -q -O 'Latlong_TorontoM.csv'  http://cocl.us/Geospatial_data

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc


In [8]:
df_latlong = pd.read_csv('Latlong_TorontoM.csv')
df_latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
df_latlong.columns=['PostalCode','Latitude','Longitude']

In [12]:
df_latlong.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Merge Data derived from sources Wikipedia and Geospatial_data sites 

In [13]:
TorontoM_df = pd.merge(zipcode, df_latlong, on='PostalCode')

# What does new data look like:
TorontoM_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


### Select only neighborhoods pertaining to "North York" borough.

In [14]:
NorthYork_data = TorontoM_df[TorontoM_df['Borough'] == 'North York']
NorthYork_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
7,M3B,North York,Don Mills North,43.745906,-79.352188
10,M6B,North York,Glencairn,43.709577,-79.445073


### Create a map of North York's neighborhoods superimposed on top

In [16]:
latNorthYork = 43.761539
longNorthYork = -79.411079
print('The geographical coordinate of "North York" are: {}, {}.'.format(latNorthYork, longNorthYork))

map_NorthYork = folium.Map(location=[latNorthYork, longNorthYork], zoom_start=10.5)

# add markers to map
for lat, lng, Borough, Neighborhood in zip(NorthYork_data['Latitude'], NorthYork_data['Longitude'], NorthYork_data['Borough'], NorthYork_data['Neighborhood']):
    
    label = '{}, {}'.format(Neighborhood, Borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=7,
        popup=label,
        color='Brown',
        fill=True,
        fill_color='#632a40',
        fill_opacity=0.7,
        parse_html=False).add_to(map_NorthYork)  
    
map_NorthYork

The geographical coordinate of "North York" are: 43.761539, -79.411079.


### Part 2 - Connecting to Foursquare and Retrieving Locational Data

#### Let's create a function to process all the neighborhoods in Toronto

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

So let's segment and cluster the neighborhoods in Toronto.

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [28]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = ''
#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

#### Let's create a function to repeat the same process to all the neighborhoods in Toronto/North York

In [54]:
def getNearbyVenues(postal_code_list, neighborhood_list, lat_list, long_list, LIMIT = 500, radius = 1000):    
    venues_list=[]
    counter=0 # A counter for rows processed
    for postal_code, neighborhood, lat, lng in zip(postal_code_list, neighborhood_list, lat_list, long_list):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        tmp_dict = {}
        tmp_dict['Postal Code'] = postal_code; tmp_dict['Neighborhood(s)'] = neighborhood; 
        tmp_dict['Latitude'] = lat; tmp_dict['Longitude'] = lng;
        tmp_dict['Crawling_result'] = results;
        venues_list.append(tmp_dict)
        counter += 1
        print('{}.'.format(counter))
        print('Data is obtained, for the Postal Code {} (and Neighborhoods {}) SUCCESSFULLY.'.format(postal_code, neighborhood))
    return venues_list;

#### Now write the code to run the above function on each neighborhood and create a new dataframe called *NorthYork_venues*

In [55]:
print('Getting data from Foursquare for boroughs in North York:')
NorthYork_foursquare_dataset = getNearbyVenues(list(NorthYork_data['PostalCode']),
                                               list(NorthYork_data['Neighborhood']),
                                               list(NorthYork_data['Latitude']),
                                               list(NorthYork_data['Longitude'])
                                              )

Getting data from Foursquare for boroughs in North York:
1.
Data is Obtained, for the Postal Code M3A (and Neighborhoods Parkwoods) SUCCESSFULLY.
2.
Data is Obtained, for the Postal Code M4A (and Neighborhoods Victoria Village) SUCCESSFULLY.
3.
Data is Obtained, for the Postal Code M6A (and Neighborhoods Lawrence Heights, Lawrence Manor) SUCCESSFULLY.
4.
Data is Obtained, for the Postal Code M3B (and Neighborhoods Don Mills North) SUCCESSFULLY.
5.
Data is Obtained, for the Postal Code M6B (and Neighborhoods Glencairn) SUCCESSFULLY.
6.
Data is Obtained, for the Postal Code M3C (and Neighborhoods Flemingdon Park, Don Mills South) SUCCESSFULLY.
7.
Data is Obtained, for the Postal Code M2H (and Neighborhoods Hillcrest Village) SUCCESSFULLY.
8.
Data is Obtained, for the Postal Code M3H (and Neighborhoods Bathurst Manor, Downsview North, Wilson Heights) SUCCESSFULLY.
9.
Data is Obtained, for the Postal Code M2J (and Neighborhoods Fairview, Henry Farm, Oriole) SUCCESSFULLY.
10.
Data is Obtain

### Store results set of Foursquare data locally and avoid having to reconnect again to Foursquare

In [56]:
import pickle
with open("NorthYork_foursquare_dataset.dmp", "wb") as fp:   #Pickling
    pickle.dump(NorthYork_foursquare_dataset, fp)
print('Received Data from Foursquare are Saved to local drive.') 

Received Data from Foursquare are Saved to local drive.


Load dataset from local source (previously stored locally)

In [20]:
import pickle
with open("NorthYork_foursquare_dataset.dmp", "rb") as fp:   # Unpickling
    NorthYork_foursquare_dataset = pickle.load(fp)
# print(type(NorthYork_foursquare_dataset)) ==> List

In [21]:
#NorthYork_foursquare_dataset

In [22]:
# This function is created to connect to the saved list which is the received database. It will extract each venue 
# for every neighborhood inside the database

def get_venue_dataset(foursquare_dataset):
    result_df = pd.DataFrame(columns = ['Postal Code', 'Neighborhood', 
                                           'Neighborhood Latitude', 'Neighborhood Longitude',
                                          'Venue', 'Venue Summary', 'Venue Category', 'Distance'])
    # print(result_df)
    
    for neigh_dict in foursquare_dataset:
        postal_code = neigh_dict['Postal Code']; neigh = neigh_dict['Neighborhood(s)']
        lat = neigh_dict['Latitude']; lng = neigh_dict['Longitude']
        print('Number of Venuse in Coordination "{}" Posal Code and "{}" Negihborhood(s) is: {}'.format(postal_code, neigh, len(neigh_dict['Crawling_result'])))
        #print(len(neigh_dict['Crawling_result']))
        
        for venue_dict in neigh_dict['Crawling_result']:
            summary = venue_dict['reasons']['items'][0]['summary']
            name = venue_dict['venue']['name']
            dist = venue_dict['venue']['location']['distance']
            cat =  venue_dict['venue']['categories'][0]['name']
            
            
            result_df = result_df.append({'Postal Code': postal_code, 'Neighborhood': neigh, 
                              'Neighborhood Latitude': lat, 'Neighborhood Longitude':lng,
                              'Venue': name, 'Venue Summary': summary, 
                              'Venue Category': cat, 'Distance': dist}, ignore_index = True)
            # print(result_df)
    
    return(result_df)

In [24]:
NorthYork_venues = get_venue_dataset(NorthYork_foursquare_dataset)

Number of Venuse in Coordination "M3A" Posal Code and "Parkwoods" Negihborhood(s) is: 28
Number of Venuse in Coordination "M4A" Posal Code and "Victoria Village" Negihborhood(s) is: 14
Number of Venuse in Coordination "M6A" Posal Code and "Lawrence Heights, Lawrence Manor" Negihborhood(s) is: 50
Number of Venuse in Coordination "M3B" Posal Code and "Don Mills North" Negihborhood(s) is: 29
Number of Venuse in Coordination "M6B" Posal Code and "Glencairn" Negihborhood(s) is: 28
Number of Venuse in Coordination "M3C" Posal Code and "Flemingdon Park, Don Mills South" Negihborhood(s) is: 44
Number of Venuse in Coordination "M2H" Posal Code and "Hillcrest Village" Negihborhood(s) is: 22
Number of Venuse in Coordination "M3H" Posal Code and "Bathurst Manor, Downsview North, Wilson Heights" Negihborhood(s) is: 28
Number of Venuse in Coordination "M2J" Posal Code and "Fairview, Henry Farm, Oriole" Negihborhood(s) is: 44
Number of Venuse in Coordination "M3J" Posal Code and "Northwood Park, York

## Part 3 - Processing the Retrieved Data and Creating a Dataframe for All the Venues inside North York

#### Let's check the size of the resulting dataframe

In [293]:
print(len(NorthYork_venues))

615


In [72]:
NorthYork_venues.head(10)

Unnamed: 0,Postal Code,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Summary,Venue Category,Distance
0,M3A,Parkwoods,43.753259,-79.329656,Allwyn's Bakery,This spot is popular,Caribbean Restaurant,833
1,M3A,Parkwoods,43.753259,-79.329656,Brookbanks Park,This spot is popular,Park,245
2,M3A,Parkwoods,43.753259,-79.329656,Tim Hortons,This spot is popular,Café,866
3,M3A,Parkwoods,43.753259,-79.329656,A&W Canada,This spot is popular,Fast Food Restaurant,852
4,M3A,Parkwoods,43.753259,-79.329656,Food Basics,This spot is popular,Supermarket,895
5,M3A,Parkwoods,43.753259,-79.329656,Bruno's valu-mart,This spot is popular,Grocery Store,882
6,M3A,Parkwoods,43.753259,-79.329656,Shoppers Drug Mart,This spot is popular,Pharmacy,953
7,M3A,Parkwoods,43.753259,-79.329656,High Street Fish & Chips,This spot is popular,Fish & Chips Shop,967
8,M3A,Parkwoods,43.753259,-79.329656,Variety Store,This spot is popular,Food & Drink Shop,312
9,M3A,Parkwoods,43.753259,-79.329656,Shoppers Drug Mart,This spot is popular,Pharmacy,926


**Let's check how many venues were returned for each neighborhood**

In [296]:
NorthYork_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Postal Code,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Summary,Venue Category,Distance
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"Bathurst Manor, Downsview North, Wilson Heights",28,28,28,28,28,28,28
Bayview Village,13,13,13,13,13,13,13
"Bedford Park, Lawrence Manor East",38,38,38,38,38,38,38
"CFB Toronto, Downsview East",21,21,21,21,21,21,21
Don Mills North,29,29,29,29,29,29,29
Downsview Central,4,4,4,4,4,4,4
Downsview Northwest,31,31,31,31,31,31,31
Downsview West,9,9,9,9,9,9,9
"Downsview, North Park, Upwood Park",12,12,12,12,12,12,12
"Emery, Humberlea",8,8,8,8,8,8,8


#### Let's find out how many unique categories can be curated from all the returned venues

In [297]:
print('There are {} uniques categories.'.format(len(NorthYork_venues['Venue Category'].unique())))

There are 152 uniques categories.


In [25]:
# Save a cleaned Version of DataFrame as a Result from Foursquare
NorthYork_venues.to_csv('NorthYork_venues.csv')

## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting shops and restaurants where we can furnish with our coffee brand. As our asian stakeholder considers asian customers to accommodate their needs and satisfy their clients, we will focus on the type of venues mainly retaurants. 

In first step we have collected the required **data: location and type (category) of every restaurant**. We have also **identified asian restaurants** (according to Foursquare categorization).

Second step in our analysis will be exploration of asian restaurants in North York.

In third and final step we will focus on most promising venues and within those created **clusters of locations that meet some basic requirements** established in discussion with stakeholders: we will take into consideration the selection of asian locations first. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

## Analysis <a name="analysis"></a>

### Analyze Each Neighborhood

In [26]:
# Loading Data from File (Saved "Foursquare " DataFrame for Venues)
NorthYork_venues = pd.read_csv('NorthYork_venues.csv') 
NorthYork_venues.

SyntaxError: invalid syntax (<ipython-input-26-515dedf1085e>, line 3)

In [258]:
nbh_list = list(NorthYork_venues['Neighborhood'].unique())
print('Number of Neighborhoods inside North York:')
print(len(nbh_list))
print('List of Neighborhoods inside North york:')
nbh_list

Number of Neighborhoods inside North York:
24
List of Neighborhoods inside North york:


['Parkwoods',
 'Victoria Village',
 'Lawrence Heights, Lawrence Manor',
 'Don Mills North',
 'Glencairn',
 'Flemingdon Park, Don Mills South',
 'Hillcrest Village',
 'Bathurst Manor, Downsview North, Wilson Heights',
 'Fairview, Henry Farm, Oriole',
 'Northwood Park, York University',
 'Bayview Village',
 'CFB Toronto, Downsview East',
 'Silver Hills, York Mills',
 'Downsview West',
 'Downsview, North Park, Upwood Park',
 'Humber Summit',
 'Newtonbrook, Willowdale',
 'Downsview Central',
 'Bedford Park, Lawrence Manor East',
 'Emery, Humberlea',
 'Willowdale South',
 'Downsview Northwest',
 'York Mills West',
 'Willowdale West']

In [27]:
nbh_venue_summary = NorthYork_venues.groupby('Neighborhood').count()
nbh_venue_summary.drop(columns = ['Unnamed: 0']).head()

KeyError: "['Unnamed: 0'] not found in axis"

In [298]:
print('There are {} uniques categories.'.format(len(NorthYork_venues['Venue Category'].unique())))

print('Here is the list of different categories:')
list(NorthYork_venues['Venue Category'].unique())

There are 152 uniques categories.
Here is the list of different categories:


['Caribbean Restaurant',
 'Park',
 'Café',
 'Fast Food Restaurant',
 'Supermarket',
 'Grocery Store',
 'Pharmacy',
 'Fish & Chips Shop',
 'Food & Drink Shop',
 'Pizza Place',
 'Road',
 'Bus Stop',
 'Train Station',
 'Discount Store',
 'Laundry Service',
 'Chinese Restaurant',
 'Coffee Shop',
 'Convenience Store',
 'Shopping Mall',
 'Tennis Court',
 'Cosmetics Shop',
 'Shop & Service',
 'Hockey Arena',
 'Portuguese Restaurant',
 'Sporting Goods Shop',
 'Furniture / Home Store',
 "Men's Store",
 'Lounge',
 'Golf Course',
 'Athletics & Sports',
 'Gym / Fitness Center',
 'Boutique',
 'Vietnamese Restaurant',
 'Sushi Restaurant',
 'Deli / Bodega',
 'Dessert Shop',
 'Burger Joint',
 'Greek Restaurant',
 'Fried Chicken Joint',
 'Bowling Alley',
 'Restaurant',
 'Clothing Store',
 'Seafood Restaurant',
 'Pet Store',
 'Bank',
 'Accessories Store',
 'Gym',
 'Event Space',
 'Cheese Shop',
 'Miscellaneous Shop',
 'Sandwich Place',
 "Women's Store",
 'Arts & Crafts Store',
 'Gift Shop',
 'Paper / Of

### One-hot Encoding the "categroies" Column into Every Unique Categorical Feature

In [28]:
# one hot encoding
NorthYork_onehot = pd.get_dummies(NorthYork_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
NorthYork_onehot['Neighborhood'] = NorthYork_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [NorthYork_onehot.columns[-1]] + list(NorthYork_onehot.columns[:-1])
NorthYork_onehot = NorthYork_onehot[fixed_columns]

NorthYork_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Beach,Beer Store,Bike Shop,Boutique,Bowling Alley,Breakfast Spot,Bridal Shop,Bubble Tea Shop,Burger Joint,Burrito Place,Bus Line,Bus Stop,Business Service,Butcher,Cafeteria,Café,Candy Store,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Clothing Store,Coffee Shop,Comfort Food Restaurant,Community Center,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Eastern European Restaurant,Electronics Store,Empanada Restaurant,Event Space,Falafel Restaurant,Fast Food Restaurant,Fireworks Store,Fish & Chips Shop,Fish Market,Food & Drink Shop,Food Court,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,General Entertainment,Gift Shop,Golf Course,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Hardware Store,History Museum,Hockey Arena,Hookah Bar,Hot Dog Joint,Hotel,Housing Development,Ice Cream Shop,Indian Restaurant,Indonesian Restaurant,Intersection,Italian Restaurant,Japanese Restaurant,Juice Bar,Karaoke Bar,Kitchen Supply Store,Korean Restaurant,Latin American Restaurant,Laundry Service,Liquor Store,Lounge,Massage Studio,Mediterranean Restaurant,Men's Store,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Movie Theater,Moving Target,Office,Other Repair Shop,Paper / Office Supplies Store,Park,Pet Store,Pharmacy,Photography Lab,Pizza Place,Playground,Plaza,Pool,Portuguese Restaurant,Pub,Ramen Restaurant,Recreation Center,Rental Car Location,Residential Building (Apartment / Condo),Restaurant,Road,Salad Place,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Shoe Store,Shop & Service,Shopping Mall,Skate Park,Skating Rink,Ski Area,Ski Chalet,Smoothie Shop,Snack Place,Soccer Field,Spa,Sporting Goods Shop,Sports Bar,Sports Club,Steakhouse,Storage Facility,Supermarket,Sushi Restaurant,Tea Room,Tennis Court,Thai Restaurant,Theater,Toy / Game Store,Trail,Train Station,Turkish Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [300]:
NorthYork_onehot.shape

(615, 153)

### Manually filtering (Subsetting) related features for the Café Contractor

In [29]:
filtered_categories_list = ['Neighborhood',
 'American Restaurant',
 'Asian Restaurant',
 'Bubble Tea Shop',
 'Burger Joint',
 'Burrito Place',
 'Cafeteria',
 'Café',
 'Caribbean Restaurant',
 'Cheese Shop',
 'Chinese Restaurant',
 'Coffee Shop',
 'Comfort Food Restaurant',
 'Deli / Bodega',
 'Dessert Shop',
 'Dim Sum Restaurant',
 'Diner',
 'Eastern European Restaurant',
 'Empanada Restaurant',
 'Falafel Restaurant',
 'Fish & Chips Shop',
 'Food & Drink Shop',
 'French Restaurant',
 'Greek Restaurant',
 'Housing Development',
 'Indian Restaurant',
 'Indonesian Restaurant', 
 'Italian Restaurant',
 'Japanese Restaurant',
 'Korean Restaurant',
 'Latin American Restaurant',
 'Mediterranean Restaurant',
 'Middle Eastern Restaurant',
 'Portuguese Restaurant',
 'Restaurant',
 'Seafood Restaurant',
 'Steakhouse',
 'Sushi Restaurant',
 'Thai Restaurant',
 'Turkish Restaurant',
 'Vietnamese Restaurant']

### Updating the One-hot Encoded DataFrame and Grouping the Data by Neighborhoods

In [30]:
NorthYork_onehot = NorthYork_onehot[filtered_categories_list].groupby('Neighborhood').sum()

NorthYork_onehot.head()

Unnamed: 0_level_0,American Restaurant,Asian Restaurant,Bubble Tea Shop,Burger Joint,Burrito Place,Cafeteria,Café,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Coffee Shop,Comfort Food Restaurant,Deli / Bodega,Dessert Shop,Dim Sum Restaurant,Diner,Eastern European Restaurant,Empanada Restaurant,Falafel Restaurant,Fish & Chips Shop,Food & Drink Shop,French Restaurant,Greek Restaurant,Housing Development,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Latin American Restaurant,Mediterranean Restaurant,Middle Eastern Restaurant,Portuguese Restaurant,Restaurant,Seafood Restaurant,Steakhouse,Sushi Restaurant,Thai Restaurant,Turkish Restaurant,Vietnamese Restaurant
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1
"Bathurst Manor, Downsview North, Wilson Heights",0,0,0,0,0,0,0,0,0,0,2,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0
Bayview Village,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0
"Bedford Park, Lawrence Manor East",1,0,0,0,0,0,1,0,0,0,3,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,3,0,0,0,0,0,0,1,0,0,1,1,0,0
"CFB Toronto, Downsview East",0,0,0,0,0,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,2,1
Don Mills North,0,1,0,2,0,1,1,1,0,0,3,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,1,0,0,0,1,0,0


**And let's examine the new dataframe size**

In [31]:
NorthYork_onehot.shape

(24, 40)

In [304]:
NorthYork_onehot.columns

Index(['American Restaurant', 'Asian Restaurant', 'Bubble Tea Shop',
       'Burger Joint', 'Burrito Place', 'Cafeteria', 'Café',
       'Caribbean Restaurant', 'Cheese Shop', 'Chinese Restaurant',
       'Coffee Shop', 'Comfort Food Restaurant', 'Deli / Bodega',
       'Dessert Shop', 'Dim Sum Restaurant', 'Diner',
       'Eastern European Restaurant', 'Empanada Restaurant',
       'Falafel Restaurant', 'Fish & Chips Shop', 'Food & Drink Shop',
       'French Restaurant', 'Greek Restaurant', 'Housing Development',
       'Indian Restaurant', 'Indonesian Restaurant', 'Italian Restaurant',
       'Japanese Restaurant', 'Korean Restaurant', 'Latin American Restaurant',
       'Mediterranean Restaurant', 'Middle Eastern Restaurant',
       'Portuguese Restaurant', 'Restaurant', 'Seafood Restaurant',
       'Steakhouse', 'Sushi Restaurant', 'Thai Restaurant',
       'Turkish Restaurant', 'Vietnamese Restaurant'],
      dtype='object')

### Reduce the scope of feature to only restaurants

In [32]:
feat_name_list = list(NorthYork_onehot.columns)
other_than_restaurant_list = []

for counter, value in enumerate(feat_name_list):
    if value.find('Restaurant') == (-1):
        other_than_restaurant_list.append(value)
        
NorthYork_onehot['Total Non Restaurant'] = NorthYork_onehot[other_than_restaurant_list].sum(axis = 1)
NorthYork_onehot = NorthYork_onehot.drop(columns = other_than_restaurant_list)
NorthYork_onehot = NorthYork_onehot.drop(columns = 'Total Non Restaurant')
NorthYork_onehot

Unnamed: 0_level_0,American Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant,Falafel Restaurant,French Restaurant,Greek Restaurant,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Latin American Restaurant,Mediterranean Restaurant,Middle Eastern Restaurant,Portuguese Restaurant,Restaurant,Seafood Restaurant,Sushi Restaurant,Thai Restaurant,Turkish Restaurant,Vietnamese Restaurant
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
"Bathurst Manor, Downsview North, Wilson Heights",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,1,0,0,0
Bayview Village,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0
"Bedford Park, Lawrence Manor East",1,0,0,0,1,0,0,0,0,0,1,1,0,3,0,0,0,0,0,0,1,0,1,1,0,0
"CFB Toronto, Downsview East",0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,2,1
Don Mills North,0,1,1,0,0,0,0,0,0,0,1,0,0,0,3,0,0,0,0,0,1,0,0,1,0,0
Downsview Central,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2
Downsview Northwest,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
Downsview West,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
"Downsview, North Park, Upwood Park",0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
"Emery, Humberlea",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Showing the Fully-Processed DataFrame about Neighborhoods inside North York
### This Dataset is Ready for any Machine Learning Algorithm

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [33]:
NorthYork_grouped = NorthYork_onehot.groupby('Neighborhood').mean().reset_index()
NorthYork_grouped

Unnamed: 0,Neighborhood,American Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant,Falafel Restaurant,French Restaurant,Greek Restaurant,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Latin American Restaurant,Mediterranean Restaurant,Middle Eastern Restaurant,Portuguese Restaurant,Restaurant,Seafood Restaurant,Sushi Restaurant,Thai Restaurant,Turkish Restaurant,Vietnamese Restaurant
0,"Bathurst Manor, Downsview North, Wilson Heights",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,1,0,0,0
1,Bayview Village,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0
2,"Bedford Park, Lawrence Manor East",1,0,0,0,1,0,0,0,0,0,1,1,0,3,0,0,0,0,0,0,1,0,1,1,0,0
3,"CFB Toronto, Downsview East",0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,2,1
4,Don Mills North,0,1,1,0,0,0,0,0,0,0,1,0,0,0,3,0,0,0,0,0,1,0,0,1,0,0
5,Downsview Central,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2
6,Downsview Northwest,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
7,Downsview West,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
8,"Downsview, North Park, Upwood Park",0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
9,"Emery, Humberlea",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Let's confirm the new size

In [34]:
NorthYork_grouped.shape

(24, 27)

### Print each neighborhood along with the top 5 most common venues

In [35]:
num_top_venues = 5

for Nbh in NorthYork_grouped['Neighborhood']:
    print("----"+Nbh+"----")
    top = NorthYork_grouped[NorthYork_grouped['Neighborhood'] == Nbh].T.reset_index()
    top.columns = ['venue','freq']
    top = top.iloc[1:]
    top['freq'] = top['freq'].astype(float)
    top = top.round({'freq': 2})
    print(top.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor, Downsview North, Wilson Heights----
                       venue  freq
0           Sushi Restaurant   1.0
1                 Restaurant   1.0
2  Middle Eastern Restaurant   1.0
3   Mediterranean Restaurant   1.0
4        American Restaurant   0.0


----Bayview Village----
                 venue  freq
0  Japanese Restaurant   2.0
1   Chinese Restaurant   1.0
2  American Restaurant   0.0
3   Turkish Restaurant   0.0
4      Thai Restaurant   0.0


----Bedford Park, Lawrence Manor East----
                venue  freq
0  Italian Restaurant   3.0
1    Greek Restaurant   1.0
2     Thai Restaurant   1.0
3    Sushi Restaurant   1.0
4          Restaurant   1.0


----CFB Toronto, Downsview East----
                       venue  freq
0         Turkish Restaurant   2.0
1         Italian Restaurant   1.0
2  Latin American Restaurant   1.0
3  Middle Eastern Restaurant   1.0
4      Vietnamese Restaurant   1.0


----Don Mills North----
                  venue  freq
0   Japanese Resta

### Let's put that into a pandas dataframe

**First, let's write a function to sort the venues in descending order**

In [36]:
def top_venues(row, top):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:top]

### Now let's create the new dataframe and display the top 10 venues for each neighborhood

In [37]:
import numpy as np
top = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(top):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
Nbhs_venues_sorted = pd.DataFrame(columns=columns)
Nbhs_venues_sorted['Neighborhood'] = NorthYork_grouped['Neighborhood']

for ind in np.arange(NorthYork_grouped.shape[0]):
    Nbhs_venues_sorted.iloc[ind, 1:] = top_venues(NorthYork_grouped.iloc[ind, :], top)

Nbhs_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Downsview North, Wilson Heights",Sushi Restaurant,Restaurant,Middle Eastern Restaurant,Mediterranean Restaurant,Vietnamese Restaurant,Greek Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant
1,Bayview Village,Japanese Restaurant,Chinese Restaurant,Vietnamese Restaurant,Indian Restaurant,Asian Restaurant,Caribbean Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant
2,"Bedford Park, Lawrence Manor East",Italian Restaurant,American Restaurant,Thai Restaurant,Sushi Restaurant,Restaurant,Comfort Food Restaurant,Indian Restaurant,Greek Restaurant,French Restaurant,Asian Restaurant
3,"CFB Toronto, Downsview East",Turkish Restaurant,Vietnamese Restaurant,Middle Eastern Restaurant,Latin American Restaurant,Italian Restaurant,Greek Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant
4,Don Mills North,Japanese Restaurant,Greek Restaurant,Thai Restaurant,Asian Restaurant,Caribbean Restaurant,Restaurant,Vietnamese Restaurant,Chinese Restaurant,Comfort Food Restaurant,Dim Sum Restaurant


## Part 4 - Applying one of Machine Learning Techniques (K-Means Clustering)

### Cluster Neighborhoods


Run k-means to cluster the neighborhood into 3 clusters.

Here we cluster neighborhoods via K-means clustering method. We think that 3 clusters is enough and can cover the complexity of our problem. After clustering we will update our dataset and create a column representing the group for each neighborhood.

In [38]:
# set number of clusters (k)
k = 3

NorthYork_grouped_clustering = NorthYork_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(NorthYork_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 2, 0, 0, 0, 1,
       0, 0])

In [42]:
means_df = pd.DataFrame(kmeans.cluster_centers_)
means_df.columns = NorthYork_grouped_clustering.columns
means_df.index = ['G1','G2','G3']
means_df['Total Sum'] = means_df.sum(axis = 1)
fixed_columns = [means_df.columns[-1]] + list(means_df.columns[:-1])
means_df = means_df[fixed_columns]
means_df.sort_values(axis = 0, by = ['Total Sum'], ascending=False)

Unnamed: 0,Total Sum,American Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant,Falafel Restaurant,French Restaurant,Greek Restaurant,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Latin American Restaurant,Mediterranean Restaurant,Middle Eastern Restaurant,Portuguese Restaurant,Restaurant,Seafood Restaurant,Sushi Restaurant,Thai Restaurant,Turkish Restaurant,Vietnamese Restaurant
G2,29.0,1.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,5.0,5.0,0.0,0.0,2.0,0.0,3.0,1.0,4.0,0.0,0.0,2.0
G3,7.666667,0.5,0.6666667,0.333333,0.5,0.0,0.166667,0.0,0.0,0.166667,0.0,0.166667,0.166667,0.0,0.166667,1.833333,0.666667,0.0,0.0,0.666667,0.0,1.333333,0.0,0.166667,0.166667,0.0,0.0
G1,3.352941,0.117647,2.775558e-17,0.117647,0.235294,0.058824,0.058824,0.058824,0.058824,0.058824,0.058824,0.176471,0.058824,6.938894e-18,0.352941,0.058824,0.117647,0.117647,0.176471,0.117647,0.058824,0.470588,0.058824,0.176471,0.058824,0.117647,0.411765


In [43]:
print("means_df shape:", means_df.shape)
print("orthYork_grouped_clustering shape:", NorthYork_grouped_clustering.shape)

means_df shape: (3, 27)
orthYork_grouped_clustering shape: (24, 26)


### Inserting "kmeans.labels_" into the Original Scarborough DataFrame
#### Finding the Corresponding Group for Each Neighborhood.

In [122]:
neigh_summary = pd.DataFrame([NorthYork_grouped.index, 1 + kmeans.labels_]).T
neigh_summary.columns = ['Neighborhood', 'Group']

Unnamed: 0,Neighborhood,Group
21,21,2


# Deducing Results:
## Best Neighborhoods Are...

In [124]:
neigh_summary[neigh_summary['Group'] == 2]

Unnamed: 0,Neighborhood,Group
21,21,2


In [125]:
NorthYork_grouped.loc[21:21]

Unnamed: 0,Neighborhood,American Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant,Falafel Restaurant,French Restaurant,Greek Restaurant,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Latin American Restaurant,Mediterranean Restaurant,Middle Eastern Restaurant,Portuguese Restaurant,Restaurant,Seafood Restaurant,Sushi Restaurant,Thai Restaurant,Turkish Restaurant,Vietnamese Restaurant
21,Willowdale South,1,0,0,2,1,0,0,0,0,0,0,0,1,2,5,5,0,0,2,0,3,1,4,0,0,2


Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [149]:
# add clustering labels
Nbhs_venues_sorted.insert(0, 'Cluster_Labels', kmeans.labels_)

NorthYork_merged = NorthYork_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
NorthYork_merged = NorthYork_merged.join(Nbhs_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

In [150]:
# NorthYork_data
NorthYork_merged

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0,Caribbean Restaurant,Chinese Restaurant,Vietnamese Restaurant,Turkish Restaurant,Asian Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant,Falafel Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,0,Portuguese Restaurant,Vietnamese Restaurant,Indian Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,0,Vietnamese Restaurant,Restaurant,Greek Restaurant,Sushi Restaurant,Seafood Restaurant,Korean Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant
7,M3B,North York,Don Mills North,43.745906,-79.352188,2,Japanese Restaurant,Greek Restaurant,Thai Restaurant,Asian Restaurant,Caribbean Restaurant,Restaurant,Vietnamese Restaurant,Chinese Restaurant,Comfort Food Restaurant,Dim Sum Restaurant
10,M6B,North York,Glencairn,43.709577,-79.445073,0,Restaurant,Mediterranean Restaurant,Latin American Restaurant,Japanese Restaurant,Italian Restaurant,Greek Restaurant,Vietnamese Restaurant,French Restaurant,Asian Restaurant,Caribbean Restaurant
13,M3C,North York,"Flemingdon Park, Don Mills South",43.7259,-79.340923,2,Restaurant,American Restaurant,Asian Restaurant,Japanese Restaurant,Sushi Restaurant,Chinese Restaurant,Middle Eastern Restaurant,Dim Sum Restaurant,Italian Restaurant,French Restaurant
27,M2H,North York,Hillcrest Village,43.803762,-79.363452,0,Chinese Restaurant,Korean Restaurant,Vietnamese Restaurant,Indian Restaurant,Asian Restaurant,Caribbean Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant
28,M3H,North York,"Bathurst Manor, Downsview North, Wilson Heights",43.754328,-79.442259,0,Sushi Restaurant,Restaurant,Middle Eastern Restaurant,Mediterranean Restaurant,Vietnamese Restaurant,Greek Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant
33,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,2,Japanese Restaurant,American Restaurant,Caribbean Restaurant,Restaurant,Indian Restaurant,Asian Restaurant,Chinese Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant
34,M3J,North York,"Northwood Park, York University",43.76798,-79.487262,2,Restaurant,Chinese Restaurant,Middle Eastern Restaurant,Japanese Restaurant,Falafel Restaurant,Vietnamese Restaurant,Greek Restaurant,Asian Restaurant,Caribbean Restaurant,Comfort Food Restaurant


Finally, let's visualize the resulting clusters

In [151]:
NorthYork_merged['Cluster_Labels'] = NorthYork_merged.Cluster_Labels.astype(int)

latNorthYork = 43.761539
longNorthYork = -79.411079

# create map
map_clusters = folium.Map(location=[latNorthYork, longNorthYork], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, nbh, cluster in zip(NorthYork_merged['Latitude'], NorthYork_merged['Longitude'], NorthYork_merged['Neighborhood'], NorthYork_merged['Cluster_Labels']):
    label = folium.Popup(str(nbh) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=8,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.9).add_to(map_clusters)
       
map_clusters

### Examine Clusters

Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can then assign a name to each cluster. There are 3 cluster starting from 0 to 2.

### Cluster 1 : This cluster appears to group neighborhoods where most common venues are non asian restaurants at first glance

In [316]:
NorthYork_merged.loc[NorthYork_merged['Cluster_Labels'] == 0, NorthYork_merged.columns[[2] + list(range(5, NorthYork_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Parkwoods,0,Caribbean Restaurant,Chinese Restaurant,Vietnamese Restaurant,Turkish Restaurant,Asian Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant,Falafel Restaurant
1,Victoria Village,0,Portuguese Restaurant,Vietnamese Restaurant,Indian Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant
3,"Lawrence Heights, Lawrence Manor",0,Vietnamese Restaurant,Restaurant,Greek Restaurant,Sushi Restaurant,Seafood Restaurant,Korean Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant
10,Glencairn,0,Restaurant,Mediterranean Restaurant,Latin American Restaurant,Japanese Restaurant,Italian Restaurant,Greek Restaurant,Vietnamese Restaurant,French Restaurant,Asian Restaurant,Caribbean Restaurant
27,Hillcrest Village,0,Chinese Restaurant,Korean Restaurant,Vietnamese Restaurant,Indian Restaurant,Asian Restaurant,Caribbean Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant
28,"Bathurst Manor, Downsview North, Wilson Heights",0,Sushi Restaurant,Restaurant,Middle Eastern Restaurant,Mediterranean Restaurant,Vietnamese Restaurant,Greek Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant
40,"CFB Toronto, Downsview East",0,Turkish Restaurant,Vietnamese Restaurant,Middle Eastern Restaurant,Latin American Restaurant,Italian Restaurant,Greek Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant
45,"Silver Hills, York Mills",0,Vietnamese Restaurant,Turkish Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant,Falafel Restaurant
46,Downsview West,0,Vietnamese Restaurant,Turkish Restaurant,Asian Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant,Falafel Restaurant
49,"Downsview, North Park, Upwood Park",0,Chinese Restaurant,Dim Sum Restaurant,Mediterranean Restaurant,Vietnamese Restaurant,Indian Restaurant,Asian Restaurant,Caribbean Restaurant,Comfort Food Restaurant,Eastern European Restaurant,Empanada Restaurant


### Cluster 2 : This cluster appears to gather large set of asian restaurants in a one neighborhood and it is a candidate of choice

In [317]:
NorthYork_merged.loc[NorthYork_merged['Cluster_Labels'] == 1, NorthYork_merged.columns[[2] + list(range(5, NorthYork_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
59,Willowdale South,1,Korean Restaurant,Japanese Restaurant,Sushi Restaurant,Restaurant,Vietnamese Restaurant,Chinese Restaurant,Italian Restaurant,Middle Eastern Restaurant,Comfort Food Restaurant,Indonesian Restaurant


### Cluster 3 : This cluster appears to asian and western restaurants

In [318]:
NorthYork_merged.loc[NorthYork_merged['Cluster_Labels'] == 2, NorthYork_merged.columns[[2] + list(range(5, NorthYork_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,Don Mills North,2,Japanese Restaurant,Greek Restaurant,Thai Restaurant,Asian Restaurant,Caribbean Restaurant,Restaurant,Vietnamese Restaurant,Chinese Restaurant,Comfort Food Restaurant,Dim Sum Restaurant
13,"Flemingdon Park, Don Mills South",2,Restaurant,American Restaurant,Asian Restaurant,Japanese Restaurant,Sushi Restaurant,Chinese Restaurant,Middle Eastern Restaurant,Dim Sum Restaurant,Italian Restaurant,French Restaurant
33,"Fairview, Henry Farm, Oriole",2,Japanese Restaurant,American Restaurant,Caribbean Restaurant,Restaurant,Indian Restaurant,Asian Restaurant,Chinese Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant
34,"Northwood Park, York University",2,Restaurant,Chinese Restaurant,Middle Eastern Restaurant,Japanese Restaurant,Falafel Restaurant,Vietnamese Restaurant,Greek Restaurant,Asian Restaurant,Caribbean Restaurant,Comfort Food Restaurant
39,Bayview Village,2,Japanese Restaurant,Chinese Restaurant,Vietnamese Restaurant,Indian Restaurant,Asian Restaurant,Caribbean Restaurant,Comfort Food Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Empanada Restaurant
52,"Newtonbrook, Willowdale",2,Korean Restaurant,Middle Eastern Restaurant,Asian Restaurant,Japanese Restaurant,Indian Restaurant,Vietnamese Restaurant,Greek Restaurant,Caribbean Restaurant,Chinese Restaurant,Comfort Food Restaurant


## Results and Discussion <a name="results"></a>

After having data collected from different sources, cleaned and transformed, we came to the stage where we retrieve usefull information on neighborhoods and their venues that are of our concern. 
Although the scope is large in terms of number of restaurants in North York, the clustering led us to get a list of neighbohood with significant and potential spectrum of restaurants we can start the business with in this city. 

Now, we focus on the centers of clusters and compare them for their fist common venues. The group which its center has the highest number of asian restaurants will be our best recommendation to the contractor. Neighborhood **Willowdale South** is a good candidate to spring up the business of the new coffee brand.

More criteras from stakeholders could help refine this study for example to adress customers is closer area to head quarter for example. But this study is a start point to have sufficient idea about the area to cover with distribution of the new coffee brand. The transportation cost aspect could be a key factor as well as the volumes traded to target new customers in North York.

## Conclusion <a name="conclusion"></a>

Purpose of this project is to identify new areas/market in North York, Toronto, where matured stakeholder can extend his business. Knowing that Coffee is Canada’s most consumed beverage amongst adults – even more than tap water, still many places in North York (inhabited by people of many different cultures) don't give such value and importance to it. That is why we aimed at finding the best restaurants in North York that can handle and share the idea to promote good quality of coffee with diverse flavour and tastes. By calculating restaurants distribution from Foursquare data we identified neighborhood in Nort York that justify further analysis and then generated collection of locations which satisfy some basic requirements regarding nearby restaurants. Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of potential locations) and addresses of those zone centers were created to be used as starting points for final exploration by stakeholders. 

Final decission on optimal restaurant location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to other restaurants or Coffee shops), levels of noise / proximity to major roads, social and economic dynamics of every neighborhood etc.