# <ins>London vs New York</ins>?
# What is your choice? Let the Data Decide!

##### Author - Chirag Sable

<br>

## <ins> Introduction</ins>

NYC and London are the two most important metro cities and the financial centers of their respective country in the world. So, there is a competition among the cities as in which city is the most perfect city. Also the tourism sector of both these cities is very strong. There has been a war for supremacy in terms of quality of life, jobs, education, entertainment and recreational facilities that these cities have to offer to its residents.

This project attempts to analyze the neighborhoods in each of these two cities and tries to understand what is popular in them and what they have to offer to someone who is contemplating to make a choice on seeking a life in either of the metro cities.

## Table of Contents


1. <a href="#item1">Download, Scrape and Wrangle London Dataset</a><br>

2. <a href="#item2">Explore Neighborhoods in London</a><br>

3. <a href="#item3">Analyze Each Neighborhood of London</a>

4. <a href="#item4">Cluster Neighborhoods of London</a>

5. <a href="#item5">Examine Clusters of London</a> 

6. <a href="#item6">Download and Explore Dataset of New York</a>

7. <a href="#item7">Explore Neighborhoods in New York</a>

8. <a href="#item8">Analyze Each Neighborhood of New York</a>

9. <a href="#item9">Cluster Neighborhoods of New York</a>

10. <a href="#item10">Examine Clusters of New York</a>


# <ins>LONDON DATASET</ins>

* **Import the necessary libraries** 

    There are a few libraries which couldn't be imported directly i.e you need to install them. The installation command is given for the respective library in the comments.

In [7]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#uncomment and install it, I've already installed 
#!conda install -c conda-forge geopy --yes
!pip install geocoder
from geopy.geocoders import Nominatim 
import geocoder

import requests 
from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

#uncomment and install it, I've already installed
#!conda install -c conda-forge folium=0.5.0 --yes 
!pip install folium
import folium

# import the library we use to open URLs
import urllib.request

from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>
## 1. Download, Scrape and Wrangle London Dataset

* **Download, Scrape and Wrangle**

    **Web scraping** (also known as screen scraping, data scraping, web harvesting, web data extraction and a multitude of other aliases) is a method for extracting data from web pages. We scrape the Wikipedia page using Python, Urllib, Beautiful Soup and Pandas. The website used for reference is: https://simpleanalytical.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas.
    
    There are different website scraping libraries and packages in Python. One of the most common packages is BeautifulSoup and we will use it in this project. The package's main documentation page is: http://beautiful-soup-4.readthedocs.io/en/latest/
    
    <ins>Urllib.request</ins>: As we are using Python 3.7, we will use urllib.request to fetch the HTML from the URL we specify that we want to scrape.
    
    <ins>BeautifulSoup</ins>: Once urllib.request has pulled in the content from the URL, we use the power of BeautifulSoup to extract and work with the data within it. BeautifulSoup4 has a multitude of functions at it’s disposal to make this incredibly easy for us.
    
    The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood. We only process the cells that have an assigned borough. We ignore cells with a borough that is Not assigned. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

Using the **urllib.request** library, we want to query the page and put the HTML data into a variable (which we have called ‘url’). Then we use **Beautiful Soup** to parse the HTML data we stored in our ‘url’ variable and store it in a new variable called ‘soup’ in the **Beautiful Soup** format. Jupyter Notebook prefers we specify a parser format so we use the “lxml” library option.

In [8]:
url = 'https://en.wikipedia.org/wiki/List_of_areas_of_London'

# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")

To get an idea of the structure of the underlying HTML in our web page, we view the code using **Beautiful Soup’s prettify function** and check it out right there in our Jupyter Notebook. 

<ins>Find the table we want</ins>:

* By looking at our Wikipedia page, we can see there is a **LOT** of information in there. The table which we are looking for is already set up in nice rows and columns which should make our job a little easier as beginner web scrapers.
    
* Starting with an **HTML table tag** with a class identifier of "**wikitable sortable**". We’ll make a note of that for further use later.
    
* Scroll down a little to see how the table is made up and you’ll see the rows start and end with **tr** tags.

* The top row of headers has **th** tags while the data rows beneath for each club has **td** tags. It’s in these tags that we will tell Python to extract our data from.

In [9]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of areas of London - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"b81dbd96-e80d-4719-a090-8bf7bd1c4d4e","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_areas_of_London","wgTitle":"List of areas of London","wgCurRevisionId":947527724,"wgRevisionId":947527724,"wgArticleId":11915713,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Use dmy dates from August 2015","Use British English from August 2015","Lists of coord

We know the data resides within an **HTML** table so we send **Beautiful Soup** off to retrieve all instances of the **table** tag within the page with class "**wikitable sortable**". We can use this to get Beautiful Soup to only bring back the table data for this particular table and keep that in a variable called "**right_table**":

In [10]:
right_table=soup.find('table', class_='wikitable sortable')
right_table

<table class="wikitable sortable" style="clear:both;">
<tbody><tr>
<th>Location</th>
<th>London borough</th>
<th>Post town</th>
<th>Postcode district</th>
<th>Dial code</th>
<th>OS grid ref
</th></tr>
<tr>
<td><a href="/wiki/Abbey_Wood" title="Abbey Wood">Abbey Wood</a></td>
<td>Bexley,  Greenwich <sup class="reference" id="cite_ref-mills1_7-0"><a href="#cite_note-mills1-7">[7]</a></sup></td>
<td>LONDON</td>
<td>SE2</td>
<td>020</td>
<td><span class="plainlinks nourlexpansion" style="white-space: nowrap"><a class="external text" href="https://tools.wmflabs.org/geohack/en/51.48648031512;0.10859224316653_region:GB_scale:25000?pagename=List_of_areas_of_London">TQ465785</a></span>
</td></tr>
<tr>
<td><a href="/wiki/Acton,_London" title="Acton, London">Acton</a></td>
<td>Ealing, Hammersmith and Fulham<sup class="reference" id="cite_ref-mills2_8-0"><a href="#cite_note-mills2-8">[8]</a></sup></td>
<td>LONDON</td>
<td>W3, W4</td>
<td>020</td>
<td><span class="plainlinks nourlexpansion" style="

<ins>Loop through the rows</ins>:

We know we have to **start looping through the rows** to get the data for every club in the table. The table is well structured with each club having it’s own defined row. This makes things somewhat easier.

There are three columns in our table that we want to scrape the data from so we will set up three empty lists (A, B, and C) to store our data in.

To start with, **we want to use the Beautiful Soup ‘find_all’ function again** and set it to look for the string ‘tr’. We will then set up a **FOR** loop for each row within that array and set Python to loop through the rows, one by one.

Within the loop we are going to use **find_all** again to search each row for <td> tags with the ‘td’ string. We will add all of these to a variable called ‘cells’ and then check to make sure that there are 3 items in our 'cells' array (i.e. one for each column).

If there are then we use the find(text=True)) option to extract the content string from within each <td> element in that row and add them to the A-C lists we created at the start of this step. Let’s have a look at the code:

In [11]:
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]

for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==6:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        D.append(cells[3].find(text=True))
        E.append(cells[4].find(text=True))
        F.append(cells[5].find(text=True))

Lets convert these lists into **Dataframe** assigning each of the lists A-C into a column with the name of our source table columns i.e. Postal Code, Borough, Neighborhood

In [12]:
df_london = pd.DataFrame(D, columns=['PostalCode'])
df_london['Borough'] = B
df_london['Neighborhood'] = A
df_london.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,SE2,"Bexley, Greenwich",Abbey Wood
1,"W3, W4","Ealing, Hammersmith and Fulham",Acton
2,CR0,Croydon,Addington
3,CR0,Croydon,Addiscombe
4,"DA5, DA14",Bexley,Albany Park


Let us see the shape of the dataset.

In [13]:
df_london.shape

(533, 3)

Now, we will combine the rows which have the same Postal Code and Neighborhood.

In [14]:
df_london = df_london.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

In [15]:
df_london.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,BR1,Bromley,"Bromley, Plaistow, Sundridge, Widmore"
1,BR1,Lewisham,Downham
2,BR2,Bromley,"Hayes, Keston, Leaves Green, Southborough"
3,BR3,Bromley,"Bickley, Bromley Common, Eden Park, Elmers End"
4,"BR3, SE20",Bromley,Beckenham


In [16]:
df_london.shape

(320, 3)

Where the Postcode are more than one, **(for example, in *Bromley*, there are 2 postcodes - *BR3* and *SE20*)**, the postcodes are spread to multi-rows and assigned the same values from the other columns.

In [17]:
df0 = df_london.drop('PostalCode', axis=1).join(df_london['PostalCode'].str.split(',', expand=True).stack().reset_index(level=1, drop=True).rename('PostalCode'))

In [18]:
df0.head(10)

Unnamed: 0,Borough,Neighborhood,PostalCode
0,Bromley,"Bromley, Plaistow, Sundridge, Widmore",BR1
1,Lewisham,Downham,BR1
2,Bromley,"Hayes, Keston, Leaves Green, Southborough",BR2
3,Bromley,"Bickley, Bromley Common, Eden Park, Elmers End",BR3
4,Bromley,Beckenham,BR3
4,Bromley,Beckenham,SE20
5,Bromley,"Coney Hall, West Wickham",BR4
6,Bromley,"Derry Downs, Petts Wood, St Mary Cray, St Paul...",BR5
7,Bromley,Orpington,BR5
7,Bromley,Orpington,BR6


Reset the indexes.

In [19]:
df0.reset_index(drop=True).head()

Unnamed: 0,Borough,Neighborhood,PostalCode
0,Bromley,"Bromley, Plaistow, Sundridge, Widmore",BR1
1,Lewisham,Downham,BR1
2,Bromley,"Hayes, Keston, Leaves Green, Southborough",BR2
3,Bromley,"Bickley, Bromley Common, Eden Park, Elmers End",BR3
4,Bromley,Beckenham,BR3


In [20]:
df0.shape

(418, 3)

Now, bring the column **PostalCode** to the front.

In [21]:
df0 = df0.set_index('PostalCode').reset_index()

In [22]:
df0.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,BR1,Bromley,"Bromley, Plaistow, Sundridge, Widmore"
1,BR1,Lewisham,Downham
2,BR2,Bromley,"Hayes, Keston, Leaves Green, Southborough"
3,BR3,Bromley,"Bickley, Bromley Common, Eden Park, Elmers End"
4,BR3,Bromley,Beckenham


In obtaining the location data of the locations, the `Geocoder` package is used with the `arcgis_geocoder` to obtain the latitude and longitude of the needed locations.

In [23]:
# Geocoder starts here
# Defining a function to use --> get_latlng()'''
def get_latlng(arcgis_geocoder):
    
    # Initialize the Location (lat. and long.) to "None"
    lat_lng_coords = None
    
    # While loop helps to create a continous run until all the location coordinates are geocoded
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, London, United Kingdom'.format(arcgis_geocoder))
        lat_lng_coords = g.latlng
    return lat_lng_coords
# Geocoder ends here

Testing the function above for a sample postcode - `SE2`.

In [24]:
sample = get_latlng('SE2')

In [None]:
sample

[51.492450000000076, 0.12127000000003818]

So, we are certain that the geocoder works fine. So we proceed to applying it to our dataframe `df0`.

In [None]:
postal_codes = df0['PostalCode']
coordinates = [get_latlng(postal_code) for postal_code in postal_codes.tolist()]

Then we proceed to store the location data — latitude and longitude as follows. The obtained coordinates are then joined to `df0` to create new data frame.

In [None]:
df_london_loc = df0

# The obtained coordinates (latitude and longitude) are joined with the dataframe as shown
df_london_coordinates = pd.DataFrame(coordinates, columns=['Latitude', 'Longitude'])
df_london_loc['Latitude'] = df_london_coordinates['Latitude']
df_london_loc['Longitude'] = df_london_coordinates['Longitude']

df_london_loc.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,BR1,Bromley,"Bromley, Plaistow, Sundridge, Widmore",51.41671,0.009042
1,BR1,Lewisham,Downham,51.41671,0.009042
2,BR2,Bromley,"Hayes, Keston, Leaves Green, Southborough",51.50642,-0.12721
3,BR3,Bromley,"Bickley, Bromley Common, Eden Park, Elmers End",51.415095,-0.035403
4,BR3,Bromley,Beckenham,51.415095,-0.035403


In [None]:
df_london_loc.shape

(418, 5)

* **Use geopy library to get the latitude and longitude values of London** 
    
    In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent ***ny_explorer***, as shown below.

In [None]:
address = 'London, UK'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of London are 51.5073219, -0.1276474.


* **Create a map of Toronto with neighborhoods superimposed on top.**

In [None]:
map_london = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood, postal_code in zip(df_london_loc['Latitude'], df_london_loc['Longitude'], df_london_loc['Borough'], df_london_loc['Neighborhood'], df_london_loc['PostalCode']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london)  
    
map_london

**Note:**  Unfortunately, the folium maps do not seem to render on GitHub natively. So to view the map drop the github link to your ```.ipynb``` file into nbviewer.org and get a full dynamic output, when provided a valid ```folium.Map``` instance.




Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

* **Define Foursquare Credentials and Version**
    
    Since this is a sensitive cell it's been hidden.
    
    However, the format is:
    > CLIENT_ID = 'your Foursquare ID'    
    CLIENT_SECRET = 'your Foursquare Secret'    
    VERSION = '20180605'

In [None]:
# The code was removed by Watson Studio for sharing.

* **Let's explore the neighborhood *Hither Green* in our dataframe**.

    Get the neighborhood's name.

In [None]:
df_london_loc.loc[257,'Neighborhood']

'Hither Green, Lewisham'

 Get the neighborhood's latitude and longitude values.

In [None]:
lewisham_lat = df_london_loc.loc[257, 'Latitude']
lewisham_long = df_london_loc.loc[257, 'Longitude']
lewisham_loc = df_london_loc.loc[257, 'Neighborhood']
lewisham_postcode = df_london_loc.loc[257, 'PostalCode']
print('The latitude and longitude values of {} with postcode {}, are {}, {}.'.format(lewisham_loc, lewisham_postcode, lewisham_lat, lewisham_long))

The latitude and longitude values of Hither Green, Lewisham with postcode SE13, are 51.46196000000003, -0.007539999999949032.


* **Now, let's get the top 100 venues that are in Regent Park, Harbourfront within a radius of 500 meters**.

    First, let's create the GET request URL. Name your URL **url**.

In [None]:
radius = 2000
LIMIT = 100
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET,lewisham_lat, lewisham_long, VERSION, radius, LIMIT)

Send the GET request and examine the resutls.

In [None]:
results = requests.get(url).json()
#'results' are not displayed since it's a very big result. However, snapshot of one is shown below to get an idea.
#results

'**results**' are not displayed since it's a very big result. However, snapshot of one is shown below to get an idea.

`{'meta': {'code': 200, 'requestId': '5ed642f2a2e538001b20a5da'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Lewisham Central',
  'headerFullLocation': 'Lewisham Central, London',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 186,
  'suggestedBounds': {'ne': {'lat': 51.47996001800005,
    'lng': 0.021296961190459426},
   'sw': {'lat': 51.44395998200002, 'lng': -0.03637696119035749}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '535823bc498ec8d8da9aad5f',
       'name': 'Street Feast Model Market',
       'location': {'address': '196 Lewisham High St',
        'crossStreet': 'entrance at Molesworth St',`

Let's create a function ***get_category_type***.

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a pandas dataframe.

In [None]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(10)

Unnamed: 0,name,categories,lat,lng
0,Street Feast Model Market,Street Food Gathering,51.460209,-0.012199
1,Maggie's Kitchen,Café,51.46538,-0.011213
2,Gennaro Delicatessan,Deli / Bodega,51.461765,-0.009726
3,Levante restaurant,Restaurant,51.462072,-0.009491
4,Dirty South,Pub,51.458846,-0.002666
5,Levante Pide Restaurant,Turkish Restaurant,51.459848,-0.011476
6,Corte,Coffee Shop,51.459776,-0.011554
7,Manor House Gardens,Park,51.456686,0.004684
8,The Sausage Man,Food Truck,51.462507,-0.010248
9,Côte Brasserie,French Restaurant,51.467378,0.007176


In [None]:
nearby_venues_lewisham_unique = nearby_venues['categories'].value_counts().to_frame(name='Count')
nearby_venues_lewisham_unique.head(5)

Unnamed: 0,Count
Pub,13
Café,8
Park,6
Gastropub,5
Coffee Shop,4


Interestingly, even though there are restaurants are the Lewisham area, they are not even in the top 5 venues. It should be noted that since we are limited by data availability, our perspectives will be on what we have.

And how many venues were returned by Foursquare?

In [None]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


<br>

## 2. Explore Neighborhoods in London

* **Let's create a function to repeat the same process to all the neighborhoods in Manhattan**

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=2000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        try:
            # make the GET request
            results = requests.get(url).json()["response"]['groups'][0]['items']
            #results = requests.get(url).json()["response"]['venues']
            #print(results)

            # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
        except KeyError:
            pass
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

* **Now write the code to run the above function on each neighborhood and create a new dataframe called _toronto_venues_**.

In [None]:
london_venues = getNearbyVenues(names=df_london_loc['Neighborhood'],
                                   latitudes=df_london_loc['Latitude'],
                                   longitudes=df_london_loc['Longitude']
                                  )

Bromley, Plaistow, Sundridge, Widmore
Downham
Hayes, Keston, Leaves Green, Southborough
Bickley, Bromley Common, Eden Park, Elmers End
Beckenham
Beckenham
Coney Hall, West Wickham
Derry Downs, Petts Wood, St Mary Cray, St Paul's Cray
Orpington
Orpington
Chelsfield, Downe, Goddington, Hazelwood, Locksbottom, Pratt's Bottom
Chislehurst, Elmstead
Addington, Addiscombe, Coombe, Croydon, Forestdale, New Addington, Shirley, Waddon, Woodside
Sanderstead, Selsdon, South Croydon
Mitcham
Coulsdon, Old Coulsdon
Thornton Heath
Kenley, Purley, Riddlesdown
Blendon
Barnes Cray, Crayford
Foots Cray, North Cray
Ruxley, Upper Ruxley
Ruxley, Upper Ruxley
Sidcup
Sidcup
Blackfen, Lamorbey
East Wickham, Welling
Belvedere, Lessness Heath
Bexley
Dartford
Albany Park
Albany Park
Crook Log, Upton
Bexleyheath
Bexleyheath
Bexleyheath
Barnehurst
Colyers, North End, Northumberland Heath, Slade Green
Erith
Erith
Mile End, Ratcliff, Shadwell, Spitalfields, Stepney, Wapping, Whitechapel
Lea Bridge
Leyton
Leyton
Wanste

* **Let's check the size of the resulting dataframe**

In [None]:
print(london_venues.shape)
london_venues.head()

Let's check how many venues were returned for each neighborhood

In [None]:
london_venues.groupby('Neighbourhood').count().head(10)

* **Let's find out how many unique categories can be curated from all the returned venues**.

In [None]:
print('There are {} uniques categories.'.format(len(london_venues['Venue Category'].unique())))

* **Let's count the top venues**.

In [None]:
venue_unique_count = london_venues['Venue Category'].value_counts().to_frame(name='Count')
venue_unique_count.head()

<br>

## 3. Analyze Each Neighborhood of London

In [None]:
# one hot encoding
london_onehot = pd.get_dummies(london_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
london_onehot['Neighborhood'] = london_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [london_onehot.columns[-1]] + list(london_onehot.columns[:-1])
london_onehot = london_onehot[fixed_columns]

london_onehot.head()

And let's examine the new dataframe size.

In [None]:
london_onehot.shape

* **Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category**

In [None]:
london_grouped = london_onehot.groupby('Neighborhood').mean().reset_index()
london_grouped.head(10)

* **Let's confirm the new size**

In [None]:
london_grouped.shape

* **Let's get each neighborhood along with the top 5 most common venues**

In [None]:
num_top_venues = 5

for hood in london_grouped['Neighborhood']:
    print('----'+hood+'----')
    temp = london_grouped[london_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq':2})
    print(temp.sort_values('freq', ascending = False).reset_index(drop = True).head(num_top_venues))
    print('\n')

* **Let's put that into a _pandas_ dataframe**
    
    First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = london_grouped['Neighborhood']

for ind in np.arange(london_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(london_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

In [None]:
neighborhoods_venues_sorted.shape

<br>

## 4. Cluster Neighborhoods of London

Run k-means to cluster the neighborhood into 5 clusters.

In [None]:
#set number of clusters
kclusters = 5

london_grouped_clustering = london_grouped.drop('Neighborhood',1)

#run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(london_grouped_clustering)

#check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

london_merged = df_london_loc

london_merged = london_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on = 'Neighborhood')

london_merged.head()

In [None]:
london_merged = london_merged.dropna()
london_merged.reset_index(drop=True)

In [None]:
df_exp = london_merged[london_merged.columns[6:16]]
df_exp

In [None]:
df1 = df_exp.stack().value_counts()
df1

In [None]:
df1 = pd.DataFrame(data = df1)
df1

In [None]:
df1 = df1.reset_index()

In [None]:
df1.rename(columns = {'index':'Venues',0:'Count'}, inplace = True)

In [None]:
df1 = df1.sort_values(by='Venues', ascending = True)

In [None]:
df1.reset_index(inplace=True, drop = True)

In [None]:
df1

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt


df1.groupby('Venues').plot.barh(figsize=(100,600))


In [None]:
london_merged.shape

In [None]:
neighborhoods_venues_sorted.shape

* **Finally, let's visualize the resulting clusters**.

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(london_merged['Latitude'], london_merged['Longitude'], london_merged['Neighborhood'], london_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

**Note:**  Unfortunately, the folium maps do not seem to render on GitHub natively. So to view the map drop the github link into https://nbviewer.jupyter.org/ and get a full dynamic output, when provided a valid ```folium.Map``` instance.

<br>

## 5. Examine Clusters of London

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster.

#### Cluster 1

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 0, london_merged.columns[[1] + [0] + list(range(5, london_merged.shape[1]))]]

#### Cluster 2

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 1, london_merged.columns[[1] + [0] + list(range(5, london_merged.shape[1]))]]

#### Cluster 3

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 2, london_merged.columns[[1] + [0] + list(range(5, london_merged.shape[1]))]]

#### Cluster 4

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 3, london_merged.columns[[1] + [0] + list(range(5, london_merged.shape[1]))]]

#### Cluster 5

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 4, london_merged.columns[[1] + [0] + list(range(5, london_merged.shape[1]))]]

<br>

<br>

<br>

# <ins>NEW YORK DATASET</ins>

## 6. Download and Explore Dataset of New York

Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

Luckily, this dataset exists for free on the web. Here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572

Simply run a `wget` command and access the data. So let's go ahead and do that.

In [None]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

* **Load and explore the data**

    Next, let's load the data.

In [None]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Let's take a quick look at the data.

In [None]:
neighborhoods_data = newyork_data['features']
neighborhoods_data[0]

* **Tranform the data into a *pandas* dataframe**
    
    The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. So let's start by creating an empty dataframe. Take a look at the empty dataframe to confirm that the columns are as intended.

In [None]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

neighborhoods

Then let's loop through the data and fill the dataframe one row at a time.

In [None]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Quickly examine the resulting dataframe.

In [None]:
print(neighborhoods.shape)
neighborhoods.head()

And make sure that the dataset has all 5 boroughs and 306 neighborhoods.

In [None]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

* **Use geopy library to get the latitude and longitude values of New York** 
    
    In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent ***ny_explorer***, as shown below.

In [None]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

* **Create a map of New York with neighborhoods superimposed on top.**

In [None]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

**Note:**  Unfortunately, the folium maps do not seem to render on GitHub natively. So to view the map drop the github link to your ```.ipynb``` file into nbviewer.org and get a full dynamic output, when provided a valid ```folium.Map``` instance.




Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

* **Define Foursquare Credentials and Version**
    
    Since this is a sensitive cell it's been hidden.
    
    However, the format is:
    > CLIENT_ID = 'your Foursquare ID'    
    CLIENT_SECRET = 'your Foursquare Secret'    
    VERSION = '20180605'

In [None]:
# The code was removed by Watson Studio for sharing.

* **Let's explore the neighborhood *Marble Hill* in our dataframe**.

    Get the neighborhood's name.

In [None]:
neighborhoods.loc[6, 'Neighborhood']

Get the neighborhood's latitude and longitude values.

In [None]:
neighborhood_latitude = neighborhoods.loc[6, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods.loc[6, 'Longitude'] # neighborhood longitude value

neighborhood_name = neighborhoods.loc[6, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

* **Now, let's get the top 100 venues that are in Regent Park, Harbourfront within a radius of 2000 meters**.

    First, let's create the GET request URL. Name your URL **url**.

In [None]:
radius = 500
LIMIT = 100
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET,neighborhood_latitude, neighborhood_longitude, VERSION, radius, LIMIT)

In [None]:
results = requests.get(url).json()
#'results' are not displayed since it's a very big result. However, snapshot of one is shown below to get an idea.
#results

'**results**' are not displayed since it's a very big result. However, snapshot of one is shown below to get an idea.

`{'meta': {'code': 200, 'requestId': '5ed6474477af03001b7face2'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Marble Hill',
  'headerFullLocation': 'Marble Hill, New York',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 26,
  'suggestedBounds': {'ne': {'lat': 40.88105078329964,
    'lng': -73.90471933917806},
   'sw': {'lat': 40.87205077429964, 'lng': -73.91659997808156}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4b4429abf964a52037f225e3',
       'name': "Arturo's",
       'location': {'address': '5198 Broadway',
        'crossStreet': 'at 225th St.',`

Let's create a function ***get_category_type***.

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a pandas dataframe.

In [None]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

And how many venues were returned by Foursquare?

In [None]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

<br>

## 7. Explore Neighborhoods in New York

* **Let's create a function to repeat the same process to all the neighborhoods in New York**

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)

        try:
            # make the GET request
            results = requests.get(url).json()["response"]['groups'][0]['items']

            # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
        except KeyError:
            pass

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
        
    
    
    return(nearby_venues)

* **Now write the code to run the above function on each neighborhood and create a new dataframe called _ny_venues_**.

In [None]:
ny_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

* **Let's check the size of the resulting dataframe**

In [None]:
print(ny_venues.shape)
ny_venues.head()

Let's check how many venues were returned for each neighborhood

In [None]:
ny_venues.groupby('Neighborhood').count()

* **Let's find out how many unique categories can be curated from all the returned venues**.

In [None]:
print('There are {} uniques categories.'.format(len(ny_venues['Venue Category'].unique())))

* **Let's count the top venues**.

In [None]:
venue_unique_count = ny_venues['Venue Category'].value_counts().to_frame(name='Count')
venue_unique_count.head()

<br>

## 8. Analyze Each Neighborhood of New York

In [None]:
# one hot encoding
ny_onehot = pd.get_dummies(ny_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ny_onehot['Neighborhood'] = ny_venues['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [ny_onehot.columns[-1]] + list(ny_onehot.columns[:-1])
ny_onehot = ny_onehot[fixed_columns]

ny_onehot.head()

And let's examine the new dataframe size.

In [None]:
ny_onehot.shape

* **Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category**

In [None]:
ny_grouped = ny_onehot.groupby('Neighborhood').mean().reset_index()
ny_grouped

* **Let's confirm the new size**

In [None]:
ny_grouped.shape

* **Let's get each neighborhood along with the top 5 most common venues**

In [None]:
num_top_venues = 5

for hood in ny_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = ny_grouped[ny_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    #print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    #print('\n')

Again, since the result is too big, has not been displayed. But you'll get an idea through the snapshot of the result.

`----Allerton----
           venue  freq
0    Pizza Place  0.17
1  Deli / Bodega  0.07
2    Supermarket  0.07
3  Grocery Store  0.07
4   Dessert Shop  0.03`


`----Annadale----
                 venue  freq
0  American Restaurant  0.21
1         Dance Studio  0.07
2     Sushi Restaurant  0.07
3        Train Station  0.07
4                 Park  0.07`


`----Arden Heights----
          venue  freq
0      Pharmacy  0.25
1   Coffee Shop  0.25
2   Pizza Place  0.25
3        Lawyer  0.25
4  Outlet Store  0.00`

* **Let's put that into a _pandas_ dataframe**
    
    First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = ny_grouped['Neighborhood']

for ind in np.arange(ny_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ny_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

<br>

## 9. Cluster Neighborhoods of New York

Run k-means to cluster the neighborhood into 5 clusters.

In [None]:
# set number of clusters
kclusters = 5

ny_grouped_clustering = ny_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ny_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

ny_merged = neighborhoods

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
ny_merged = ny_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

ny_merged.head() 

Drop null values, if any.

In [None]:
ny_merged = ny_merged.dropna()
ny_merged.reset_index(drop=True)

In [None]:
neighborhoods_venues_sorted.shape

In [None]:
ny_merged.shape

* **Finally, let's visualize the resulting clusters**.

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ny_merged['Latitude'], ny_merged['Longitude'], ny_merged['Neighborhood'], ny_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

**Note:**  Unfortunately, the folium maps do not seem to render on GitHub natively. So to view the map drop the github link into https://nbviewer.jupyter.org/ and get a full dynamic output, when provided a valid ```folium.Map``` instance.

<br>

## 10. Examine Clusters of New York

#### Cluster 1

In [None]:
ny_merged.loc[ny_merged['Cluster Labels'] == 0, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

#### Cluster 2

In [None]:
ny_merged.loc[ny_merged['Cluster Labels'] == 1, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

#### Cluster 3

In [None]:
ny_merged.loc[ny_merged['Cluster Labels'] == 2, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

#### Cluster 4

In [None]:
ny_merged.loc[ny_merged['Cluster Labels'] == 3, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

#### Cluster 5

In [None]:
ny_merged.loc[ny_merged['Cluster Labels'] == 4, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]