# IBM DataScience Professional Certificate
## Applied DataScience Capstone - The Battle of Neighborhoods
_By Kevin Gilson._

---

## Table of contents
* [Introduction - A Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

---

## Introduction - A Business Problem <a name="introduction"></a>

Young entrepreneurs are always in demand of good advices, and always looking towards the keys of success.
DataScience can help them analyzing the market, and get information such as:
* Which places of a city are the most prolific?
* Which types of business are present in which areas?
* Which types of business are lacking in which areas?
* ...

This kind of information are particularly relevant for people looking to open food related businesses.

With almost 9 millions people in 2020, London is a big city. Not only is it the capital of the United Kingdom, it is also one of the top financial places in Europe, and a hugely touristic city.
With that in mind, what could possibly leverage a young entrepreneur looking to start his Food-Truck in London, if he wanted to be successful? What is the most prevalent types of shops? Where are located the _hot_ zone of the city? Which place would be best to maximize his profit?

Thanks to Data Science, we have a way to analyze raw data and come up with suggestions to help this young man, grow as a successful entrepreneur.
And who knows, theses information might as well help him make it big in 10 years.

---

## Data <a name="data"></a>

London is split into Borough, holding various Neighborhoods.

In order to leverage further geographical data, we will first scrape Wikipedia to get the complete list of each Borough and Neighborhoods.
Then, we will use Python libraries to link each Neighborhoods to its relative geographical coordinates.
After that, we will leverage the Foursquare location data to retrieve their attributes, cluster zones, and most importantly come up with suggestions on the best places to install a food truck.

Our primary data, scraped from Wikipedia, adopt the form of:

|Location|London Borough|Post Town|Postcode District|Dial code|OS Grid Ref|
|--------|--------------|---------|-----------------|---------|-----------|
|Abbey Wood|Bexley, Greenwich [7]|LONDON|SE2|020|TQ465785|
|Acton|Ealing, Hammersmith and Fulham[8]|LONDON|W3, W4|020|TQ205805|
|Addington|Croydon[8]|CROYDON|CR0|020|TQ375645|
|Addiscombe|Croydon[8]|CROYDON|CR0|020|TQ345665|
|Albany Park|Bexley|BEXLEY, SIDCUP|DA5, DA14|020|TQ478728|

As we can see, the data will need to be cleansed, but it is also important to note that OS Grid References can quite easily be converted into coordinates.

### Preliminary steps - Import libraries

Before starting our analysis, let's first import the libaries we will need along our computations:

In [1]:
# System
import sys

# Regular expression
import re

# Math
import math

# Handle data in a vectorized way
import numpy as np

# Data analysis
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Clustering
from sklearn.cluster import KMeans

# JSON files
import json

# Handle GET / POST requests
import requests

# Convert an address into coordinates
try:
    from geopy.geocoders import Nominatim
except:
    !conda install --yes --prefix {sys.prefix} geopy
    from geopy.geocoders import Nominatim

# Coordinates
try:
    import geocoder
except:
    !conda install --yes --prefix {sys.prefix} geocoder
    import geocoder


    
# Shapes
try:
    import shapely.geometry
except:
    !conda install --yes --prefix {sys.prefix} shapely
    import shapely.geometry

# Pyproj
try:
    import pyproj
except:
    !conda install --yes --prefix {sys.prefix} pyproj
    import pyproj
    
# OS Grid References converter
try:
    from OSGridConverter import grid2latlong
except:
    !conda install --yes --prefix {sys.prefix} OSGridConverter
    from OSGridConverter import grid2latlong

# Web scraping
try:
    from bs4 import BeautifulSoup
except:
    !conda install --yes --prefix {sys.prefix} beautifulsoup4

# Parser
try:
    import lxml
except:
    !conda install --yes --prefix {sys.prefix} lxml
    import lxml

# URL opening
import urllib.request as req

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Map rendering
try:
    import folium
except:
    !conda install --yes --prefix {sys.prefix} folium=0.5.0
    import folium

# Python ENV file
try:
    from dotenv import load_dotenv
except:
    !conda install --yes --prefix {sys.prefix} python-dotenv
    from dotenv import load_dotenv

# Password input
import getpass

# Importation confirmation
print('Libraries installed and imported.')

Libraries installed and imported.


### Scrape the WikiPedia page

As the first step, we will need to scrape the WikiPedia page in order to retrieve its table, and therefore datas about the London Districts and Areas.
In order to do so, we will:
1. Scrape the page using BeautifulSoup;
2. Convert the OS Grid References found to coordinates (latitude, longitude) using the OSGRridConverter library; missing OS Grid References will results in NaN values for their coordinates;
3. Check the shape of the recovered dataframe

In [2]:
url_neigh = "https://en.wikipedia.org/wiki/List_of_areas_of_London"

# Scrape the page with BeautifulSoup
page = req.urlopen(url_neigh)
soup = BeautifulSoup(page, "lxml")
all_tables = soup.find_all("table")

# Initiate the dataframe
column_names = ['Location',
                'London Borough',
                'Post Town',
                'Post District',
                'Dial Code',
                'OS Grid Ref',
                'Latitude',
                'Longitude']
neigh_data = pd.DataFrame(columns=column_names)

# Loop through the scraping and extract the second table (first one is the coordinates files)
for row in all_tables[1].find_all('tr'):
    cells = row.findAll('td')
    if len(cells) == 6:
        wiki_location = cells[0].text.strip()
        wiki_borough = cells[1].text.strip()
        wiki_town = cells[2].text.strip()
        wiki_district = cells[3].text.strip()
        wiki_dial = cells[4].text.strip()
        
        wiki_gridref = cells[5].text.strip()
        try:
            wiki_latlong = grid2latlong(wiki_gridref)
        except:
            wiki_latlong.latitude = np.NaN
            wiki_latlong.longitude = np.NaN
        
        neigh_data = neigh_data.append({'Location': wiki_location,
                                        'London Borough': wiki_borough,
                                        'Post Town': wiki_town,
                                        'Post District': wiki_district,
                                        'Dial Code': wiki_dial,
                                        'OS Grid Ref': wiki_gridref,
                                        'Latitude': wiki_latlong.latitude,
                                        'Longitude': wiki_latlong.longitude}
                                       , ignore_index=True)

# Check the results
print("The resulting dataframe has a shape of: {}\n".format(neigh_data.shape))
neigh_data.head()

The resulting dataframe has a shape of: (532, 8)



Unnamed: 0,Location,London Borough,Post Town,Post District,Dial Code,OS Grid Ref,Latitude,Longitude
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785,51.486484,0.109318
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805,51.510591,-0.264585
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645,51.362934,-0.02578
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665,51.381625,-0.068126
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728,51.434929,0.125663


In [3]:
url_borough = "https://en.wikipedia.org/wiki/List_of_London_boroughs"

# Scrape the page with BeautifulSoup
page = req.urlopen(url_borough)
soup = BeautifulSoup(page, "lxml")
all_tables = soup.find_all("table")

# Initiate the dataframe
column_names = ['Borough',
                'Inner',
                'Status',
                'Local authority',
                'Political control',
                'Headquarters',
                'Area (sqmi)',
                'Population (2013 est)',
                'Coordinates',
                'Nr. in map']
borough_data = pd.DataFrame(columns=column_names)

# Loop through the scraping and extract the first table and second
for i in [0,1]:
    for row in all_tables[i].find_all('tr'):
        cells = row.findAll('td')
        if len(cells) == 10:
            wiki_borough = cells[0].text.strip()
            wiki_inner = cells[1].text.strip()
            wiki_status = cells[2].text.strip()
            wiki_locauth = cells[3].text.strip()
            wiki_polctr = cells[4].text.strip()
            wiki_head = cells[5].text.strip()
            wiki_area = cells[6].text.strip()
            wiki_pop = cells[7].text.strip()
            wiki_coord = cells[8].text.strip()
            wiki_nmap = cells[9].text.strip()

            borough_data = borough_data.append({'Borough': wiki_borough,
                                                'Inner': wiki_inner,
                                                'Status': wiki_status,
                                                'Local authority': wiki_locauth,
                                                'Political control': wiki_polctr,
                                                'Headquarters': wiki_head,
                                                'Area (sqmi)': wiki_area,
                                                'Population (2013 est)': wiki_pop,
                                                'Coordinates': wiki_coord,
                                                'Nr. in map': wiki_nmap}
                                               , ignore_index=True)

# Check the results
print("The resulting dataframe has a shape of: {}\n".format(borough_data.shape))
borough_data.head()

The resulting dataframe has a shape of: (33, 10)



Unnamed: 0,Borough,Inner,Status,Local authority,Political control,Headquarters,Area (sqmi),Population (2013 est),Coordinates,Nr. in map
0,Barking and Dagenham [note 1],,,Barking and Dagenham London Borough Council,Labour,"Town Hall, 1 Town Square",13.93,194352,51°33′39″N 0°09′21″E﻿ / ﻿51.5607°N 0.1557°E﻿ /...,25
1,Barnet,,,Barnet London Borough Council,Conservative,"Barnet House, 2 Bristol Avenue, Colindale",33.49,369088,51°37′31″N 0°09′06″W﻿ / ﻿51.6252°N 0.1517°W﻿ /...,31
2,Bexley,,,Bexley London Borough Council,Conservative,"Civic Offices, 2 Watling Street",23.38,236687,51°27′18″N 0°09′02″E﻿ / ﻿51.4549°N 0.1505°E﻿ /...,23
3,Brent,,,Brent London Borough Council,Labour,"Brent Civic Centre, Engineers Way",16.7,317264,51°33′32″N 0°16′54″W﻿ / ﻿51.5588°N 0.2817°W﻿ /...,12
4,Bromley,,,Bromley London Borough Council,Conservative,"Civic Centre, Stockwell Close",57.97,317899,51°24′14″N 0°01′11″E﻿ / ﻿51.4039°N 0.0198°E﻿ /...,20


In [4]:
neigh_ori = neigh_data.copy()
borough_ori = borough_data.copy()

### Columns cleansing

We can see that the resulting dataframe isn't looking its best; let's upgrade it a bit:
1. First we will drop columns that won't be used later;
2. Then we will rename a few of the remaining ones into clearer names;
3. Finally we will reorder the columns.

In [5]:
# Drop useless columns
neigh_data.drop(columns=['Post District',
                         'Dial Code',
                         'OS Grid Ref'], axis=1, inplace=True)

# Rename columns
neigh_data.rename(columns={'Location':'Neighborhood',
                           'London Borough':'Borough',
                           'Post Town':'Town'}, inplace=True)

# Reorder
neigh_data = neigh_data[['Borough',
                        'Neighborhood',
                        'Town',
                        'Latitude',
                        'Longitude']]
# Check results
neigh_data.head()

Unnamed: 0,Borough,Neighborhood,Town,Latitude,Longitude
0,"Bexley, Greenwich [7]",Abbey Wood,LONDON,51.486484,0.109318
1,"Ealing, Hammersmith and Fulham[8]",Acton,LONDON,51.510591,-0.264585
2,Croydon[8],Addington,CROYDON,51.362934,-0.02578
3,Croydon[8],Addiscombe,CROYDON,51.381625,-0.068126
4,Bexley,Albany Park,"BEXLEY, SIDCUP",51.434929,0.125663


In [6]:
# Drop useless columns
borough_data.drop(columns=['Status',
                           'Local authority',
                           'Political control',
                           'Headquarters',
                           'Nr. in map'], axis=1, inplace=True)

# Rename columns
borough_data.rename(columns={'Area (sqmi)':'Area',
                             'Population (2013 est)':'Population'}, inplace=True)
# Check results
borough_data.head()

Unnamed: 0,Borough,Inner,Area,Population,Coordinates
0,Barking and Dagenham [note 1],,13.93,194352,51°33′39″N 0°09′21″E﻿ / ﻿51.5607°N 0.1557°E﻿ /...
1,Barnet,,33.49,369088,51°37′31″N 0°09′06″W﻿ / ﻿51.6252°N 0.1517°W﻿ /...
2,Bexley,,23.38,236687,51°27′18″N 0°09′02″E﻿ / ﻿51.4549°N 0.1505°E﻿ /...
3,Brent,,16.7,317264,51°33′32″N 0°16′54″W﻿ / ﻿51.5588°N 0.2817°W﻿ /...
4,Bromley,,57.97,317899,51°24′14″N 0°01′11″E﻿ / ﻿51.4039°N 0.0198°E﻿ /...


### Data cleansing

We can see a few problems with our data, such as:
* Some values contains WikiPedia links references (i.e. "[1]");
* Some Neighborhoods are affected to two or more Towns;
* Some values are followed by explainations in between paratheses;
* The case of the data are not normalized, mixing upper and lower cases.

Let's define a function that will use regular expression to clean the data of brackets, parentheses, and others.
And then, let's capitalize each value (i.e. "VALUES" and "values" will become "Values"):

In [7]:
# Cleaning function, based on RegEx patterns
def Clean_DataEnd(raw_data, pattern):
    if re.search(pattern, str(raw_data)):
        pos = re.search(pattern, str(raw_data)).start()
        return raw_data[:pos]
    else:
        return raw_data

In [8]:
# Fix wikipedia references, double values, and annotations
for col in neigh_data.select_dtypes(include='object').columns:
    neigh_data[col] = neigh_data[col].apply(Clean_DataEnd, pattern=' \[.*')
    neigh_data[col] = neigh_data[col].apply(Clean_DataEnd, pattern='\[.*')
    neigh_data[col] = neigh_data[col].apply(Clean_DataEnd, pattern=' \(.*')
    neigh_data[col] = neigh_data[col].apply(Clean_DataEnd, pattern=', .*')

# Change City to City of London
neigh_data['Borough'].replace({'City': 'City of London'}, inplace=True)

# Capitalize data - Use title to capitalize each words
for col in ['Borough','Neighborhood','Town']:
    neigh_data[col] = neigh_data[col].str.title()

neigh_data['Borough'] = neigh_data['Borough'].str.replace(' And ', ' and ')
neigh_data['Borough'] = neigh_data['Borough'].str.replace(' Upon ', ' upon ')
neigh_data['Borough'] = neigh_data['Borough'].str.replace(' Of ', ' of ')

# Check
print(neigh_data.shape)
print('\n')
neigh_data.head()

(532, 5)




Unnamed: 0,Borough,Neighborhood,Town,Latitude,Longitude
0,Bexley,Abbey Wood,London,51.486484,0.109318
1,Ealing,Acton,London,51.510591,-0.264585
2,Croydon,Addington,Croydon,51.362934,-0.02578
3,Croydon,Addiscombe,Croydon,51.381625,-0.068126
4,Bexley,Albany Park,Bexley,51.434929,0.125663


In [9]:
neigh_data.dtypes

Borough          object
Neighborhood     object
Town             object
Latitude        float64
Longitude       float64
dtype: object

In [10]:
for col in borough_data.select_dtypes(include='object').columns:
    borough_data[col] = borough_data[col].apply(Clean_DataEnd, pattern=' \[.*')
    borough_data[col] = borough_data[col].apply(Clean_DataEnd, pattern='\[.*')
    borough_data[col] = borough_data[col].apply(Clean_DataEnd, pattern=' \(.*')
    borough_data[col] = borough_data[col].apply(Clean_DataEnd, pattern=', .*')

for col in ['Borough','Inner']:
    borough_data[col] = borough_data[col].str.title()

borough_data['Borough'] = borough_data['Borough'].str.replace(' And ', ' and ')
borough_data['Borough'] = borough_data['Borough'].str.replace(' Upon ', ' upon ')
borough_data['Borough'] = borough_data['Borough'].str.replace(' Of ', ' of ')

print(borough_data.shape)
print('\n')
borough_data.head()

(33, 5)




Unnamed: 0,Borough,Inner,Area,Population,Coordinates
0,Barking and Dagenham,,13.93,194352,51°33′39″N 0°09′21″E﻿ / ﻿51.5607°N 0.1557°E﻿ /...
1,Barnet,,33.49,369088,51°37′31″N 0°09′06″W﻿ / ﻿51.6252°N 0.1517°W﻿ /...
2,Bexley,,23.38,236687,51°27′18″N 0°09′02″E﻿ / ﻿51.4549°N 0.1505°E﻿ /...
3,Brent,,16.7,317264,51°33′32″N 0°16′54″W﻿ / ﻿51.5588°N 0.2817°W﻿ /...
4,Bromley,,57.97,317899,51°24′14″N 0°01′11″E﻿ / ﻿51.4039°N 0.0198°E﻿ /...


In [11]:
borough_data.loc[0, 'Coordinates']

'51°33′39″N 0°09′21″E\ufeff / \ufeff51.5607°N 0.1557°E\ufeff / 51.5607; 0.1557\ufeff'

In [12]:
borough_data['Coordinates'] = borough_data['Coordinates'].str.replace(u'\ufeff', '')
borough_data.loc[0, 'Coordinates']

'51°33′39″N 0°09′21″E / 51.5607°N 0.1557°E / 51.5607; 0.1557'

In [13]:
borough_data['Coordinates'] = borough_data['Coordinates'].str.split("/", n = 3, expand=False).str[-1]
borough_data.head()

Unnamed: 0,Borough,Inner,Area,Population,Coordinates
0,Barking and Dagenham,,13.93,194352,51.5607; 0.1557
1,Barnet,,33.49,369088,51.6252; -0.1517
2,Bexley,,23.38,236687,51.4549; 0.1505
3,Brent,,16.7,317264,51.5588; -0.2817
4,Bromley,,57.97,317899,51.4039; 0.0198


In [14]:
borough_data['Latitude'] = borough_data['Coordinates'].str.split(";", n=2, expand=False).str[0]
borough_data['Longitude'] = borough_data['Coordinates'].str.split(";", n=2, expand=False).str[1]
borough_data.head()

Unnamed: 0,Borough,Inner,Area,Population,Coordinates,Latitude,Longitude
0,Barking and Dagenham,,13.93,194352,51.5607; 0.1557,51.5607,0.1557
1,Barnet,,33.49,369088,51.6252; -0.1517,51.6252,-0.1517
2,Bexley,,23.38,236687,51.4549; 0.1505,51.4549,0.1505
3,Brent,,16.7,317264,51.5588; -0.2817,51.5588,-0.2817
4,Bromley,,57.97,317899,51.4039; 0.0198,51.4039,0.0198


In [15]:
borough_data.drop(columns=['Coordinates'], axis=1, inplace=True)
borough_data.head()

Unnamed: 0,Borough,Inner,Area,Population,Latitude,Longitude
0,Barking and Dagenham,,13.93,194352,51.5607,0.1557
1,Barnet,,33.49,369088,51.6252,-0.1517
2,Bexley,,23.38,236687,51.4549,0.1505
3,Brent,,16.7,317264,51.5588,-0.2817
4,Bromley,,57.97,317899,51.4039,0.0198


In [16]:
borough_data['Inner'].replace({'': 0, 'Y': 1, '(Y)': 1}, inplace=True)
borough_data.head()

Unnamed: 0,Borough,Inner,Area,Population,Latitude,Longitude
0,Barking and Dagenham,0,13.93,194352,51.5607,0.1557
1,Barnet,0,33.49,369088,51.6252,-0.1517
2,Bexley,0,23.38,236687,51.4549,0.1505
3,Brent,0,16.7,317264,51.5588,-0.2817
4,Bromley,0,57.97,317899,51.4039,0.0198


In [17]:
borough_data.dtypes

Borough       object
Inner          int64
Area          object
Population    object
Latitude      object
Longitude     object
dtype: object

In [18]:
borough_data[['Area','Population','Latitude','Longitude']].head()

Unnamed: 0,Area,Population,Latitude,Longitude
0,13.93,194352,51.5607,0.1557
1,33.49,369088,51.6252,-0.1517
2,23.38,236687,51.4549,0.1505
3,16.7,317264,51.5588,-0.2817
4,57.97,317899,51.4039,0.0198


In [19]:
borough_data['Area'] = borough_data['Area'].astype(float)
borough_data['Latitude'] = borough_data['Latitude'].astype(float)
borough_data['Longitude'] = borough_data['Longitude'].astype(float)
borough_data['Population'] = borough_data['Population'].iloc[:,].str.replace(',', '').astype(float)
borough_data.dtypes

Borough        object
Inner           int64
Area          float64
Population    float64
Latitude      float64
Longitude     float64
dtype: object

In [20]:
borough_data['Area'] = borough_data['Area'] * 2589988.11
borough_data.head()

Unnamed: 0,Borough,Inner,Area,Population,Latitude,Longitude
0,Barking and Dagenham,0,36078530.0,194352.0,51.5607,0.1557
1,Barnet,0,86738700.0,369088.0,51.6252,-0.1517
2,Bexley,0,60553920.0,236687.0,51.4549,0.1505
3,Brent,0,43252800.0,317264.0,51.5588,-0.2817
4,Bromley,0,150141600.0,317899.0,51.4039,0.0198


In [21]:
borough_data['Radius'] = borough_data['Area'] / math.pi
borough_data['Radius'] = borough_data['Radius'].apply(math.sqrt)
borough_data.head()

Unnamed: 0,Borough,Inner,Area,Population,Latitude,Longitude,Radius
0,Barking and Dagenham,0,36078530.0,194352.0,51.5607,0.1557,3388.827846
1,Barnet,0,86738700.0,369088.0,51.6252,-0.1517,5254.501527
2,Bexley,0,60553920.0,236687.0,51.4549,0.1505,4390.320264
3,Brent,0,43252800.0,317264.0,51.5588,-0.2817,3710.497851
4,Bromley,0,150141600.0,317899.0,51.4039,0.0198,6913.143932


In [22]:
borough_data['Area'] = borough_data['Area'] / 1000000
borough_data.head()

Unnamed: 0,Borough,Inner,Area,Population,Latitude,Longitude,Radius
0,Barking and Dagenham,0,36.078534,194352.0,51.5607,0.1557,3388.827846
1,Barnet,0,86.738702,369088.0,51.6252,-0.1517,5254.501527
2,Bexley,0,60.553922,236687.0,51.4549,0.1505,4390.320264
3,Brent,0,43.252801,317264.0,51.5588,-0.2817,3710.497851
4,Bromley,0,150.141611,317899.0,51.4039,0.0198,6913.143932


### Missing values

Before going further, we want to make sure that we have the coordinates of each neighborhoods.
Let's check the Latitudes and Longitudes:

In [23]:
i = 0
for col in neigh_data.columns:
    if neigh_data[col].isnull().values.any():
        i = i + 1
        print("Missing data in column {}, at index:".format(col))
        print(neigh_data.loc[pd.isna(neigh_data[col]), :].index)
if (i == 0):
    print('No missing data found.')

Missing data in column Latitude, at index:
Int64Index([53, 232], dtype='int64')
Missing data in column Longitude, at index:
Int64Index([53, 232], dtype='int64')


We can see that two Neighborhoods are missing their coordinates.

Using the GeoCoder library, let's retrieve their latitudes and longitudes:

In [24]:
# Define a function to retrieve the coordinates, using ArcGis instead of Google for better performances
def GetCoordinates(df):
    bor = df['Borough']
    neigh = df['Neighborhood']
    town = df['Town']
    
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, London'.format(neigh))
        lat_lng_coords = g.latlng
    lat = lat_lng_coords[0]
    long = lat_lng_coords[1]
    lst = {'Borough': bor,
           'Neighborhood': neigh,
           'Town': town,
           'Latitude': lat,
           'Longitude': long}
    return pd.Series(lst)

In [25]:
# Apply the function to missing coordinates
neigh_data[np.isnan(neigh_data['Latitude'])] = neigh_data[np.isnan(neigh_data['Latitude'])].apply(GetCoordinates, axis=1)

# Check if there is any remaining NaN values
neigh_data.isnull().values.any()

False

In [26]:
i = 0
for col in borough_data.columns:
    if borough_data[col].isnull().values.any():
        i = i + 1
        print("Missing data in column {}, at index:".format(col))
        print(borough_data.loc[pd.isna(borough_data[col]), :].index)
if (i == 0):
    print('No missing data found.')

No missing data found.


Now that we are satisfied with our dataset, let's create a copy on which we will be working on, and keep the *wiki_data* as an untouch orignal data frame:

In [27]:
neigh_data.head()

Unnamed: 0,Borough,Neighborhood,Town,Latitude,Longitude
0,Bexley,Abbey Wood,London,51.486484,0.109318
1,Ealing,Acton,London,51.510591,-0.264585
2,Croydon,Addington,Croydon,51.362934,-0.02578
3,Croydon,Addiscombe,Croydon,51.381625,-0.068126
4,Bexley,Albany Park,Bexley,51.434929,0.125663


In [28]:
borough_data.head()

Unnamed: 0,Borough,Inner,Area,Population,Latitude,Longitude,Radius
0,Barking and Dagenham,0,36.078534,194352.0,51.5607,0.1557,3388.827846
1,Barnet,0,86.738702,369088.0,51.6252,-0.1517,5254.501527
2,Bexley,0,60.553922,236687.0,51.4549,0.1505,4390.320264
3,Brent,0,43.252801,317264.0,51.5588,-0.2817,3710.497851
4,Bromley,0,150.141611,317899.0,51.4039,0.0198,6913.143932


In [29]:
borough_data

Unnamed: 0,Borough,Inner,Area,Population,Latitude,Longitude,Radius
0,Barking and Dagenham,0,36.078534,194352.0,51.5607,0.1557,3388.827846
1,Barnet,0,86.738702,369088.0,51.6252,-0.1517,5254.501527
2,Bexley,0,60.553922,236687.0,51.4549,0.1505,4390.320264
3,Brent,0,43.252801,317264.0,51.5588,-0.2817,3710.497851
4,Bromley,0,150.141611,317899.0,51.4039,0.0198,6913.143932
5,Camden,1,21.7559,229719.0,51.529,-0.1255,2631.561911
6,Croydon,0,86.531503,372752.0,51.3714,-0.0977,5248.22187
7,Ealing,0,55.529345,342494.0,51.513,-0.3089,4204.228765
8,Enfield,0,82.206223,320524.0,51.6538,-0.0799,5115.374215
9,Greenwich,1,47.344983,264008.0,51.4892,0.0648,3882.058222


In [30]:
neigh_data['Borough'].unique() # Missing City in borough_data

array(['Bexley', 'Ealing', 'Croydon', 'Redbridge', 'City of London',
       'Westminster', 'Brent', 'Bromley', 'Islington', 'Havering',
       'Barnet', 'Enfield', 'Wandsworth', 'Southwark',
       'Barking and Dagenham', 'Richmond upon Thames', 'Newham', 'Sutton',
       'Lewisham', 'Harrow', 'Camden', 'Kingston upon Thames',
       'Tower Hamlets', 'Greenwich', 'Haringey', 'Hounslow', 'Lambeth',
       'Kensington and Chelseahammersmith and Fulham', 'Waltham Forest',
       'Kensington and Chelsea', 'Merton', 'Hillingdon', 'Hackney',
       'Islington & City', 'Hammersmith and Fulham',
       'Camden and Islington', 'Haringey and Barnet'], dtype=object)

In [31]:
neigh_data[neigh_data['Borough'] == 'City of London']

Unnamed: 0,Borough,Neighborhood,Town,Latitude,Longitude
6,City of London,Aldgate,London,51.514885,-0.078356
18,City of London,Barbican,London,51.51966,-0.095466
49,City of London,Blackfriars,London,51.510767,-0.101607
456,City of London,Temple,London,51.511828,-0.111659


---

## 3. Explore the dataset  <a name="explore"></a>
Let's have a first look at how our dataset look, by looking at its shape, and list its Neighborhoods and Towns:

In [32]:
print('The dataframe has {} boroughs, {} neighborhoods, and {} towns.'.format(
        len(neigh_data['Borough'].unique()),
        len(neigh_data['Town'].unique()),
        neigh_data.shape[0]
    )
)
print('\nThe boroughs are the following:\n{}'.format(neigh_data['Borough'].unique()))
print('\nThe towns are the following:\n{}'.format(neigh_data['Town'].unique()))

The dataframe has 37 boroughs, 64 neighborhoods, and 532 towns.

The boroughs are the following:
['Bexley' 'Ealing' 'Croydon' 'Redbridge' 'City of London' 'Westminster'
 'Brent' 'Bromley' 'Islington' 'Havering' 'Barnet' 'Enfield' 'Wandsworth'
 'Southwark' 'Barking and Dagenham' 'Richmond upon Thames' 'Newham'
 'Sutton' 'Lewisham' 'Harrow' 'Camden' 'Kingston upon Thames'
 'Tower Hamlets' 'Greenwich' 'Haringey' 'Hounslow' 'Lambeth'
 'Kensington and Chelseahammersmith and Fulham' 'Waltham Forest'
 'Kensington and Chelsea' 'Merton' 'Hillingdon' 'Hackney'
 'Islington & City' 'Hammersmith and Fulham' 'Camden and Islington'
 'Haringey and Barnet']

The towns are the following:
['London' 'Croydon' 'Bexley' 'Ilford' 'Wembley' 'Westerham' 'Hornchurch'
 'Barnet' 'Barking' 'Bexleyheath' 'Dartford' 'Beckenham' 'Dagenham'
 'Wallington' 'Harrow' 'Sutton' 'Belvedere' 'Surbiton' 'Bromley' 'Sidcup'
 'Enfield' 'Brentford' 'Edgware' 'Carshalton' 'Romford' 'Sutton/Merton'
 'Orpington' 'Chessington' 'Chisle

Considering that we are analyzing London, it might be appropriate to retrieve its coordinates:

In [33]:
address = 'London'
geolocator = Nominatim(user_agent="london_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London, UK are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of London, UK are 51.5073219, -0.1276474.


Now let's plot the neighborhoods on a map of London, in order to get a better view of the situation; we will leverage the coordinates that we just found to center the map:

In [34]:
map_london = folium.Map(location=[latitude, longitude], zoom_start=10)

# Boroughs
for lat, lng, borough, rad, inn in zip(borough_data['Latitude'], borough_data['Longitude'], borough_data['Borough'], borough_data['Radius'], borough_data['Inner']):
    if (inn == 0):
        colors = 'blue'
    else:
        colors = 'red'
    label = borough
    label = folium.Popup(label, parse_html=True)
    folium.Circle(
        [lat, lng],
        radius = rad,
        popup = label,
        color = colors,
        fill = True,
        fill_color = colors,
        fill_opacity = 0.5,
        parse_html = False).add_to(map_london)

# Neighborhoods
for lat, lng, borough, neighborhood in zip(neigh_data['Latitude'], neigh_data['Longitude'], neigh_data['Borough'], neigh_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london)

# Display map
map_london

As we can see, circling the borough isn't the most appropriate way to visualize neighborhoods. Let's try with a GeoJSON of the borough limits

In [35]:
url_geolondon = "https://skgrange.github.io/www/data/london_boroughs.json"
#url_geoinner = "https://skgrange.github.io/www/data/inner_london_polygons.json"

map_json = folium.Map(location=[latitude, longitude], tiles='OpenStreetMap', zoom_start=10)

map_json.choropleth(
    geo_data=url_geolondon,
    data=borough_data,
    columns=['Borough', 'Inner'],
    key_on='feature.properties.name',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    reset=True
)

# Neighborhoods

for lat, lng, borough, neighborhood in zip(neigh_data['Latitude'], neigh_data['Longitude'], neigh_data['Borough'], neigh_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_json)

map_json

It is already much clearer. And as we can see, the neighborhoods are pretty spread appart.

London is a big city, and even the inner part of it is huge. Therefore, it migth be best to create a sub-set of the data focusing on the City of London borough

In [36]:
# Join the two dataset
inner_london_data = pd.merge(neigh_data, borough_data[['Borough','Inner']], on='Borough')

# Drop rows that are not part of Inner London
inner_london_data = inner_london_data[inner_london_data['Inner'] == 1] .reset_index(drop=True)
inner_london_data.drop(columns=['Inner'], axis=1, inplace=True)

# Add City column
inner_london_data['City'] = pd.get_dummies(inner_london_data['Borough'].replace({'^(?!City of London).*$': None}, regex=True))

# Check results
print(inner_london_data.shape)
print('\n')
inner_london_data.head()

(180, 6)




Unnamed: 0,Borough,Neighborhood,Town,Latitude,Longitude,City
0,City of London,Aldgate,London,51.514885,-0.078356,1
1,City of London,Barbican,London,51.51966,-0.095466,1
2,City of London,Blackfriars,London,51.510767,-0.101607,1
3,City of London,Temple,London,51.511828,-0.111659,1
4,Westminster,Aldwych,London,51.512819,-0.117388,0


In [37]:
london_city_data = neigh_data[neigh_data['Borough'] == 'City of London'].reset_index(drop=True)

# Check results
print(london_city_data.shape)
print('\n')
london_city_data.head()

(4, 5)




Unnamed: 0,Borough,Neighborhood,Town,Latitude,Longitude
0,City of London,Aldgate,London,51.514885,-0.078356
1,City of London,Barbican,London,51.51966,-0.095466
2,City of London,Blackfriars,London,51.510767,-0.101607
3,City of London,Temple,London,51.511828,-0.111659


Let's now compare the two dataset on a map, by plotting first every neighborhoods in blue, then turning red the ones in London city:

In [38]:
# Create map of London using latitude and longitude values
map_london_city = folium.Map(location=[latitude, longitude], zoom_start=11)

# Add markers corresponding to the neighborhoods to the map
for lat, lng, borough, neighborhood in zip(inner_london_data['Latitude'], inner_london_data['Longitude'], inner_london_data['Borough'], inner_london_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london_city)  

# Add markers corresponding to the neighborhoods to the map
for lat, lng, borough, neighborhood in zip(london_city_data['Latitude'], london_city_data['Longitude'], london_city_data['Borough'], london_city_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london_city)  

# Display the map
map_london_city

In [39]:
city_latitude = borough_data.loc[borough_data['Borough'] == 'City of London', 'Latitude'].values[0]
city_longitude = borough_data.loc[borough_data['Borough'] == 'City of London', 'Longitude'].values[0]

map_london_city_json = folium.Map(location=[city_latitude, city_longitude], tiles='OpenStreetMap', zoom_start=13)

map_london_city_json.choropleth(
    geo_data=url_geolondon,
    data=inner_london_data,
    columns=['Borough', 'City'],
    key_on='feature.properties.name',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    reset=True
)

# Neighborhoods

for lat, lng, borough, neighborhood in zip(inner_london_data['Latitude'], inner_london_data['Longitude'], inner_london_data['Borough'], inner_london_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london_city_json)

map_london_city_json

Plot Hexa

In [None]:
# From
from functools import partial
from pyproj import Proj, transform
proj_4326 = Proj(init="epsg:4326")
proj_3857 = Proj(init="epsg:3857")
transformer = partial(transform, proj_4326, proj_3857)
transformer(12, 12)

# To
from pyproj import Transformer
transformer = Transformer.from_crs("epsg:4326", "epsg:3857")
transformer.transform(12, 12)

In [105]:
from pyproj import Transformer

Transformer.transform(city_longitude, city_latitude)

TypeError: transform() missing 1 required positional argument: 'yy'


Building graph of deps:   0%|          | 0/315 [00:00<?, ?it/s]
Examining qt:   0%|          | 0/315 [00:00<?, ?it/s]          
Examining pep8:   0%|          | 1/315 [00:00<00:11, 27.06it/s]
Examining numpydoc:   1%|          | 2/315 [00:00<00:08, 38.50it/s]
Examining conda-env:   1%|          | 3/315 [00:00<00:51,  6.10it/s]
Examining conda-env:   1%|1         | 4/315 [00:00<00:38,  8.14it/s]
Examining wcwidth:   1%|1         | 4/315 [00:00<00:38,  8.14it/s]  
Examining pyparsing:   2%|1         | 5/315 [00:00<00:38,  8.14it/s]
Examining prometheus_client:   2%|1         | 6/315 [00:00<00:37,  8.14it/s]
Examining prometheus_client:   2%|2         | 7/315 [00:00<00:30, 10.02it/s]
Examining pywavelets:   2%|2         | 7/315 [00:00<00:30, 10.02it/s]       
Examining pywavelets:   3%|2         | 8/315 [00:00<00:43,  7.02it/s]
Examining cycler:   3%|2         | 8/315 [00:01<00:43,  7.02it/s]    
Examining cycler:   3%|2         | 9/315 [00:01<01:50,  2.76it/s]
Examining navigator-update

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Solving environment: ...working... 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed


docutils -> python[version='>=3.7,<3.8.0a0'] -> openssl[version='>=1.1.1a,<1.1.2a|>=1.1.1b,<1.1.2a|>=1.1.1c,<1.1.2a|>=1.1.1d,<1.1.2a|>=1.1.1e,<1.1.2a|>=1.1.1f,<1.1.2a|>=1.1.1g,<1.1.2a']
seaborn -> python[version='>=3.6'] -> openssl[version='>=1.1.1a,<1.1.2a|>=1.1.1b,<1.1.2a|>=1.1.1c,<1.1.2a|>=1.1.1d,<1.1.2a|>=1.1.1e,<1.1.2a|>=1.1.1f,<1.1.2a|>=1.1.1g,<1.1.2a|>=1.1.1h,<1.1.2a']
sphinxcontrib-devhelp -> python[version='>=3.5'] -> openssl[version='>=1.1.1a,<1.1.2a|>=1.1.1b,<1.1.2a|>=1.1.1c,<1.1.2a|>=1.1.1d,<1.1.2a|>=1.1.1e,<1.1.2a|>=1.1.1f,<1.1.2a|>=1.1.1g,<1.1.2a|>=1.1.1h,<1.1.2a']
qtconsole -> python -> openssl[version='>=1.1.1a,<1.1.2a|>=1.1.1b,<1.1.2a|>=1.1.1c,<1.1.2a|>=1.1.1d,<1.1.2a|>=1.1.1e,<1.1.2a|>=1.1.1f,<1.1.2a|>=1.1.1g,<1.1.2a|>=1.1.1h,<1.1.2a']
anaconda-client -> python[version='>=3.8,<3.9.0a0'] -> openssl[version='>=1.1.1a,<1.1.2a|>=1.1.1b,<1.1.2a|>=1.1.1c,<1.1.2a|>=1.1.1d,<1.1.2a|>=1.1.1e,<1.1.2a|>=1.1.1f,<1.1.2a|>=1.1.1g,<1.1.2a']
ipython -> python[version='>=3.7,<3.8.0a0']

In [40]:
#!pip install shapely
import shapely.geometry

#!pip install pyproj
import pyproj

import math

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('London center longitude={}, latitude={}'.format(city_longitude, city_latitude))
x, y = lonlat_to_xy(city_longitude, city_latitude)
print('London center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('London center longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
London center longitude=-0.0922, latitude=51.5155
London center UTM X=-544382.7527455712, Y=5815943.184422143
London center longitude=-0.09219999999999758, latitude=51.5155


  xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)


In [41]:
city_center_x, city_center_y = lonlat_to_xy(city_longitude, city_latitude) # City center in Cartesian coordinates

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = city_center_x - 6000
x_step = 600
y_min = city_center_y - 6000 - (int(21/k)*k*600 - 12000)/2
y_step = 600 * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(city_center_x, city_center_y, x, y)
        if (distance_from_center <= 6001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'candidate neighborhood centers generated.')

  xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon

364 candidate neighborhood centers generated.


  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
  lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)


In [42]:
map_hexa = folium.Map(location=[city_latitude, city_longitude], zoom_start=13)
folium.Marker([city_latitude, city_longitude], popup='City of London').add_to(map_hexa)
for lat, lon in zip(latitudes, longitudes):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_berlin) 
    folium.Circle([lat, lon], radius=300, color='blue', fill=False).add_to(map_hexa)
    #folium.Marker([lat, lon]).add_to(map_berlin)
map_hexa

In [104]:
import json
from shapely.geometry import shape, Point
# depending on your version, use: from shapely.geometry import shape, Point

# load GeoJSON file containing sectors
with open('thames.geojson') as f:
    thames = json.load(f)

# construct point based on lon/lat returned by geocoder
#point = Point(latitudes[0], longitudes[0])

lat_bis = []
long_bis = []

#print(lat_bis, long_bis)

for lat, long in zip(latitudes, longitudes):
    point = Point(long, lat)
    # check each polygon to see if it contains the point
    for feature in thames['features']:
        polygon = shape(feature['geometry'])
        if polygon.contains(point):
            lat_bis.append(lat)
            long_bis.append(long)
            break

print('{} coordinates removed out of {}'.format(len(lat_bis), len(latitudes)))

17 coordinates removed out of 364


In [99]:
map_hexa_bis = folium.Map(location=[city_latitude, city_longitude], zoom_start=13)
folium.Marker([city_latitude, city_longitude], popup='City of London').add_to(map_hexa_bis)

for lat, lon in zip(latitudes, longitudes):
    folium.Circle([lat, lon], radius=300, color='blue', fill=False).add_to(map_hexa_bis)

for lat, lon in zip(lat_bis, long_bis):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_berlin) 
    folium.Circle([lat, lon], radius=300, color='red', fill=False).add_to(map_hexa_bis)
    #folium.Marker([lat, lon]).add_to(map_berlin)
map_hexa_bis

We can see that it is already better. We are now ready to skip to the next phase of our analysis.

---

## 4. Leverage the Foursquare API  <a name="foursquare"></a>

Using the Foursquare API, we will retrieve more informations regarding our neighborhoods.
In order to use it, we first need to set-up our credentials.

For confidentiality purposes, we will try to load then from a *.env* file, and if no results are retrieved, we prompt for a secrured input, using the GetPass library:

In [43]:
# Try to load .env file
try:
    load_dotenv()
    CLIENT_ID = os.getenv('CLIENT_ID')
    CLIENT_SECRET = os.getenv('CLIENT_SECRET')
# If no .env file available, ask for user input
except:
    CLIENT_ID = getpass.getpass(prompt="Please type your CLIENT_ID: ")
    CLIENT_SECRET = getpass.getpass(prompt="Please  type your CLIENT_SECRET")

# Other parameters
VERSION = '20180605'
LIMIT = 100
radius = 500

# Print end of credentials
print('Your credentials:')
print('CLIENT_ID: {}{}'.format((len(CLIENT_ID)-4)*"*", CLIENT_ID[-4:]))
print('CLIENT_SECRET: {}{}'.format((len(CLIENT_SECRET)-4)*"*", CLIENT_SECRET[-4:]))

Please type your CLIENT_ID:  ················································
Please  type your CLIENT_SECRET ················································


Your credentials:
CLIENT_ID: ********************************************UXXI
CLIENT_SECRET: ********************************************RJLJ


Now we will define a function to retrieve venues for a fiven neighborhood:

In [44]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    i = 0
    for name, lat, lng in zip(names, latitudes, longitudes):
        i = i + 1
        print('{} - {}'.format(i, name))
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

The Foursquare API is limited to 950 request per day for a free user, therefor, we will store the resulting dataframe into a csv file, in case we need to re-run the whole notebook.

If we reach our limit, or if the API fail, we will read the data from the CSV file.

In [45]:
try:
    london_city_venues = getNearbyVenues(names=london_city_data['Neighborhood'],
                                    latitudes=london_city_data['Latitude'],
                                    longitudes=london_city_data['Longitude'])
    london_city_venues.to_csv('london_city_venues.csv', index = False)
except:
    london_city_venues = pd.read_csv('london_city_venues.csv')

1 - Aldgate
2 - Barbican
3 - Blackfriars
4 - Temple


Let's now have a look at the resulting dataframe:

In [46]:
print(london_city_venues.shape)
print('\n')
london_city_venues.head()

(277, 7)




Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Aldgate,51.514885,-0.078356,1Rebel,51.515569,-0.08004,Gym / Fitness Center
1,Aldgate,51.514885,-0.078356,Swingers - The Crazy Golf Club,51.514202,-0.080383,Mini Golf
2,Aldgate,51.514885,-0.078356,The Association,51.513733,-0.079132,Coffee Shop
3,Aldgate,51.514885,-0.078356,Katsu Wrap,51.515883,-0.077849,Food Truck
4,Aldgate,51.514885,-0.078356,SUSHISAMBA,51.516156,-0.081169,Sushi Restaurant


Now that we have all the venues, we shall group them by neighborhoods:

In [47]:
london_city_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aldgate,100,100,100,100,100,100
Barbican,79,79,79,79,79,79
Blackfriars,49,49,49,49,49,49
Temple,49,49,49,49,49,49


Let's also have a look at how many unique venues categories there is:

In [48]:
print('There are {} uniques categories.'.format(len(london_city_venues['Venue Category'].unique())))

There are 89 uniques categories.


We now have everything we need to move on to the next step, and analyze the neighborhoods.

---

## 5. Analyze each Neighborhoods

The first thing we should do is create a new dataframe using the One-Hot Encoding technique.
This method will create a boolean column for each venue category.

In [49]:
# One hot encoding
london_onehot = pd.get_dummies(london_venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
london_onehot['Neighborhood'] = london_venues['Neighborhood'] 

# Move neighborhood column to the first column
fixed_columns = ['Neighborhood']  + [col for col in london_onehot if col != 'Neighborhood']
london_onehot = london_onehot[fixed_columns]

# Check results
print(london_onehot.shape)
print('\n')
london_onehot.head()

NameError: name 'london_venues' is not defined

Now that we have created the dummies columns, let's group the data by neighborhoods:

In [None]:
london_grouped = london_onehot.groupby('Neighborhood').mean().reset_index()

# Check results
print(london_grouped.shape)
print('\n')
london_grouped

In its current form, the dataset is pretty much unreadable. So let's have a look at the Top 5 venues per neighborhoods:

In [None]:
num_top_venues = 5

for hood in london_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = london_grouped[london_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

Once again, while these data are interesting, they could be rework to be more human friendly.
Let's define a function to return the Top N venues for a given neighborhood:

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

With this function, we can now create a new dataframe listing the Top 10 venues per neighborhoods:

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# Create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = london_grouped['Neighborhood']

for ind in np.arange(london_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(london_grouped.iloc[ind, :], num_top_venues)

# Check results
print(neighborhoods_venues_sorted.shape)
print('\n')
neighborhoods_venues_sorted.head()

It is already much better and easier to analyze.
With this dataset, we can proceed to the next step and cluster neighborhoods.

---

## 6. Cluster Neighborhoods

Using the _k_-means method, we will create 5 clusters of the neighborhoods:

In [None]:
# Set number of clusters
kclusters = 10

# Defining the dataframe
london_grouped_clustering = london_grouped.drop('Neighborhood', axis=1)

# Run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(london_grouped_clustering)

# Check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Now let's merge the clusters with the venues; we will use in inner join to remove potential NaN for which Foursquare didn't retrieve any information:

In [None]:
# Add clustering labels
neighborhoods_venues_clustered = neighborhoods_venues_sorted
neighborhoods_venues_clustered.insert(0, 'Cluster Labels', kmeans.labels_)

london_merged = london_data

# Merge grouped data with original data to add latitude/longitude for each neighborhood
london_merged = london_merged.join(neighborhoods_venues_clustered.set_index('Neighborhood'), on='Neighborhood', how='inner')

# Check results
print(london_merged.shape)
print('\n')
london_merged.head()

In [None]:
# Create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# Set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(london_merged['Latitude'], london_merged['Longitude'], london_merged['Neighborhood'], london_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

# Display map
map_clusters



---

## 7. Examine Clusters



In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 0,
                  london_merged.columns[[2] + list(range(6, london_merged.shape[1]))]]

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 1,
                  london_merged.columns[[2] + list(range(6, london_merged.shape[1]))]]

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 2,
                  london_merged.columns[[2] + list(range(6, london_merged.shape[1]))]]

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 3,
                  london_merged.columns[[2] + list(range(6, london_merged.shape[1]))]]

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 4,
                  london_merged.columns[[2] + list(range(6, london_merged.shape[1]))]]

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 5,
                  london_merged.columns[[2] + list(range(6, london_merged.shape[1]))]]

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 6,
                  london_merged.columns[[2] + list(range(6, london_merged.shape[1]))]]

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 7,
                  london_merged.columns[[2] + list(range(6, london_merged.shape[1]))]]

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 8,
                  london_merged.columns[[2] + list(range(6, london_merged.shape[1]))]]

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 9,
                  london_merged.columns[[2] + list(range(6, london_merged.shape[1]))]]