<h1><center>The Battle of the Nieghborhoods</center></h1>
<h3><center>Applied Data Science Capstone by IBM/Coursera</center></h3>

## Table of contents
* [1. Introduction: Business Problem](#introduction)
* [2. Data: Data Sources](#data)

## 1. Introduction: Business Problem <a name="introduction"></a>

1.1.	Background<br>
London is one of the biggest cities in the world with a population of over 8 million people, as per a Census performed in 2011. Subsequently, the city has attracted individuals from all over the world. It has become one of the most ethnically diverse cities in the world. London is melting pot of various people, ethnicities, cultures and backgrounds. Asians are one of its largest minority ethnic group. As per the 2011 Census, Asians make up 18.4% of the total population in London. This includes many Chinese, both immigrants and those that were born and raised in the UK. As such, it is no surprise that Chinese cuisine has been increasing in popularity. London even has its very own Chinatown located in Westminster.

1.2.	Business Problem<br>
Our business problem here is to capitalize on this increasing demand and interest in Chinese cuisine and open a Chinese restaurant. However, the first thing to think about when opening a new restaurant is location. The purpose of this project is to determine which neighborhood would be the most ideal to open a new Chinese restaurant. To do so, we will be analyzing the demographic data of boroughs in London and nearby venues as well as performing clustering on neighborhoods.

1.3.	Who would be interested?<br>
Any person/company looking to open a new Chinese restaurant in London.

## 2. Data: Data Sources <a name="data"></a>

**Data Sources**<br>
We will be using 3 main data sources:
* Firstly, we want to figure out which boroughs have the highest Chinese population since one of the assumptions we will be making in this project is that the demand for Chinese food is predominantly from the Chinese community. Thus, we need demographic data in London. We can obtain this information from the Wikipedia page – Ethnic Groups in London.
* Next, we will be choosing the top five boroughs with the highest Chinese population and obtain a list of neighborhoods and respective postal codes for those boroughs. The postal codes will then be used to get the coordinates (latitude and longitude) of the neighborhood which we will need later on. We can obtain this information from the Wikipedia page – List of Areas of London.
* Lastly, we need to use the FourSquare API to explore each neighborhood and analyze what venues are in each neighborhood, how many Chinese restaurants there are, and which venues are the most common in each neighborhood. This will guide our final recommendation on the neighborhood that is the most suitable to open a new Chinese restaurant. We can obtain this information from the Foursquare Developer API.

**Data Cleaning**<br>
We start off by webscrapping the Wikipedia page, Ethnic Groups in London. We identify the table relevant to our project which is titled Asian Population of London. The html table is converted into a pandas dataframe. We only extract the two relevant columns, London Borough and Chinese Population. We then sort the dataframe in descending order by Chinese Population to get the top 5 boroughs with the highest Chinese population.
Next, we need to get the neighborhoods and postal codes for the 5 boroughs. We webscrape the Wikipedia page, List of Areas of London. From the html table, we extract the relevant columns – Location, London borough, Post town, Postcode district – into a new pandas dataframe. We note that in some cases the neighbourhood is affixed to more than one postcode. Since we only need one postcode to extract the coordinates of the neighbourhood, we select the first postcode listed and drop the rest. After that, we filter the column post town for only values that contain ‘LONDON’. Then, we filter the resulting dataframe for the 5 London boroughs that we identified earlier.


<h3>2.1 Installing Required Libraries</h3>

In [1]:
!pip -q install geopy

!pip -q install geocoder

!pip install pgeocode

!pip -q install folium
print('Installation Completed')

Installation Completed


<h3> 2.2 Importing Required Libraries</h3>

In [2]:
# Library for BeautifulSoup, for web scrapping
from bs4 import BeautifulSoup

# Library to handle data in a vectorized manner
import numpy as np

# Library for data analsysis
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Library to handle JSON files
import json

# Convert an address into latitude and longitude values
from geopy.geocoders import Nominatim

# Library to handle requests
import requests

# Tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Import k-means from clustering stage
from sklearn.cluster import KMeans

# Import the Geocoder
import geocoder

# Import pgeocode
import pgeocode

# Map rendering library
import folium

# Import matplotlib and seaborn for visualisation
import matplotlib.pyplot as plt
import seaborn as sns

print('Libraries Imported')

Libraries Imported


<h3>2.3 Getting Deomographic Data for Ethnic Groups in London</h3>

In order to determine which neighborhood in London would be the best to open a new Chinese restaurant, we assume that the Chinese population make up the majority of the market for Chinese cuisine. In reality, other ethnicities also do frequent Chinese restaurants. However, since market data for Chinese cuisine is not made freely available on the internet, for the purpose of this project, we will make this assumption.

In [3]:
# Submiting GET request using BeautifulSoup object
url = 'https://en.wikipedia.org/wiki/Ethnic_groups_in_London#:~:text=At%20the%202011%20census%2C%20London,24.5%25%20born%20outside%20of%20Europe.'
html_data  = requests.get(url).text
soup = BeautifulSoup(html_data, "html5lib")

In [4]:
tables = soup.find_all('table', {'class':'wikitable'})

In [5]:
asian_pop = tables[6].tbody

In [6]:
# Define the column names for a new dataframe
columns = ['London Borough', 'Chinese Population']

# Create a new empty dataframe 
chinese_pop = pd.DataFrame(columns = columns)

# Extract the borough and chinese population from the html table into the dataframe chinese_pop
for i, row in enumerate(asian_pop.find_all('tr')): 
    if i == 0: # The first row is ignored as it only contains headers
        pass
    else:
        col = row.find_all("td")
        borough = col[1].text
        chinese = col[5].text.replace(',','')
        chinese_pop = chinese_pop.append({"London Borough":borough, "Chinese Population":chinese}, ignore_index=True)

In [7]:
# Examine the first 5 rows of the dataframe
chinese_pop.head()

Unnamed: 0,London Borough,Chinese Population
0,Newham,3930
1,Redbridge,3000
2,Brent,3250
3,Tower Hamlets,8109
4,Harrow,2629


Now that we have our data, we want to sort the dataframe by 'Chinese Population' from the borough with the highest Chinese population to the lowest.

In [8]:
# Converting the values to type int so we can sort by population
chinese_pop['Chinese Population'] = chinese_pop['Chinese Population'].astype('int')

# Sorting the Chinese Population column in descending order 
chinese_pop_sort = chinese_pop.sort_values(by='Chinese Population', ascending = False).reset_index(drop = True)

# Returning the top 5 boroughs with the highest Chinese population
chinese_pop_sort.head()

Unnamed: 0,London Borough,Chinese Population
0,Barnet,8259
1,Tower Hamlets,8109
2,Southwark,8074
3,Camden,6493
4,Westminster,5917


<h3>2.4 Getting Neighborhood and Postal Code Data for the 5 Boroughs in London</h3>

Now that we have identified the 5 boroughs that we will be focusing on, we want to get a list of the neighborhoods and postal codes in those boroughs.

In [9]:
# Submiting GET request using BeautifulSoup object
url = 'https://en.wikipedia.org/wiki/List_of_areas_of_London'
html_data  = requests.get(url).text
soup = BeautifulSoup(html_data, "html5lib")

In [10]:
# This extracts the "tbody" within the table where class is "wikitable sortable"
table = soup.find('table', {'class':'wikitable sortable'}).tbody

# Extracts all table rows within the table above
rows = table.find_all('tr')

# Define the column names for a new dataframe
columns = ['Neighborhood', 'Borough', 'PostTown', 'PostCode']

# Create a new empty dataframe 
london = pd.DataFrame(columns = columns)

# Extract the neighborhood, borough, posttown and postal code data from the html table into the dataframe london
for i in range(1, len(table.find_all('tr'))): # The first row is ignored as it only contains headers
    col = rows[i].find_all("td")
    location = col[0].text
    borough = col[1].text.rstrip(']').rstrip('0123456789').rstrip('[')
    posttown = col[2].text
    postcode = col[3].text
    london = london.append({"Neighborhood":location, "Borough":borough, 'PostTown':posttown, 'PostCode':postcode}, ignore_index=True)

In [11]:
# Examine the first 5 rows of the dataframe
london.head()

Unnamed: 0,Neighborhood,Borough,PostTown,PostCode
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Addington,Croydon,CROYDON,CR0
3,Addiscombe,Croydon,CROYDON,CR0
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


In the <code>london</code> dataframe, we note that in some cases the neighborhood is affixed to more than one postcode. Since we only need one postcode to extract the coordinates of the neighborhood, we select the first postcode listed and drop the rest.

In [12]:
# Spliting the postcode values by ',' and selecting one postcode
london = london.drop('PostCode', axis=1).join(london['PostCode'].str.split(',', expand=True).stack().reset_index(level=1, drop=True).rename('PostCode'))

In [13]:
# Examine the first 5 rows of the dataframe
london.head()

Unnamed: 0,Neighborhood,Borough,PostTown,PostCode
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,W3
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,W4
2,Addington,Croydon,CROYDON,CR0
3,Addiscombe,Croydon,CROYDON,CR0


Now that we have only one postal code for each neighborhood, we want to filter the dataframe for PostTown 'LONDON' and the 5 boroughs as well.

In [14]:
# Filtering the data to include only neighborhoods in London
london = london[london['PostTown'].str.contains('LONDON')]

# Dropping the PostTown column after we have filtered the dataframe
london.drop(['PostTown'], axis=1, inplace = True)

# Filtering the data to include only the 5 boroughs
london = london[london['Borough'].str.contains('Barnet|Tower Hamlets|Southwark|Camden|Westminster')].reset_index(drop = True)

In [15]:
# Examine the first 10 rows of the dataframe
london.head(10)

Unnamed: 0,Neighborhood,Borough,PostCode
0,Aldwych,Westminster,WC2
1,Arkley,Barnet,EN5
2,Arkley,Barnet,NW7
3,Bankside,Southwark,SE1
4,Barnet Gate,Barnet,NW7
5,Barnet Gate,Barnet,EN5
6,Bayswater,Westminster,W2
7,Belgravia,Westminster,SW1
8,Belsize Park,Camden,NW3
9,Bermondsey,Southwark,SE1


Although we have dropped the excess postcodes, we can see from the dataframe above that some of the postcodes are still repeating. The latitude and longitude values of these neighborhoods will be the same, therefore having multiplies of the same postcode does not add any value to our analysis and are subsequently dropped.

In [16]:
# Drop duplicated postcodes
london.drop_duplicates(subset ="PostCode", keep='first', inplace = True)

In [17]:
# Check the final number of rows of the dataframe
london.shape

(54, 3)

<h3>2.5 Getting the Latitude and Longitude for each PostCode</h3>

Now that we have the names of the neighborhoods and their respective postal codes, we want to get the latitude and longitude values as well. To do so, we will be using the Python library <code>pgeocode</code>.

In [18]:
# Extracting the PostCode column from our london dataframe
postal_codes = london['PostCode']

# Define the column names for a new dataframe
columns = ['PostCode', 'Latitude', 'Longitude']

# Create a new empty dataframe 
coordinates = pd.DataFrame(columns = columns)

# Iterate through the postcodes, get the latitude and longitude values and append to the new dataframe coordinates
for postal_code in postal_codes:
    nomi = pgeocode.Nominatim('gb')
    coordinate = nomi.query_postal_code(postal_code)
    lat = coordinate[9]
    lng = coordinate[10]
    coordinates = coordinates.append({"PostCode":postal_code, "Latitude":lat, 'Longitude':lng}, ignore_index=True)
    
# Examine the first 5 rows of the dataframe
coordinates.head()

Unnamed: 0,PostCode,Latitude,Longitude
0,WC2,51.5142,-0.123382
1,EN5,51.6562,-0.194317
2,NW7,51.6143,-0.2273
3,SE1,51.4963,-0.093038
4,NW7,51.6143,-0.2273


We have 2 dataframes now. <code>london</code>, containing the name of the neighborhoods and boroughs and <code>coordinates</code>, containing the coordinates for these neighborhoods. We want to merge the two dataframe to get what will be our main dataframe for our analysis.

In [19]:
# Perform a left join on london and coordinates dataframe
london = pd.merge(london, coordinates,on='PostCode', how='left')

# Examine the first 5 rows of the dataframe
london.head()

Unnamed: 0,Neighborhood,Borough,PostCode,Latitude,Longitude
0,Aldwych,Westminster,WC2,51.5142,-0.123382
1,Arkley,Barnet,EN5,51.6562,-0.194317
2,Arkley,Barnet,NW7,51.6143,-0.2273
3,Bankside,Southwark,SE1,51.4963,-0.093038
4,Barnet Gate,Barnet,NW7,51.6143,-0.2273


In [20]:
# Check the shape of the dataframe london to ensure no information has been dropped or lost during the process
london.shape

(54, 5)

<h3>2.6 Defining the FourSquare API Data</h3>

Finally, we define the details of our credentials to access the FourSquare API. We also define the limit and radius of our calls.

In [21]:
CLIENT_ID = 'W33WOM3RWJHP511UIYKJKJISIWDN5KAOWU3N2V3ECGQGYIYP'
CLIENT_SECRET = 'P22DORR32XWLTRDAZQJ2T24N1YUVPM3DLV3G0UR4RNJAULGP'
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
radius = 2000 # A default radius value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: W33WOM3RWJHP511UIYKJKJISIWDN5KAOWU3N2V3ECGQGYIYP
CLIENT_SECRET:P22DORR32XWLTRDAZQJ2T24N1YUVPM3DLV3G0UR4RNJAULGP
