# IBM DataScience Professional Certificate
## Applied DataScience Capstone - The Battle of Neighborhoods
_By Kevin Gilson._

---

## 1. Introduction
### 1a. Business problem

Young entrepreneurs are always in demand of good advices, and always looking towards the keys of success.
DataScience can help them analyzing the market, and get information such as:
* Which places of a city are the most prolific?
* Which types of business are present in which areas?
* Which types of business are lacking in which areas?
* ...

This kind of information are particularly relevant for people looking to open food related businesses.

With almost 9 millions people in 2020, London is a big city. Not only is it the capital of the United Kingdom, it is also one of the top financial places in Europe, and a hugely touristic city.
With that in mind, what could possibly leverage a young entrepreneur looking to start his Food-Truck in London, if he wanted to be successful? What is the most prevalent types of shops? Where are located the _hot_ zone of the city? Which place would be best to maximize his profit?

Thanks to Data Science, we have a way to analyze raw data and come up with suggestions to help this young man, grow as a successful entrepreneur.
And who knows, theses information might as well help him make it big in 10 years.

### 1b. Data

London is split into Borough, holding various Neighborhoods.

In order to leverage further geographical data, we will first scrape Wikipedia to get the complete list of each Borough and Neighborhoods.
Then, we will use Python libraries to link each Neighborhoods to its relative geographical coordinates.
After that, we will leverage the Foursquare location data to retrieve their attributes, cluster zones, and most importantly come up with suggestions on the best places to install a food truck.

Our primary data, scraped from Wikipedia, adopt the form of:

|Location|London Borough|Post Town|Postcode District|Dial code|OS Grid Ref|
|--------|--------------|---------|-----------------|---------|-----------|
|Abbey Wood|Bexley, Greenwich [7]|LONDON|SE2|020|TQ465785|
|Acton|Ealing, Hammersmith and Fulham[8]|LONDON|W3, W4|020|TQ205805|
|Addington|Croydon[8]|CROYDON|CR0|020|TQ375645|
|Addiscombe|Croydon[8]|CROYDON|CR0|020|TQ345665|
|Albany Park|Bexley|BEXLEY, SIDCUP|DA5, DA14|020|TQ478728|

As we can see, the data will need to be cleansed, but it is also important to note that OS Grid References can quite easily be converted into coordinates.

### 1c. Preliminary steps

Before starting our analysis, let's first import the libaries we will need along our computations:

In [None]:
# Regular expression
import re

# Handle data in a vectorized way
import numpy as np

# Data analysis
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Clustering
from sklearn.cluster import KMeans

# JSON files
import json

# Handle GET / POST requests
import requests

# Convert an address into coordinates
!pip install geopy
from geopy.geocoders import Nominatim

# Coordinates
!pip install geocoder
import geocoder # Geographical library

# OS Grid References converter
!pip install OSGridConverter
from OSGridConverter import grid2latlong

# Web scraping
!pip install beautifulsoup4
from bs4 import BeautifulSoup # web scraping library

# Parser
!pip install lxml
import lxml # parser library

# URL opening
import urllib.request as req # URL opening library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Map rendering
!pip install folium==0.5.0
import folium

# Python ENV file
!pip install python-dotenv
from dotenv import load_dotenv

# Password input
import getpass

# Importation confirmation
print('Libraries installed and imported.')

---

## 2. Retrieve the dataset

### 2a. Scrape the WikiPedia page

As the first step, we will need to scrape the WikiPedia page in order to retrieve its table, and therefore datas about the London Districts and Areas.
In order to do so, we will:
1. Scrape the page using BeautifulSoup;
2. Convert the OS Grid References found to coordinates (latitude, longitude) using the OSGRridConverter library; missing OS Grid References will results in NaN values for their coordinates;
3. Check the shape of the recovered dataframe

In [101]:
url_wiki = "https://en.wikipedia.org/wiki/List_of_areas_of_London"

# Scrape the page with BeautifulSoup
page = req.urlopen(url_wiki)
soup = BeautifulSoup(page, "lxml")
all_tables = soup.find_all("table")

# Initiate the dataframe
column_names = ['Location', 'London Borough', 'Post Town',
                'Post District', 'Dial Code', 'OS Grid Ref',
                'Latitude', 'Longitude']
wiki_data = pd.DataFrame(columns=column_names)

# Loop through the scraping and extract the second table (first one is the Contents)
for row in all_tables[1].find_all('tr'):
    cells = row.findAll('td')
    if len(cells)==6:
        #print(cells)
        wiki_location = cells[0].text.strip()
        wiki_borough = cells[1].text.strip()
        wiki_town = cells[2].text.strip()
        wiki_district = cells[3].text.strip()
        wiki_dial = cells[4].text.strip()
        
        wiki_gridref = cells[5].text.strip()
        try:
            wiki_latlong = grid2latlong(wiki_gridref)
        except:
            wiki_latlong.latitude = 'NaN'
            wiki_latlong.longitude = 'NaN'
        
        wiki_data = wiki_data.append({'Location': wiki_location,
                                     'London Borough': wiki_borough,
                                     'Post Town': wiki_town,
                                     'Post District': wiki_district,
                                     'Dial Code': wiki_dial,
                                     'OS Grid Ref': wiki_gridref,
                                     'Latitude': wiki_latlong.latitude,
                                     'Longitude': wiki_latlong.longitude}
                                    , ignore_index=True)

# Check the results
print("The resulting dataframe has a shape of: {}\n".format(wiki_data.shape))
wiki_data.head()

The resulting dataframe has a shape of: (533, 8)



Unnamed: 0,Location,London Borough,Post Town,Post District,Dial Code,OS Grid Ref,Latitude,Longitude
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785,51.4865,0.109318
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805,51.5106,-0.264585
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645,51.3629,-0.0257799
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665,51.3816,-0.0681255
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728,51.4349,0.125663


### 2b. Upgrade the data

We can see that the resulting dataframe isn't looking its best; let's upgrade it a bit:
1. First we will drop columns that won't be used later;
2. Then we will rename a few of the remaining ones into clearer names;
3. Finally we will reorder the columns.

In [102]:
# Drop useless columns
wiki_data.drop(columns=['Post District'
                        ,'Dial Code'
                        ,'OS Grid Ref'], axis=1, inplace=True)

# Rename columns
wiki_data.rename(columns={'Location':'Neighborhood'
                          ,'London Borough':'Borough'
                          ,'Post Town':'Town'}, inplace=True)

# Reorder
wiki_data = wiki_data[['Borough'
                       ,'Neighborhood'
                       ,'Town'
                       ,'Latitude'
                       ,'Longitude']]
#
wiki_data.head()

Unnamed: 0,Borough,Neighborhood,Town,Latitude,Longitude
0,"Bexley, Greenwich [7]",Abbey Wood,LONDON,51.4865,0.109318
1,"Ealing, Hammersmith and Fulham[8]",Acton,LONDON,51.5106,-0.264585
2,Croydon[8],Addington,CROYDON,51.3629,-0.0257799
3,Croydon[8],Addiscombe,CROYDON,51.3816,-0.0681255
4,Bexley,Albany Park,"BEXLEY, SIDCUP",51.4349,0.125663


### 2c. Clean the data

We can see a few problems with our data, such as:
* Some values contains WikiPedia links references (i.e. "[1]");
* Some Neighborhoods are affected to two or more Towns;
* Some values are followed by explainations in between paratheses;
* The case of the data are not normalized, mixing upper and lower cases.

Let's define a function that will use regular expression to clean the data of brackets, parentheses, and others.
And then, let's capitalize each value (i.e. "VALUES" and "values" will become "Values").

In [103]:
# Cleaning function, based on RegEx patterns
def Clean_DataEnd(raw_data, pattern):
    if re.search(pattern, str(raw_data)):
        pos = re.search(pattern, str(raw_data)).start()
        return raw_data[:pos]
    else:
        return raw_data

# Fix wikipedia references, double values, and annotations
for col in wiki_data.select_dtypes(include='object').columns:
    wiki_data[col] = wiki_data[col].apply(Clean_DataEnd, pattern=' \[.*')
    wiki_data[col] = wiki_data[col].apply(Clean_DataEnd, pattern='\[.*')
    wiki_data[col] = wiki_data[col].apply(Clean_DataEnd, pattern=' \(.*')
    wiki_data[col] = wiki_data[col].apply(Clean_DataEnd, pattern=', .*')
    wiki_data[col] = wiki_data[col].apply(Clean_DataEnd, pattern=' and .*')
                       
# Capitalize data - Use title to capitalize each words
for col in ['Borough','Neighborhood','Town']:
    wiki_data[col] = wiki_data[col].str.title()

# Check
wiki_data.head()

Unnamed: 0,Borough,Neighborhood,Town,Latitude,Longitude
0,Bexley,Abbey Wood,London,51.4865,0.109318
1,Ealing,Acton,London,51.5106,-0.264585
2,Croydon,Addington,Croydon,51.3629,-0.0257799
3,Croydon,Addiscombe,Croydon,51.3816,-0.0681255
4,Bexley,Albany Park,Bexley,51.4349,0.125663


### 2d. Handle missing values

Before going further, we want to make sure that we have the coordinates of each neighborhoods.
Let's check the Latitudes and Longitudes:

In [104]:
print(wiki_data[wiki_data['Latitude'] == 'NaN'])
print(wiki_data[wiki_data['Longitude'] == 'NaN'])

     Borough Neighborhood       Town Latitude Longitude
53    Bexley      Blendon     Bexley      NaN       NaN
233  Bromley    Hazelwood  Orpington      NaN       NaN
     Borough Neighborhood       Town Latitude Longitude
53    Bexley      Blendon     Bexley      NaN       NaN
233  Bromley    Hazelwood  Orpington      NaN       NaN


We can see that two Neighborhoods are missing their coordinates.

Using the GeoCoder library, let's retrieve their latitudes and longitudes:

In [105]:
# Define a function to retrieve the coordinates, using ArcGis instead of Google for better performances
def GetCoordinates(df):
    bor = df['Borough']
    neigh = df['Neighborhood']
    town = df['Town']
    
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, London'.format(neigh))
        lat_lng_coords = g.latlng
    lat = lat_lng_coords[0]
    long = lat_lng_coords[1]
    lst = {'Borough': bor,
           'Neighborhood': neigh,
           'Town': town,
           'Latitude': lat,
           'Longitude': long}
    return pd.Series(lst)

# Apply the function to missing coordinates
wiki_data[wiki_data['Latitude'] == 'NaN'] = wiki_data[wiki_data['Latitude'] == 'NaN'].apply(GetCoordinates, axis=1)

# Check if there is any remaining NaN values
print(wiki_data[wiki_data['Latitude'] == 'NaN'])
print(wiki_data[wiki_data['Longitude'] == 'NaN'])

Empty DataFrame
Columns: [Borough, Neighborhood, Town, Latitude, Longitude]
Index: []
Empty DataFrame
Columns: [Borough, Neighborhood, Town, Latitude, Longitude]
Index: []


---

## 3. Explore the dataset