# This is part of Applied Data Science Capstone Project for the IBM Data Science Professional Certification

## Date - 4th of March 2020

## Author - Adam Abr

### 1. The problem
London is a great multicultural city with plenty of opportunities for everyone. It is a home to great tourist attractions, food and drinks. Being a home to 8.9 million people, there are plenty of places to pick to look for a flat to rent.

With this much variety and opportunity, it's difficult to know where to search. 

Target user: Our target user searches for a neighbourhood in London to find a flat. He/She is a great Sushi lover, as such their priorities are to find a place that is safe, as well as provides plenty of opportunities to enjoy their favourite food.

I intend to help them with this problem, by finding and filtering to the safest neighborhoods of London. Then for each neighbourhood finding what Sushi places are available, and clustering these together to help them decide what would be their best place to live.

### 2. Data Acquisition and Preprocessing
The data used in this project will consist of the Londons Recorded Crime for past 2 years, List of Broughs and the FOursuare API.

The list of boroughs will come from Wikipedia:
https://en.wikipedia.org/wiki/List_of_London_boroughs
This data will be used to determine geographical location of each borough. 

Crime statistics will come from London database with geographical breakdown:
https://data.london.gov.uk/dataset/recorded_crime_summary
This data will be used to filter the London's boroughs based on safety

The foursquare API, where a list of sushi restaurants will be requested.
https://api.foursquare.com
This data will be used to search for best sushi restaurants



In [799]:
#import appropriate libraries
import pandas as pd
import numpy as np

#for scrapping wikipedia
import requests
import lxml
from bs4 import BeautifulSoup
from pandas.io.json import json_normalize

#for generating a map
import folium
from geopy.geocoders import Nominatim

#for K-Means Clustering
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [800]:
#importing crime data  
crime_data = pd.read_csv(r"...\GitHub\Coursera_Capstone\Capstone_Project\MPS Borough Level Crime.csv")

#making a list of unique boroughs
boroughs = crime_data['LookUp_BoroughName'].unique()

crime_data.head()

Unnamed: 0,MajorText,MinorText,LookUp_BoroughName,201903,201904,201905,201906,201907,201908,201909,...,202004,202005,202006,202007,202008,202009,202010,202011,202012,202101
0,Arson and Criminal Damage,Arson,Barking and Dagenham,5,5,11,3,5,3,6,...,2,2,4,4,6,2,7,4,2,4
1,Arson and Criminal Damage,Criminal Damage,Barking and Dagenham,138,130,140,113,134,118,109,...,80,86,121,122,114,116,120,100,107,100
2,Burglary,Burglary - Business and Community,Barking and Dagenham,29,27,21,27,31,35,37,...,29,16,16,28,24,32,21,18,24,20
3,Burglary,Burglary - Residential,Barking and Dagenham,99,96,114,96,71,67,80,...,57,42,63,72,63,54,68,90,91,69
4,Drug Offences,Drug Trafficking,Barking and Dagenham,6,5,9,6,11,8,7,...,16,15,11,21,9,12,14,17,15,9


In [801]:
#create a column 'Sum' which is an addition of all crimes
crime_data['Sum'] = crime_data.iloc[:,3:].sum(axis=1)

#filter crime_data to 4 useful for us columns
crime_data = crime_data[['MajorText','MinorText','LookUp_BoroughName','Sum']]

In [802]:
#group and sum the crimes by borough name
crime_data = crime_data.groupby(['LookUp_BoroughName'], as_index=False).sum()
crime_data.head()

Unnamed: 0,LookUp_BoroughName,Sum
0,Barking and Dagenham,37630
1,Barnet,55803
2,Bexley,31822
3,Brent,56196
4,Bromley,44735


In [803]:
#rename columns
crime_data.rename(columns={crime_data.columns[0]:'Borough Name'}, inplace=True)

#remove london airports as this is not a borough, and reset index
crime_data = crime_data[crime_data['Borough Name'] != 'London Heathrow and London City Airports']
crime_data.reset_index()

#sort by boroughs with least amount of crimes
ascending = crime_data.sort_values(by='Borough Name', ascending=True)
ascending.head()

Unnamed: 0,Borough Name,Sum
0,Barking and Dagenham,37630
1,Barnet,55803
2,Bexley,31822
3,Brent,56196
4,Bromley,44735


In [804]:
#sort by boroughs with most amount of crimes
descending = crime_data.sort_values(by='Borough Name', ascending=False).head()
descending

Unnamed: 0,Borough Name,Sum
32,Westminster,120874
31,Wandsworth,48725
30,Waltham Forest,45520
29,Tower Hamlets,64101
28,Sutton,25553


In [805]:
# Obtaining data from Wikipedia
boroughs = requests.get('https://en.wikipedia.org/wiki/List_of_London_boroughs').text
soup = BeautifulSoup(boroughs, 'lxml')

In [806]:
borough_name = []
coordinates = []

#index through the html and iteratively add rows of tables to borough_name and coordinates
for row in soup.find('table').find_all('tr'):
    cells = row.find_all('td')
    if len(cells) > 0:
        borough_name.append(cells[0].text.rstrip('\n'))
        coordinates.append(cells[8].text.rstrip('\n'))

In [807]:
# Form a dataframe
dict = {'borough_name' : borough_name,
       'coordinates': coordinates}
boroughs = pd.DataFrame.from_dict(dict)
boroughs.head()


Unnamed: 0,borough_name,coordinates
0,Barking and Dagenham [note 1],51°33′39″N 0°09′21″E﻿ / ﻿51.5607°N 0.1557°E﻿ /...
1,Barnet,51°37′31″N 0°09′06″W﻿ / ﻿51.6252°N 0.1517°W﻿ /...
2,Bexley,51°27′18″N 0°09′02″E﻿ / ﻿51.4549°N 0.1505°E﻿ /...
3,Brent,51°33′32″N 0°16′54″W﻿ / ﻿51.5588°N 0.2817°W﻿ /...
4,Bromley,51°24′14″N 0°01′11″E﻿ / ﻿51.4039°N 0.0198°E﻿ /...


In [808]:
#clean the data, remove '/', spaces and extract latitude and longitude from the data
for index,each in enumerate(boroughs['coordinates']):
    each = each.split('/')
    each = each[1]
    each = each.split(' ')
    
    latitude = each[1]
    latitude = float(latitude[1:-2])
    
    longitude = each[2]
    longitude = float(longitude[0:-3])
    
    boroughs.loc[index,'latitude'] = latitude
    boroughs.loc[index,'longitude'] = longitude

In [810]:
#filter the boroughs dataframe to three neccesary columns.
boroughs = boroughs[['borough_name','latitude','longitude']]
boroughs.head()

Unnamed: 0,borough_name,latitude,longitude
0,Barking and Dagenham [note 1],51.5607,0.1557
1,Barnet,51.6252,0.1517
2,Bexley,51.4549,0.1505
3,Brent,51.5588,0.2817
4,Bromley,51.4039,0.0198


In [811]:
#remove '[note]' from name
for index,each in enumerate(boroughs['borough_name']):
    if 'note' in each:
        boroughs.loc[index,'borough_name'] = each[:-8]

#add crime_sum to boroughs df        
boroughs['crime_sum'] = crime_data['Sum']

#sort by least amount of crimes
boroughs_filtered = boroughs.sort_values(by='crime_sum', ascending=True) 

#filter the dataframe to boroughs with less than 45000 crimes.
boroughs_filtered = boroughs_filtered[boroughs_filtered['crime_sum']<45000]

#reset index
boroughs_filtered.reset_index(drop=True, inplace=True)
boroughs_filtered

Unnamed: 0,borough_name,latitude,longitude,crime_sum
0,Kingston upon Thames,51.4085,0.3064,23228.0
1,Southwark,51.5035,0.0804,23905.0
2,Tower Hamlets,51.5099,0.0059,25553.0
3,Newham,51.5077,0.0469,26707.0
4,Harrow,51.5898,0.3346,31514.0
5,Bexley,51.4549,0.1505,31822.0
6,Havering,51.5812,0.1837,34076.0
7,Barking and Dagenham,51.5607,0.1557,37630.0
8,Hammersmith and Fulham,51.4927,0.2339,40623.0
9,Kensington and Chelsea,51.502,0.1947,40644.0


In [812]:
#Foursquare API connection details (removed for security)
CLIENT_ID = 'hidden' # your Foursquare ID
CLIENT_SECRET = 'hidden' # your Foursquare Secret
ACCESS_TOKEN = 'hidden' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 50
search_query = 'SUSHI'
radius = 6000

In [813]:
#creating empty dataframe and temp_index for iteration count of the restaurants index
restaurants = pd.DataFrame() 
temp_index = 0

for index,each in enumerate(boroughs_filtered['borough_name']):
    #reseting values for each iteration
    dataframe = [] 
    venues = []
    results = []
    
    #setting lattitude and longitude for each search query
    lat = boroughs_filtered['latitude'][index] 
    long =boroughs_filtered['longitude'][index]
    
    #creating url for each borough name
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat,long,ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)
    
    #requesting data from the forursquare API
    results = requests.get(url).json()
    
    #filtering the result
    venues = results['response']['venues']
    
    # tranform venues into a dataframe
    dataframe = pd.json_normalize(venues)
    
    #for each feedback from Foursquare API iterate and add data to restaurants df for further analysis.
    for index_df, each_df in enumerate(dataframe['name']):
        temp_index = temp_index+1
        restaurants.loc[temp_index,'borough'] = each
        restaurants.loc[temp_index,'name'] = each_df
        restaurants.loc[temp_index,'lat'] = dataframe['location.lat'][index_df]
        restaurants.loc[temp_index,'long'] = dataframe['location.lng'][index_df]
        restaurants.loc[temp_index,'distance'] = dataframe['location.distance'][index_df]
    
    #for each borough add number of restaurants and average distance from borough cooridnates.
    boroughs_filtered.loc[index,'no of restaurants'] = len(list(dataframe['name']))
    boroughs_filtered.loc[index,'avg restaurant dist'] = dataframe['location.distance'].mean()

In [814]:
restaurants.head()

Unnamed: 0,borough,name,lat,long,distance
1,Kingston upon Thames,YO! Sushi,51.438786,0.26881,4263.0
2,Kingston upon Thames,Umami Sushi Box,51.440078,0.37044,5667.0
3,Southwark,Thames Barrier Sushi,51.500633,0.033207,3285.0
4,Southwark,Sushi Japanese,51.456053,0.010815,7153.0
5,Southwark,Sushi Ya,51.507601,0.02278,4018.0


In [815]:
#Get coordinates of London
address = 'London, UK'
geolocator = Nominatim(user_agent="explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London, UK are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of London, UK are 51.5073219, -0.1276474.


In [816]:
#sort the dataframe based on imporance
boroughs_filtered = boroughs_filtered.sort_values(by=['crime_sum','avg restaurant dist'], ascending=[True, True])

In [817]:
#Make a map of London using folium
map = folium.Map(location=[latitude, longitude], zoom_start=11)

#Add markers of filtered London's boroughs to map
for lat, lng, borough in zip(boroughs_filtered['latitude'], boroughs_filtered['longitude'], boroughs_filtered['borough_name']):
    label = '{}'.format(borough)
    label2 = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label2,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map)  
    
map

In [818]:
# One hot encoding before clustering
onehot = pd.get_dummies(restaurants[['name']], prefix="", prefix_sep="")

In [819]:
#setting number of clusters, and dropping borough name from one hot encoding for kmeans to be applied
kclusters = 8
london_cluster = grouped.drop('borough', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(london_cluster)
kmeans.labels_[0:10]

array([1, 6, 7, 0, 5, 1, 2, 4, 3, 3])

In [820]:
#merge the onehot encoded data with previously filtered borough data
london_merged = boroughs_filtered
london_merged = london_merged.join(grouped.set_index('borough'), on='borough_name')

london_merged.head()

Unnamed: 0,borough_name,latitude,longitude,crime_sum,no of restaurants,avg restaurant dist,Cluster Labels,Atomic Sushi,Bluefin Sushi,Ding Dong Sushi,...,Thames Barrier Sushi,Tokyo Sushi,Umami Sushi Box,WOW Sushi,Wasabi,YO! Sushi,Yo! Sushi,Yobun sushi,ichiban handmade sushi,sushi express
0,Kingston upon Thames,51.4085,0.3064,23228.0,2.0,4965.0,5,0.0,0.0,0.0,...,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.0,0.0
1,Southwark,51.5035,0.0804,23905.0,12.0,6092.416667,1,0.0,0.0,0.0,...,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Tower Hamlets,51.5099,0.0059,25553.0,46.0,4752.347826,1,0.021739,0.0,0.0,...,0.021739,0.0,0.0,0.021739,0.043478,0.021739,0.021739,0.021739,0.021739,0.021739
3,Newham,51.5077,0.0469,26707.0,20.0,5022.8,1,0.05,0.0,0.0,...,0.05,0.0,0.0,0.0,0.0,0.05,0.05,0.0,0.0,0.0
4,Harrow,51.5898,0.3346,31514.0,1.0,3706.0,4,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [821]:
# create map of london
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add borough markers to the map, with large radius
markers_colors = []
for lat, lon, poi,cluster in zip(london_merged['latitude'], london_merged['longitude'], london_merged['borough_name'], london_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=15,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

#add markers of each sushi restaurant with small radius
for lat, lon, poi in zip(restaurants['lat'], restaurants['long'], restaurants['name']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=3,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)    

map_clusters

In [822]:
#Based on this we could reccomend any boroughs from the cluster 1
recommended = london_merged[london_merged['Cluster Labels']==1]
recommended = recommended.loc[:, (reccomended != 0.000000).any(axis=0)]

#split the dataframe by borough and general info & numerical values
recommended_1 = recommended[['borough_name','latitude','longitude','crime_sum','no of restaurants', 'avg restaurant dist', 'Cluster Labels']]
recommended_2 = recommended
recommended_2 = recommended_2.drop(['borough_name','latitude','longitude','crime_sum','no of restaurants', 'avg restaurant dist','Cluster Labels'], 1)

#Change numerical values to 'yes' or 'no'
recommended_2.loc[:,:] = np.where(recommended_2>0, 'Yes','No')

#combine the two dataframes together
recommended = pd.concat([recommended_1, recommended_2], axis=1, join="inner")

#output of the dataframe
recommended

Unnamed: 0,borough_name,latitude,longitude,crime_sum,no of restaurants,avg restaurant dist,Cluster Labels,Atomic Sushi,Dumo Sushi,Gourmet Sushi,...,Sushinoen,Takeshi sushi,Thames Barrier Sushi,WOW Sushi,Wasabi,YO! Sushi,Yo! Sushi,Yobun sushi,ichiban handmade sushi,sushi express
1,Southwark,51.5035,0.0804,23905.0,12.0,6092.416667,1,No,No,No,...,No,No,Yes,No,No,No,No,No,No,No
2,Tower Hamlets,51.5099,0.0059,25553.0,46.0,4752.347826,1,Yes,Yes,Yes,...,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes
3,Newham,51.5077,0.0469,26707.0,20.0,5022.8,1,Yes,No,No,...,No,Yes,Yes,No,No,Yes,Yes,No,No,No
