# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

This project aims to select the safest borough in London based on the total crimes, explore the neighborhoods of that borough to find the 10 most common venues in each neighborhood and finally cluster the neighborhoods using k-mean clustering.

This report will be targeted to people who are looking to relocate to London. Inorder to finalise a neighborhood to hunt for an apartment, safety is considered as a top concern when moving to a new place. If you don’t feel safe in your own home, you’re not going to be able to enjoy living there. The crime statistics will provide an insight into this issue.

We will focus on the safest borough and explore its neighborhoods and the 10 most common venues in each neighborhood so that the best neighborhood suited to an individual's needs can be selected.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:

*The total number of crimes commited in each of the borough during the last year.

*The most common venues in each of the neighborhood in the safest borough selected.

Following data sources will be needed to extract/generate the required information:

Part 1: Preprocessing a real world data set from Kaggle showing the London Crimes from 2008 to 2016: A dataset consisting of the crime statistics of each borough in London obtained from Kaggle

Part 2: Scraping additional information of the different Boroughs in London from a Wikipedia page.: More information regarding the boroughs of London is scraped using the Beautifulsoup library

Part 3: Creating a new dataset of the Neighborhoods of the safest borough in London and generating their co-ordinates.: Co-ordinate of neighborhood will be obtained using Google Maps API geocoding

### Part 1: Preprocessing a real world data set from Kaggle showing the London Crimes from 2008 to 2016¶

London Crime Data

About this file

lsoa_code: code for Lower Super Output Area in Greater London.

borough: Common name for London borough.

major_category: High level categorization of crime

minor_category: Low level categorization of crime within major category.

value: monthly reported count of categorical crime in given borough

year: Year of reported counts, 2008-2016

month: Month of reported counts, 1-12

Data set URL: https://www.kaggle.com/jboysen/london-crime

In [4]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
from bs4 import BeautifulSoup # library for web scrapping  

#!conda install -c conda-forge geocoder --yes
import geocoder

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')


ModuleNotFoundError: No module named 'geocoder'

In [3]:
CLIENT_ID = 'R01LINGO2WC45KLRLKT3ZHU2QENAO2IPRK2N2ELOHRNK4P3K' # your Foursquare ID
CLIENT_SECRET = '4JT1TWRMXMPLX5IOKNBAFU3L3ARXK4D5JJDPFK1CLRZM2ZVW' # your Foursquare Secret

VERSION = '20180604'
LIMIT = 30

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: R01LINGO2WC45KLRLKT3ZHU2QENAO2IPRK2N2ELOHRNK4P3K
CLIENT_SECRET:4JT1TWRMXMPLX5IOKNBAFU3L3ARXK4D5JJDPFK1CLRZM2ZVW


Now let's create a grid of area candidates, equaly spaced, centered around city center and within ~6km from Alexanderplatz. Our neighborhoods will be defined as circular areas with a radius of 300 meters, so our neighborhood centers will be 600 meters apart.

To accurately calculate distances we need to create our grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters (not in latitude/longitude degrees). Then we'll project those coordinates back to latitude/longitude degrees to be shown on Folium map. So let's create functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in  meters).

In [7]:
# Read in the data 
df = pd.read_csv("Copy of london_crime_by_lsoa.csv")

Let's create a **hexagonal grid of cells**: we offset every other row, and adjust vertical row spacing so that **every cell center is equally distant from all it's neighbors**.

In [8]:
df.head()


Unnamed: 0,lsoa_code,borough,major_category,minor_category,value,year,month
0,E01001116,Croydon,Burglary,Burglary in Other Buildings,0,2016,11
1,E01001646,Greenwich,Violence Against the Person,Other violence,0,2016,11
2,E01000677,Bromley,Violence Against the Person,Other violence,0,2015,5
3,E01003774,Redbridge,Burglary,Burglary in Other Buildings,0,2016,3
4,E01004563,Wandsworth,Robbery,Personal Property,0,2008,6


Let's visualize the data we have so far: city center location and candidate neighborhood centers:

In [9]:
# Accessing the most recent crime rates (2016)

# Taking only the most recent year (2016) and dropping the rest
df.drop(df.index[df['year'] != 2016], inplace = True)

# Removing all the entires where crime values are null  
df = df[df.value != 0]

# Reset the index and dropping the previous index
df = df.reset_index(drop=True)

In [9]:
# Shape of the data frame
df.shape


(1048575, 7)

OK, we now have the coordinates of centers of neighborhoods/areas to be evaluated, equally spaced (distance from every point to it's neighbors is exactly the same) and within ~6km from Alexanderplatz. 

Let's now use Google Maps API to get approximate addresses of those locations.

In [10]:
# View the top of the dataset 
df.head()

Unnamed: 0,lsoa_code,borough,major_category,minor_category,value,year,month
0,E01001116,Croydon,Burglary,Burglary in Other Buildings,0,2016,11
1,E01001646,Greenwich,Violence Against the Person,Other violence,0,2016,11
2,E01000677,Bromley,Violence Against the Person,Other violence,0,2015,5
3,E01003774,Redbridge,Burglary,Burglary in Other Buildings,0,2016,3
4,E01004563,Wandsworth,Robbery,Personal Property,0,2008,6


In [11]:
df.columns = ['LSOA_Code', 'Borough','Major_Category','Minor_Category','No_of_Crimes','Year','Month']
df.head()

Unnamed: 0,LSOA_Code,Borough,Major_Category,Minor_Category,No_of_Crimes,Year,Month
0,E01001116,Croydon,Burglary,Burglary in Other Buildings,0,2016,11
1,E01001646,Greenwich,Violence Against the Person,Other violence,0,2016,11
2,E01000677,Bromley,Violence Against the Person,Other violence,0,2015,5
3,E01003774,Redbridge,Burglary,Burglary in Other Buildings,0,2016,3
4,E01004563,Wandsworth,Robbery,Personal Property,0,2008,6


In [12]:
# View the information of the dataset 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 7 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   LSOA_Code       1048575 non-null  object
 1   Borough         1048575 non-null  object
 2   Major_Category  1048575 non-null  object
 3   Minor_Category  1048575 non-null  object
 4   No_of_Crimes    1048575 non-null  int64 
 5   Year            1048575 non-null  int64 
 6   Month           1048575 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 56.0+ MB


Looking good. Let's now place all this into a Pandas dataframe.

In [13]:
df['Borough'].value_counts()

Croydon                   46642
Barnet                    44419
Ealing                    42648
Bromley                   40852
Lambeth                   40523
Enfield                   39698
Wandsworth                38893
Brent                     37921
Lewisham                  37662
Southwark                 37618
Newham                    36610
Redbridge                 34639
Hillingdon                34392
Greenwich                 32893
Hackney                   32524
Haringey                  32117
Tower Hamlets             31993
Waltham Forest            31801
Havering                  31173
Hounslow                  30809
Bexley                    29773
Camden                    29476
Westminster               28589
Harrow                    28441
Islington                 27813
Merton                    26217
Hammersmith and Fulham    25534
Sutton                    25008
Barking and Dagenham      24318
Richmond upon Thames      23454
Kensington and Chelsea    23166
Kingston

...and let's now save/persist this data into local file.

In [14]:
df['Major_Category'].value_counts()    

Theft and Handling             307992
Violence Against the Person    247061
Criminal Damage                159997
Drugs                           92169
Burglary                        81064
Robbery                         72852
Other Notifiable Offences       60400
Fraud or Forgery                18521
Sexual Offences                  8519
Name: Major_Category, dtype: int64

In [17]:
### Pivoting the table to view the no. of crimes for each major category in each Borough¶

London_crime = pd.pivot_table(df,values=['No_of_Crimes'],
                               index=['Borough'],
                               columns=['Major_Category'],
                               aggfunc=np.sum,fill_value=0)
London_crime.head()

Unnamed: 0_level_0,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes
Major_Category,Burglary,Criminal Damage,Drugs,Fraud or Forgery,Other Notifiable Offences,Robbery,Sexual Offences,Theft and Handling,Violence Against the Person
Borough,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Barking and Dagenham,1400,1502,690,25,218,519,1,4040,3392
Barnet,2872,1586,724,29,231,534,3,6832,3539
Bexley,1120,1296,540,8,164,172,3,3091,2376
Brent,2133,1586,1947,6,298,989,3,5614,4881
Bromley,2120,1874,660,16,197,415,1,5501,3569


Foursquare credentials are defined in hidden cell bellow.

In [18]:
# Reset the index
London_crime.reset_index(inplace = True)

In [19]:
London_crime['Total'] = London_crime.sum(axis=1)
London_crime.head(33)


Unnamed: 0_level_0,Borough,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,Total
Major_Category,Unnamed: 1_level_1,Burglary,Criminal Damage,Drugs,Fraud or Forgery,Other Notifiable Offences,Robbery,Sexual Offences,Theft and Handling,Violence Against the Person,Unnamed: 11_level_1
0,Barking and Dagenham,1400,1502,690,25,218,519,1,4040,3392,11787
1,Barnet,2872,1586,724,29,231,534,3,6832,3539,16350
2,Bexley,1120,1296,540,8,164,172,3,3091,2376,8770
3,Brent,2133,1586,1947,6,298,989,3,5614,4881,17457
4,Bromley,2120,1874,660,16,197,415,1,5501,3569,14353
5,Camden,2213,1417,1520,13,321,841,3,10581,3920,20829
6,City of London,2,1,4,0,0,7,0,40,10,64
7,Croydon,2472,2368,1563,14,311,913,3,7239,5256,20139
8,Ealing,2376,1989,1405,6,352,779,1,7321,5326,19555
9,Enfield,2323,1705,1066,7,255,750,1,5404,3549,15060


In [20]:

London_crime.columns = London_crime.columns.map(''.join)
London_crime.head()
        

Unnamed: 0,Borough,No_of_CrimesBurglary,No_of_CrimesCriminal Damage,No_of_CrimesDrugs,No_of_CrimesFraud or Forgery,No_of_CrimesOther Notifiable Offences,No_of_CrimesRobbery,No_of_CrimesSexual Offences,No_of_CrimesTheft and Handling,No_of_CrimesViolence Against the Person,Total
0,Barking and Dagenham,1400,1502,690,25,218,519,1,4040,3392,11787
1,Barnet,2872,1586,724,29,231,534,3,6832,3539,16350
2,Bexley,1120,1296,540,8,164,172,3,3091,2376,8770
3,Brent,2133,1586,1947,6,298,989,3,5614,4881,17457
4,Bromley,2120,1874,660,16,197,415,1,5501,3569,14353


In [28]:
London_crime.columns = ['Borough','Burglary', 'Criminal Damage','Drugs','Other Notifiable Offences',
                        'Robbery','Theft and Handling', 'Violence Against the Person','Murder', 'Accident','Total']
London_crime.head()

Unnamed: 0,Borough,Burglary,Criminal Damage,Drugs,Other Notifiable Offences,Robbery,Theft and Handling,Violence Against the Person,Murder,Accident,Total
0,Barking and Dagenham,1400,1502,690,25,218,519,1,4040,3392,11787
1,Barnet,2872,1586,724,29,231,534,3,6832,3539,16350
2,Bexley,1120,1296,540,8,164,172,3,3091,2376,8770
3,Brent,2133,1586,1947,6,298,989,3,5614,4881,17457
4,Bromley,2120,1874,660,16,197,415,1,5501,3569,14353


In [29]:
London_crime.shape

(33, 11)

In [30]:
# getting data from internet
wikipedia_link='https://en.wikipedia.org/wiki/List_of_London_boroughs'
raw_wikipedia_page= requests.get(wikipedia_link).text

# using beautiful soup to parse the HTML/XML codes.
soup = BeautifulSoup(raw_wikipedia_page,'xml')
print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="UTF-8"/>
  <title>
   List of London boroughs - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"f221994a-9e19-4844-ba0a-0237faeea6ca","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_London_boroughs","wgTitle":"List of London boroughs","wgCurRevisionId":958873870,"wgRevisionId":958873870,"wgArticleId":28092685,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use dmy dates from August 2015","Use British English from August 2015","Lists of 

In [32]:
London_table = pd.read_html(str(table[0]), index_col=None, header=0)[0]
London_table.head()

NameError: name 'table' is not defined

Let's now see all the collected restaurants in our area of interest on map, and let's also show Italian restaurants in different color.

In [33]:
London_table1 = pd.read_html(str(table[1]), index_col=None, header=0)[0]

# Rename the columns to match the previous table to append the tables.

London_table1.columns = ['Borough','Inner','Status','Local authority','Political control',
                         'Headquarters','Area (sq mi)','Population (2013 est)[1]','Co-ordinates','Nr. in map']

# View the table
London_table1

NameError: name 'table' is not defined

In [34]:
London_table.tail()


NameError: name 'London_table' is not defined

Looking good. So now we have all the restaurants in area within few kilometers from Alexanderplatz, and we know which ones are Italian restaurants! We also know which restaurants exactly are in vicinity of every neighborhood candidate center.

This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for a new Italian restaurant!

In [35]:
London_table.info()


NameError: name 'London_table' is not defined

In [36]:
London_table = London_table.replace('note 1','', regex=True) 
London_table = London_table.replace('note 2','', regex=True) 
London_table = London_table.replace('note 3','', regex=True) 
London_table = London_table.replace('note 4','', regex=True) 
London_table = London_table.replace('note 5','', regex=True) 

# View the top of the data set
London_table.head()

NameError: name 'London_table' is not defined

In [37]:
type(London_table)


NameError: name 'London_table' is not defined

In [38]:
London_table.shape


NameError: name 'London_table' is not defined

In [39]:
set(df.Borough) - set(London_table.Borough)


NameError: name 'London_table' is not defined

In [40]:
print("The index of first borough is",London_table.index[London_table['Borough'] == 'Barking and Dagenham []'].tolist())
print("The index of second borough is",London_table.index[London_table['Borough'] == 'Greenwich []'].tolist())
print("The index of third borough is",London_table.index[London_table['Borough'] == 'Hammersmith and Fulham []'].tolist())

NameError: name 'London_table' is not defined

In [41]:
London_table.iloc[0,0] = 'Barking and Dagenham'
London_table.iloc[9,0] = 'Greenwich'
London_table.iloc[11,0] = 'Hammersmith and Fulham'

NameError: name 'London_table' is not defined

In [42]:
set(df.Borough) - set(London_table.Borough)


NameError: name 'London_table' is not defined

In [43]:
Ld_crime = pd.merge(London_crime, London_table, on='Borough')
Ld_crime.head(10)

NameError: name 'London_table' is not defined

In [44]:
Ld_crime.shape


NameError: name 'Ld_crime' is not defined

In [45]:
set(df.Borough) - set(Ld_crime.Borough)


NameError: name 'Ld_crime' is not defined

In [46]:
list(Ld_crime)


NameError: name 'Ld_crime' is not defined

In [47]:
columnsTitles = ['Borough','Local authority','Political control','Headquarters',
                 'Area (sq mi)','Population (2013 est)[1]',
                 'Inner','Status',
                 'Burglary','Criminal Damage','Drugs','Other Notifiable Offences',
                 'Robbery','Theft and Handling','Violence Against the Person','Total','Co-ordinates']

Ld_crime = Ld_crime.reindex(columns=columnsTitles)

Ld_crime = Ld_crime[['Borough','Local authority','Political control','Headquarters',
                 'Area (sq mi)','Population (2013 est)[1]','Co-ordinates',
                 'Burglary','Criminal Damage','Drugs','Other Notifiable Offences',
                 'Robbery','Theft and Handling','Violence Against the Person','Total']]

Ld_crime.head()

NameError: name 'Ld_crime' is not defined

## Methodology <a name="methodology"></a>

The methodology in this project consists of two parts:

Exploratory Data Analysis: Visualise the crime rates in the London boroughs to idenity the safest borough and extract the neighborhoods in that borough to find the 10 most common venues in each neighborhood.
Modelling: To help people find similar neighborhoods in the safest borough we will be clustering similar neighborhoods using K - means clustering which is a form of unsupervised machine learning algorithm that clusters data based on predefined cluster size. We will use a cluster size of 5 for this project that will cluster the 15 neighborhoods into 5 clusters. The reason to conduct a K- means clustering is to cluster neighborhoods with similar venues together so that people can shortlist the area of their interests based on the venues/amenities around each neighborhood.


## Exploratory Data Analysis ¶
 <a name="analysis"></a>

Descriptive statistics of the data¶


In [48]:
London_crime.describe()


Unnamed: 0,Burglary,Criminal Damage,Drugs,Other Notifiable Offences,Robbery,Theft and Handling,Violence Against the Person,Murder,Accident,Total
count,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0
mean,1780.424242,1459.454545,1087.30303,14.818182,250.666667,619.121212,2.575758,6203.424242,3685.909091,15103.69697
std,586.325263,456.202333,580.638575,11.463133,105.506418,357.893032,2.136444,3376.732287,1346.58337,6030.742442
min,2.0,1.0,4.0,0.0,0.0,7.0,0.0,40.0,10.0,64.0
25%,1400.0,1220.0,664.0,6.0,195.0,354.0,1.0,4252.0,2620.0,11787.0
50%,1953.0,1535.0,1077.0,12.0,261.0,616.0,3.0,5889.0,3920.0,16115.0
75%,2168.0,1761.0,1405.0,24.0,321.0,841.0,3.0,7239.0,4743.0,17821.0
max,2872.0,2368.0,2720.0,43.0,499.0,1447.0,10.0,20613.0,5765.0,34072.0


In [49]:
# use the inline backend to generate the plots within the browser
%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.style.use('ggplot') # optional: for ggplot-like style

# check for latest version of Matplotlib
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

Matplotlib version:  3.1.3


In [50]:
Ld_crime.columns = list(map(str, Ld_crime.columns))

# let's check the column labels types now
all(isinstance(column, str) for column in Ld_crime.columns)

NameError: name 'Ld_crime' is not defined

In [51]:
Ld_crime.sort_values(['Total'], ascending = False, axis = 0, inplace = True )

df_top5 = Ld_crime.head() 
df_top5

NameError: name 'Ld_crime' is not defined

In [52]:
df_tt = df_top5[['Borough','Total']]

df_tt.set_index('Borough',inplace = True)

ax = df_tt.plot(kind='bar', figsize=(10, 6), rot=0)

ax.set_ylabel('Number of Crimes') # add to x-label to the plot
ax.set_xlabel('Borough') # add y-label to the plot
ax.set_title('London Boroughs with the Highest no. of crime') # add title to the plot

# Creating a function to display the percentage.

for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=2), 
                (p.get_x()+p.get_width()/2., p.get_height()), 
                ha='center', 
                va='center', 
                xytext=(0, 10), 
                textcoords='offset points',
                fontsize = 14
               )

plt.show()

NameError: name 'df_top5' is not defined

In [53]:
Ld_crime.sort_values(['Total'], ascending = True, axis = 0, inplace = True )

df_bot5 = Ld_crime.head() 
df_bot5

NameError: name 'Ld_crime' is not defined

In [54]:
df_bt = df_bot5[['Borough','Total']]

df_bt.set_index('Borough',inplace = True)

ax = df_bt.plot(kind='bar', figsize=(10, 6), rot=0)

ax.set_ylabel('Number of Crimes') # add to x-label to the plot
ax.set_xlabel('Borough') # add y-label to the plot
ax.set_title('London Boroughs with the least no. of crime') # add title to the plot

# Creating a function to display the percentage.

for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=2), 
                (p.get_x()+p.get_width()/2., p.get_height()), 
                ha='center', 
                va='center', 
                xytext=(0, 10), 
                textcoords='offset points',
                fontsize = 14
               )

plt.show()


NameError: name 'df_bot5' is not defined

In [55]:
df_col = df_bot5[df_bot5['Borough'] == 'City of London']
df_col = df_col[['Borough','Total','Area (sq mi)','Population (2013 est)[1]']]
df_col

NameError: name 'df_bot5' is not defined

In [56]:

df_bc = df_bc1[['Borough','Burglary','Criminal Damage','Drugs','Other Notifiable Offences',
                 'Robbery','Theft and Handling','Violence Against the Person']]


df_bc.set_index('Borough',inplace = True)

ax = df_bc.plot(kind='bar', figsize=(10, 6), rot=0)

ax.set_ylabel('Number of Crimes') # add to x-label to the plot
ax.set_xlabel('Borough') # add y-label to the plot
ax.set_title('London Boroughs with the least no. of crime') # add title to the plot

# Creating a function to display the percentage.

for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=2), 
                (p.get_x()+p.get_width()/2., p.get_height()), 
                ha='center', 
                va='center', 
                xytext=(0, 10), 
                textcoords='offset points',
                fontsize = 14
               )

plt.show()

NameError: name 'df_bc1' is not defined

In [57]:
Neighborhood = ['Berrylands','Canbury','Chessington','Coombe','Hook','Kingston upon Thames',
'Kingston Vale','Malden Rushett','Motspur Park','New Malden','Norbiton',
'Old Malden','Seething Wells','Surbiton','Tolworth']

Borough = ['Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames',
          'Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames',
          'Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames',
          'Kingston upon Thames','Kingston upon Thames','Kingston upon Thames']

Latitude = ['','','','','','','','','','','','','','','']
Longitude = ['','','','','','','','','','','','','','','']

df_neigh = {'Neighborhood': Neighborhood,'Borough':Borough,'Latitude': Latitude,'Longitude':Longitude}
kut_neig = pd.DataFrame(data=df_neigh, columns=['Neighborhood', 'Borough', 'Latitude', 'Longitude'], index=None)

kut_neig

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
0,Berrylands,Kingston upon Thames,,
1,Canbury,Kingston upon Thames,,
2,Chessington,Kingston upon Thames,,
3,Coombe,Kingston upon Thames,,
4,Hook,Kingston upon Thames,,
5,Kingston upon Thames,Kingston upon Thames,,
6,Kingston Vale,Kingston upon Thames,,
7,Malden Rushett,Kingston upon Thames,,
8,Motspur Park,Kingston upon Thames,,
9,New Malden,Kingston upon Thames,,


In [58]:
df_neigh = {'Neighborhood': Neighborhood,'Borough':Borough,'Latitude': Latitude,'Longitude':Longitude}
kut_neig = pd.DataFrame(data=df_neigh, columns=['Neighborhood', 'Borough', 'Latitude', 'Longitude'], index=None)

kut_neig

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
0,Berrylands,Kingston upon Thames,,
1,Canbury,Kingston upon Thames,,
2,Chessington,Kingston upon Thames,,
3,Coombe,Kingston upon Thames,,
4,Hook,Kingston upon Thames,,
5,Kingston upon Thames,Kingston upon Thames,,
6,Kingston Vale,Kingston upon Thames,,
7,Malden Rushett,Kingston upon Thames,,
8,Motspur Park,Kingston upon Thames,,
9,New Malden,Kingston upon Thames,,


In [59]:
address = 'Berrylands, London, United Kingdom'

geolocator = Nominatim(user_agent="ld_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Berrylands, London are {}, {}.'.format(latitude, longitude))

NameError: name 'Nominatim' is not defined

In [60]:
map_lon = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(kut_neig['Latitude'], kut_neig['Longitude'], kut_neig['Borough'], kut_neig['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_lon)  
    
map_lon

NameError: name 'folium' is not defined

In [61]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [62]:
kut_venues = getNearbyVenues(names=kut_neig['Neighborhood'],
                                   latitudes=kut_neig['Latitude'],
                                   longitudes=kut_neig['Longitude']
                                  )

Berrylands


KeyError: 'groups'

In [63]:
print(kut_venues.shape)
kut_venues.head()


NameError: name 'kut_venues' is not defined

In [64]:
kut_venues.groupby('Neighborhood').count()


NameError: name 'kut_venues' is not defined

In [65]:
print('There are {} uniques categories.'.format(len(kut_venues['Venue Category'].unique())))


NameError: name 'kut_venues' is not defined

In [66]:
kut_onehot = pd.get_dummies(kut_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
kut_onehot['Neighborhood'] = kut_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [kut_onehot.columns[-1]] + list(kut_onehot.columns[:-1])
kut_onehot = kut_onehot[fixed_columns]

kut_onehot.head()

NameError: name 'kut_venues' is not defined

In [67]:
kut_grouped = kut_onehot.groupby('Neighborhood').mean().reset_index()
kut_grouped

NameError: name 'kut_onehot' is not defined

In [68]:
kut_grouped.shape


NameError: name 'kut_grouped' is not defined

In [69]:
num_top_venues = 5

for hood in kut_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = kut_grouped[kut_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

NameError: name 'kut_grouped' is not defined

# Results and Discussion ¶


The aim of this project is to help people who want to relocate to the safest borough in London, expats can chose the neighborhoods to which they want to relocate based on the most common venues in it. For example if a person is looking for a neighborhood with good connectivity and public transportation we can see that Clusters 3 and 4 have Train stations and Bus stops as the most common venues. If a person is looking for a neighborhood with stores and restaurants in a close proximity then the neighborhoods in the first cluster is suitable. For a family I feel that the neighborhoods in Cluster 4 are more suitable dues to the common venues in that cluster, these neighborhoods have common venues such as Parks, Gym/Fitness centers, Bus Stops, Restaurants, Electronics Stores and Soccer fields which is ideal for a family.

# Conclusion ¶


This project helps a person get a better understanding of the neighborhoods with respect to the most common venues in that neighborhood. It is always helpful to make use of technology to stay one step ahead i.e. finding out more about places before moving into a neighborhood. We have just taken safety as a primary concern to shortlist the borough of London. The future of this project includes taking other factors such as cost of living in the areas into consideration to shortlist the borough based on safety and a predefined budget.