# Geocoding London neighbourhoods
This notebook is part of Coursera capstone project, but I separated it to keep readibility in the main project notebook. You can find the main project here: https://github.com/dedi400/Coursera_Capstone/blob/main/Capstone%20project%20-%20Location%20data.ipynb

The aim of this code is to get the list of London neighbourhoods from a wiki page and attach location data to each of them.

In [29]:
import pandas as pd
import googlemaps
import folium

## Reading data
Fortunately wikipedia puts all tables properly into a *\<table>* tag, so Pandas library can read it directly. *Read_html* returns a list of dataframes so we have to find the proper the table we need (it's the second table in this case).

In [2]:
london_url='https://en.wikipedia.org/wiki/List_of_areas_of_London'
london=pd.read_html(london_url,flavor='bs4')
len(london)

5

In [3]:
df_london=london[1]
print(df_london.shape)
df_london.head()

(533, 6)


Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


## Data cleaning
I have to rename the columns because *\&nbsp;* was used instead of normal space in th header and it makes trouble in filtering

In [4]:
df_london.columns=['Neighbourhood','Borough','Town','Postcode','Dial','OS']
df_london

Unnamed: 0,Neighbourhood,Borough,Town,Postcode,Dial,OS
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,020,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",020,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,020,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,020,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",020,TQ478728
...,...,...,...,...,...,...
528,Woolwich,Greenwich,LONDON,SE18,020,TQ435795
529,Worcester Park,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4,020,TQ225655
530,Wormwood Scrubs,Hammersmith and Fulham,LONDON,W12,020,TQ225815
531,Yeading,Hillingdon,HAYES,UB4,020,TQ115825


I'll filter only those neighbourhoods that are within London. Sometimes there are more than one town is listed in one row, so I'll use `contains` function

In [5]:
df_london=df_london[df_london['Town'].str.contains('LONDON')].drop(['Town','Dial','OS'],axis=1)
df_london

Unnamed: 0,Neighbourhood,Borough,Postcode
0,Abbey Wood,"Bexley, Greenwich [7]",SE2
1,Acton,"Ealing, Hammersmith and Fulham[8]","W3, W4"
6,Aldgate,City[10],EC3
7,Aldwych,Westminster[10],WC2
9,Anerley,Bromley[11],SE20
...,...,...,...
523,Woodford,Redbridge,"IG8, E18"
524,Woodford Green,"Redbridge, Waltham Forest",IG8
527,Woodside Park,Barnet,N12
528,Woolwich,Greenwich,SE18


Removing references and whitespaces from Borough column

In [6]:
df_london['Borough']=df_london['Borough'].apply(lambda b: b.split('[',1)[0].strip())
df_london

Unnamed: 0,Neighbourhood,Borough,Postcode
0,Abbey Wood,"Bexley, Greenwich",SE2
1,Acton,"Ealing, Hammersmith and Fulham","W3, W4"
6,Aldgate,City,EC3
7,Aldwych,Westminster,WC2
9,Anerley,Bromley,SE20
...,...,...,...
523,Woodford,Redbridge,"IG8, E18"
524,Woodford Green,"Redbridge, Waltham Forest",IG8
527,Woodside Park,Barnet,N12
528,Woolwich,Greenwich,SE18


I ran some more check on the neighbourhood list. It seems that postcodes don't follow neighbourhood borders so nice as in Toronto. As you can see below there are several different situations:

1. Some neighbourhoods have multiple postal codes (55 times)
2. Also, postal codes spans thorugh multiple neighbourhoods most of the time (220 out of 299!) Sometimes they span through multiple Boroughs (e.g. WC2 in Camden and Westminster).

In [7]:
print(df_london[df_london['Postcode'].str.contains(',')].shape)
df_london[df_london['Postcode'].str.contains(',')].sort_values('Borough')


(55, 3)


Unnamed: 0,Neighbourhood,Borough,Postcode
14,Arkley,Barnet,"EN5, NW7"
108,Colney Hatch,Barnet,"N11, N10"
60,Brent Cross,Barnet,"NW2, NW4"
24,Barnet Gate,Barnet,"NW7, EN5"
171,Finchley,Barnet,"N2, N3, N12"
292,Longlands,Bexley,"SE9, DA14, DA15"
45,Bexleyheath (also Bexley New Town),Bexley,"DA6, DA7, SE2"
168,Falconwood,"Bexley, Greenwich","SE9, DA16"
459,Thamesmead,"Bexley, Greenwich","SE28, SE2, DA18"
261,Kensal Green,Brent,"NW10, NW6"


In [8]:
df_london[df_london.duplicated(subset=['Postcode'],keep=False)].sort_values('Postcode')

Unnamed: 0,Neighbourhood,Borough,Postcode
492,Wapping,Tower Hamlets,E1
440,Stepney,Tower Hamlets,E1
381,Ratcliff,Tower Hamlets,E1
400,Shadwell,Tower Hamlets,E1
427,Spitalfields,Tower Hamlets,E1
...,...,...,...
269,King's Cross,Camden and Islington,WC1
431,St Giles,Camden,WC2
87,Charing Cross,Westminster,WC2
114,Covent Garden,Westminster,WC2


Based on these findings I decided that I am going to use Neighbourhood names instead of postal codes for this project.

There are 2-2 neighbourhoods with the same name, I'll add postcode to the name so geocoding can identify:

In [9]:
df_london[df_london.duplicated(subset=['Neighbourhood'],keep=False)].sort_values('Neighbourhood')

Unnamed: 0,Neighbourhood,Borough,Postcode
100,Church End,Brent,NW10
101,Church End,Barnet,N3
197,Grove Park,Hounslow,W4
198,Grove Park,Lewisham,SE12


In [10]:
to_be_adjusted=[100,101,197,198]
  
df_london.loc[to_be_adjusted,'Neighbourhood']=df_london.loc[to_be_adjusted,'Postcode']+' '+ \
                                                df_london.loc[to_be_adjusted,'Neighbourhood']

df_london.loc[to_be_adjusted]

Unnamed: 0,Neighbourhood,Borough,Postcode
100,NW10 Church End,Brent,NW10
101,N3 Church End,Barnet,N3
197,W4 Grove Park,Hounslow,W4
198,SE12 Grove Park,Lewisham,SE12


## Geocoding
I'm using Google Maps API for geocoding neighbourhoods, I found it more precise than OpenStreetMap when I search for areas instead of exact locations.

The googlemaps library returns a json-like data structure (actually it is a list of dictionaries of dictionaries). After some testing I found that `result[0]['geometry']['location']` gives a dictionary of `{'lat':'xxx','lng':'yyy'}` that is the coordinates we need.

In [11]:
google=googlemaps.Client(key='xxx') #to be deleted before publishing!

In [18]:
def neighbourhood_geocode(s):
    print(s['Neighbourhood'])
    result=google.geocode(s['Neighbourhood']+' London UK')
    try:
        return result[0]['geometry']['location']
    except:
        print('Error!',result)
        return

In [19]:
geo=df_london.apply(neighbourhood_geocode,axis=1,result_type='expand')

Abbey Wood
Acton
Aldgate
Aldwych
Anerley
Angel
Archway
Arkley
Arnos Grove
Balham
Bankside
Barbican
Barnes
Barnet Gate
Barnsbury
Battersea
Bayswater
Beckenham
Beckton
Bedford Park
Belgravia
Bellingham
Belsize Park
Bermondsey
Bethnal Green
Bexleyheath (also Bexley New Town)
Blackfriars
Blackheath
Blackheath Royal Standard
Blackwall
Bloomsbury
Bounds Green
Bow
Bowes Park
Brent Cross
Brent Park
Brixton
Brockley
Bromley (also Bromley-by-Bow)
Brompton
Brondesbury
Brunswick Park
Burroughs, The
Camberwell
Cambridge Heath
Camden Town
Canary Wharf
Cann Hall
Canning Town
Canonbury
Castelnau
Catford
Chalk Farm
Charing Cross
Charlton
Chelsea
Childs Hill
Chinatown
Chinbrook
Chingford
Chiswick
NW10 Church End
N3 Church End
Clapham
Clerkenwell
Colindale
Colliers Wood
Colney Hatch
Covent Garden
Cricklewood
Crofton Park
Crossness
Crouch End
Crystal Palace
Cubitt Town
Custom House
Dalston
Dartford
De Beauvoir Town
Denmark Hill
Deptford
Dollis Hill
Dulwich
Ealing
Earls Court
Earlsfield
East Dulwich
East F

In [24]:
df_london=df_london.join(geo)
df_london.head()

Unnamed: 0,Neighbourhood,Borough,Postcode,lat,lng
0,Abbey Wood,"Bexley, Greenwich",SE2,51.492612,0.118818
1,Acton,"Ealing, Hammersmith and Fulham","W3, W4",51.508372,-0.27444
6,Aldgate,City,EC3,51.513438,-0.077171
7,Aldwych,Westminster,WC2,51.513266,-0.117183
9,Anerley,Bromley,SE20,51.411911,-0.067978


Sometimes a couple of searches returns with error, so it has to be checked and re-run geocoding if neccessary.

In [25]:
df_london.isna().sum()

Neighbourhood    0
Borough          0
Postcode         0
lat              0
lng              0
dtype: int64

In [26]:
df_london[df_london['lat'].isna()]

Unnamed: 0,Neighbourhood,Borough,Postcode,lat,lng


## Saving data for the main project
Before saving the dataframe to csv I rename columns to be consistent with other cities.

In [28]:
df_london.rename({'lat':'Lat','lng':'Lon'},axis=1,inplace=True)
df_london.head()

Unnamed: 0,Neighbourhood,Borough,Postcode,Lat,Lon
0,Abbey Wood,"Bexley, Greenwich",SE2,51.492612,0.118818
1,Acton,"Ealing, Hammersmith and Fulham","W3, W4",51.508372,-0.27444
6,Aldgate,City,EC3,51.513438,-0.077171
7,Aldwych,Westminster,WC2,51.513266,-0.117183
9,Anerley,Bromley,SE20,51.411911,-0.067978


Github viewer cannot show folium maps, but you can look at it with [nbviewer](https://nbviewer.jupyter.org/github/dedi400/Coursera_Capstone/blob/main/London_geocoding.ipynb).

In [32]:
map = folium.Map(location=[51.51,-0.11],zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_london['Lat'], df_london['Lon'], 
                                            df_london['Borough'], df_london['Neighbourhood']):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        tooltip='{}, {}'.format(neighborhood, borough),
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map)  
    
map

In [31]:
df_london.to_csv('data/london_w_geocode.csv')