This notebook will be mainly used for the capstone project of the IBM Data Science Professional Certificate. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
print ('Hello Capstone Project Course!')

Hello Capstone Project Course!


# **_Moving Recommenations_**

## - From San Diego to London -

### Introduction:
___________________

My problem is a simple one: I'll be moving from San Diego (USA) to London (UK soon) and I'll like to find out what the best neighborhoods are to start looking for a place to live.

I would like to leverage all the information learned in this Professional Certificate to get specific recommendations according to my tastes. To this end I will apply a *Content-based recommendation system* as learned in the Machine Learning course, and combine it with *Foursquare location data* to. 

I envision my approach with the following stages:
1. Get a list of all neighborhoods in San Diego
2. Get location data using Geolocator
3. Get top venues information for each neighborhood using Foursquare data
4. Rank my top/bottom neighborhoods to create my "likes" profile
5. Generate a ranking profile for the venue categories in the Foursquare data for San Diego according to my ranking if the neighborhoods
6. Get a list of all neighborhoods in London
7. Get location dat using Geolocator
8. Get top venues information for each neighborhood using Forsquare data
9. Use my ranking profile to assign values to London neighborhoods
10. Identify top10 neighborhoods in London according to my preferences
11. Get the top venues on these top10 neighborhoods (to see what they have and understand what drove their ranking)
12. Map these neighborhoods

And finally:
_Use the map, ranking, and info to make my decision!_

### Data:
____________

I will be using 4 main sources of data:
1. [The wikipedia page for San Diego communities:](https://en.wikipedia.org/wiki/List_of_communities_and_neighborhoods_of_San_Diego) 
- Unlike other sources we used in this course, the html data show the community names are includede as nested lists inside a table, so extracting the data will require some creativity
- San Diego is located in Southern California, US (just north of the border with Mexico) and it has 118 Neighborhoods
2. [Nominatim from GeoPy Geolocator:](https://geopy.readthedocs.io/en/stable/)
- Geopy is a Python client for several popular geocoding web services. It makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.
- The Nominatim geocoder uses the geocoding service from OpenStreetMap
- I will be using this source to get the latitudes and longitudes for the neighborhoods of San Diego and London
3. [The wikipedia page for London neighborhoods:](https://en.wikipedia.org/wiki/List_of_areas_of_London)
- This list includes both the City of London, London, and the Greater London Metropolitan Area, totalling 531 neighborhoods
4. [Foursquare locator data:](https://foursquare.com/city-guide)
- As we learned, Foursquare is a platform that leverages crowdsourcing to collect location information to build a dynamic database and provide useful, up-to-date answers to location-based queries such as venue information, top ranked venues around a specific location, user tips, and even trending venues. 
- I will leverage Foursquare location data to identify the top venues for each Neighborhood in San Diego and London, to build a profile for each neighborhood based on what's available in each of them. 

### Methology
________

In [5]:
# Importe required libraries

import requests 
import random 

#!pip install geopy -already installed
from geopy.geocoders import Nominatim 

from pandas.io.json import json_normalize

!pip install folium 
import folium 
import bs4
from bs4 import BeautifulSoup

Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 3.6 MB/s eta 0:00:011
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1


In [78]:
# @hidden_cell
CLIENT_ID = 'ZOIY2E22NTMCSLLHOEEI2SYOWASR4RTSEY0ZJEESDWJN2NPT' # your Foursquare ID
CLIENT_SECRET = 'R4HASMXKBI2BXZSVT0KLFZNK0LBOU1O3MVN3RTGCOHQWXNBK' # your Foursquare Secret
ACCESS_TOKEN = 'KUGB0KIGK4S0HR5JBP1AK2FKM1Y5TPFB3131RXY53ZLGO1UJ' # your FourSquare Access Token

### First - get a list of all neighborhoods in San Diego

In [7]:
SanDiego_url='https://en.wikipedia.org/wiki/List_of_communities_and_neighborhoods_of_San_Diego'

#Scrape website with BeautifulSoup
pageSD=requests.get(SanDiego_url).text
soupSD=BeautifulSoup(pageSD, 'html5lib')
tableSD=soupSD.find('table', class_='multicol')

A=[]
B=[]
C=[]
D=[]

#utilizing HTML tags for rows <tr> and elements <td> to iterate through each row of data and append data elements to their appropriate lists:
for row in tableSD.find_all('tr'):
    cells = row.find_all('td')
    if len(cells)==5:
        A.append(cells[0].find_all('li', text=True))
        B.append(cells[1].find_all('li', text=True))
        C.append(cells[2].find_all('li', text=True))
        D.append(cells[3].find_all('li', text=True))

# Get all the elements in the generated series, and add them into a single list        
SD_neigh=[]
for i,name in enumerate(A[0]):
    temp=A[0][i]
    temp2=temp.string
    SD_neigh.append(temp2)
for i,name in enumerate(B[0]):
    temp=B[0][i]
    temp2=temp.string
    SD_neigh.append(temp2)
for i,name in enumerate(C[0]):
    temp=C[0][i]
    temp2=temp.string
    SD_neigh.append(temp2)
for i,name in enumerate(D[0]):
    temp=D[0][i]
    temp2=temp.string
    SD_neigh.append(temp2)

#Transform it into a dataframe    
SD_df=pd.DataFrame({'Neighborhood':SD_neigh})
print(SD_df.shape)
SD_df.head()


(118, 1)


Unnamed: 0,Neighborhood
0,Adams North
1,Balboa Park
2,Bankers Hill
3,Barrio Logan
4,Bay Terraces


In [8]:
# Preparea list for later processing
SD_list=[]
for i,name in enumerate(SD_df['Neighborhood']):
    address=name+', San Diego, USA'
    SD_list.append(address)
len(SD_list)

118

### Get neighborhood location data using Geolocator

In [24]:
# Initialize a new dataframe to add locator data
SD_df2=pd.DataFrame(columns=['Latitude', 'Longitude'])
SD_df2.head()

Unnamed: 0,Latitude,Longitude


In [25]:
# Loop to get the coordinates for each neighborhood - note that Nominatim is not infallible, so if it fails to assign coordinates, it'll write NaN instead

for i,name in enumerate(SD_list):
    try:
        address=str(SD_list[i])
        geolocator = Nominatim(user_agent="foursquare_agent")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        SD_df2=SD_df2.append({'Latitude':latitude, 'Longitude':longitude}, ignore_index=True)
    except:
        latitude = 'NaN'
        longitude = 'NaN'
        SD_df2=SD_df2.append({'Latitude':latitude, 'Longitude':longitude}, ignore_index=True)
SD_df2['Neighborhood']=SD_df['Neighborhood']
print(SD_df2.shape)
SD_df2.head()

(118, 3)


Unnamed: 0,Latitude,Longitude,Neighborhood
0,,,Adams North
1,32.7314,-117.147,Balboa Park
2,32.7283,-117.162,Bankers Hill
3,32.6939,-117.138,Barrio Logan
4,32.6919,-117.037,Bay Terraces


In [26]:
 #drop Neighborhoods with no location data assigned
SD_df2=SD_df2.replace('NaN', np.NaN)
SD_df2.dropna(subset=['Latitude'], axis=0, inplace=True)
SD_df2.reset_index(drop=True, inplace=True)
print(SD_df2.shape)

(101, 3)


### Let's visualize all neighborhoods in San Diego

In [27]:
map_SD = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(SD_df2['Latitude'], SD_df2['Longitude'], SD_df2['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='lightblue',
        fill_opacity=0.7,
        parse_html=False).add_to(map_SD)  
    
map_SD

### Get top venues information for each neighborhood using Foursquare data

In [76]:
VERSION = '20180605' 
LIMIT = 100

# Define the function to get the top 100 enues within a 800 m radiums of each neighborhood center. 

def getNearbyVenues(names, latitudes, longitudes, radius=800):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [83]:
#Run the function to get venues and save into a dataframe
neighborhoods=SD_df2['Neighborhood']
latitudes=SD_df2['Latitude']
longitudes=SD_df2['Longitude']
SD_venues=getNearbyVenues(neighborhoods, latitudes, longitudes)
print(SD_venues.shape)
SD_venues.head()

(3463, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Balboa Park,32.731357,-117.146527,Balboa Park Fountain,32.731453,-117.146809,Fountain
1,Balboa Park,32.731357,-117.146527,San Diego Natural History Museum,32.732239,-117.147395,History Museum
2,Balboa Park,32.731357,-117.146527,San Diego Model Railroad Museum,32.731132,-117.148365,Museum
3,Balboa Park,32.731357,-117.146527,Botanical Building & Lily Pond,32.732237,-117.149288,Botanical Garden
4,Balboa Park,32.731357,-117.146527,Spanish Village Art Center,32.733395,-117.14737,Arts & Crafts Store


### Let's turn it into a one-hot encoded dataframe for analysis

In [31]:
# one hot encoding for late
SD_ohe = pd.get_dummies(SD_venues[['Venue Category']], prefix="", prefix_sep="")

#Add Neighborhood name and reorder columns so it shows on the first column
SD_ohe['Neighborhood'] = SD_venues['Neighborhood'] 
col=SD_ohe['Neighborhood']
SD_ohe.drop(labels='Neighborhood', axis=1, inplace=True)
SD_ohe.insert(0, 'Neighborhood', col)
print(SD_ohe.shape)
SD_ohe.head()


(3789, 324)


Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,African Restaurant,Airport,Airport Lounge,Airport Service,American Restaurant,Amphitheater,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Balboa Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Balboa Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Balboa Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Balboa Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Balboa Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Generate a new dataframe with my rankings of the top 5 and worse 5 neighborhoods in San Diego (scale: 0-10)

In [68]:
My_rankings=pd.DataFrame({'Neighborhood':['La Jolla', 'North Park', 'Hillcrest', 'University Heights', 'Gaslamp Quarter', 'Barrio Logan', 'San Ysidro', \
                                        'Tijuana River Valley', 'Kearny Mesa', 'East Village'], 'Ranking':[10, 9.8, 9.9, 9.5, 9.0, 0.1, 0.2, 0.1, 0.5, 0.8]})
#Sort by alphabetical order and clean-up index
My_rankings.sort_values(['Neighborhood'], ascending=True, inplace=True)
My_rankings.reset_index(inplace=True)
My_rankings.drop(labels='index', axis=1, inplace=True)
print(My_rankings.shape)
My_rankings

(10, 2)


Unnamed: 0,Neighborhood,Ranking
0,Barrio Logan,0.1
1,East Village,0.8
2,Gaslamp Quarter,9.0
3,Hillcrest,9.9
4,Kearny Mesa,0.5
5,La Jolla,10.0
6,North Park,9.8
7,San Ysidro,0.2
8,Tijuana River Valley,0.1
9,University Heights,9.5


### Get info for the ranked San Diego neighborhoods 

In [63]:

my_neighborhoods = SD_ohe[SD_ohe['Neighborhood'].isin(My_rankings['Neighborhood'].tolist())]
grouped_neigh = my_neighborhoods.groupby('Neighborhood').mean()

#Sort by alphabetical order and clean-up index
grouped_neigh.sort_values(['Neighborhood'], ascending=True, inplace=True)
grouped_neigh.reset_index(inplace=True)
print(grouped_neigh.shape)
grouped_neigh

(10, 324)


Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,African Restaurant,Airport,Airport Lounge,Airport Service,American Restaurant,Amphitheater,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Barrio Logan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,East Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0
2,Gaslamp Quarter,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,...,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0
3,Hillcrest,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0
4,Kearny Mesa,0.019608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.039216,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,La Jolla,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,North Park,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.02,0.0,...,0.01,0.0,0.01,0.01,0.0,0.01,0.0,0.02,0.0,0.0
7,San Ysidro,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Tijuana River Valley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,University Heights,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.015625,0.0,...,0.0,0.0,0.0,0.015625,0.0,0.015625,0.0,0.0,0.0,0.0


### Generate my rating profile for the venue categories according to my ranking if the neighborhoods

In [124]:
#Dropping unnecessary columns to avoid issues
train_table = grouped_neigh.drop('Neighborhood', 1)
#Dot produt to get weights
my_profile = train_table.transpose().dot(My_rankings['Ranking'])
#Save in new df
my_profile_df=pd.DataFrame(my_profile)
#Transpose for future processing
my_profile_df=my_profile_df.transpose()
print(my_profile_df.shape)
my_profile_df.head()

(1, 323)


Unnamed: 0,ATM,Accessories Store,Adult Boutique,African Restaurant,Airport,Airport Lounge,Airport Service,American Restaurant,Amphitheater,Antique Shop,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,0.114256,0.098,0.098,0.148438,0.0,0.0,0.0,1.539104,0.0,0.0,...,0.216608,0.0,0.188,0.542438,0.09,0.246438,0.0,0.196,0.0,0.0


In [127]:
#Get a list of columns
my_profile_cat=my_profile_df.columns
len(my_profile_cat)

323

### With profile ready, now I can move to London. First lets scrape the list of London neighborhoods from the Wikipedia page

In [71]:
url_London='https://en.wikipedia.org/wiki/List_of_areas_of_London'

#Scrape website with BeautifulSoup
page=requests.get(url_London).text
soup=BeautifulSoup(page, 'html5lib')
table=soup.find('table', class_='wikitable sortable')
#Save desired table into a dataframe
df_Lon=pd.read_html(str(table), flavor='bs4')[0]
#Check initial df is correct and print size
print(df_Lon.shape)
df_Lon.head()

(531, 6)


Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


### As before, use the geolocator to assigne coordinates to each Neighborhood

In [72]:
#get a list of the Neighborhoods
Lon_list=[]
for i,name in enumerate(df_Lon['Location']):
    address=name+', London, UK'
    Lon_list.append(address)
print(len(Lon_list))

#initialize new dataframe
Lon_df2=pd.DataFrame(columns=['Latitude', 'Longitude'])
Lon_df2.head()

531


Unnamed: 0,Latitude,Longitude


In [73]:
#Get location data - as before, assign NaN if location cannot be determined
for i,name in enumerate(Lon_list):
    try:
        address=str(Lon_list[i])
        geolocator = Nominatim(user_agent="foursquare_agent")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        Lon_df2=Lon_df2.append({'Latitude':latitude, 'Longitude':longitude}, ignore_index=True)
    except:
        latitude = 'NaN'
        longitude = 'NaN'
        Lon_df2=Lon_df2.append({'Latitude':latitude, 'Longitude':longitude}, ignore_index=True)
Lon_df2['Neigborhood']=df_Lon['Location']
print(Lon_df2.shape)
Lon_df2.head()

(531, 3)


Unnamed: 0,Latitude,Longitude,Neigborhood
0,51.4876,0.11405,Abbey Wood
1,51.5081,-0.273261,Acton
2,51.3586,-0.0316347,Addington
3,51.3797,-0.0742821,Addiscombe
4,51.4354,0.125965,Albany Park


In [88]:
 #drop Neighborhoods with no location data assigned
Lon_df2=Lon_df2.replace('NaN', np.NaN)
Lon_df2.dropna(subset=['Latitude'], axis=0, inplace=True)
Lon_df2.reset_index(drop=True, inplace=True)
print(Lon_df2.shape)

(522, 3)


### Let's visualize all the neighborhoods in London

In [92]:
map_Lon = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(Lon_df2['Latitude'], Lon_df2['Longitude'], Lon_df2['Neigborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='lightred',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Lon)  
    
map_Lon

### Get top venues of each neighborhood using Foursquare data to build neighborhood profiles

In [93]:
#Define a new function, since London is more compact reduce the radius to 500 m and set the limit to 80

VERSION = '20180605' 

def getNearbyVenues2(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit=80'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [94]:
# The data set was too large to run with a single function, so I split the dataset into ~150 neighborhood groups

Lon_df2_a = Lon_df2.iloc[:150,:] 
Lon_df2_b = Lon_df2.iloc[151:300,:]
Lon_df2_c = Lon_df2.iloc[301:450,:]
Lon_df2_d = Lon_df2.iloc[451:,:]
print(Lon_df2_a.shape, ' & ', Lon_df2_b.shape, ' & ', Lon_df2_c.shape, ' & ', Lon_df2_d.shape)

(150, 3)  &  (149, 3)  &  (149, 3)  &  (71, 3)


In [95]:
#Run the function to get venues and save into a dataframe A
neighborhoods=Lon_df2_a['Neigborhood']
latitudes=Lon_df2_a['Latitude']
longitudes=Lon_df2_a['Longitude']
Lon_venues_a=getNearbyVenues2(neighborhoods, latitudes, longitudes)
print(Lon_venues_a.shape)
Lon_venues_a.head()

(3483, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Abbey Wood,51.487621,0.11405,Co-op Food,51.48765,0.11349,Grocery Store
1,Abbey Wood,51.487621,0.11405,Abbey Wood Caravan Club,51.485502,0.120014,Campground
2,Acton,51.50814,-0.273261,London Star Hotel,51.509624,-0.272456,Hotel
3,Acton,51.50814,-0.273261,Dragonfly Brewery at George & Dragon,51.507378,-0.271702,Brewery
4,Acton,51.50814,-0.273261,MrBakeme,51.508452,-0.268543,Creperie


In [96]:
#Run the function to get venues and save into a dataframe B
neighborhoods=Lon_df2_b['Neigborhood']
latitudes=Lon_df2_b['Latitude']
longitudes=Lon_df2_b['Longitude']
Lon_venues_b=getNearbyVenues2(neighborhoods, latitudes, longitudes)
print(Lon_venues_b.shape)
Lon_venues_b.head()

(3110, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Elephant and Castle,51.494888,-0.100573,Theo's Pizzeria,51.492847,-0.100349,Pizza Place
1,Elephant and Castle,51.494888,-0.100573,Gymbox,51.494664,-0.097604,Gym
2,Elephant and Castle,51.494888,-0.100573,Corsica Studios,51.493546,-0.098541,Music Venue
3,Elephant and Castle,51.494888,-0.100573,Sabor Peruano,51.492887,-0.101063,Peruvian Restaurant
4,Elephant and Castle,51.494888,-0.100573,Southwark Playhouse,51.497779,-0.098603,Theater


In [97]:
#Run the function to get venues and save into a dataframe C
neighborhoods=Lon_df2_c['Neigborhood']
latitudes=Lon_df2_c['Latitude']
longitudes=Lon_df2_c['Longitude']
Lon_venues_c=getNearbyVenues2(neighborhoods, latitudes, longitudes)
print(Lon_venues_c.shape)
Lon_venues_c.head()

(3169, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Mill Hill,51.615442,-0.233068,Mill Hill Sport Centre,51.619787,-0.232949,Athletics & Sports
1,Millbank,51.492612,-0.129044,Tate Britain,51.49119,-0.127872,Art Museum
2,Millbank,51.492612,-0.129044,Regency Cafe,51.493895,-0.132311,Café
3,Millbank,51.492612,-0.129044,Sapori Café & Restaurant,51.494805,-0.127654,Café
4,Millbank,51.492612,-0.129044,Burberry Global Headquarters,51.494705,-0.126743,Boutique


In [98]:
#Run the function to get venues and save into a dataframe D
neighborhoods=Lon_df2_d['Neigborhood']
latitudes=Lon_df2_d['Latitude']
longitudes=Lon_df2_d['Longitude']
Lon_venues_d=getNearbyVenues2(neighborhoods, latitudes, longitudes)
print(Lon_venues_d.shape)
Lon_venues_d.head()

(1552, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Tokyngton,51.550596,-0.284899,Station 31,51.552639,-0.285045,Indian Restaurant
1,Tokyngton,51.550596,-0.284899,Post Office,51.553378,-0.286397,Post Office
2,Tokyngton,51.550596,-0.284899,Londis,51.548622,-0.282476,Convenience Store
3,Tokyngton,51.550596,-0.284899,Wembley Stadium Railway Station (WCX),51.554374,-0.285007,Train Station
4,Tokyngton,51.550596,-0.284899,18 bus bus stop flamsted ave harrow rd,51.549646,-0.279003,Bus Stop


In [110]:
# Now combine all resulting df into a single one
Lon_venues=Lon_venues_a
Lon_venues=Lon_venues.append(Lon_venues_b)
Lon_venues=Lon_venues.append(Lon_venues_c)
Lon_venues=Lon_venues.append(Lon_venues_d)
print(Lon_venues.shape)

(11314, 7)


### Process the London dataframe for analysis - with one-hot encoding, and then grouping by Neighborhood to obtain profiles

In [111]:
# one hot encode it for processing
Lon_ohe = pd.get_dummies(Lon_venues[['Venue Category']], prefix="", prefix_sep="")
#Add Neighborhood name and reorder columns so it shows on the first column
Lon_ohe['Neighborhood'] = Lon_venues['Neighborhood'] 
col=Lon_ohe['Neighborhood']
Lon_ohe.drop(labels='Neighborhood', axis=1, inplace=True)
Lon_ohe.insert(0, 'Neighborhood', col)

# Group by neighborhood
grouped_Lon_neigh = Lon_ohe.groupby('Neighborhood').mean()
print(grouped_Lon_neigh.shape)
grouped_Lon_neigh.head()

(508, 413)


Unnamed: 0_level_0,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Arcade,...,Whisky Bar,Windmill,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo,Zoo Exhibit
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abbey Wood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Acton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.045455,0.045455,0.0,0.0,0.0,0.0,0.0,0.0
Addington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Addiscombe,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Albany Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Before I can use my profile to rank the London neighborhoods, I need to make sure the tables have the same categories since there are some things found in San Diego that are not found in London, and viceversa, some things in London that are not found in San Diego. Therefore I need to clean up both the London venues (to include only the categories in my profile), and my profile (to include only categories found in London). 

In [129]:
#clean up London table
Lon_cat=grouped_Lon_neigh.columns
clean_Lon_df=pd.DataFrame()

for i,cat in enumerate(Lon_cat):
    if cat in my_profile_cat:
        clean_Lon_df['{}'.format(cat)]=grouped_Lon_neigh['{}'.format(cat)]

print(clean_Lon_df.shape)
clean_Lon_df.head()

(508, 273)


Unnamed: 0_level_0,Accessories Store,Adult Boutique,African Restaurant,Airport,Airport Service,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abbey Wood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Acton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.045455,0.045455,0.0,0.0,0.0,0.0,0.0
Addington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Addiscombe,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Albany Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [134]:
#clean up my profile
clean_Lon_cat=clean_Lon_df.columns
clean_profile=pd.DataFrame()

for i,cat in enumerate(my_profile_cat):
    if cat in clean_Lon_cat:
        clean_profile['{}'.format(cat)]=my_profile_df['{}'.format(cat)]
clean_profile=clean_profile.transpose()
print(clean_profile.shape)
clean_profile.head()

(273, 1)


Unnamed: 0,0
Accessories Store,0.098
Adult Boutique,0.098
African Restaurant,0.148438
Airport,0.0
Airport Service,0.0


And now...




# Results:
_________

### _Now I can finally rank the London neighborhoods according to my profile rating of each category!!!_

In [155]:
#Get the list of London neighborhoods rated according to my profile
profile_list=clean_profile[0]
recommendation_list = ((clean_Lon_df*profile_list).sum(axis=1))/(profile_list.sum())
recommendations_df=pd.DataFrame(recommendation_list)
recommendations_df.columns=['Rating']
#sort it to get the top 10 and save into a dataframe
recommendations_df=recommendations_df.sort_values('Rating', ascending=False)
top10=recommendations_df.head(10)
top10

Unnamed: 0_level_0,Rating
Neighborhood,Unnamed: 1_level_1
Pratt's Bottom,0.03929
Oakleigh Park,0.028367
Freezywater,0.02834
Hampstead Garden Suburb,0.028139
Highams Park,0.023388
Brentford,0.022952
Eastcote,0.021922
Woodford,0.021584
Lamorbey,0.021457
Earls Court,0.021332


In [164]:
clean_Lon_df2=clean_Lon_df.reset_index()

### Now that I have my recommended neighborhoods, let's collect some data on them to make my decision

In [163]:
# Define the function to get the most common venues in each neighborhood
def return_most_common_venues(row, top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:top_venues]

top_venues = 10
indicators = ['st', 'nd', 'rd']
columns = ['Neighborhood']
for ind in np.arange(top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neigh_venues_rank = pd.DataFrame(columns=columns)
neigh_venues_rank['Neighborhood'] = clean_Lon_df2['Neighborhood']

# Run the function to get the info on all neighborhoods
for ind in np.arange(clean_Lon_df2.shape[0]):
    neigh_venues_rank.iloc[ind, 1:] = return_most_common_venues(clean_Lon_df2.iloc[ind, :], top_venues)

print(neigh_venues_rank.shape)


(508, 11)


In [165]:
#Now let's generate a final df with all info (Neighborhood, rating, location, and most common venues)
top_10_info = top10
top_10_info = top_10_info.join(Lon_df2.set_index('Neigborhood'), on='Neighborhood')
top_10_info = top_10_info.join(neigh_venues_rank.set_index('Neighborhood'), on='Neighborhood')
top_10_info=top_10_info.reset_index()
print(top_10_info.shape)
top_10_info

(10, 14)


Unnamed: 0,Neighborhood,Rating,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Pratt's Bottom,0.03929,51.340884,0.111459,Bar,Coffee Shop,Zoo Exhibit,Farmers Market,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Falafel Restaurant,Fast Food Restaurant
1,Oakleigh Park,0.028367,51.637667,-0.166225,Café,Zoo Exhibit,Farmers Market,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Falafel Restaurant,Fast Food Restaurant,Cosmetics Shop
2,Freezywater,0.02834,51.675772,-0.03143,Café,Shoe Store,Pizza Place,Coffee Shop,Falafel Restaurant,Escape Room,Ethiopian Restaurant,Event Service,Event Space,Exhibit
3,Hampstead Garden Suburb,0.028139,51.580508,-0.180616,Park,Coffee Shop,Zoo Exhibit,Falafel Restaurant,Escape Room,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Farmers Market
4,Highams Park,0.023388,51.606544,-0.008336,Gym,Coffee Shop,Zoo Exhibit,Falafel Restaurant,Escape Room,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Farmers Market
5,Brentford,0.022952,51.486396,-0.321662,Coffee Shop,Gym,Convenience Store,Office,Furniture / Home Store,Pizza Place,Deli / Bodega,Sandwich Place,Ethiopian Restaurant,Event Service
6,Eastcote,0.021922,51.579542,-0.401653,Café,Hotel,Burger Joint,Coffee Shop,Sandwich Place,Grocery Store,Pharmacy,Zoo Exhibit,Exhibit,Ethiopian Restaurant
7,Woodford,0.021584,51.606806,0.034027,Coffee Shop,Hotel,Italian Restaurant,Café,Chinese Restaurant,Grocery Store,Restaurant,Bakery,Indian Restaurant,Pizza Place
8,Lamorbey,0.021457,51.435509,0.101805,Coffee Shop,Hotel,Burger Joint,Train Station,Grocery Store,Gym / Fitness Center,Dessert Shop,Restaurant,Mexican Restaurant,Pizza Place
9,Earls Court,0.021332,51.491612,-0.193903,Hotel,Café,Pub,Italian Restaurant,Garden,Grocery Store,Thai Restaurant,Historic Site,Pizza Place,Coffee Shop


### And finally, let's visualize the top10 recommended neighborhoods in London

In [190]:
map_Top10 = folium.Map(location=[latitude, longitude], zoom_start=10)

Neigh=folium.map.FeatureGroup()

# add markers to map
for lat, lng, label in zip(top_10_info['Latitude'], top_10_info['Longitude'], top_10_info['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    Neigh.add_child(folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='purple',
        fill=True,
        fill_color='lightpurple',
        fill_opacity=0.7,
        parse_html=False))  

map_Top10.add_child(Neigh)

#Add a marker for my futur work location
Work=folium.map.FeatureGroup()
la=Lon_df2.loc[Lon_df2['Neigborhood']=='Bloomsbury', 'Latitude']
lo=Lon_df2.loc[Lon_df2['Neigborhood']=='Bloomsbury', 'Longitude']
Work.add_child(folium.Marker([la, lo], popup='Work'))
map_Top10.add_child(Work)
map_Top10

# Discussion
___________
* Based on the map, the recommendated neighborhoods spread accross all London. This is a good sign, indicating that London is very diverse, i.e. it is not clustering my prefered neighborhoods in a specific Borough. It also indicateds my analysis was able to pick specific neighborhoods that match my profile rather result in a cluster in a particular area.
* An advantage of this is that I can pick a neighborhood that is close to where I'll be work in Bloomsbury(work). Based on the map, the closest recommended neighborhoods are Hampstead Garden and Earls Court.
* Based on the information table, it seems that the criteria that contributed to higher rankings in the neighborhood are things like Parks, Cafes, Grocery Stores, Events venues(both Spaces, and Services), Restaurants (particularly Ethiopian :P, Falafel, and Pizza Places), Farmer Markets and Zoos... definitely all things I like, so the Recommendation system seemed to work.
* Finally, from the Closest recommended neighborhoods to work, Hampstead Garden is ranked higher (#4). Furthermore, it seems to have a good spread of top categories across its top venues!

# Conclusion
_________
In conclusion my _Machine Learning Recommendation System_, trained with my preferences of the San Diego neighborhoods, has done a good job at analysing the neighborhood profiles in London to recommend the ones that best match my preferences.
It has helped me solve my problem, by narrowing down my options and after looking at the recommended neighborhoods around my future job, and their top venues, I've selected the neighborhood to live as *Hampstead Garden Suburb*
### Mission Acomplished! :)