## Coursera Capstone Project
This notebook will mainly be used for the capstone project for the IBM Data Science Professional Certificate on Coursera.

In [1]:
import pandas as pd
import numpy as np

## Table of contents
* [Introduction - Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction - Business Problem <a name="introduction"></a>
***
When traveling, it is always important to get a hotel in the right area. If you are in a big city that you've never been to, you may not want to try to figure out how to get around whether by taxi, bus, or some other public transportation. You may even just want to avoid spending money on getting around and instead be able to walk everywhere you want to visit. For that reason, we can compare your preferences to the different areas of the city you want to visit and what they have to offer to make a recommendation of where you may want to get a hotel. These preferences could be from a previous trip, so, for example, if you visited Paris last year and want to go to London this year, you could compare neighborhoods that you went to and enjoyed in New York to the neighborhoods in London to see where you might prefer to stay.

## Data <a name="data"></a>
***
The data I will be using will come from FourSquare based on the neighborhoods in London and Paris. The project will be set up so that you can change the two cities, but for the purposes of demonstration that's what I will use. First, I will get the neighborhood latitudes and longitudes which will be needed for gathering the data from FourSquare. I will then use that data to determine similar neighborhoods based on the types of venues that are located in each area. That way I can find which neighborhoods are similar and then make recommendations to the user based on the findings. If the user enjoyed cafes, parks, and museums in Paris, they would receive recommendations on where to stay in London that have similar venues.

### Gathering Neighborhoods

First, I want to get neighborhoods in both cities. This involves using the postal codes for different areas of each city, and then graphing them to see if there is any editing I want to do in terms of how many neighborhoods I want to compare, how far spread out the neighborhoods are, and any other issues that might be noticeable when visualizing the neighborhoods on a map.

#### London
I was able to find information on London and its postcodes on Wikipedia. I will scrape the table with the information and then narrow down the list to neighborhoods at the heart of the city as the neighborhoods for Paris are all pretty central. I want them to be similarly set up and not have one spread out more than the other.

First, I will install the necessary packages to scrape and store the data into a dataframe.

In [2]:
! pip install html-table-parser-python3

import urllib.request

from html_table_parser import HTMLTableParser

Collecting html-table-parser-python3
  Downloading html_table_parser_python3-0.1.5-py3-none-any.whl (3.5 kB)
Installing collected packages: html-table-parser-python3
Successfully installed html-table-parser-python3-0.1.5


Next I get the table and store it in a dataframe. After checking the first five columns, I can see I need to edit the column names, so I do that here as well. Finally, I will not need the Dial code, but I will use the OS Grid ref to get the Latitude and Longitude.

In [3]:
def url_get_contents(url):
    req = urllib.request.Request(url=url)
    f = urllib.request.urlopen(req)
    
    return f.read()

xhtml = url_get_contents('https://en.wikipedia.org/wiki/List_of_areas_of_London').decode('utf-8')

parser = HTMLTableParser()

parser.feed(xhtml)

london_df = pd.DataFrame(parser.tables[1])

london_df.columns = london_df.loc[0]
london_df.drop(0, inplace=True)

london_df = london_df.iloc[:, [0,1,2,3,5]]

london_df

Unnamed: 0,Location,London borough,Post town,Postcode district,OS grid ref
1,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,TQ465785
2,Acton,"Ealing, Hammersmith and Fulham [8]",LONDON,"W3, W4",TQ205805
3,Addington,Croydon [8],CROYDON,CR0,TQ375645
4,Addiscombe,Croydon [8],CROYDON,CR0,TQ345665
5,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",TQ478728
...,...,...,...,...,...
527,Woolwich,Greenwich,LONDON,SE18,TQ435795
528,Worcester Park,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4,TQ225655
529,Wormwood Scrubs,Hammersmith and Fulham,LONDON,W12,TQ225815
530,Yeading,Hillingdon,HAYES,UB4,TQ115825


I now want to get the Latitude and Longitude for each location by using the OS grid ref and add those to the dataframe. A small number of neighborhoods did not have OS grid ref, so I just get rid of those rows first, and then I have to reset the indices. Then I iterate over every row and get the latitude and longitude from the OS grid ref and add those to the dataframe.

In [4]:
! pip install OSGridConverter
from OSGridConverter import grid2latlong

london_df['OS grid ref'].replace('', np.nan, inplace=True)
london_df.dropna(subset=['OS grid ref'], inplace=True)
london_df.reset_index(drop=True, inplace=True)
for index, row in london_df.iterrows():
    os_grid_ref = row["OS grid ref"]
    
    lat_long_coords = grid2latlong(os_grid_ref)
    latitude = lat_long_coords.latitude
    longitude = lat_long_coords.longitude

    london_df.loc[london_df.index[index], 'Latitude'] = latitude
    london_df.loc[london_df.index[index], 'Longitude'] = longitude
    
    
london_df

Collecting OSGridConverter
  Downloading OSGridConverter-0.1.3-py3-none-any.whl (19 kB)
Installing collected packages: OSGridConverter
Successfully installed OSGridConverter-0.1.3


Unnamed: 0,Location,London borough,Post town,Postcode district,OS grid ref,Latitude,Longitude
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,TQ465785,51.486484,0.109318
1,Acton,"Ealing, Hammersmith and Fulham [8]",LONDON,"W3, W4",TQ205805,51.510591,-0.264585
2,Addington,Croydon [8],CROYDON,CR0,TQ375645,51.362934,-0.025780
3,Addiscombe,Croydon [8],CROYDON,CR0,TQ345665,51.381625,-0.068126
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",TQ478728,51.434929,0.125663
...,...,...,...,...,...,...,...
524,Woolwich,Greenwich,LONDON,SE18,TQ435795,51.496238,0.066504
525,Worcester Park,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4,TQ225655,51.375352,-0.240950
526,Wormwood Scrubs,Hammersmith and Fulham,LONDON,W12,TQ225815,51.519148,-0.235411
527,Yeading,Hillingdon,HAYES,UB4,TQ115825,51.530413,-0.393669


I now want to visualize the neighborhoods of London on a map so I can see if I want to eliminate any further rows from my data because the neighborhoods are not in the area I would like. First I install and import the necessary libraries, then center the map on London, and then graph the neighborhoods on the map to determine how spread out they are.

In [5]:
! pip install folium
import folium

! pip install geopy
from geopy.geocoders import Nominatim

Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 5.4 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1


In [6]:
address = 'London, England'

geolocator = Nominatim(user_agent="london_explorer")
location = geolocator.geocode(address)
london_latitude = location.latitude
london_longitude = location.longitude
print('The geographical coordinates of London are {}, {}.'.format(london_latitude, london_longitude))

The geographical coordinates of London are 51.5073219, -0.1276474.


In [7]:
map_london = folium.Map(location=[london_latitude, london_longitude], zoom_start=10)

for lat, lng, neighborhood in zip(london_df['Latitude'], london_df['Longitude'], london_df['Location']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False
    ).add_to(map_london)

map_london

As we can see from the map above, there is a large amount of neighborhoods, and a large amount of spread for those neighborhoods. I know that Paris is split into much fewer neighborhoods, so I am going to narrow the London Neighborhoods to only those with Post town of London.

In [8]:
london_only_df = london_df[london_df["Post town"] == 'LONDON'].reset_index(drop=True)

london_only_df

Unnamed: 0,Location,London borough,Post town,Postcode district,OS grid ref,Latitude,Longitude
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,TQ465785,51.486484,0.109318
1,Acton,"Ealing, Hammersmith and Fulham [8]",LONDON,"W3, W4",TQ205805,51.510591,-0.264585
2,Aldgate,City [10],LONDON,EC3,TQ334813,51.514885,-0.078356
3,Aldwych,Westminster [10],LONDON,WC2,TQ307810,51.512819,-0.117388
4,Anerley,Bromley [11],LONDON,SE20,TQ345695,51.408585,-0.066989
...,...,...,...,...,...,...,...
292,Wood Green,Haringey,LONDON,N22,TQ305905,51.598237,-0.116745
293,Woodford,Redbridge,LONDON,"IG8, E18",TQ405915,51.604820,0.028068
294,Woodside Park,Barnet,LONDON,N12,TQ256925,51.617324,-0.186791
295,Woolwich,Greenwich,LONDON,SE18,TQ435795,51.496238,0.066504


In [9]:
map_london = folium.Map(location=[london_latitude, london_longitude], zoom_start=10)

for lat, lng, neighborhood in zip(london_only_df['Latitude'], london_only_df['Longitude'], london_only_df['Location']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False
    ).add_to(map_london)

map_london

This is better, but there are still 297 neighborhoods and a lot of spread. I will narrow it down further by limiting the distance from the city center.

In [10]:
import math

def within_city_center(lat, long):
    radius = math.sqrt(math.pow(lat - london_latitude, 2) + math.pow(long - london_longitude, 2))
    return radius < .04

In [11]:
for index, row in london_only_df.iterrows():
    london_only_df.loc[london_only_df.index[index], 'City Center'] = within_city_center(row['Latitude'], row['Longitude'])
    
london_only_df

Unnamed: 0,Location,London borough,Post town,Postcode district,OS grid ref,Latitude,Longitude,City Center
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,TQ465785,51.486484,0.109318,False
1,Acton,"Ealing, Hammersmith and Fulham [8]",LONDON,"W3, W4",TQ205805,51.510591,-0.264585,False
2,Aldgate,City [10],LONDON,EC3,TQ334813,51.514885,-0.078356,False
3,Aldwych,Westminster [10],LONDON,WC2,TQ307810,51.512819,-0.117388,True
4,Anerley,Bromley [11],LONDON,SE20,TQ345695,51.408585,-0.066989,False
...,...,...,...,...,...,...,...,...
292,Wood Green,Haringey,LONDON,N22,TQ305905,51.598237,-0.116745,False
293,Woodford,Redbridge,LONDON,"IG8, E18",TQ405915,51.604820,0.028068,False
294,Woodside Park,Barnet,LONDON,N12,TQ256925,51.617324,-0.186791,False
295,Woolwich,Greenwich,LONDON,SE18,TQ435795,51.496238,0.066504,False


In [12]:
london_city_center_df = london_only_df[london_only_df['City Center'] == True].reset_index(drop=True)

london_city_center_df

Unnamed: 0,Location,London borough,Post town,Postcode district,OS grid ref,Latitude,Longitude,City Center
0,Aldwych,Westminster [10],LONDON,WC2,TQ307810,51.512819,-0.117388,True
1,Bankside,Southwark [14],LONDON,SE1,TQ325795,51.498921,-0.092006,True
2,Barbican,City [14],LONDON,"EC1, EC2",TQ322818,51.51966,-0.095466,True
3,Barnsbury,Islington [17],LONDON,N1,TQ305845,51.544318,-0.118974,True
4,Belgravia,Westminster [22],LONDON,SW1,TQ283792,51.497193,-0.152637,True
5,Blackfriars,City [27],LONDON,EC4,TQ318808,51.510767,-0.101607,True
6,Bloomsbury,Camden [29],LONDON,WC1,TQ299820,51.52199,-0.12855,True
7,Camden Town,Camden [40],LONDON,NW1,TQ295845,51.544548,-0.133398,True
8,Charing Cross,Westminster,LONDON,WC2,TQ305805,51.508372,-0.120456,True
9,Chinatown,Westminster,LONDON,W1,TQ297808,51.511252,-0.131875,True


In [13]:
map_london = folium.Map(location=[london_latitude, london_longitude], zoom_start=12)

for lat, lng, neighborhood in zip(london_city_center_df['Latitude'], london_city_center_df['Longitude'], london_city_center_df['Location']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False
    ).add_to(map_london)

map_london

The above map looks much better for comparing with Paris. The number of neighborhoods is similar, and the location seems better suited to visiting London and not staying too far from the majority of the attractions.

#### Paris
For Paris, I will use the 20 Arrondissements. I will run through the list getting each latitude and longitude and adding them to the dataframe.

In [14]:
arrondissement = ["1st", "2nd", "3rd", "4th", "5th", "6th", "7th", "8th", "9th", "10th", "11th", "12th", "13th", "14th", "15th", "16th", "17th", "18th", "19th", "20th"]
latitudes = []
longitudes = []
for arr in arrondissement:
    address = "{} arrondissement, Paris, France".format(arr)
    location = geolocator.geocode(address)
    latitudes.append(location.latitude)
    longitudes.append(location.longitude)

paris_df = pd.DataFrame(list(zip(arrondissement, latitudes, longitudes)), columns = ["Arrondissement", "Latitude", "Longitude"])

paris_df
    

Unnamed: 0,Arrondissement,Latitude,Longitude
0,1st,48.864614,2.334396
1,2nd,48.868743,2.341688
2,3rd,48.864212,2.360936
3,4th,48.856202,2.355619
4,5th,48.845973,2.34435
5,6th,48.850433,2.332951
6,7th,48.857028,2.320195
7,8th,48.870905,2.312152
8,9th,48.876019,2.339962
9,10th,48.876126,2.359839


In [15]:
address= 'Paris, France'
location = geolocator.geocode(address)
paris_latitude = location.latitude
paris_longitude = location.longitude

map_paris = folium.Map(location=[paris_latitude, paris_longitude], zoom_start=12)

for lat, lng, neighborhood in zip(paris_df['Latitude'], paris_df['Longitude'], paris_df['Arrondissement']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False
    ).add_to(map_paris)

map_paris

Now that I have comparable data for both cities, the next step is to start gathering data that 
can be used to compare the neighborhoods.

### Foursquare

Here, I will gather data on the venues in each neighborhood. I want to gather information on the types of venues and their locations which will help determine clusters later on.

In [16]:
# The code was removed by Watson Studio for sharing.

In [17]:
import requests

In [18]:
def getNearbyVenues(names, latitudes, longitudes, radius=800):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [19]:
paris_venues = getNearbyVenues(names=paris_df['Arrondissement'], latitudes=paris_df['Latitude'], longitudes=paris_df['Longitude'])
paris_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,1st,48.864614,2.334396,Musée des Arts Décoratifs,48.863077,2.333393,Art Museum
1,1st,48.864614,2.334396,Ellsworth,48.865528,2.337057,French Restaurant
2,1st,48.864614,2.334396,Place des Pyramides,48.863924,2.332224,Plaza
3,1st,48.864614,2.334396,Kosyuen 華修園,48.864163,2.333567,Tea Room
4,1st,48.864614,2.334396,Jardin du Palais Royal,48.864941,2.337728,Garden


In [20]:
london_venues = getNearbyVenues(names=london_city_center_df['Location'], latitudes=london_city_center_df['Latitude'], longitudes=london_city_center_df['Longitude'])
london_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Aldwych,51.512819,-0.117388,Lyceum Theatre,51.511598,-0.119785,Theater
1,Aldwych,51.512819,-0.117388,Somerset House,51.510786,-0.117899,Event Space
2,Aldwych,51.512819,-0.117388,Novello Theatre,51.51228,-0.119322,Theater
3,Aldwych,51.512819,-0.117388,The Delaunay,51.513181,-0.117988,Restaurant
4,Aldwych,51.512819,-0.117388,Lundenwic,51.512823,-0.118343,Coffee Shop


## Methodology

I will be comparing neighborhoods based on their attractions and a user's preferences. If a user enjoyed art, museums, and parks, I will use the data from Foursquare to find the areas that have the most options for those preferences. Then I can show the user which neighborhoods match up by showing the maps side by side.

Once we have the neighborhoods for both cities narrowed down by preference, we can create clusters using k-means clustering. I can then create side-by-side maps to show the neighborhoods that are similar which would allow a user to choose the best place to stay on their trip.

## Analysis

First, let's get a breakdown of the different venues in each neighborhood.

In [21]:
london_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aldwych,100,100,100,100,100,100
Bankside,100,100,100,100,100,100
Barbican,100,100,100,100,100,100
Barnsbury,53,53,53,53,53,53
Belgravia,100,100,100,100,100,100
Blackfriars,100,100,100,100,100,100
Bloomsbury,100,100,100,100,100,100
Camden Town,94,94,94,94,94,94
Charing Cross,100,100,100,100,100,100
Chinatown,100,100,100,100,100,100


In [22]:
paris_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10th,100,100,100,100,100,100
11th,100,100,100,100,100,100
12th,80,80,80,80,80,80
13th,100,100,100,100,100,100
14th,100,100,100,100,100,100
15th,100,100,100,100,100,100
16th,87,87,87,87,87,87
17th,100,100,100,100,100,100
18th,100,100,100,100,100,100
19th,100,100,100,100,100,100


In [23]:
print('There are {} unique categories.'.format(len(london_venues['Venue Category'].unique())))

There are 271 unique categories.


In [24]:
print('There are {} unique categories.'.format(len(paris_venues['Venue Category'].unique())))

There are 220 unique categories.


At this point, I want to combine the data into one dataframe so that I can cluster neighborhoods together across the two cities. I will need to get the top 10 types of venues in each neighborhood and then use k-means clustering.

In [25]:
combined_neighborhoods_venues = paris_venues.append(london_venues)

combined_neighborhoods_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,1st,48.864614,2.334396,Musée des Arts Décoratifs,48.863077,2.333393,Art Museum
1,1st,48.864614,2.334396,Ellsworth,48.865528,2.337057,French Restaurant
2,1st,48.864614,2.334396,Place des Pyramides,48.863924,2.332224,Plaza
3,1st,48.864614,2.334396,Kosyuen 華修園,48.864163,2.333567,Tea Room
4,1st,48.864614,2.334396,Jardin du Palais Royal,48.864941,2.337728,Garden
...,...,...,...,...,...,...,...
3635,Westminster,51.499615,-0.135236,Gianni's,51.494861,-0.127139,Café
3636,Westminster,51.499615,-0.135236,Market Hall Victoria,51.496378,-0.144286,Food Court
3637,Westminster,51.499615,-0.135236,The Goring Bar & Lounge,51.497385,-0.145475,Bar
3638,Westminster,51.499615,-0.135236,Costa Coffee,51.500934,-0.124805,Coffee Shop


In [26]:
print('There are {} unique categories.'.format(len(combined_neighborhoods_venues['Venue Category'].unique())))

There are 318 unique categories.


The number of restaurants and especially French restaurants in Paris have a large impact on the clustering, I have removed columns of restaurants to get a better idea of how neighborhoods are similar based on other attractions that the user would be more interested in while traveling. I have also removed the hotel column as that will not be needed.

In [27]:
combined_neighborhoods_onehot = pd.get_dummies(combined_neighborhoods_venues[['Venue Category']], prefix="", prefix_sep="")
combined_neighborhoods_onehot['Neighborhood'] = combined_neighborhoods_venues['Neighborhood']
cols = [c for c in combined_neighborhoods_onehot.columns if 'restaurant' not in c.lower() and 'hotel' not in c.lower()]

combined_neighborhoods_onehot = combined_neighborhoods_onehot[cols]

combined_neighborhoods_onehot.head()

Unnamed: 0,Accessories Store,Animal Shelter,Antique Shop,Aquarium,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Athletics & Sports,Auto Garage,...,Warehouse Store,Watch Shop,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
combined_grouped = combined_neighborhoods_onehot.groupby('Neighborhood').mean().reset_index()
combined_grouped

Unnamed: 0,Neighborhood,Accessories Store,Animal Shelter,Antique Shop,Aquarium,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Athletics & Sports,...,Warehouse Store,Watch Shop,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,10th,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0
1,11th,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0
2,12th,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0125,0.0,0.0,0.0,0.0,0.0,0.0
3,13th,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,14th,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.0,...,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0
5,15th,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,16th,0.0,0.0,0.0,0.0,0.0,0.0,0.011494,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,17th,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0
8,18th,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0
9,19th,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [30]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = combined_grouped['Neighborhood']

for ind in np.arange(combined_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(combined_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,10th,Coffee Shop,Pizza Place,Breakfast Spot,Bakery,Cocktail Bar,Wine Shop,Bistro,Bar,Sandwich Place,Hostel
1,11th,Bar,Cocktail Bar,Bistro,Pizza Place,Café,Record Shop,Food & Drink Shop,Beer Bar,Music Venue,Bakery
2,12th,Bistro,Supermarket,Bakery,Bus Stop,Gym,Grocery Store,Farmers Market,Bar,Food & Drink Shop,Fruit & Vegetable Store
3,13th,Bakery,Creperie,Bistro,Brasserie,Bar,Park,Burger Joint,Museum,Food & Drink Shop,Movie Theater
4,14th,Bar,Bakery,Bistro,Theater,Coffee Shop,Beer Store,Art Museum,Plaza,Hookah Bar,Pizza Place
5,15th,Bistro,Park,Plaza,Bakery,Coffee Shop,Bar,Café,Brasserie,Sports Bar,Sandwich Place
6,16th,Bakery,Supermarket,Sandwich Place,Café,Garden,Train Station,Pizza Place,Tennis Court,Tea Room,Shopping Mall
7,17th,Wine Bar,Farmers Market,Bar,Pastry Shop,Bakery,Bookstore,Coffee Shop,Steakhouse,Breakfast Spot,Café
8,18th,Bar,Bistro,Plaza,Pizza Place,Café,Sandwich Place,Burger Joint,Art Gallery,Bookstore,Convenience Store
9,19th,Bar,Supermarket,Café,Bistro,Cocktail Bar,Pizza Place,Creperie,Beer Bar,Movie Theater,Concert Hall


In [31]:
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [32]:
kclusters = 5

combined_grouped_clustering = combined_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(combined_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:] 

array([4, 2, 2, 0, 2, 0, 0, 0, 2, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 3, 1,
       4, 1, 0, 4, 4, 1, 3, 0, 4, 3, 1, 4, 4, 4, 3, 1, 4, 0, 1, 4, 0, 4,
       1, 1, 4, 4, 1, 4, 4, 3, 0, 4, 4, 3, 1, 1, 4], dtype=int32)

In [33]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

neighborhoods_venues_sorted

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,4,10th,Coffee Shop,Pizza Place,Breakfast Spot,Bakery,Cocktail Bar,Wine Shop,Bistro,Bar,Sandwich Place,Hostel
1,2,11th,Bar,Cocktail Bar,Bistro,Pizza Place,Café,Record Shop,Food & Drink Shop,Beer Bar,Music Venue,Bakery
2,2,12th,Bistro,Supermarket,Bakery,Bus Stop,Gym,Grocery Store,Farmers Market,Bar,Food & Drink Shop,Fruit & Vegetable Store
3,0,13th,Bakery,Creperie,Bistro,Brasserie,Bar,Park,Burger Joint,Museum,Food & Drink Shop,Movie Theater
4,2,14th,Bar,Bakery,Bistro,Theater,Coffee Shop,Beer Store,Art Museum,Plaza,Hookah Bar,Pizza Place
5,0,15th,Bistro,Park,Plaza,Bakery,Coffee Shop,Bar,Café,Brasserie,Sports Bar,Sandwich Place
6,0,16th,Bakery,Supermarket,Sandwich Place,Café,Garden,Train Station,Pizza Place,Tennis Court,Tea Room,Shopping Mall
7,0,17th,Wine Bar,Farmers Market,Bar,Pastry Shop,Bakery,Bookstore,Coffee Shop,Steakhouse,Breakfast Spot,Café
8,2,18th,Bar,Bistro,Plaza,Pizza Place,Café,Sandwich Place,Burger Joint,Art Gallery,Bookstore,Convenience Store
9,2,19th,Bar,Supermarket,Café,Bistro,Cocktail Bar,Pizza Place,Creperie,Beer Bar,Movie Theater,Concert Hall


In [34]:
london_merged = london_city_center_df

london_merged.rename(columns= {'Location': 'Neighborhood'}, inplace=True)

london_venues_sorted = neighborhoods_venues_sorted[20:]
london_merged = london_merged.join(london_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

london_merged

Unnamed: 0,Neighborhood,London borough,Post town,Postcode district,OS grid ref,Latitude,Longitude,City Center,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Aldwych,Westminster [10],LONDON,WC2,TQ307810,51.512819,-0.117388,True,3,Theater,Pub,Coffee Shop,Burger Joint,Steakhouse,Cocktail Bar,Café,Ice Cream Shop,Dessert Shop,Building
1,Bankside,Southwark [14],LONDON,SE1,TQ325795,51.498921,-0.092006,True,1,Pub,Coffee Shop,Garden,Bakery,Deli / Bodega,Gym / Fitness Center,Burger Joint,Street Food Gathering,Café,Music Venue
2,Barbican,City [14],LONDON,"EC1, EC2",TQ322818,51.51966,-0.095466,True,4,Gym / Fitness Center,Coffee Shop,Plaza,Café,Wine Bar,Food Truck,Bakery,Steakhouse,Roof Deck,Pub
3,Barnsbury,Islington [17],LONDON,N1,TQ305845,51.544318,-0.118974,True,1,Grocery Store,Pub,Café,Park,Coffee Shop,Brewery,Breakfast Spot,Theater,Liquor Store,Tennis Court
4,Belgravia,Westminster [22],LONDON,SW1,TQ283792,51.497193,-0.152637,True,0,Boutique,Plaza,Theater,Coffee Shop,Gastropub,Garden,Cocktail Bar,Café,Bakery,Ice Cream Shop
5,Blackfriars,City [27],LONDON,EC4,TQ318808,51.510767,-0.101607,True,4,Coffee Shop,Park,Pub,Art Museum,Cocktail Bar,Scenic Lookout,Theater,Gym / Fitness Center,Wine Bar,Bar
6,Bloomsbury,Camden [29],LONDON,WC1,TQ299820,51.52199,-0.12855,True,4,Coffee Shop,Bookstore,Café,Cocktail Bar,Beer Bar,Exhibit,Pizza Place,Plaza,Burger Joint,Gym / Fitness Center
7,Camden Town,Camden [40],LONDON,NW1,TQ295845,51.544548,-0.133398,True,1,Pub,Café,Grocery Store,Coffee Shop,Park,Bakery,Pizza Place,Food & Drink Shop,Pharmacy,Liquor Store
8,Charing Cross,Westminster,LONDON,WC2,TQ305805,51.508372,-0.120456,True,3,Theater,Burger Joint,Cocktail Bar,Scenic Lookout,Pub,Bakery,Steakhouse,Garden,Bookstore,Event Space
9,Chinatown,Westminster,LONDON,W1,TQ297808,51.511252,-0.131875,True,0,Bakery,Plaza,Steakhouse,Ice Cream Shop,Coffee Shop,Liquor Store,Lounge,Dessert Shop,Cocktail Bar,Comic Shop


In [35]:
london_clusters = folium.Map(location=[london_latitude, london_longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(london_merged['Latitude'], london_merged['Longitude'], london_merged['Neighborhood'], london_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(london_clusters)

In [36]:
paris_merged = paris_df

paris_merged.rename(columns= {'Arrondissement': 'Neighborhood'}, inplace=True)

paris_venues_sorted = neighborhoods_venues_sorted[:20]
paris_merged = paris_merged.join(paris_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

paris_merged

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1st,48.864614,2.334396,0,Plaza,Café,Wine Bar,Art Museum,Theater,Historic Site,Garden,Tea Room,Bakery,Coffee Shop
1,2nd,48.868743,2.341688,0,Wine Bar,Bakery,Pedestrian Plaza,Creperie,Coffee Shop,Bookstore,Cocktail Bar,Women's Store,Café,Bistro
2,3rd,48.864212,2.360936,0,Art Gallery,Cocktail Bar,Coffee Shop,Wine Bar,Sandwich Place,Gourmet Shop,Clothing Store,Bakery,Supermarket,Bookstore
3,4th,48.856202,2.355619,0,Plaza,Garden,Bistro,Coffee Shop,Gourmet Shop,Clothing Store,Art Museum,Pastry Shop,Ice Cream Shop,Bookstore
4,5th,48.845973,2.34435,0,Plaza,Bakery,Bar,Bookstore,Indie Movie Theater,Coffee Shop,Creperie,Wine Bar,Comic Shop,Pizza Place
5,6th,48.850433,2.332951,0,Plaza,Wine Bar,Chocolate Shop,Cocktail Bar,Pastry Shop,Fountain,Shoe Store,Tailor Shop,Garden,Tea Room
6,7th,48.857028,2.320195,0,Plaza,Art Museum,Bakery,Garden,Café,Historic Site,History Museum,Pedestrian Plaza,Bistro,Coffee Shop
7,8th,48.870905,2.312152,0,Garden,Plaza,Clothing Store,Boutique,Tailor Shop,Coffee Shop,Cocktail Bar,Department Store,Café,Shoe Store
8,9th,48.876019,2.339962,0,Bakery,Pizza Place,Wine Bar,Café,Cocktail Bar,Creperie,Cheese Shop,Pedestrian Plaza,Bistro,Plaza
9,10th,48.876126,2.359839,4,Coffee Shop,Pizza Place,Breakfast Spot,Bakery,Cocktail Bar,Wine Shop,Bistro,Bar,Sandwich Place,Hostel


In [37]:
paris_clusters = folium.Map(location=[paris_latitude, paris_longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(paris_merged['Latitude'], paris_merged['Longitude'], paris_merged['Neighborhood'], paris_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(paris_clusters)

In [38]:
from IPython.core.display import display, HTML

htmlmap = HTML('<iframe srcdoc="{}" style="float:left; width: {}px; height: {}px; display:inline-block; width: 50%; margin: 0 auto; border: 2px solid black"></iframe>'
           '<iframe srcdoc="{}" style="float:right; width: {}px; height: {}px; display:inline-block; width: 50%; margin: 0 auto; border: 2px solid black"></iframe>'
           .format(london_clusters.get_root().render().replace('"', '&quot;'),500,500,
                   paris_clusters.get_root().render().replace('"', '&quot;'),500,500))
display(htmlmap)



## Results and Discussion

In the end, we can see that there are some clear differences between Paris and London. London does not have any neighborhoods in Cluster 2 which is mainly bakeries and bistros (which makes sense for Paris),  Paris has no neighborhoods in Cluster 1 which has many pubs (which makes sense for London) or Cluster 3 which has many theatres (which also makes sense with the West End in London).

The neighborhoods that are similar seem to be broken down into areas that have plazas and parks/gardens (Cluster 0) and coffee shops and cafes (Cluster 4).

A user could make a decision on what area to stay in based on the clusters provided in both maps. Even with the different types of clusters in London, they can at least see where those types of attractions are. That means that they may not have found and liked many pubs in Paris the year before, but they know where they can stay that has those in London based on the map. If they want to stick to what they know from Paris, they can find those two clusters that are similar and go with that.

While the results are reasonable, further data gathering could be useful to fine-tune the clusters. Also, cleaning the data more to just include major attractions instead of all the different types of venues that Foursquare offers might make it better for tourists looking more for the traditional experience instead of seeing pubs, cafes, and plazas are.

## Conclusion

The purpose of the project was to be able to make recommendations to a user on where to stay in London based on their experiences in Paris the year before. By limiting the regions to search and the types of venues to look for,  we have been able to determine clusters of neighborhoods that have similar or dissimilar features for the user to use to determine where they want to stay. The user will be able to make the final decision on where to stay based on the side-by-side comparison and their preferences.