
<h1 style="text-align:center">
            Touriusm in London
</h1>
<h2 style="text-align:center"> 
    Edis Gradjan
    </h2>

<h3>Introduction</h3>

London being the capital of England as well as a city with rich history means that every year millions of people visit this tourist hotspot. Boasting beautiful buildings, measums and sites such as the London Eye attracts people from around the world. London is also a very diverse city with people from all over the world settling here and this can be seen by taking a glance at all the different ethnic restaraunts in the city. 

In this project we will be taking a deep dive into London and drawing insights from the analysis we complete. 

<h3>Buisness Problem</h3>

The goal of this analysis is to explore London and its variety of tourist attractions to give people who are looking at coming to London insight into the area. This analysis will allow people to make guided and informed decisions on the types of food, stores and other things London has to offer. 

<h3>Data</h3>

For this project we will be using the geolocations of London. To start we will use postal codes to then extract information on the boroughs, neigborhoods and venues. We will take a deeper look at venues and the most popular venue catagories.


<h4>London</h4>

The data for this analysis will be scraped from https://en.wikipedia.org/wiki/List_of_areas_of_London.

This page will allow us to extract information on the neighborhoods,boroughs, and postal codes in the city of London. 
However, there is no geographical locations in this wiki page and thus we will turn to an ArcGIS API.

<h3>ArcGIS API</h3>

The ArcGIS API for Python is a powerful, modern and easy to use Pythonic library to perform GIS visualization and analysis, spatial data management and GIS system administration tasks that can run both interactively, and using scripts.

From the ArcGIS API we will extract latitude and longitude for the neighborhoods.

<h3>Foursquare API</h3>

To complete our report and analysis we will need to obtain data about venues in London and to do so we will use a Foursquare API which gives locational information. We will gather information on each venue such as the neighborhood, coordinates of this neighborhood, venue name,venue coordinated and venue category. 

<h3>Method of Analysis</h3>

Our first step will be to install all the packages needed for our analysis.

In [None]:
import pandas as pd
import requests
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

The packages above will allow us to collect and manipulate our data using pandas, handle http requests, generate detailed maps and use kmeans for analysis.

<h1>Exploration and Analysis of London</h1>

<h2>Data collection</h2>

In this section we will be scrapping the list of of areas in London wiki page to obtain neighborhood information.

In [None]:
url = "https://en.wikipedia.org/wiki/List_of_areas_of_London"
london_url = requests.get(url)
london_url

In [None]:
london_data = pd.read_html(london_url.text)
london_data

In [None]:
london_data = london_data[1]
london_data

<h2>Data Selection</h2>

In this section we will extract only the data we need which is boroughs,postal codes, and post towns. 

In [25]:
df1 = london_data.drop( [ london_data.columns[0], london_data.columns[4], london_data.columns[5] ], axis=1)

In [26]:
df1.head()

Unnamed: 0,London borough,Post town,Postcode district
0,"Bexley, Greenwich [7]",LONDON,SE2
1,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Croydon[8],CROYDON,CR0
3,Croydon[8],CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


In this next step we will clean the data up,first we will rename the columnsa and then we will be removing the "[ ]" from the London borough category.

In [33]:
df1.columns = ['Borough','Town','Post Code']
df1

Unnamed: 0,Borough,Town,Post Code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Croydon,CROYDON,CR0
3,Croydon,CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"
...,...,...,...
526,Greenwich,LONDON,SE18
527,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4
528,Hammersmith and Fulham,LONDON,W12
529,Hillingdon,HAYES,UB4


In [34]:
df1['Borough'] = df1['Borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))
df1

Unnamed: 0,Borough,Town,Post Code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Croydon,CROYDON,CR0
3,Croydon,CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"
...,...,...,...
526,Greenwich,LONDON,SE18
527,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4
528,Hammersmith and Fulham,LONDON,W12
529,Hillingdon,HAYES,UB4


In [36]:
df1.shape

(531, 3)

<h2>Data Statistics</h2>

In this section we will be eliminating all data from towns that are not London as we are focusing on London in this report. After this we will get descriptive statistics on the data we have.

In [38]:
df1 = df1[df1['Town'].str.contains('LONDON')]
df1

Unnamed: 0,Borough,Town,Post Code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
6,City,LONDON,EC3
7,Westminster,LONDON,WC2
9,Bromley,LONDON,SE20
...,...,...,...
521,Redbridge,LONDON,"IG8, E18"
522,"Redbridge, Waltham Forest","LONDON, WOODFORD GREEN",IG8
525,Barnet,LONDON,N12
526,Greenwich,LONDON,SE18


In [39]:
df1.shape

(308, 3)

In [40]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 308 entries, 0 to 528
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Borough    308 non-null    object
 1   Town       308 non-null    object
 2   Post Code  308 non-null    object
dtypes: object(3)
memory usage: 9.6+ KB


We now see that we have 308 rows and see the statistics on these rows. 

<h2>London Geolocation</h2>

In this section we will be getting the coordinates of the neighborhoods so we can plot the map

In [41]:
pip install arcgis

Note: you may need to restart the kernel to use updated packages.


In [42]:
from arcgis.geocoding import geocode
from arcgis.gis import GIS
gis = GIS()

  pd.datetime,


In [None]:
def get_x_y_uk(address1):
   lat_coords = 0
   lng_coords = 0
   g = geocode(address='{}, London, England, GBR'.format(address1))[0]
   lng_coords = g['location']['x']
   lat_coords = g['location']['y']
   return str(lat_coords) +","+ str(lng_coords)

The function above will allow us to run our postal codes through it and get coordinates.

In [60]:
coordinates_uk = df1['Post Code']    
coordinates_uk

0           SE2
1        W3, W4
6           EC3
7           WC2
9          SE20
         ...   
521    IG8, E18
522         IG8
525         N12
526        SE18
528         W12
Name: Post Code, Length: 308, dtype: object

In [57]:
latlng_uk = coordinates_uk.apply(lambda x: get_x_y_uk(x))
latlng_uk

0       51.492450000000076,0.12127000000003818
1        51.51324000000005,-0.2674599999999714
6       51.51200000000006,-0.08057999999994081
7       51.51651000000004,-0.11967999999995982
9       51.41009000000008,-0.05682999999993399
                        ...                   
521    51.589770000000044,0.030520000000024083
522      51.50642000000005,-0.1272099999999341
525     51.615920000000074,-0.1767399999999384
526      51.48207000000008,0.07143000000002075
528      51.50645000000003,-0.2369099999999662
Name: Post Code, Length: 308, dtype: object

<h3>Latitude and Longitude</h3>

In [62]:
latitude_uk = latlng_uk.apply(lambda x: x.split(',')[0])
latitude_uk

0      51.492450000000076
1       51.51324000000005
6       51.51200000000006
7       51.51651000000004
9       51.41009000000008
              ...        
521    51.589770000000044
522     51.50642000000005
525    51.615920000000074
526     51.48207000000008
528     51.50645000000003
Name: Post Code, Length: 308, dtype: object

In [63]:
longitude_uk = latlng_uk.apply(lambda x: x.split(',')[1])
longitude_uk

0       0.12127000000003818
1       -0.2674599999999714
6      -0.08057999999994081
7      -0.11967999999995982
9      -0.05682999999993399
               ...         
521    0.030520000000024083
522     -0.1272099999999341
525     -0.1767399999999384
526     0.07143000000002075
528     -0.2369099999999662
Name: Post Code, Length: 308, dtype: object

<h3> Merging Coordinates with dataframe</h3>

We will now merge the coordinates obtained with the dataframe that has the borough,town and postal code

In [71]:
london_merged = pd.concat([df1,latitude_uk.astype(float), longitude_uk.astype(float)], axis=1)
london_merged.columns= ['Borough','Town','Post Code','Latitude','Longitude']
london_merged

Unnamed: 0,Borough,Town,Post Code,Latitude,Longitude
0,"Bexley, Greenwich",LONDON,SE2,51.49245,0.12127
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",51.51324,-0.26746
6,City,LONDON,EC3,51.51200,-0.08058
7,Westminster,LONDON,WC2,51.51651,-0.11968
9,Bromley,LONDON,SE20,51.41009,-0.05683
...,...,...,...,...,...
521,Redbridge,LONDON,"IG8, E18",51.58977,0.03052
522,"Redbridge, Waltham Forest","LONDON, WOODFORD GREEN",IG8,51.50642,-0.12721
525,Barnet,LONDON,N12,51.61592,-0.17674
526,Greenwich,LONDON,SE18,51.48207,0.07143


<h3>Visualization</h3>

In [72]:
london = geocode(address='London, England, GBR')[0]
london_lng_coords = london['location']['x']
london_lat_coords = london['location']['y']
london_lng_coords


-0.1272099999999341

In [73]:
london_lat_coords

51.50642000000005

In [75]:
# Creating the map of London
map_London = folium.Map(location=[london_lat_coords, london_lng_coords], zoom_start=12)
map_London

# adding markers to map
for Latitude, Longitude, Borough, Town in zip(london_merged['Latitude'], london_merged['Longitude'], london_merged['Borough'], london_merged['Town']):
    label = '{}, {}'.format(Town, Borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [Latitude, Longitude],
        radius=5,
        popup=label,
        color='red',
        fill=True
        ).add_to(map_London)  
    
map_London

<h2>Venues in London</h2>

In this section we will use Foursquare API to get the venues needed and get the category of venue

In [112]:
CLIENT_ID = 'LDIJF4KI5VGMMA3NNDLFZWHR12TCMNTUL0TUC3QPZ3SJD040' 
CLIENT_SECRET = '0DXHVDFCZXNXFSLOFGOONJSS35KH4NAZXZN2AAAX5GCZVVTH'
VERSION = '20180605'

In [117]:
LIMIT=100

def getNearbyVenues(names, Latitudes, Longitude, radius=500):
    
    venues_list=[]
    for name, latitude, longitude in zip(names, Latitudes, Longitude):
        print(name)
            
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            latitude, 
            longitude, 
            radius,
            LIMIT
            )
            
  
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        
        venues_list.append([(
            name, 
            latitude, 
            longitude, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
venues_in_London = getNearbyVenues(london_merged['Borough'], london_merged['Latitude'], london_merged['Longitude'])

In [None]:
venues_in_London.groupby('Venue Category').max()

In [None]:
London_venue_cat = pd.get_dummies(venues_in_London[['Venue Category']], prefix="", prefix_sep="")

In [None]:
London_venue_cat['Neighbourhood'] = venues_in_London['Neighbourhood'] 

fixed_columns = [London_venue_cat.columns[-1]] + list(London_venue_cat.columns[:-1])
London_venue_cat = London_venue_cat[fixed_columns]

London_venue_cat.head()

<h3>Venue Grouping</h3>

In this section we will take our venues and cluster them together. We will group and sort them to get the best analysis.

In [None]:
London_grouped = London_venue_cat.groupby('Neighbourhood').mean().reset_index()
London_grouped.head()

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

This function allows us take the top 10 venues and cluster in neighborhoods.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

In [None]:
neighborhoods_venues_sorted_london = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted_london['Neighbourhood'] = London_grouped['Neighbourhood']

for ind in np.arange(London_grouped.shape[0]):
    neighborhoods_venues_sorted_london.iloc[ind, 1:] = return_most_common_venues(London_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted_london.head()

<h2>K Means Model Building</h2>

In [None]:

k_num_clusters = 5

London_grouped_clustering = London_grouped.drop('Neighbourhood', 1)

kmeans_london = KMeans(n_clusters=k_num_clusters, random_state=0).fit(London_grouped_clustering)
kmeans_london

In [None]:
kmeans_london.labels_

In [None]:
neighborhoods_venues_sorted_london.insert(0, 'Cluster Labels', kmeans_london.labels_ +1)

In [None]:
london_data = london_merged

london_data = london_data.join(neighborhoods_venues_sorted_london.set_index('Neighbourhood'), on='Borough')

london_data.head()

In [None]:
london_data_nonan = london_data.dropna(subset=['Cluster Labels'])

<h3>Visualization of Clusters</h3>

In [None]:
map_clusters_london = folium.Map(location=[london_lat_coords, london_lng_coords], zoom_start=12)


x = np.arange(k_num_clusters)
ys = [i + x + (i*x)**2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


markers_colors = []
for lat, lon, poi, cluster in zip(london_data_nonan['latitude'], london_data_nonan['longitude'], london_data_nonan['borough'], london_data_nonan['Cluster Labels']):
    label = folium.Popup('Cluster ' + str(int(cluster) +1) + '\n' + str(poi) , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)]
        ).add_to(map_clusters_london)
        
map_clusters_london


<h3>Sample Cluster Examination</h3>

In [None]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 1, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

<h2>Disscussion and Conclusion</h2>

In this analysis we can see that London is a very diverse and multicultural city which is seen through its plethora of restaraunts and shops stemming from all regions of the world. There is cofffe shops, pubs and much more throughout the city.All neighborhoods have parks and general areas to hang outside during nice weath