# Finding a city "twin"

## Introduction
As an entrepreneur, you have successfully launched your business in a city. You have made a lot of research to assess this was the right place to launch and now it is time to scale and expand to another city. You could do the same research to find another city having the same requirements than the initial one but what if there was some other key specificities you missed ? By looking for a "twin" for your original city, you automate your launching process and can discover more peculiarities that would make a city the perfect next place to launch. Choosing the right city is key to a new successful launch and using data science could be a real benefit.


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Download other cities Dataset</a>

</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [2]:
!pip install geopy



## 1. Download and Explore Dataset

A paris dataset of its neighbohoods from https://opendata.paris.fr/ in a json file format has been downloaded for our convinience

In [8]:
# define the dataframe columns
column_names = ['Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
paris = pd.DataFrame(columns=column_names)

# load the data
with open('paris_data.json') as f:
    d = json.load(f)
    for dataset in d:
        paris = paris.append({'Neighborhood': dataset['fields']['l_aroff'],
                              'Latitude': dataset['fields']['geom_x_y'][0],
                              'Longitude': dataset['fields']['geom_x_y'][1]}, ignore_index=True)

# check the resulting dataframe.
paris.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Louvre,48.862563,2.336443
1,Bourse,48.868279,2.342803
2,Batignolles-Monceau,48.887327,2.306777
3,Observatoire,48.829245,2.326542
4,Ménilmontant,48.863461,2.401188


#### Use geopy library to get the latitude and longitude values of Paris (France).
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>pa_explorer</em>, as shown below.

In [10]:
address = 'Paris, France'

geolocator = Nominatim(user_agent="pa_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Paris are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Paris are 48.8566969, 2.3514616.


In [11]:
# create map of Paris using latitude and longitude values
map_paris = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(paris['Latitude'], paris['Longitude'], paris['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_paris)  
    
map_paris

#### Define Foursquare Credentials and Version

In [12]:
CLIENT_ID = 'PESQIFIZTNDBTMZPSPLZXDCQVGVQT50IE3K1RP3QCQALSDCV' # your Foursquare ID
CLIENT_SECRET = 'ODUYPM1K0ZG1Z0LZTM4GE5IRA0KSMBARQMVDJWAD0HC04UBW' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PESQIFIZTNDBTMZPSPLZXDCQVGVQT50IE3K1RP3QCQALSDCV
CLIENT_SECRET:ODUYPM1K0ZG1Z0LZTM4GE5IRA0KSMBARQMVDJWAD0HC04UBW


In [17]:
# get Foursquare categories
url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION)
categories = requests.get(url).json()['response']['categories']
for category in categories:
    print('name: ', category['name'], ' id: ', category['id'])

name:  Arts & Entertainment  id:  4d4b7104d754a06370d81259
name:  College & University  id:  4d4b7105d754a06372d81259
name:  Event  id:  4d4b7105d754a06373d81259
name:  Food  id:  4d4b7105d754a06374d81259
name:  Nightlife Spot  id:  4d4b7105d754a06376d81259
name:  Outdoors & Recreation  id:  4d4b7105d754a06377d81259
name:  Professional & Other Places  id:  4d4b7105d754a06375d81259
name:  Residence  id:  4e67e38e036454776db1fb3a
name:  Shop & Service  id:  4d4b7105d754a06378d81259
name:  Travel & Transport  id:  4d4b7105d754a06379d81259


In [56]:
# we search venues for every neighborhood of Paris per main category (can take some time)
LIMIT = 100
intent = 'browse'

paris_venues = [] 
for i in range(len(paris)):
    neighborhood_latitude = paris.loc[i,"Latitude"]
    neighborhood_longitude = paris.loc[i,"Longitude"]
    for category in categories:
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&intent={}&limit={}&categoryId={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        neighborhood_latitude, 
        neighborhood_longitude, 
        intent,
        LIMIT,
        category['id'])
        response = requests.get(url).json()['response']
        if response and response['groups'] and response['groups'][0] and response['groups'][0]['items']:
            paris_venues.extend(response['groups'][0]['items'])
        else:
            print('no results for', paris.loc[i,"Neighborhood"], 'category', category['name'])


nb venues 12966


In [58]:
print('number of venues found for Paris: ', len(paris_venues))

number of venues found for Paris:  12966


In [62]:
# We remove duplicates as our collecting venues with our method can have overlaps
paris_venues = [i for n, i in enumerate(paris_venues) if i not in paris_venues[:n]] 
print('unique number of venues found for Paris: ', len(paris_venues))

unique number of venues found for Paris:  12961


In [64]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [65]:
# Now, we clean the json and structure it into a pandas dataframe.
paris_venues_df = json_normalize(paris_venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
paris_venues_df =paris_venues_df.loc[:, filtered_columns]

# filter the category for each row
paris_venues_df['venue.categories'] = paris_venues_df.apply(get_category_type, axis=1)

# clean columns
paris_venues_df.columns = [col.split(".")[-1] for col in paris_venues_df.columns]

paris_venues_df.head()

Unnamed: 0,name,categories,lat,lng
0,Musée du Louvre,Art Museum,48.860847,2.33644
1,Comédie-Française,Theater,48.863088,2.336612
2,Les Arts Décoratifs,Art Museum,48.863077,2.333393
3,La Vénus de Milo (Vénus de Milo),Exhibit,48.859943,2.337234
4,모나리자 / 라 조콘다 (Mona Lisa | La Joconde),Exhibit,48.860139,2.335337


## 2. Download other cities Dataset

### Lyon, France

In [72]:
# define the dataframe columns
column_names = ['Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
lyon = pd.DataFrame(columns=column_names)

# load the data
with open('lyon_data.json') as f:
    d = json.load(f)
    for feature in d['features']:
        lyon = lyon.append({'Neighborhood': feature['properties']['nom'],
                              'Latitude': feature['properties']['coordinates'][1],
                              'Longitude': feature['properties']['coordinates'][0]}, ignore_index=True)

# we search venues for every neighborhood of Paris per main category (can take some time)
LIMIT = 100
intent = 'browse'

lyon_venues = [] 
for i in range(len(lyon)):
    neighborhood_latitude = lyon.loc[i,"Latitude"]
    neighborhood_longitude = lyon.loc[i,"Longitude"]
    for category in categories:
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&intent={}&limit={}&categoryId={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        neighborhood_latitude, 
        neighborhood_longitude, 
        intent,
        LIMIT,
        category['id'])
        response = requests.get(url).json()['response']
        if response and response['groups'] and response['groups'][0] and response['groups'][0]['items']:
            lyon_venues.extend(response['groups'][0]['items'])
        else:
            print('no results for', lyon.loc[i,"Neighborhood"], 'category', category['name'])

# We remove duplicates as our collecting venues with our method can have overlaps
lyon_venues = [i for n, i in enumerate(lyon_venues) if i not in lyon_venues[:n]] 
print('unique number of venues found for Lyon: ', len(lyon_venues))

# Now, we clean the json and structure it into a pandas dataframe.
lyon_venues_df = json_normalize(lyon_venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
lyon_venues_df =lyon_venues_df.loc[:, filtered_columns]

# filter the category for each row
lyon_venues_df['venue.categories'] = lyon_venues_df.apply(get_category_type, axis=1)

# clean columns
lyon_venues_df.columns = [col.split(".")[-1] for col in lyon_venues_df.columns]

lyon_venues_df.head()

unique number of venues found for Lyon:  4643


Unnamed: 0,name,categories,lat,lng
0,Periscope,Music Venue,45.746946,4.827476
1,Le Comœdia,Indie Movie Theater,45.747268,4.835553
2,Musée des tissus et des arts décoratifs,Art Museum,45.752959,4.831652
3,Sonic,Rock Club,45.750726,4.820247
4,UGC Ciné Cité Confluence,Multiplex,45.740905,4.818421
