# Data preparation for city similarity project (IBM Applied Data Science Capstone)

- We will characterize the city/neighborhood main venues and services by using the foursquare API.

- We will collect other features by web scrapping, such as services available, airports, etc. Wikipedia will be a resource.

- The main feature will use is the cost of living, which will narrow the cities of destination to compare. The site expastitan.com offers such a service.

### Libraries and tools
- Pandas and numpy for data manipulation.
- BeautifulSoup and requests for web scrapping.
- Foursquare API and geocoder. For venues and geolocation.
- Folium for maps.
- Matplotlib for visualization.
- Dotenv for api keys manipulation.
- Scikit-learn for k-means clustering.

# Methods
First, we need to establish the country and city of origin. Second we define the target country, i.e., the destination. With this information we will use a recommender engine to find similarities between the cities.

When a couple of suitable destination cities are found, we will use k-means clustering to group neighborhoods with similar features, using the Foursquare API. Also, the same analysis we will perform in the neighborhood of origin. We will locate the clusters in a Folium map. 

With all this data we will report the destination neighborhoods that are similar to the neighborhood of origin.


In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
import numpy as np

In [3]:
import pandas as pd

Load the content with `BeautifulSoup`

## Scraping `expatistan` data

`expatistan.com` computes a collaborative international cost of living index, which is useful to compare cities in terms of the daily expenses of life. They offer a worldwide and by region ranking of cities. Also, it is possible to compare two cities directly in terms of several categories, such as food, housing, clothes, transportation, personal care, and entertainment.

We start with the ranking of cities in North America because we are interested in compare US cities against Canada cities.
The final dataframe will contain the North America cost of living index by city.

In [4]:
#: worldwide, 1: europe, 2: 'north-america'
regions = ('', 'europe', 'north-america', 'latin-america', 'asia', 'middle-east', 'africa', 'oceania')

In [5]:
# north-america
reg = 2

In [6]:
# URL for index for region
URL = 'https://www.expatistan.com/cost-of-living/index/'+regions[reg]

In [7]:
# get the page
page = requests.get(URL)

In [8]:
def parse_rank_entry(entry, cities):
    """ Function to parse city row
    """
    clean_entry = entry.strip()[1:-2]
    if len(clean_entry) > 0:
        elem = clean_entry.split(',')
        # in case we fetch 8 elements
        if len(elem) < 9: 
            e = [""]*9
            e[0:3] = elem[0:3]
            e[3] = ""
            e[4:] = elem[3:]
            elem = e
        clean_elems = []
        for i, e in enumerate(elem):
            clean_elems.append(e.strip("\'"))
        # remove text
        clean_elems[6] = clean_elems[6].replace("Cost of Living Index: ", "")
        cities.append(clean_elems)

In [9]:
# flags to detect if we are in the javascript code and array cities
is_script = False
in_cities = False

cities = []
# iterate in lines
for line in page.text.splitlines():
    if "<script" in line:
        is_script = True   
    if "/script>" in line:
        is_script = False
    if "var cities" in line:
        in_cities = True
        # remove text
        line = line.replace("var cities = ", "")
    # check end of array cities
    if in_cities and '}' in line:
        in_cities = False
    # call function to parse entry
    if is_script and in_cities:
        parse_rank_entry(line, cities)        

In [10]:
# create a new dataframe from cities with proper column names
cities_df = pd.DataFrame(cities, columns=["Latitude", "Longitude", "City", "State", "Score", "Population", "Index", "Code", "Country"])

In [11]:
cities_df

Unnamed: 0,Latitude,Longitude,City,State,Score,Population,Index,Code,Country
0,32.293,-64.782,Hamilton,,0.005479543556370645,2000,294,hamilton-bermuda,BM
1,37.3928,-122.042,Mountain View,California,0.0042120264999792045,74066,259,mountain-view-california,US
2,40.7143,-74.006,New York City,,0.004095520327779454,8008278,256,new-york-city,US
3,37.7793,-122.419,San Francisco,California,0.0036154281359158523,808976,244,san-francisco,US
4,40.7114,-74.0647,Jersey City,New Jersey,0.00311111959964693,247000,232,jersey-city,US
...,...,...,...,...,...,...,...,...,...
63,44.652,-63.5968,Halifax,,-0.002083717265444118,359111,138,halifax,CA
64,45.5168,-73.6492,Montreal,,-0.0021564448587349182,3268513,137,montreal,CA
65,35.0845,-106.651,Albuquerque,New Mexico,-0.0026042808850046573,487378,131,albuquerque,US
66,35.1495,-90.049,Memphis,Tennessee,-0.0028359514778200003,650100,128,memphis,US


In [91]:
cities_df.describe()

Unnamed: 0,Latitude,Longitude,City,State,Score,Population,Index,Code,Country
count,68.0,68.0,68,68.0,68.0,68,68,68,68
unique,68.0,68.0,68,34.0,52.0,68,52,68,3
top,43.6136,-96.8067,Charleston,,-0.0002363762334507,330000,138,philadelphia,US
freq,1.0,1.0,1,14.0,3.0,1,3,1,56


In [108]:
cities_df.head().to_clipboard(sep=',')

In [109]:
URL = 'https://www.expatistan.com/cost-of-living/comparison/dallas/montreal'

In [110]:
page = requests.get(URL)

In [115]:
import re

In [122]:
cl_index=[]
for line in page.text.splitlines():
    if "<title>" in line:
        print(line)
        cl_index=re.findall(r'\d+', line)[0]

                <title>Montreal is 14% cheaper than Dallas, Texas. Jul 2020 Cost of Living.</title>


In [123]:
cl_index

'14'

## Scraping `wikipedia.org`

Many websites offers airport data: `openflights.org`, `ourairports.com`, etc. Because we are interested in international airports, it is easy to extract this information from `wikipedia.org`.
The final tables will contain the city with airports in two countries.

In [12]:
# International airports listed in wikipedia
URL = 'https://en.wikipedia.org/wiki/List_of_international_airports_by_country'
page = requests.get(URL)

In [13]:
# load the content in beautiful soup
soup = BeautifulSoup(page.content, 'html.parser')

In [14]:
# look for tables in page
tables = soup.findAll(class_="wikitable sortable")

In [15]:
def parse_table(tables, first_city):
    """ Function to parse a table (ordered by country) given a city up to the end of the table
    """
    data = []
    for table in tables:
        append_it = False
        rows = table.find_all('tr')

        col_name = [name.text.strip() for name in rows[0].find_all('th')]

        for row in rows[1:]:
            cols = row.find_all('td')
            cols = [element.text.strip() for element in cols]
            if cols[0] == first_city:
                data.append(col_name)
                append_it = True
            if append_it:
                data.append([element for element in cols if element])
    return data

In [16]:
#countries = set([country.find(class_='toctext').string for country in soup.findAll(class_="toclevel-{}".format(i)) for i in range(2, 6)])

In [17]:
# scrape list of countries with an international airport
countries = {}
bad_names = ('Passenger Roles (2011-2020)', 'Africa', 'Americas', 'Caribbean', 'Central America', 'North America', 'South America',
             'Asia', 'Central Asia', 'South Asia', 'Southeast Asia', 'Southwest Asia and the Middle East',
             'Europe', 'West Europe', 'Central Europe', 'Southern Europe', 'East Europe', 'Nordic region', 'United Kingdom',
             'Oceania', 'See also', 'References')
for country in soup.findAll(class_="mw-headline"):
    if country.string not in bad_names:
        countries[country.string] = country.string

In [18]:
# set countries to study. If the country does not have an international airport then it does not appear on the
# dictionary and throws a key error
country1 = countries['Canada']

In [19]:
country2 = countries['United States']

In [20]:
# scrape the first city of each country in the table
city1 = soup.find(id=country1.replace(" ", "_")).find_all_next('a')[1].string

In [21]:
city2 = soup.find(id=country2.replace(" ", "_")).find_all_next('a')[1].string

In [22]:
# scrape the airport table for country1
country1_airports = parse_table(tables, city1)

In [23]:
# scrape the airport table for country2
country2_airports = parse_table(tables, city2)

In [24]:
# create new dataframes
country1_airports_df = pd.DataFrame(country1_airports)

In [25]:
c1_airports_df = country1_airports_df.rename(columns=country1_airports_df.iloc[0]).drop(country1_airports_df.index[0]).reset_index(drop=True)

In [26]:
c1_airports_df

Unnamed: 0,Location,Airport,IATA Code,Passenger Role,2015 Passengers,2014 Passengers
0,Calgary,Calgary International Airport,YYC,Medium,15475759,15261108.0
1,Edmonton,Edmonton International Airport,YEG,Medium,7981074,8240161.0
2,Whitehorse,Erik Nielsen Whitehorse International Airport,YXY,,,
3,Gander,Gander International Airport,YQX,,,
4,Halifax (Goffs),Halifax Stanfield International Airport,YHZ,Medium,10897234,3663039.0
5,Hamilton,John C. Munro Hamilton International Airport,YHM,Non-Hub,332378,
6,Kelowna,Kelowna International Airport,YLK,Small,,
7,London,London International Airport,YXU,Non-Hub,,
8,Moncton (Dieppe),Greater Moncton International Airport,YQM,Small,677159,
9,Montreal (Dorval),Montreal-Pierre Elliott Trudeau International ...,YUL,Medium,15517382,14840067.0


In [27]:
country2_airports_df = pd.DataFrame(country2_airports)

In [28]:
c2_airports_df = country2_airports_df.rename(columns=country2_airports_df.iloc[0]).drop(country2_airports_df.index[0]).reset_index(drop=True)

In [124]:
c1_airports_df.describe()

Unnamed: 0,City,Airport,IATA Code,Passenger Role,2015 Passengers,2014 Passengers
count,19,19,19,17,15,13
unique,19,19,19,4,15,13
top,Edmonton,Ottawa Macdonald-Cartier International Airport,YUL,Small,7981074,15261108
freq,1,1,1,7,1,1


In [95]:
c1_airports_df.to_clipboard(sep=',')

## Neighborhood segmentation for origin and destination cities

This is a mock test, because we need to do the recommendation analysis before doing the clustering.

Let's assume that the destination city is Montreal (CA), and the city of origin is Akron (USA).

In [30]:
# load environment variables
import os
from dotenv import load_dotenv
load_dotenv()

True

In [31]:
# get foursquare key
CLIENT_ID = os.getenv('FOURSQUARE_CLIENT_ID')# your Foursquare ID
CLIENT_SECRET = os.getenv('FOURSQUARE_CLIENT_SECRET')# your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [32]:
# get bing api key
bing_api_key = os.environ.get('BING_API_KEY')

In [33]:
c1_airports_df.Location[9].strip()

'Montreal (Dorval)'

In [34]:
# regular expressions
import re

In [35]:
def clean_city(city):
    """ Remove parenthesis and text between from city
    """
    city = re.sub("[\(\[].*?[\)\]]", "", city)
    return city.replace(" ", "") 

In [36]:
# apply function to each city
c1_airports_df.Location = c1_airports_df.Location.apply(clean_city)

In [37]:
c1_airports_df = c1_airports_df.rename(columns={'Location': 'City'})

In [38]:
dest_city = c1_airports_df.City[9]

In [39]:
print(dest_city)

Montreal


In [40]:
# find postal codes
URL = 'https://worldpostalcode.com/canada/quebec/' + dest_city.lower()
page = requests.get(URL)

Load the content with `BeautifulSoup`

In [41]:
soup = BeautifulSoup(page.content, 'html.parser')

The table to scrape is identified by `class="codes"`

In [42]:
table = soup.find(class_="codes")

Parse the table

In [44]:
places = table.find_all(class_ = 'place')
codes = table.find_all(class_ = 'code')
#print(places)
data = []
for place, code in zip(places, codes):
    data.append((place.string, code.string))

Load the table into a dataframe

In [45]:
data

[('Beaconsfield', 'H9W'),
 ('Cote-Saint-Luc West', 'H4W'),
 ('Dorval Outskirts', 'H9P'),
 ('Downtown Montreal East', 'H3B'),
 ('Downtown Montreal North', 'H3A'),
 ('Downtown Montreal Northeast', 'H2Z'),
 ('Downtown Montreal South & West', 'H3H'),
 ('Downtown Montreal Southeast', 'H3G'),
 ('Hampstead', 'H3X'),
 ('Kirkland', 'H9J'),
 ('Montreal East', 'H1B'),
 ('Montreal North North', 'H1G'),
 ('Montreal North South', 'H1H'),
 ('Montreal West', 'H4X'),
 ('Old Montreal', 'H2Y'),
 ('Pointe-Claire', 'H9R'),
 ('Westmount East', 'H3Z'),
 ('Westmount West', 'H3Y')]

In [46]:
data = pd.DataFrame(data, columns=['Borough', 'PostalCode'])

In [47]:
data

Unnamed: 0,Borough,PostalCode
0,Beaconsfield,H9W
1,Cote-Saint-Luc West,H4W
2,Dorval Outskirts,H9P
3,Downtown Montreal East,H3B
4,Downtown Montreal North,H3A
5,Downtown Montreal Northeast,H2Z
6,Downtown Montreal South & West,H3H
7,Downtown Montreal Southeast,H3G
8,Hampstead,H3X
9,Kirkland,H9J


Check for repetitions for `PostalCode`

### Geolocation data by bing!

In [48]:
import geocoder

Use geocoder and bing to obtain the coordinates for each postal code

In [49]:
coords = []
for i in data.index:
    postal_code = data.at[i, 'PostalCode']

    lat_lng_coords = None

    g = []
    while(not g):
        g = geocoder.bing(location=dest_city, postalCode='{}'.format(postal_code), method='details', key=bing_api_key)
        
    
    lat_lng_coords = g.latlng

    coords.append([lat_lng_coords[0], lat_lng_coords[1]])

In [50]:
dest_city_data = pd.concat([data, pd.DataFrame(coords)], axis=1).sort_values(by='PostalCode').reset_index(drop=True)

In [51]:
dest_city_data.rename(columns={0: 'Latitude', 1: 'Longitude'}, inplace=True)

Table with coordinates retrieved using bing

In [85]:
dest_city_data.to_clipboard(sep=',')

In [54]:
import folium
import json

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values


from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium # map rendering library

In [56]:
address = dest_city+','+country1

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))

The geograpical coordinate of Montreal,Canada are 45.4972159, -73.6103642.


In [58]:

map_dest_city = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough in zip(dest_city_data['Latitude'], 
                             dest_city_data['Longitude'], 
                             dest_city_data['Borough']):
    label = '{}'.format(borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dest_city)  
    
map_dest_city

In [87]:
# to save a PNG of the map
import io
from PIL import Image

In [88]:
img_data = map_dest_city._to_png(5)
img = Image.open(io.BytesIO(img_data))
img.save('map_dest_city.png')

![image](map_dest_city.png)

In [59]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

### Explore Neighborhoods

#### Use the same function of previous Lab

In [60]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Run the function

In [62]:
LIMIT = 100
radius = 500
dest_city_venues = getNearbyVenues(names=dest_city_data['Borough'],
                                   latitudes=dest_city_data['Latitude'],
                                   longitudes=dest_city_data['Longitude']
                                  )



Montreal East
Montreal North North
Montreal North South
Old Montreal
Downtown Montreal Northeast
Downtown Montreal North
Downtown Montreal East
Downtown Montreal Southeast
Downtown Montreal South & West
Hampstead
Westmount West
Westmount East
Cote-Saint-Luc West
Montreal West
Kirkland
Dorval Outskirts
Pointe-Claire
Beaconsfield


#### Let's check the results

In [63]:
print(dest_city_venues.shape)
dest_city_venues.head()

(492, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Montreal East,45.62912,-73.50468,Broadway Pizzeria,45.627372,-73.506175,Restaurant
1,Montreal East,45.62912,-73.50468,Centre Recreatif Edouard-Rivet,45.628681,-73.50058,Recreation Center
2,Montreal East,45.62912,-73.50468,Restaurant Sunshine,45.631091,-73.508728,Sushi Restaurant
3,Montreal East,45.62912,-73.50468,Deli Diane,45.626068,-73.502196,Deli / Bodega
4,Montreal East,45.62912,-73.50468,Couche-Tard,45.626031,-73.502286,Convenience Store


Let's check how many venues were returned for each neighborhood

In [65]:
dest_city_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Beaconsfield,1,1,1,1,1,1
Cote-Saint-Luc West,4,4,4,4,4,4
Dorval Outskirts,2,2,2,2,2,2
Downtown Montreal East,82,82,82,82,82,82
Downtown Montreal North,57,57,57,57,57,57
Downtown Montreal Northeast,100,100,100,100,100,100
Downtown Montreal South & West,9,9,9,9,9,9
Downtown Montreal Southeast,100,100,100,100,100,100
Hampstead,3,3,3,3,3,3
Kirkland,4,4,4,4,4,4


#### Let's find out how many unique categories can be curated from all the returned venues

In [66]:
print('There are {} uniques categories.'.format(len(dest_city_venues['Venue Category'].unique())))

There are 152 uniques categories.


# Data description and methods

From the tables above, we will extract the following features:

- We will focus in the destination cities which have a nearby airport.
- With the origin address and country destination we will determine the most appropiated cities to migrate.
- We will use postal code information to get the borough of the destination cities.
- We will use bing to get the coordinates of the boroughs.
- We will use follium to display the map and the boroughs.
- We will use foursquare API to get the most popular venues in destination and origin cities.
- We will use k-means clustering to cluster the boroughs with common features.
- We will determine the similarity between the borough of origin and each cluster in the city of destination, usign a recommendation engine.