# Data preparation for city similarity project (IBM Applied Data Science Capstone)

- We will characterize the city/neighborhood main venues and services by using the foursquare API.

- We will collect other features by web scrapping, such as services available, airports, etc. Wikipedia will be a resource.

- The main feature will use is the cost of living, which will narrow the cities of destination to compare. The site expastitan.com offers such a service.

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
import numpy as np

In [3]:
import pandas as pd

Load the content with `BeautifulSoup`

## Scraping `expatistan` data

`expatistan.com` computes a collaborative international cost of living index, which is useful to compare cities in terms of the daily expenses of life. They offer a worldwide and by region ranking of cities. Also, it is possible to compare two cities directly in terms of several categories, such as food, housing, clothes, transportation, personal care, and entertainment.

We start with the ranking of cities in North America because we are interested in compare US cities against Canada cities.
The final dataframe will contain the North America cost of living index by city.

In [4]:
#: worldwide, 1: europe, 2: 'north-america'
regions = ('', 'europe', 'north-america', 'latin-america', 'asia', 'middle-east', 'africa', 'oceania')

In [5]:
# north-america
reg = 2

In [6]:
# URL for index for region
URL = 'https://www.expatistan.com/cost-of-living/index/'+regions[reg]

In [7]:
# get the page
page = requests.get(URL)

In [8]:
def parse_rank_entry(entry, cities):
    """ Function to parse city row
    """
    clean_entry = entry.strip()[1:-2]
    if len(clean_entry) > 0:
        elem = clean_entry.split(',')
        # in case we fetch 8 elements
        if len(elem) < 9: 
            e = [""]*9
            e[0:3] = elem[0:3]
            e[3] = ""
            e[4:] = elem[3:]
            elem = e
        clean_elems = []
        for i, e in enumerate(elem):
            clean_elems.append(e.strip("\'"))
        # remove text
        clean_elems[6] = clean_elems[6].replace("Cost of Living Index: ", "")
        cities.append(clean_elems)

In [9]:
# flags to detect if we are in the javascript code and array cities
is_script = False
in_cities = False

cities = []
# iterate in lines
for line in page.text.splitlines():
    if "<script" in line:
        is_script = True   
    if "/script>" in line:
        is_script = False
    if "var cities" in line:
        in_cities = True
        # remove text
        line = line.replace("var cities = ", "")
    # check end of array cities
    if in_cities and '}' in line:
        in_cities = False
    # call function to parse entry
    if is_script and in_cities:
        parse_rank_entry(line, cities)        

In [10]:
# create a new dataframe from cities with proper column names
cities_df = pd.DataFrame(cities, columns=["Latitude", "Longitude", "City", "State", "Score", "Population", "Index", "Code", "Country"])

In [11]:
cities_df

Unnamed: 0,Latitude,Longitude,City,State,Score,Population,Index,Code,Country
0,32.293,-64.782,Hamilton,,0.005457073578892908,2000,294,hamilton-bermuda,BM
1,37.3928,-122.042,Mountain View,California,0.004189556522501467,74066,259,mountain-view-california,US
2,40.7143,-74.006,New York City,,0.004112036754458295,8008278,257,new-york-city,US
3,37.7793,-122.419,San Francisco,California,0.003633858010953368,808976,245,san-francisco,US
4,40.7114,-74.0647,Jersey City,New Jersey,0.0031316604411630953,247000,233,jersey-city,US
...,...,...,...,...,...,...,...,...,...
63,44.652,-63.5968,Halifax,,-0.002106187242921855,359111,138,halifax,CA
64,45.5168,-73.6492,Montreal,,-0.0021789148362126552,3268513,137,montreal,CA
65,35.0845,-106.651,Albuquerque,New Mexico,-0.0025507048686302003,487378,132,albuquerque,US
66,35.1495,-90.049,Memphis,Tennessee,-0.002780600050877187,650100,129,memphis,US


## Scraping `wikipedia.org`

Many websites offers airport data: `openflights.org`, `ourairports.com`, etc. Because we are interested in international airports, it is easy to extract this information from `wikipedia.org`.
The final tables will contain the city with airports in two countries.

In [30]:
# International airports listed in wikipedia
URL = 'https://en.wikipedia.org/wiki/List_of_international_airports_by_country'
page = requests.get(URL)

In [89]:
# load the content in beautiful soup
soup = BeautifulSoup(page.content, 'html.parser')

In [90]:
# look for tables in page
tables = soup.findAll(class_="wikitable sortable")

In [91]:
def parse_table(tables, first_city):
    """ Function to parse a table (ordered by country) given a city up to the end of the table
    """
    data = []
    for table in tables:
        append_it = False
        rows = table.find_all('tr')

        col_name = [name.text.strip() for name in rows[0].find_all('th')]

        for row in rows[1:]:
            cols = row.find_all('td')
            cols = [element.text.strip() for element in cols]
            if cols[0] == first_city:
                data.append(col_name)
                append_it = True
            if append_it:
                data.append([element for element in cols if element])
    return data

In [187]:
#countries = set([country.find(class_='toctext').string for country in soup.findAll(class_="toclevel-{}".format(i)) for i in range(2, 6)])

In [199]:
# scrape list of countries with an international airport
countries = {}
bad_names = ('Passenger Roles (2011-2020)', 'Africa', 'Americas', 'Caribbean', 'Central America', 'North America', 'South America',
             'Asia', 'Central Asia', 'South Asia', 'Southeast Asia', 'Southwest Asia and the Middle East',
             'Europe', 'West Europe', 'Central Europe', 'Southern Europe', 'East Europe', 'Nordic region', 'United Kingdom',
             'Oceania', 'See also', 'References')
for country in soup.findAll(class_="mw-headline"):
    if country.string not in bad_names:
        countries[country.string] = country.string

In [208]:
# set countries to study. If the country does not have an international airport then it does not appear on the
# dictionary and throws a key error
country1 = countries['Canada']

In [209]:
country2 = countries['United States']

In [211]:
# scrape the first city of each country in the table
city1 = soup.find(id=country1.replace(" ", "_")).find_all_next('a')[1].string

In [213]:
city2 = soup.find(id=country2.replace(" ", "_")).find_all_next('a')[1].string

In [215]:
# scrape the airport table for country1
country1_airports = parse_table(tables, city1)

In [216]:
# scrape the airport table for country2
country2_airports = parse_table(tables, city2)

In [218]:
# create new dataframes
country1_airports_df = pd.DataFrame(country1_airports)

In [219]:
c1_airports_df = country1_airports_df.rename(columns=country1_airports_df.iloc[0]).drop(country1_airports_df.index[0]).reset_index(drop=True)

In [225]:
c1_airports_df

Unnamed: 0,Location,Airport,IATA Code,Passenger Role,2015 Passengers,2014 Passengers
0,Calgary,Calgary International Airport,YYC,Medium,15475759,15261108.0
1,Edmonton,Edmonton International Airport,YEG,Medium,7981074,8240161.0
2,Whitehorse,Erik Nielsen Whitehorse International Airport,YXY,,,
3,Gander,Gander International Airport,YQX,,,
4,Halifax (Goffs),Halifax Stanfield International Airport,YHZ,Medium,10897234,3663039.0
5,Hamilton,John C. Munro Hamilton International Airport,YHM,Non-Hub,332378,
6,Kelowna,Kelowna International Airport,YLK,Small,,
7,London,London International Airport,YXU,Non-Hub,,
8,Moncton (Dieppe),Greater Moncton International Airport,YQM,Small,677159,
9,Montreal (Dorval),Montreal-Pierre Elliott Trudeau International ...,YUL,Medium,15517382,14840067.0


In [220]:
country2_airports_df = pd.DataFrame(country2_airports)

In [221]:
c2_airports_df = country2_airports_df.rename(columns=country2_airports_df.iloc[0]).drop(country2_airports_df.index[0]).reset_index(drop=True)

In [224]:
c2_airports_df

Unnamed: 0,Location,Airport,IATA Code,Passenger Role,2018 Passengers
0,Akron,Akron Executive Airport,AKC,Non-Hub/Reliever,No Commercial Service
1,Albany,Albany International Airport,ALB,Small,"2,848,000 [2]"
2,Albuquerque,Albuquerque International Sunport,ABQ,Medium,"5,258,775 [3]"
3,Anchorage,Ted Stevens Anchorage International Airport,ANC,Medium,"5,176,371[4]"
4,Appleton,Appleton International Airport,ATW,Small,"717,757 [5]"
...,...,...,...,...,...
110,"Washington, D.C.",Ronald Reagan Washington National Airport**,DCA,Large,"22,695,582[38]"
111,"Washington, D.C.",Washington Dulles International Airport,IAD,Large,24060709
112,West Palm Beach,Palm Beach International Airport,PBI,Medium,6513943
113,Wilkes-Barre/Scranton,Wilkes-Barre/Scranton International Airport,AVP,Small,"508,825[39]"
