# Applied Data Science Capstone Project

Welcome to my __Capstone Project__.  
I tried to build this __notebook__ in a way that it is as straightforward as possible to go through, just scroll down, read the __markdown__ comments and look at the images and graphics.  
Of course there's a PDF report with all the details and explanations.

In [240]:
import requests
import unicodedata
from bs4 import BeautifulSoup
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes
import folium

### Getting the data

What it does is simply read a CSV file that I downloaded from https://simplemaps.com/data/world-cities with thousands of world cities.  
I then filter it to get just the capitals.  
  
_The next line is hidden because it has my IBM Cloud credentials._  


In [241]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.685,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
1,New York,New York,40.6943,-73.9249,United States,US,USA,New York,,19354922.0,1840034016
2,Mexico City,Mexico City,19.4424,-99.131,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484247881
3,Mumbai,Mumbai,19.017,72.857,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629
4,São Paulo,Sao Paulo,-23.5587,-46.625,Brazil,BR,BRA,São Paulo,admin,18845000.0,1076532519


In [242]:
df_data_0.shape

(15493, 11)

In [257]:
capitals = df_data_0[df_data_0['capital'] == 'primary']

In [258]:
capitals = capitals[["city_ascii","lat","lng","country"]]
capitals.rename(columns={"city_ascii": "Capital", "country": "Country"}, inplace = True)
capitals.head()

Unnamed: 0,Capital,lat,lng,Country
0,Tokyo,35.685,139.7514,Japan
2,Mexico City,19.4424,-99.131,Mexico
9,Dhaka,23.7231,90.4086,Bangladesh
10,Buenos Aires,-34.6025,-58.3975,Argentina
12,Cairo,30.05,31.25,Egypt


In [259]:
capitals.shape

(212, 4)

__Scrape Capital's population from Wikipedia__  
I had to get the population from Wikipedia because some of the values in the CSV file are just way off.  
  
_I created a couple of functions to parse the name of the Capitals and the population value due to some special characters, references and annotations_

In [271]:
def ParsePopulation(x):
    s=""
    for c in x:
        if not(c.isdigit()) and c!=',' and c!=' ':
            return s
        if c.isdigit():
            s=s+c
    return s

def ParseCapitals(s):
    a =''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')
    s=""
    for c in a:
        if c==',' or c=='(' or c=='[':
            return s.strip()
        s=s+c
    return s.strip()

In [272]:
res = requests.get('https://en.wikipedia.org/wiki/List_of_national_capitals_by_population')
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[1]
wiki = pd.read_html(np.str(table))[0][['Capital','Population']]
wiki['Capital'] = [ParseCapitals(s) for s in wiki['Capital']]
wiki['Population'] = [ParsePopulation(s) for s in wiki['Population']]
wiki.head()

Unnamed: 0,Capital,Population
0,Beijing,21542000
1,Tokyo,13929286
2,Moscow,12506468
3,Kinshasa,11855000
4,Jakarta,10075310


In [273]:
wiki.shape

(244, 2)

__Merge the tables and ignore rows without values__  
_We're losing some rows due mostly to geopolitics and the likes._

In [277]:
capitals = capitals.merge(wiki,how='left').dropna()

In [282]:
capitals.shape

(189, 5)

### Mapping the Capitals

Let's show the Capitals on a World Map, with colors ranging from yellow to red and increased radius as the population increases.


In [290]:
# create World map
world_map = folium.Map(location=[0, 0], zoom_start=2)
rainbow = [colors.rgb2hex(i) for i in [[1,1-i/9,0,1] for i in range(10)]]

# add markers to map
for lat, lng, city, country, population in zip(capitals['lat'], capitals['lng'], capitals['Capital'], capitals['Country'], capitals['Population']):
    label = '{}, {} ({})'.format(city, country, population)
    label = folium.Popup(label, parse_html=True)
    popinM=int(population)/1000000
    radius=popinM
    if popinM>9:
        radius=9
        popinM=9
    elif popinM<3:
        radius=3
    folium.CircleMarker(
        [lat, lng],
        radius=radius,
        popup=label,
        color=rainbow[int(popinM)],
        fill=True,
        fill_color=rainbow[int(popinM)],
        fill_opacity=1,
        parse_html=False).add_to(world_map)

world_map

### Get the venues

Let's get the venues in the vicinity of the Capitals.  
The radius will depend on the population, the value will start at 5000 meters and then 500m will be added for each 1M of population.  
  
_The next line is hidden because it has my Foursquare credentials._  

In [316]:
# The code was removed by Watson Studio for sharing.

In [317]:
def getNearbyVenues(names, latitudes, longitudes, population):
    
    venues_list=[]
    for name, lat, lng, pop in zip(names, latitudes, longitudes, population):
        # radius will be 5k plus 500m for each 1M of population
        radius=int(5000 + 500*int(pop)/1000000)
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            100)
            
        # make the GET request resilient to empty answers (retries)
        json = ""
        while json == "":
            json = requests.get(url).json()
        results = json["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Capital', 
                  'Venue', 
                  'lat', 
                  'lng', 
                  'Category']
    
    return(nearby_venues)

In [318]:
capital_venues = getNearbyVenues(names=capitals['Capital'], latitudes=capitals['lat'], longitudes=capitals['lng'], population=capitals['Population'])
capital_venues.shape

(13253, 5)

In [322]:
print('There are {} unique categories.'.format(len(capital_venues['Category'].unique())))
capital_venues.sample(20)

There are 503 unique categories.


Unnamed: 0,Capital,Venue,lat,lng,Category
3387,Caracas,Kabuki Sushi + Salads La Campiña,10.49743,-66.873549,Sushi Restaurant
13222,Philipsburg,Beau Beau's Restaurant,18.052367,-63.015744,Caribbean Restaurant
7308,Yerevan,History Museum of Armenia | Հայաստանի Պատմությ...,40.178449,44.513588,History Museum
4072,Santo Domingo,Don Nestor Parrillada,18.477242,-69.883639,Steakhouse
11373,Windhoek,Craft Cafe,-22.572002,17.08382,Café
7808,Dublin,Mad Egg,53.333668,-6.264568,Fried Chicken Joint
732,Moscow,Клуб Алексея Козлова,55.757724,37.633843,Music Venue
6211,San Salvador,i-shi cha,13.678636,-89.238081,Bubble Tea Shop
7917,Amsterdam,Generator Amsterdam,52.360802,4.918966,Hostel
12466,Luxembourg,Chemin de la Corniche,49.610389,6.134496,Trail


In [323]:
capital_venues.to_csv('capital_venues.csv')

In [324]:
!dir

capital_venues.csv
