# Geneva House Renting Consultancy
Capstone Project, Applied Data Science Capstone by IBM/Coursera, Simone Lisi.

## Scraping foursquare categories

In this notebook, we will create a dictionary and store it in json files.
This will be useful for grouping venues we will collect later (e.g. grouping restaurants like 
chinese, vietnamese, etc. as asian restaurants).

The file "dic_all_cat.json" is a dictionary where each key is the categoryId of a 
<a href="https://developer.foursquare.com/docs/build-with-foursquare/categories/">foursquare category</a> and the associated item is a list containing:

categories_l[index][0] = 'foursquare category name';

categories_l[index][1] = is an integer indicating the level of the category. Higher numbers indicate that the category is a subcategory;

categories_l[index][-1] = is the categoryId, same as the key;

categories_l[index][2:-1] = are the category ids of the parent categories; 

as a an example the list corresponding to 'Exhibit' (categoryId = '56aa371be4b08b9a8d573532')  would look like this:

['Exhibit', 1.0, '56aa371be4b08b9a8d573532', '4d4b7104d754a06370d81259']

where '4d4b7104d754a06370d81259' is the categoryId of 'Arts & Entertainment', its parent category.


## Installing packages. Set this cell to 'code' if needed.
!conda install -c anaconda bs4 --yes

!conda install -c conda-forge geopy --yes 

!conda install -c conda-forge folium=0.5.0 --yes 


In [1]:
# importing libraries
from bs4 import BeautifulSoup
import json
import urllib.request as urllib2
import random
from random import choice
import time
import pandas as pd
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library
import seaborn as sns
%matplotlib inline

In [2]:
# urlquery from Achim Tack, available on github.
# https://github.com/ATack/GoogleTrafficParser/blob/master/google_traffic_parser.py
def urlquery(url):
    # function cycles randomly through different user agents and time intervals to simulate more natural queries
    try:
        sleeptime = float(random.randint(1,6))/5
        time.sleep(sleeptime)

        agents = ['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17',
        'Mozilla/5.0 (compatible; MSIE 10.6; Windows NT 6.1; Trident/5.0; InfoPath.2; SLCC1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET CLR 2.0.50727) 3gpp-gba UNTRUSTED/1.0',
        'Opera/12.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.02',
        'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
        'Mozilla/3.0',
        'Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3',
        'Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3',
        'Opera/9.00 (Windows NT 5.1; U; en)']

        agent = choice(agents)
        opener = urllib2.build_opener()
        opener.addheaders = [('User-agent', agent)]

        html = opener.open(url).read()
        time.sleep(sleeptime)
        
        return html

    except Exception as e:
        print('Something went wrong with Crawling:\n%s' % e)

In [3]:
### utility to explore a complex nested dictionary-list
def structure_data(something, lev, keysearch = ''):
    #print(something, 'is a ', type(something))
    
    if(type(something) is dict):
        for key in something.keys():
            if(key == keysearch):
                print('\nHEEEEEEEEERE!!!!!!!\n')
                
            spaces = '    '*lev
            print(spaces, key, 'is a ', type(something[key]))
            structure_data(something[key], lev+1, keysearch)
                
    elif(type(something) is list):
        for i, element in enumerate(something):
            spaces = '    '*lev
            print(spaces, i, 'is a ', type(element))
            structure_data(element, lev+1, keysearch)
    else:
        spaces = '    '*lev
        print(spaces, something)

In [4]:
url ='https://developer.foursquare.com/docs/build-with-foursquare/categories/'
print(url)

soup = BeautifulSoup(urlquery(url), 'html.parser')

https://developer.foursquare.com/docs/build-with-foursquare/categories/


In [5]:
list_soup = soup.prettify().split('\n')

In [6]:
categories_l =[]
for i, element in enumerate(list_soup):
    if '<h3>' in element:
        lv = ((len(list_soup[i+1]) - len(list_soup[i+1].lstrip()))-11)/3
        name = list_soup[i+1]
        cat_id = list_soup[i+4]
        categories_l.append([name.lstrip().replace('&amp;', '&'), lv, cat_id.lstrip()])
        

In [7]:
#### For each foursquare category, we retrive all parent categories and put them in a list
### lvl is the category hierarichal level. Subcategories have higher levels

i= len(categories_l)-1
while i >= 0:
    lvl = categories_l[i][1]
    j = 1
    while i-j >= 0:
        lvl2 = categories_l[i-j][1]
        if(lvl2 < lvl):
            categories_l[i].insert(2, categories_l[i-j][2])
            lvl = lvl2
        if lvl2 == 0:
            break
        j += 1
        
    i -=1

In [8]:
## checking few results
categories_l[0]

['Arts & Entertainment', 0.0, '4d4b7104d754a06370d81259']

In [9]:
## checking few results
categories_l[12]

['Exhibit', 1.0, '4d4b7104d754a06370d81259', '56aa371be4b08b9a8d573532']

In [10]:
## checking few results
categories_l[14][-1]

'52e81612bcbc57f1066b79ea'

In [11]:
# storing results in a dictionary
dic_all_cat = {}
for element in categories_l:
    key = element[-1]
    dic_all_cat[key] = element

In [12]:
### we save categories venues in a file for later use
with open('dic_all_cat.json', 'w') as f:
    json.dump(dic_all_cat, f)