<h1>Homework 03 - Interactive Viz</h1>

In [142]:
## Importing useful libraries
import pandas as pd
import numpy as np
import folium as fm
import re
import os
import simplejson as json
import urllib

<h2>1. Importation and data cleaning</h2>

We import the data and put it in a dataframe.

In [143]:
os.path.exists('./Data/P3_GrantExport.csv')
data = pd.read_csv('./Data/P3_GrantExport.csv', delimiter=';', na_values=['.'], error_bad_lines=False)

All the useless columns are dropped.

In [144]:
data.drop(['Project Title', 'Project Title English','Responsible Applicant','Funding Instrument',
           'Funding Instrument Hierarchy', 'Institution', 'Discipline Number', 'Discipline Name', 
           'Discipline Name Hierarchy', 'Keywords'], axis=1, inplace=True)
data.head(3)

Unnamed: 0,Project Number,University,Start Date,End Date,Approved Amount
0,1,Nicht zuteilbar - NA,01.10.1975,30.09.1976,11619.0
1,4,Université de Genève - GE,01.10.1975,30.09.1976,41022.0
2,5,"NPO (Biblioth., Museen, Verwalt.) - NPO",01.03.1976,28.02.1985,79732.0


In [145]:
len(data)

63969

We drop projects with missing starting date or ending date.

In [146]:
data = data[pd.notnull(data['Start Date'])]
data = data[pd.notnull(data['End Date'])]
len(data)

63968

We drop the projects that have an empty university field, or "Nicht zuteilbar" written (which means in German "Not assigned").

In [147]:
data = data[data.University != 'Nicht zuteilbar - NA']
data = data.dropna(axis=0, subset=['University'])
len(data)

48392

***Observations and decisions about the data to use:***<br>
We observed that for a specific university, the first project to appear is in 2012. We thought about removing all project before 2012 but this would leave us with too little data. Also, it is more interesting to see the result overall rather than just in the last years. It would be also really interesting to compute the evolution of the money received by canton over the year (in a further work). Therefore, we decided to use the whole dataset (after cleaning) in this homework.

***Group by university:***<br>

Here, we will group the data by university, and sum over the "<i>Approved Amount</i>" column.<br>
To sum over it, we first need to convert it to float type.

In [148]:
def convert_to_float(nb):
    try:
        return float(nb)
    except ValueError:
        return 0.0

data['Approved Amount'] = data['Approved Amount'].apply(convert_to_float)

We group the data by university. We will drop the "<i>Project Number</i>" column, as this sum is completely useless.

In [149]:
universities_pd = data.groupby('University').sum()
universities_pd.drop('Project Number', axis=1, inplace=True)
universities_pd.head(3)

Unnamed: 0_level_0,Approved Amount
University,Unnamed: 1_level_1
AO Research Institute - AORI,3435621.0
Allergie- und Asthmaforschung - SIAF,19169965.0
Berner Fachhochschule - BFH,31028695.0


<h2>2. Finding cantons</h2><br>
In this part, we will explain how we found the canton linked to the different universities. We first used the university name and the abbreviation after it to infer its canton name. In a next step, we used the <i>geonames</i> service, and finally, the Google API to get the names of all the remaining universities. In a last step, we added manually the last important universities for which we couldn't get any information.

We decided to store the canton information in a column of the dataframe, which will be called "canton". The code used to store the canton information will be the abbreviation of the canton as used in Switzerland (example: <i>Vaud</i> will be <i>VD</i>).

The first list represents all the canton abbreviations of switzerland and will be used to code the canton

In [150]:
cantons_id = ['ZH', 'BE','LU', 'UR', 'SZ', 'OW','NW', 'GL', 'ZG', 'FR',
                          'SO','BS', 'BL', 'SH', 'AR', 
                          'AI', 'SG', 'GR', 'AG', 'TG', 'TI', 'VD',
                          'VS', 'NE', 'GE','JU']

This second list represent the name of the canton.

In [151]:
cantons_name = ['Zürich', 'Bern/Berne', 'Luzern', 'Uri','Schwyz', 'Obwalden', 
                'Nidwalden','Glarus', 'Zug', 'Fribourg', 'Solothurn','Basel-Stadt', 'Basel-Landschaft', 'Schaffhausen',
               'Appenzell Ausserrhoden','Appenzell Innerrhoden', 'St. Gallen', 'Graubünden/Grigioni', 'Aargau','Thurgau',
               'Ticino','Vaud', 'Valais/Wallis','Neuchâtel', 'Genève', 'Jura']

<h3>2.1 Finding cantons by using the university name</h3><br>
We noticed that some university names contained sometimes the name of the canton. In some other cases, the abbreviation after the name was the abbreviation of the canton. Here, we try to infer the canton's name.

Spliting universities names and abbreviation and putting everything inside the dataframe

*** Function: Split a full university into its name and abbreviation:***

In [152]:
## Split a full university name to return a tupple of the name of the university and its abbreviation.
## Input: Full university name as received in the dataset
## Output: (university name, uni abbreviation)
## Example: "Allergie- und Asthmaforschung - SIAF"  ===> ("Allergie- und Asthmaforschung", "SIAF")

def split_university_in_name_and_abbrev(uni_full_name):
    split_list = str(uni_full_name).split(' - ')
    if(len(split_list) == 1):
        split_list.append('')
    return (split_list[0], split_list[1])  # The first item is the name, the second is the abbreviation

***Function: Find a canton with a university name and the abbreviation:***

In [153]:
## Infer the canton ID if we can find it in the University name
def find_canton(uni_name, uni_abrev):
    uni_canton = np.nan
    if uni_abrev in cantons_id:
        uni_canton = uni_abrev
    else:
        found_match = False
        for canton_index, canton in enumerate(cantons_name):
            if canton in uni_name:
                if found_match:
                    print(uni_canton, ' - ', canton)
                    uni_canton = np.nan
                else:
                    found_match = True
                    uni_canton = cantons_id[canton_index]
    return uni_canton

In [156]:
name_list = []
abbrev_list = []
canton_list = []

# We iterate over each university full name
for uni in universities_pd.index.values:  
    uni_name, uni_abrev = split_university_in_name_and_abbrev(uni) # Splitting the name
    uni_canton = find_canton(uni_name, uni_abrev) # Getting the canton
    
    ## Appending to a list to put in the dataframe 
    name_list.append(uni_name)
    abbrev_list.append(uni_abrev)
    canton_list.append(uni_canton)

universities_pd['university name'] = name_list
universities_pd['university abbrev'] = abbrev_list
universities_pd['canton'] = canton_list

universities_pd.head()

Unnamed: 0_level_0,Approved Amount,university name,university abbrev,canton
University,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AO Research Institute - AORI,3435621.0,AO Research Institute,AORI,
Allergie- und Asthmaforschung - SIAF,19169965.0,Allergie- und Asthmaforschung,SIAF,
Berner Fachhochschule - BFH,31028695.0,Berner Fachhochschule,BFH,
Biotechnologie Institut Thurgau - BITG,2492535.0,Biotechnologie Institut Thurgau,BITG,TG
Centre de rech. sur l'environnement alpin - CREALP,1567678.0,Centre de rech. sur l'environnement alpin,CREALP,


<h2>2.2 Finding cantons with geonames</h2>

In [155]:
def look_for_missing_canton(canton):
    if canton is np.nan:
        return 'aaaaa'
    else:
        return canton
test = universities_pd.copy()
test['canton'].apply(look_for_missing_canton).head()

KeyError: 'canton'

In [95]:
universities = data['University'].unique()

university_names = universities.copy()

In [96]:
ids = {}
for record in university_names:
    wds = str(record).split()
    name = ' '.join(wds[:-2])
    if name:
        ids[name] = wds[-1]

In [97]:
import simplejson as json
DOMAIN = 'http://api.geonames.org/'#api.geonames.org/search?
USERNAME = 'shynkaru' #geonames username

def geonames_query(method, params):
    uri = DOMAIN + '%s?%s&username=%s' % (method, urllib.parse.urlencode(params), USERNAME)
    resource = urllib.request.urlopen(uri).readlines()
    return json.loads(resource[0])

In [20]:
def prepare_params(university, format_type):
    return { 'q' : university
            , 'country' : "ch"
            , 'lang': 'en'
            , 'featurecode':'univ'
            , 'type':format_type}

In [21]:
def try_query(university):
    if not university:
        return None
    
    params = prepare_params(university, 'json')
    response_json = geonames_query('search', params)
    
    if(not response_json['geonames']):
        return None
        #return try_query(' '.join(university.split()[:-1]))
    return response_json

In [22]:
def query_university(university):
    response_json = try_query(university)
    if(not response_json):
        return None
   
    out_json = response_json['geonames']
    total_results = int(response_json['totalResultsCount'])
    
    l = [( out_json[i]['name']
          , out_json[i].get('adminCode1', 'Missing: adminCode1')
          , out_json[i]['adminName1']
          , out_json[i]['lat']
          , out_json[i]['lng']) for i in range(len(out_json))]
        
    return l

In [23]:
def process_data():
    for university, canton in ids.items():
        print(university)
        possible_uni_data = query_university(university)
        if not possible_uni_data:
            print(None, '\n')
            continue
            
        for item in possible_uni_data:
            print(item)
        print('\n')

process_data()

Université de Fribourg
('Fribourg, Université', 'FR', 'Fribourg', '46.80683', '7.15317')


Pädagogische Hochschule Zürich
None 

Staatsunabh. Theologische Hochschule Basel
None 

Inst. de Hautes Etudes Internat. et du Dév
None 

Pädagogische Hochschule Thurgau
None 

Ente Ospedaliero Cantonale
None 

Haute école pédagogique fribourgeoise
None 

Eidg. Hochschulinstitut für Berufsbildung
None 

Facoltà di Teologia di Lugano
None 

Universität Basel
('University of Basel', 'BS', 'Basel-City', '47.55832', '7.58403')
('Universität', 'BS', 'Basel-City', '47.55707', '7.58405')


Interkant. Hochschule für Heilpädagogik ZH
None 

EPF Lausanne
None 

SUP della Svizzera italiana
None 

Pädagogische Hochschule Wallis
None 

Pädagogische Hochschule Nordwestschweiz
None 

Pädagogische Hochschule Luzern
None 

Fachhochschule Kalaidos
None 

Zürcher Fachhochschule (ohne PH)
None 

Inst. universit. romand de Santé au Travail
None 

Swiss Institute of Bioinformatics
None 

HES de Suisse occidentale
None

In [26]:
#Constants

URL = 'https://maps.googleapis.com/maps/api/place'
AUTOCOMPLETE_API = 'autocomplete'
DETAILS_API = 'details'

FORMAT = 'json'
KEY = ''#ask me the key

To get Geo Data by University Name we use:

request uri: 'https://maps.googleapis.com/maps/api/place/autocomplete'
format json

REQUEST PARAMETERS:

input= < university_name >
components = [ country < ISO 3166-1 Alpha-2 compatible country code >, ... ]
key = < API key>

E.g.:
https://maps.googleapis.com/maps/api/place/autocomplete/json?input=Universit%C3%A4t+Luzern&components=country:CH&types='administrative_area_level_1&key=

In [27]:
def google_maps_query(api, format_type, params, key):
    uri = URL + '/%s/%s?%s&key=%s' % (api, format_type, urllib.parse.urlencode(params), key)
    resource = urllib.request.urlopen(uri).read()
    return json.loads(resource)

In [28]:
def prepare_gm_uni_params(university):
    return {'input': university, 'components' : 'country:CH'}

def prepare_gm_placeid_params(place_id):
    return {'placeid': place_id, 'result_type':'administrative_area_level_1'}

In [29]:
def query_uni_gm(university, format_type, key):
    if not university:
        return None
    
    params = prepare_gm_uni_params(university)
    return google_maps_query(AUTOCOMPLETE_API, format_type, params, key)

def query_placeid_gm(place_id, format_type, key):
    if not place_id:
        return None
    
    params = prepare_gm_placeid_params(place_id)
    return google_maps_query(DETAILS_API, format_type, params, key)

In [30]:
def get_canton(university):
    uni_data = query_uni_gm(university, FORMAT, KEY)
    print(uni_data)
    if 'ZERO_RESULTS' == uni_data['status']:
        return None
       
    res_json = json.dumps(uni_data['predictions'][0], ensure_ascii=False)
    place_id = uni_data['predictions'][0]['place_id']
    place_id_data = query_placeid_gm(place_id, FORMAT, KEY)
    print(place_id_data)
    #'types':[ 'administrative_area_level_1', 'political']
    for node in place_id_data['result']['address_components']:
        types_node = node['types']
        if 'administrative_area_level_1' in types_node and 'political' in types_node:
            return (node['long_name'], node['short_name'])
    
    return None

In [31]:
def process_uni_data():
    i = 0
    for university, canton in ids.items():
        print(university)
        canton_data = get_canton(university)
        print(university, canton_data)
        if canton_data is not None:
            i+=1
    print(i)

process_uni_data()

Université de Fribourg
{'predictions': [], 'status': 'REQUEST_DENIED', 'error_message': 'This service requires an API key.'}


IndexError: list index out of range