# HW3 Interactive Viz
#### Background
In this homework we will practice with interactive visualization, which is the key ingredient of many successful viz (especially when it comes to infographics).
You will be working with the P3 database of the [SNSF](http://www.snf.ch/en/Pages/default.aspx) (Swiss National Science Foundation).
As you can see from their [entry page](http://p3.snf.ch/), P3 already offers some ready-made viz, but we want to build a more advanced one for the sake
of quick data exploration. Therefore, start by [downloading the raw data](http://p3.snf.ch/Pages/DataAndDocumentation.aspx) (just for the Grant Export), and read carefully
the documentation to understand the schema. Install then [Folium](https://github.com/python-visualization/folium) to deal with geographical data (*HINT*: it is not
available in your standard Anaconda environment, therefore search on the Web how to install it easily!) The README file of Folium comes with very clear examples, and links 
to their own iPython Notebooks -- make good use of this information. For your own convenience, in this same directory you can already find a TopoJSON file with the 
geo-coordinates of each Swiss canton (which can be used as an overlay on the Folium maps).


#### Assignment
1. Build a [Choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) which shows intuitively (i.e., use colors wisely) how much grant money goes to each Swiss canton.
To do so, you will need to use the provided TopoJSON file, combined with the Choropleth map example you can find in the Folium README file.

*HINT*: the P3 database is formed by entries which assign a grant (and its approved amount) to a University name. Therefore you will need a smart strategy to go from University
to Canton name. The [Geonames Full Text Search API in JSON](http://www.geonames.org/export/web-services.html) can help you with this -- try to use it as much as possible
to build the canton mappings that you need. For those universities for which you cannot find a mapping via the API, you are then allowed to build it manually -- feel free to stop 
by the time you mapped the top-95% of the universities. I also recommend you to use an intermediate viz step for debugging purposes, showing all the universties as markers in your map (e.g., if you don't select the right results from the Geonames API, some of your markers might be placed on nearby countries...)

2. *BONUS*: using the map you have just built, and the geographical information contained in it, could you give a *rough estimate* of the difference in research funding
between the areas divided by the [Röstigraben](https://en.wikipedia.org/wiki/R%C3%B6stigraben)?

*HINT*: for those cantons cut through by the Röstigraben, [this viz](http://p3.snf.ch/Default.aspx?id=allcharts) can be helpful!


## Data exploration and cleaning
We will: 
- load the data
- look at it
- remove the columns we don't need
- handle NaN values (ie. remove them)
- aggregate grants per institution

At the end of that phase we want a DataFrame indicating the total amount of grants each institution has received based on the given data.

In [1]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import show

%matplotlib inline
sns.set_context('notebook')
pd.options.mode.chained_assignment = None  # default='warn'


Add a function to control the writes to csv file.

In [2]:
# set to true if you want to write the data to csv
do_persist = True

In [3]:
def write_to_csv(df, file_name):
    if do_persist:
        df.to_csv(file_name)
        print('done writing')
        return True
    else:
        return False

Load the data and check the types of the columns

In [4]:
data = pd.read_csv('data/GrantExport.csv', delimiter=';')
data.dtypes

﻿"Project Number"                int64
Project Title                   object
Project Title English           object
Responsible Applicant           object
Funding Instrument              object
Funding Instrument Hierarchy    object
Institution                     object
University                      object
Discipline Number                int64
Discipline Name                 object
Discipline Name Hierarchy       object
Start Date                      object
End Date                        object
Approved Amount                 object
Keywords                        object
dtype: object

Look at the DF

In [5]:
data.head()

Unnamed: 0,"﻿""Project Number""",Project Title,Project Title English,Responsible Applicant,Funding Instrument,Funding Instrument Hierarchy,Institution,University,Discipline Number,Discipline Name,Discipline Name Hierarchy,Start Date,End Date,Approved Amount,Keywords
0,1,Schlussband (Bd. VI) der Jacob Burckhardt-Biog...,,Kaegi Werner,Project funding (Div. I-III),Project funding,,Nicht zuteilbar - NA,10302,Swiss history,Human and Social Sciences;Theology & religious...,01.10.1975,30.09.1976,11619.0,
1,4,Batterie de tests à l'usage des enseignants po...,,Massarenti Léonard,Project funding (Div. I-III),Project funding,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,10104,Educational science and Pedagogy,"Human and Social Sciences;Psychology, educatio...",01.10.1975,30.09.1976,41022.0,
2,5,"Kritische Erstausgabe der ""Evidentiae contra D...",,Kommission für das Corpus philosophorum medii ...,Project funding (Div. I-III),Project funding,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",10101,Philosophy,Human and Social Sciences;Linguistics and lite...,01.03.1976,28.02.1985,79732.0,
3,6,Katalog der datierten Handschriften in der Sch...,,Burckhardt Max,Project funding (Div. I-III),Project funding,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,10302,Swiss history,Human and Social Sciences;Theology & religious...,01.10.1975,30.09.1976,52627.0,
4,7,Wissenschaftliche Mitarbeit am Thesaurus Lingu...,,Schweiz. Thesauruskommission,Project funding (Div. I-III),Project funding,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",10303,Ancient history and Classical studies,Human and Social Sciences;Theology & religious...,01.01.1976,30.04.1978,120042.0,


We only need the columns 'University' and 'Approved Amount'.
All other columns are not relevant for this homework.

In [6]:
# take only the relevant cols and give them nicer names
grants = data[['University', 'Approved Amount']]
grants.rename(columns={'University': 'university', 'Approved Amount' : 'amount'}, inplace=True)
grants.dtypes

university    object
amount        object
dtype: object

If one of the two values is NaN, we can't use the entry, a grant without institution is as meanningless as a institution without grant. -> We drop it.
Note that there are entries with 'Nicht zuteilbar - NA' which are essentially also NaN values.

We drop almost 25% of all entries, which seems a lot, but what else can we do with uncomplete data?

In [7]:
nbr_entries = len(grants)
grants = grants.replace(to_replace='Nicht zuteilbar - NA', value=np.nan)
grants = grants.dropna()
print('Dropped '+str((100/nbr_entries) * (nbr_entries - len(grants)))+'% of all entries.')

Dropped 24.349294189372976% of all entries.


Then make the 'amount' column numeric in order to make the utilisation easier.

In [8]:
grants['amount'] = pd.to_numeric(grants.amount, errors='coerce')

And finally group by the institutions and sum the grants.
We show the sorted dataframe
It is interesting that neither ETHZ nor EPFL is has received the most grants. 

In [9]:
universities = grants.groupby(by='university', axis=0, as_index=False).sum()
universities.sort_values('amount', ascending=False)

Unnamed: 0,university,amount
70,Université de Genève - GE,1.838237e+09
68,Universität Zürich - ZH,1.826843e+09
6,ETH Zürich - ETHZ,1.635597e+09
65,Universität Bern - BE,1.519373e+09
64,Universität Basel - BS,1.352251e+09
71,Université de Lausanne - LA,1.183291e+09
5,EPF Lausanne - EPFL,1.175316e+09
69,Université de Fribourg - FR,4.575262e+08
72,Université de Neuchâtel - NE,3.832046e+08
39,"NPO (Biblioth., Museen, Verwalt.) - NPO",3.341306e+08


## Map the institutions to cantons
In this section we will map each institution to the canton it belongs to with following steps:
1. split the institution name into a 'university_name' and a 'abbreviation' part
2. query the google places api with the university name (stripped of some special characters) and the keyword 'switzerland'
3. then take the coorinated returned by google and use the geonames api to find the canton of the coordinates
4. for some reason no canton is found (google or geonames did not find any match) we look if the 'university_name' contains a canton name, if so we map it to that canton.

With that method we can map 61 out of the 76 institutions.
The rest we mapped by hand (about 8). 

Some institutions are present in several cantons, we just split the grants of those equally among the different cantons.

Note that there are several 'institutions' that can't be mapped such as 'Weitere Spitäler' (engl: 'other hospitals'). But they only account for ~4% of all grants, which is low enough for us to ignore.


Split the name:

In [10]:
delim = ' - '
universities['university_name'] = [fn.split(delim)[0].strip() for fn in universities['university'].values]
universities['abbrev'] = [fn.split(delim)[1].strip() if len(fn.split(delim)) > 1 else np.nan for fn in universities['university'].values]
universities.set_index('university', inplace=True)
universities.head(1)

Unnamed: 0_level_0,amount,university_name,abbrev
university,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AO Research Institute - AORI,3435621.0,AO Research Institute,AORI


The functions for using the APIs are situated in the file 'map_universities.py', so we run it.

In [11]:
# run the python file that defines the functions to access the api's
%run map_universities.py

Then match each institution:

In [12]:
# create a new column with the canton in it.
def canton_for_university_query(uni):
    # create the query
    q = str(uni.university_name) + ' ' + str(uni.abbrev) + ' Switzerland' 
    # remove some characters in the query
    to_remove = ['(', ')', ',', '.', '-', '+', '&']
    [q.replace(ch, ' ') for ch in to_remove]
    # execute the query
    return canton_for_university(q)

universities['canton'] = universities.apply(canton_for_university_query, axis=1)

***************************************
query: AO Research Institute AORI Switzerland
{'html_attributions': [], 'status': 'ZERO_RESULTS', 'results': []}
***************************************
query: Allergie- und Asthmaforschung SIAF Switzerland
{'html_attributions': [], 'status': 'OK', 'results': [{'formatted_address': 'Obere Str. 22, 7270 Davos Platz, Switzerland', 'place_id': 'ChIJL3RyxgGkhEcR24v33tCoi5A', 'reference': 'CmRSAAAAspUCvpFux3h8isDzNU5tRrJNQuGt_lS4FHVHOmHMhRG93--m79lSFZFPH7sISMQF395kq5rAOSBgARLz3r8l2XvMbbFKVyuTVP-22jU0MOpB7R0rcQDEDgjLvNak2RuhEhBNaBTKEsFN3qtckGByI80dGhR8dnaB1svik7nPuBSjkI2NmIbXTA', 'geometry': {'location': {'lng': 9.8200409, 'lat': 46.7954192}, 'viewport': {'northeast': {'lng': 9.82028165, 'lat': 46.79571619999999}, 'southwest': {'lng': 9.81931865, 'lat': 46.79532020000001}}}, 'name': 'Schweiz. Institut f. Allergie- u. Asthmaforschung', 'id': 'ea435d78b7507556ed24683ca26f040f93a8d840', 'icon': 'https://maps.gstatic.com/mapfiles/place_api/icons/generic_bu



***************************************
query: Ente Ospedaliero Cantonale EOC Switzerland
{'html_attributions': [], 'status': 'OK', 'results': [{'formatted_address': 'Viale Officina 3, Bellinzona, Switzerland', 'photos': [{'html_attributions': ['<a href="https://maps.google.com/maps/contrib/113223259282903616271/photos">Daniel Vasile Tishchenko</a>'], 'width': 3200, 'photo_reference': 'CoQBdwAAAIxVouxxncUsoqSIncohWEFpjWLA9SXYBwINexXlualBw07_0mWMvhFDS5-vFsIxbZiWM8TzT9UtrnFjF5GVwM07PdMaskiFnHYRkOno4DPhBUVB1j602a8Qw-ZGvZqTKd-HMC6wdjCY6jfzRieI_UW8r-i98AYI2RR9fhDoViu-EhDTwAPNz9ZS5vtBXdXnsp-uGhSiNCNUNhOUcQtoQKmA6YvOG85sPg', 'height': 2106}], 'reference': 'CmRSAAAAiUTCMvwC5h9u-_gnXIayI1jZUVy_GjIy8r32Vf_Ig5ZkPG95KDwNSDO4dngVu_s5XpU1fd8dgBkl92UI8hJbiKQKAzUhcUyBoHIXYhmH9KeJhTTLxGuQfhVG2-3C_xRoEhB9wLCeR8YT2pmmeDd7w_4CGhSCmqJFKChF7qcoAIHAGF2HLpKajw', 'geometry': {'location': {'lng': 9.026636499999999, 'lat': 46.1968913}, 'viewport': {'northeast': {'lng': 9.026905900000001, 'lat': 46.19694885}, 'so



***************************************
query: Pädagogische Hochschule Schaffhausen PHSH Switzerland
{'html_attributions': [], 'status': 'OK', 'results': [{'formatted_address': 'Ebnatstrasse 80, 8200 Schaffhausen, Switzerland', 'place_id': 'ChIJaz2FyteBmkcRVaBtCVv5Zqg', 'reference': 'CmRSAAAAjWXI_CODqka1fj-Og-nmtulZItIs1D9797aHpm76MOZXv1zWcnNhBaOZ87QFpXUDZPJBomNR7b3DDr5101tGpjZK-_wJIpw3FLVHbGA8H3ahHVT0tly3f_oors-158ChEhClibiWSdztTRHhCLJrPQU6GhR931ea_OVg_87TdrQRHPFDkmCzOQ', 'geometry': {'location': {'lng': 8.645286199999997, 'lat': 47.7073984}, 'viewport': {'northeast': {'lng': 8.645387299999998, 'lat': 47.70779065}, 'southwest': {'lng': 8.644982900000002, 'lat': 47.70726765000001}}}, 'name': 'Pädagogische Hochschule Schaffhausen (PHSH)', 'id': '2dd708f308987b7b0155b763b1f39ee53e03cdc0', 'icon': 'https://maps.gstatic.com/mapfiles/place_api/icons/school-71.png', 'types': ['point_of_interest', 'establishment']}]}
***************************************
query: Pädagogische Hochschule Schwy

How many did we match?

In [13]:
len(universities[~pd.isnull(universities['canton'])])

61

Show the ones we did not match:

In [14]:
universities[pd.isnull(universities['canton'])]

Unnamed: 0_level_0,amount,university_name,abbrev,canton
university,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AO Research Institute - AORI,3435621.0,AO Research Institute,AORI,
Eidg. Material und Prüfungsanstalt - EMPA,58574520.0,Eidg. Material und Prüfungsanstalt,EMPA,
Firmen/Privatwirtschaft - FP,111686700.0,Firmen/Privatwirtschaft,FP,
Forschungsanstalten Agroscope - AGS,33115720.0,Forschungsanstalten Agroscope,AGS,
Forschungskommission SAGW,100000.0,Forschungskommission SAGW,,
Istituto Svizzero di Roma - ISR,141000.0,Istituto Svizzero di Roma,ISR,
"NPO (Biblioth., Museen, Verwalt.) - NPO",334130600.0,"NPO (Biblioth., Museen, Verwalt.)",NPO,
Physikal.-Meteorolog. Observatorium Davos - PMOD,12098440.0,Physikal.-Meteorolog. Observatorium Davos,PMOD,
Pädagogische Hochschule Nordwestschweiz - PHFHNW,3476142.0,Pädagogische Hochschule Nordwestschweiz,PHFHNW,
Schweizer Kompetenzzentrum Sozialwissensch. - FORS,34735820.0,Schweizer Kompetenzzentrum Sozialwissensch.,FORS,


#### Map by hand (using google & wikipedia):
We found following informations for the unmapped institutions
- Schweizer Kompetenzzentrum Sozialwissensch. -> lausanne -> VD
- Weitere Institute -> translates to 'other institutes' -> nan
- Forschungsanstalten Agroscope -> not in one place -> nan
- Haute école pédagogique BE, JU, NE -> situated in JU but belongs to BE, JU & NE -> JU or 1/3 for each?
- Swiss Institute of Bioinformatics -> all over the place -> nan
- Firmen/Privatwirtschaft -> similar to 'other institutions' -> nan
- Forschungsinstitut für Opthalmologie -> in Sitten -> VS
- Eidg. Forschungsanstalt für Wald,Schnee,Land -> all over the place -> nan
- Istituto Svizzero di Roma -> in ROM (italy) -> nan
- Pädag. Hochschule Tessin (Teilschule SUPSI) -> TI
- Pädagogische Hochschule Nordwestschweiz -> office in Windisch -> AG
- Physikal.-Meteorolog. Observatorium Davos -> GR
- Instituto Ricerche Solari Locarno -> TI
- Staatsunabh. Theologische Hochschule Basel -> BS
- Fachhochschule Nordwestschweiz (ohne PH) -> same as 'Pädagogische Hochschule Nordwestschweiz' -> AG
- Forschungskommission SAGW -> found nothing (does it still exist?) -> nan
- NPO (Biblioth., Museen, Verwalt.) -> several institutions -> nan
- Swiss Center for Electronics and Microtech. -> Neuchâtel -> NE
- Eidg. Material und Prüfungsanstalt EMPA -> in 3 cantons (BE, ZH, SG) -> nan
- Weitere Spitäler -> several hospitals -> nan
- 'AO Research Institute - AORI' -> Davos -> GR
- Zürcher Fachhochschule (ohne PH) - ZFH -> ZH

Map them:

In [15]:
# do the manual mapping
manual_map = {
        'Schweizer Kompetenzzentrum Sozialwissensch. - FORS' : 'VD',
        'Pädag. Hochschule Tessin (Teilschule SUPSI) - ASP' : 'TI',
        'Pädagogische Hochschule Nordwestschweiz - PHFHNW' : 'AG',
        'Physikal.-Meteorolog. Observatorium Davos - PMOD' : 'GR',
        'Instituto Ricerche Solari Locarno - IRSOL' : 'TI',
        'Staatsunabh. Theologische Hochschule Basel - STHB' : 'BS',
        'AO Research Institute - AORI' : 'GR',
        'Zürcher Fachhochschule (ohne PH) - ZFH' : 'ZH'
    }
for uni_index, ctn in manual_map.items():
    if pd.isnull(universities.at[uni_index, 'canton']):
        universities.set_value(uni_index, 'canton', ctn)
universities[pd.isnull(universities['canton'])]

Unnamed: 0_level_0,amount,university_name,abbrev,canton
university,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Eidg. Material und Prüfungsanstalt - EMPA,58574520.0,Eidg. Material und Prüfungsanstalt,EMPA,
Firmen/Privatwirtschaft - FP,111686700.0,Firmen/Privatwirtschaft,FP,
Forschungsanstalten Agroscope - AGS,33115720.0,Forschungsanstalten Agroscope,AGS,
Forschungskommission SAGW,100000.0,Forschungskommission SAGW,,
Istituto Svizzero di Roma - ISR,141000.0,Istituto Svizzero di Roma,ISR,
"NPO (Biblioth., Museen, Verwalt.) - NPO",334130600.0,"NPO (Biblioth., Museen, Verwalt.)",NPO,
Swiss Institute of Bioinformatics - SIB,11583220.0,Swiss Institute of Bioinformatics,SIB,
Weitere Institute - FINST,9256736.0,Weitere Institute,FINST,
Weitere Spitäler - ASPIT,10749810.0,Weitere Spitäler,ASPIT,


In [16]:
canton_grants = universities.groupby(by='canton', axis=0).sum()
canton_grants.sort_values('amount', ascending=False)

Unnamed: 0_level_0,amount
canton,Unnamed: 1_level_1
ZH,3642140000.0
VD,2401656000.0
GE,1877102000.0
BE,1555048000.0
BS,1392498000.0
FR,459073700.0
NE,401897600.0
AG,126187500.0
TI,115262300.0
SG,91194100.0


The institution EMPA has presence in 3 cantons: BE, ZH, SG. So we split the grants for EMPA and add it to the 3 cantons (1/3 for each)

In [17]:
grants_empa = universities.at['Eidg. Material und Prüfungsanstalt - EMPA', 'amount']
grants_empa_third = grants_empa / 3
empa_cantons = ['BE', 'ZH', 'SG']
for c in empa_cantons:
    canton_grants = canton_grants.set_value(c, 'amount', canton_grants.at[c, 'amount'] + grants_empa_third)

##### Some numbers on how many we matched.

how many institutions did we match?

In [18]:
# the 1+len(...) accounts for the mapping of EMPA
print(str(round((100/ len(universities) ) * (1+len(universities[~pd.isnull(universities['canton'])]))) )+ '%')

89%


which is how many % of all grants?

In [19]:
total_grants = universities.amount.sum()
matched_grants = canton_grants.amount.sum()
matched_percent = (100/total_grants) * matched_grants
print(str(round(matched_percent, 2) )+ '%')

96.02%


Finally write the canton grants to a csv file

In [20]:
write_to_csv(canton_grants, 'all_canton_grants.csv')

done writing


True

In [21]:
pd.set_option('display.max_rows', None)
universities[['university_name', 'canton']]

Unnamed: 0_level_0,university_name,canton
university,Unnamed: 1_level_1,Unnamed: 2_level_1
AO Research Institute - AORI,AO Research Institute,GR
Allergie- und Asthmaforschung - SIAF,Allergie- und Asthmaforschung,GR
Berner Fachhochschule - BFH,Berner Fachhochschule,BE
Biotechnologie Institut Thurgau - BITG,Biotechnologie Institut Thurgau,TG
Centre de rech. sur l'environnement alpin - CREALP,Centre de rech. sur l'environnement alpin,VS
EPF Lausanne - EPFL,EPF Lausanne,VD
ETH Zürich - ETHZ,ETH Zürich,ZH
Eidg. Anstalt für Wasserversorgung - EAWAG,Eidg. Anstalt für Wasserversorgung,ZH
"Eidg. Forschungsanstalt für Wald,Schnee,Land - WSL","Eidg. Forschungsanstalt für Wald,Schnee,Land",ZH
Eidg. Hochschulinstitut für Berufsbildung - EHB,Eidg. Hochschulinstitut für Berufsbildung,BE
