# 03 - Interactive Viz

## Deadline
Friday October 28, 2016 at 11:59PM

## Important Notes
* Make sure you push on GitHub your Notebook with all the cells already evaluated
* Don't forget to add a textual description of your thought process, the assumptions you made, and the solution
you plan to implement!
* Please write all your comments in English, and use meaningful variable names in your code

## Background
In this homework we will practice with interactive visualization, which is the key ingredient of many successful viz (especially when it comes to infographics).
You will be working with the P3 database of the [SNSF](http://www.snf.ch/en/Pages/default.aspx) (Swiss National Science Foundation).
As you can see from their [entry page](http://p3.snf.ch/), P3 already offers some ready-made viz, but we want to build a more advanced one for the sake
of quick data exploration. Therefore, start by [downloading the raw data](http://p3.snf.ch/Pages/DataAndDocumentation.aspx) (just for the Grant Export), and read carefully
the documentation to understand the schema. Install then [Folium](https://github.com/python-visualization/folium) to deal with geographical data (*HINT*: it is not
available in your standard Anaconda environment, therefore search on the Web how to install it easily!) The README file of Folium comes with very clear examples, and links 
to their own iPython Notebooks -- make good use of this information. For your own convenience, in this same directory you can already find a TopoJSON file with the 
geo-coordinates of each Swiss canton (which can be used as an overlay on the Folium maps).


## Assignment
1. Build a [Choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) which shows intuitively (i.e., use colors wisely) how much grant money goes to each Swiss canton.
To do so, you will need to use the provided TopoJSON file, combined with the Choropleth map example you can find in the Folium README file.

*HINT*: the P3 database is formed by entries which assign a grant (and its approved amount) to a University name. Therefore you will need a smart strategy to go from University
to Canton name. The [Geonames Full Text Search API in JSON](http://www.geonames.org/export/web-services.html) can help you with this -- try to use it as much as possible
to build the canton mappings that you need. For those universities for which you cannot find a mapping via the API, you are then allowed to build it manually -- feel free to stop 
by the time you mapped the top-95% of the universities. I also recommend you to use an intermediate viz step for debugging purposes, showing all the universties as markers in your map (e.g., if you don't select the right results from the Geonames API, some of your markers might be placed on nearby countries...)

2. *BONUS*: using the map you have just built, and the geographical information contained in it, could you give a *rough estimate* of the difference in research funding
between the areas divided by the [Röstigraben](https://en.wikipedia.org/wiki/R%C3%B6stigraben)?

*HINT*: for those cantons cut through by the Röstigraben, [this viz](http://p3.snf.ch/Default.aspx?id=allcharts) can be helpful!


In [1]:
from bs4 import BeautifulSoup
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
sns.set_context('notebook')
import re
import requests as rq
import json

In [19]:
cantons_json_path = "data/ch-cantons.topojson.json"
cantons_csv_path = "data/P3_GrantExport.csv"

data = pd.read_csv(cantons_csv_path, sep=';')

### Drop columns we don't need

In [20]:
data = data.drop([ "Project Title", "Project Title English", "Responsible Applicant", "Funding Instrument", "Funding Instrument Hierarchy", "Discipline Number", "Discipline Name", "Discipline Name Hierarchy", "Start Date", "End Date", "Keywords"], axis=1)

In [21]:
data.head()

Unnamed: 0,"﻿""Project Number""",Institution,University,Approved Amount
0,1,,Nicht zuteilbar - NA,11619.0
1,4,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,41022.0
2,5,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.0
3,6,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,52627.0
4,7,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",120042.0


### Drop lines where amount is null and where University is not declared

In [22]:
data = data[data["Approved Amount"]!="data not included in P3"]
data = data[data["University"]!="Nicht zuteilbar - NA"]
data = data[data["University"]!="NaN"] # utiliser data = data.dropnan(subset=['Universitsy]) ??
data.head()

Unnamed: 0,"﻿""Project Number""",Institution,University,Approved Amount
1,4,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,41022.0
2,5,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.0
3,6,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,52627.0
4,7,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",120042.0
5,8,"Séminaire de politique économique, d'économie ...",Université de Fribourg - FR,53009.0


#### Reset index

In [23]:
data = data.reset_index(drop=True)
data

Unnamed: 0,"﻿""Project Number""",Institution,University,Approved Amount
0,4,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,41022.00
1,5,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.00
2,6,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,52627.00
3,7,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",120042.00
4,8,"Séminaire de politique économique, d'économie ...",Université de Fribourg - FR,53009.00
5,9,Institut für ökumenische Studien Université de...,Université de Fribourg - FR,25403.00
6,10,Ostasiatisches Seminar Universität Zürich,Universität Zürich - ZH,47100.00
7,11,,Université de Lausanne - LA,25814.00
8,13,Laboratoire de Didactique et Epistémologie des...,Université de Genève - GE,360000.00
9,14,Klinische Psychologie und Psychotherapie Insti...,Université de Fribourg - FR,153886.00


#### Find unique university names

In [26]:
universities = data.University.unique()
#universities.tolist
universities

array(['Université de Genève - GE',
       'NPO (Biblioth., Museen, Verwalt.) - NPO', 'Universität Basel - BS',
       'Université de Fribourg - FR', 'Universität Zürich - ZH',
       'Université de Lausanne - LA', 'Universität Bern - BE',
       'Eidg. Forschungsanstalt für Wald,Schnee,Land - WSL',
       'Université de Neuchâtel - NE', 'ETH Zürich - ETHZ',
       'Inst. de Hautes Etudes Internat. et du Dév - IHEID',
       'Universität St. Gallen - SG', 'Weitere Institute - FINST',
       'Firmen/Privatwirtschaft - FP',
       'Pädagogische Hochschule Graubünden - PHGR', 'EPF Lausanne - EPFL',
       'Pädagogische Hochschule Zürich - PHZFH', 'Universität Luzern - LU',
       'Schweiz. Institut für Kunstwissenschaft - SIK-ISEA',
       'SUP della Svizzera italiana - SUPSI',
       'HES de Suisse occidentale - HES-SO',
       'Robert Walser-Stiftung Bern - RWS', 'Paul Scherrer Institut - PSI',
       'Pädagogische Hochschule St. Gallen - PHSG',
       'Eidg. Anstalt für Wasserversorgun

#### Create a new dataframe with the universities and their canton's code (empty for the moment)

In [None]:
df = pd.DataFrame(universities, columns=["Université"])
df["Canton"]=""

### How to find the cantons for each university

First we remarked that some universities are described by their Canton (ex : "Université de Genève - GE"). For those ones we want to just get the Canton code and add it in the correct column.

Then some others are not described by their Canton so in this case we use GeoNames

### Extract canton's code directly from University names given in data

First of all, we need to know which code describe a canton, we can find this in the json

TODO parse le json pour récupérer id (code canton) et properties.name (nom canton)

In [None]:
# This gets the value after the '-' in the uni name :
# re.findall(r'- (\w+)', data['University'][index])[0]
input_file=open(cantons_json_path, 'r', encoding='utf-8')
json_decode=json.load(input_file)
canton_ids = []
canton_names = []
for i in range(0,len(json_decode['objects']['cantons']['geometries'])):
    canton_ids.append(json_decode['objects']['cantons']['geometries'][i]['id'])
    canton_names.append(json_decode['objects']['cantons']['geometries'][i]['properties']['name'])
print(canton_ids)
print(canton_names)

For cantons with two names (French/German/Italian), let's separate them

In [None]:
for canton_name in canton_names:
    if('/' in canton_name):
        canton_names.remove(canton_name)
        canton_names.append(re.findall(r'(\w+)/', canton_name)[0])
        canton_names.append(re.findall(r'/(\w+)', canton_name)[0])
print(canton_names)

Now we want to go through the data and we look if in the name of the university there is an occurance of one of the ids or names and we assume they are from this canton
## TODO

Then we look in the name of the university if it does contain the name or the code of a Canton

### Use GeoNames to find cantons of University

In [None]:
for placename in universities:
    r = rq.get('http://api.geonames.org/postalCodeSearch?', params={'placename' : placename, 'username' : 'almil36'})
    soup = BeautifulSoup(r.text, 'lxml-xml')
    for p in soup.find_all('adminCode1'):
        print('OK')
        #TODO access the line of this placename
        #df["Canton"][placename]+=[p.text]
    #print(placename)

### Create new dataframe with amount and cantons

In [None]:
#TODO

In [None]:
# TEST
swiss_map = folium.Map(location=[5.956, 45.818], zoom_start=3)
swiss_map.choropleth(geo_path=cantons_json_path, 
                     data=cantons_data,
                     columns=['Cantons', 'Approved Amount'],
                     key_on='feature.id',
                     topojson='objects.cantons',
                     fill_color='YlGn',
                     legend_name = 'Random numbers'
                    )
swiss_map.save('swiss_map.html')