# Sarah's notebook to clean the Institute name dataset

Before working with these data we will need to clean them. This involves identifying the correct institute name (i.e. one that we can use to geocode - relate the name to a location).

#### Notebook approach

Using this notebook to clean the data:

To do this we will basically be creating a dictionary, to relate what is currently in the dataset, to what it should be written as.

Note that we can use unicode characters directly here. 

#### External text files

Alternativley, you can prepare some simple text files, and I will use those to clean the data:

The Data folder, of this directory should contain `place_list.txt` and `mappings.json`.

* Pick a line
* Check you can geocode the location string - this can be done at http://www.gpsvisualizer.com/geocode
If you cant get a geocode response (from Google - not Bing) then you will have to use Google Search to work out what the correct string should be. To replace/update a string you need to create a mapping by adding a relationship to a mapping file (mapping.json).
* If you see lines that are repeats/duplicates of the same location you need to create a mapping for those too

To create the mapping:

Open a text editor (like the free and excelent Atom https://atom.io/). Open the mapping.json file I provided with the text editor. Add a new line per mapping:
```
{
 "old_wrong_name":"correct_name",
 "another_old_wrong_name":"new_correct_name"
 }
```
To check the contents of the json file are okay, you can copy them into this link, and click https://jsonlint.com/
Validate JSON. If it comes up as 'Valid Json', all is okay. If not, you added a bad change!


### Workflow

First, load the libraries needed

In [2]:
from collections import Counter
import numpy as np
import geocoder   #https://pypi.python.org/pypi/geocoder

Here is a function we need to geocode.

In [19]:
def search_location(place):
    """Use Geopy to geocode an address string  https://github.com/geopy/geopy
    """
    g = geocoder.google(place)
    return g.geojson

Read the listing of uniqe place names. Note in this case the data should be in a sub-folder called Data.

In [14]:
place_list = []
with open('./Data/place_list.txt') as f:
    for line in f:
        place_list.append(line.strip('\n'))
        
sarahs_place_list = place_list[480:960]  # Sarah's subset of places to clean

In [23]:
for n, place in enumerate(sarahs_place_list):
    print(place)

Karl-Franzens University of Graz
Karl-Franzens-University Graz
Karl-Franzens-University of Graz
Katholieke Universiteit Leuven
Kiel University
Kшbenhavns Universitet
LFU Innsbruck
LMU Muenchen
LMU Munich
La Sapienza - Universitа di Roma
La Sapienza Roma
La Sapienza Rome University
La Sapienza, University of Rome
Laboratoire Magmas et Volcans
Laboratoire des Fonctionnements et Evolution des systиmes ecologiques (UMR 7625)
Leibniz-Institute of Marine Sciences
Leibniz-Zentrum fьr Agrarlandschaftsforschung (ZALF) e.V.
Leiden University
Leiden University c/o Natural History Museum (Naturalis) Leiden
Leopold-Franzens-Universitдt Innsbruck
Liceo Scientifico G. Galilei
Liege
Lille University /CNRS
Liverpool John Moores University
Liиge University
Ludwig Maximilian University
Ludwig Maximilians University
Ludwig Maximilians University Munich
Ludwig Maximilians Universitдt Mьnchen
Ludwig-Maximilians University Munich, Department on Earth- and Environmental Science, Section Palaeontology
Ludwig-M

Test individual strings can be geocoded below. Look at the address (check it makes sense), and if needed put the lat longs into a web search and confirm this is the correct location (at least approximate to the right city).

In [20]:
search_location('University of Zagreb')

{'bbox': [15.9684525197085,
  45.8092213197085,
  15.9711504802915,
  45.8119192802915],
 'geometry': {'coordinates': [15.9698015, 45.81057029999999], 'type': 'Point'},
 'properties': {'accuracy': 'GEOMETRIC_CENTER',
  'address': 'Trg maršala Tita 14, 10000, Zagreb, Croatia',
  'bbox': [15.9684525197085,
   45.8092213197085,
   15.9711504802915,
   45.8119192802915],
  'city': 'Zagreb',
  'confidence': 9,
  'country': 'HR',
  'county': 'Grad Zagreb',
  'encoding': 'utf-8',
  'lat': 45.81057029999999,
  'lng': 15.9698015,
  'location': 'University of Zagreb',
  'ok': True,
  'place': 'ChIJB8Q7U_vWZUcRLvq7EuDKXxY',
  'postal': '10000',
  'provider': 'google',
  'quality': 'establishment',
  'state': 'Grad Zagreb I Zagrebačka županija',
  'status': 'OK',
  'status_code': 200,
  'street': 'Trg maršala Tita 14'},
 'type': 'Feature'}

Here is the key bit: You will need to create a dictionary, and map the keys (incorrect strings currently in the dataset) to values (correct strings, that can be geocoded).

In [None]:
known_errors = {
   "Personal  Address": 'nil',
    "--": 'nil',
}

Just to double confirm, before you enter a string as a key you can check it is written correctly by checking that you can identify it within the list, like so:

In [27]:
"Natural History Museum" in sarahs_place_list 

True