# Clean the datasets

Before working with these data we will need to clean them. This involves identifying the correct institute name (i.e. one that we can use to geocode - relate the name to a location).

#### Notebook approach

Using this notebook to clean the data:

To do this we will basically be creating a dictionary, to relate what is currently in the dataset, to what it should be written as.

Note that we can use unicode characters directly here. 

#### External text files

Alternativley, you can prepare some simple text files, and I will use those to clean the data:

The Data folder, of this directory should contain `place_list.txt` and `mappings.json`.

* Pick a line
* Check you can geocode the location string - this can be done at http://www.gpsvisualizer.com/geocode
If you cant get a geocode response (from Google - not Bing) then you will have to use Google Search to work out what the correct string should be. To replace/update a string you need to create a mapping by adding a relationship to a mapping file (mapping.json).
* If you see lines that are repeats/duplicates of the same location you need to create a mapping for those too

To create the mapping:

Open a text editor (like the free and excelent Atom https://atom.io/). Open the mapping.json file I provided with the text editor. Add a new line per mapping:
```
{
 "old_wrong_name":"correct_name",
 "another_old_wrong_name":"new_correct_name"
 }
```
To check the contents of the json file are okay, you can copy them into this link, and click https://jsonlint.com/
Validate JSON. If it comes up as 'Valid Json', all is okay. If not, you added a bad change!


In [1]:
from collections import Counter
import pandas as pd
import numpy as np
from colorama import Fore, Style
import geocoder   #https://pypi.python.org/pypi/geocoder

In [2]:
df = pd.read_csv("./Data/FullExportAnon_v4.csv", encoding="mac_cyrillic")

In [3]:
#df.keys()
df.head()

Unnamed: 0,User_Code,Call_Submitted,Applicant_Age_Visit_Start,User_NHM.Gender,User_NHM.Researcher_status,HostInstName1,User_NHM.Home_Institution_Name,User_NHM.Home_Institution_Type,User_NHM.Home_Institution_Town,User_NHM.Home_Institution_Country_code,...,NHM_Installation_Use.Infrastructure_Short_Name,NHM_Installation_Use.Installation_ID,NHM_Installation_Use.Installation_Long_Name,NHM_Installation_Use.Installation_Short_Name,ProjectsView.Visit_Funded_Previously,SynthRound,TAF_ID,TAF_Name,Application_ID,UserProject_ID
0,User0,1,32,M,PDOC,NRM,Natural History Museum,RES,London,GB,...,SE-TAF,1,NRM,NRM,False,R2,10,SE-TAF,4009,570
1,User0,4,39,M,EXP,National Museum Prague,University of Cambridge,UNI,CAMBRIDGE,GB,...,CZ-TAF,1,NMP Collections and Facilities,NMP,False,R3,12,CZ-TAF,9493,6980
2,User0,4,39,M,EXP,NRM,University of Cambridge,UNI,CAMBRIDGE,GB,...,SE-TAF,1,NRM,NRM,False,R3,10,SE-TAF,9493,6889
3,User1,1,40,M,PGR,NHM,University of Basel,UNI,Basel,CH,...,GB-TAF,3,NHM Collections and Laboratories,NHM COL MOL,False,R2,11,GB-TAF,3705,200
4,User10,2,56,M,EXP,RMCA,National Museum Wales,RES,Cardiff,GB,...,BE-TAF,2,RMCA,RMCA,False,R2,2,BE-TAF,3051,1002


In [4]:
def given_inst_string_find_town_country(institute):
    """Give an institute string, return the town and country entries"""
    msk = df['User_NHM.Home_Institution_Name'] == institute
    df[msk]
    tmp_towns = df[msk]['User_NHM.Home_Institution_Town'].values[0]
    tmp_countries = df[msk]['User_NHM.Home_Institution_Country_code'].values[0]
    tmp_inst = df[msk]['User_NHM.Home_Institution_Type'].values[0]
    entry = 0
    for town, country in zip(tmp_towns, tmp_countries):
        print(f"{entry}: {tmp_inst[entry]}, {tmp_towns}, {tmp_countries}")
        entry += 1
        
def search_location(place):
    """Use Geopy to geocode an address string  https://github.com/geopy/geopy
    """
    g = geocoder.google(place)
    return g.geojson

In [5]:
# Note that any missing data should be mapped to the string 'nil'
# known_errors = {
#    "Personal  Address": 'nil',
#     "--": 'nil',
#     "-": 'nil',
#     "------": 'nil',
#     "Freelance": 'nil',
#     "free lance":'nil',
#     "freelance":'nil',
#     'Freelance researcher':'nil',
#     '(private researcher, not on staff)':'nil',
#     "[no current affiliation]":'nil',
#     "hosted by University Paris 6":'nil',
#     "in retirement, member of Czech Entomological Society":'nil',
#     "none":'nil',
#     "personal address":'nil',
#     "presently none":'nil',
#     "private home":'nil',
#     "unemployed":'nil',
#     '2 Rambler Close':'nil',
#     '6 Bramley Avenue':'nil',
#     "Univ. Of Vienna":"University of Vienna",
#     "?nonu University, Art and Science Faculty":"İnönü University",
#     '?nonu University, Faculty of Pharmacy':"İnönü University",
#     "> Queen's University":"Queen's University",
#     "ASL (NHS equivalent)":"Rossano italy",
#     "Academy of Science of the Czech Republic":"Academy of Sciences of the Czech Republic",
#     'A. Mickiewicz University':"Adam Mickiewicz University in Poznań",
#     'Adam Mickiewicz':"Adam Mickiewicz University in Poznań",
#     'Adam Mickiewicz University':"Adam Mickiewicz University in Poznań",
#     'Adam Mickiewicz University in Pozna?':"Adam Mickiewicz University in Poznań",
#     'Adam Mickiewicz University in Poznan':"Adam Mickiewicz University in Poznań",
#     'Adam Mickiewicz University in Poznan':"Adam Mickiewicz University in Poznań",
#     'Adam Mickiewicz University, Faculty of Biology':"Adam Mickiewicz University in Poznań",
#     'Abant ?zzet Baysal №niversitesi - Fen Bilimleri Enstitьsь':"Abant İzzet Baysal University",
#     'Abant Izzet Baysal University':"Abant İzzet Baysal University",
#     'A Coruсa':"Universidade da Coruña",
#     'Academy of Science':"Czech Academy of Sciences",
#     'Academy of Sciences of the Czech Republic':"Czech Academy of Sciences",
#     'Agencia Estatal Consejo Superior de Investigaciones Cientнficas':"Agencia Estatal Consejo Superior de Investigaciones Científicas",
#     'Agriculture Research':'University of Málaga',
#     'Akdeniz University, Faculty of Art &amp; Science':'Akdeniz Üniversitesi',
#     'Aksaray University Faculty of Science and Literature':"Aksaray University",
#     "Albert-Ludwigs-Universitдt":"Albert Ludwig University of Freiburg",
#     'AgroBioInstitute':"Sofia University",
#     'APM (formerly NBGB)':"Agentschap Plantentuin Meise",
#     'Alcalб University':"University of Alcalá",
#     'Agricultural University of Wroc?aw':"Wrocław University of Environmental and Life Sciences",
#     'Albrecht-von-Haller Institute, Georg-August-Universitдt Gцttingen University':"University of Göttingen",
#     'Albrecht-von-Haller-Institute for Plant Sciences':"University of Göttingen",
#     'Alexandru Ioan Cuza University from Iasi':"Alexandru Ioan Cuza University",
#     'Alexandru Ioan Cuza':"Alexandru Ioan Cuza University",
#     "Alicante":"University of Alicante",
#     'Alfred Wegener Institute for Polar &amp; Marine Research':"Alfred-Wegener-Institut",
#     'Alfred Wegener Institute for Polar and Marine Research':"Alfred-Wegener-Institut",
#     'Alfred-Wegener-Institute':"Alfred-Wegener-Institut",
#     'Alfred-Wegener-Institute for Polar and Marine Research':"Alfred-Wegener-Institut",
#     'Aristotle University of Thessaloniki, Greece':'Aristotle University of Thessaloniki',
#     'Association CATHARSIUS':"4 RUE BERNARD MULE 31400 TOULOUSE",
#     'Atatьrk University':"Atatürk Üniversitesi",
#     "Atatьrk University, Science and Art faculty":"Atatürk Üniversitesi",
#     'Austrian Agency for Health and food Safety':"Österreichische Agentur für Gesundheit und Ernährungssicherheit GmbH",
#     "Autonomous University of Barcelona":"Universitat Autònoma de Barcelona",
#     "Babes Bolyai University":"Babeș-Bolyai University",
#     "Babes Bolyai University, Institute for Interdisciplinary Experimental Research":"Babeș-Bolyai University",
#     'Babes-Bolyai':"Babeș-Bolyai University",
#     'Babes-Bolyai University':"Babeș-Bolyai University",
#     'Bal?kesir University, Necatibey Education Faculty':"Balıkesir Üniversitesi",
#     'Balikesir University, Necatibey Education Faculty':"Balıkesir Üniversitesi",
#     'Bar Ilan':"Bar-Ilan University",
#     'Bar Ilan University':"Bar-Ilan University",
#     "Bayerische Staatssammlung fur Palaeontologie und Geologie":"Bayerische Staatssammlung für Paläontologie und Geologie",
#     'Bayerische Staatssammlung fьr Palдontologie und Geologie':"Bayerische Staatssammlung für Paläontologie und Geologie",
#     'Bavarian State Collection':"Zoologische Staatssammlung München",
#     'Bavarian State Collection of Zoology':"Zoologische Staatssammlung München",
#     'BfN (Federal Agency for Nature Conservation)':"Bundesamt für Naturschutz",
#     'Ben Gurion University of the Negev (BGU)':'Ben Gurion University',
#     "Biological Museums":"Lund University",
#     "Botanical Museum":"Lund University",
#     "Biology Centre":"Biologické centrum AV ČR",
#     "Biology Centre AS CR, Institute of Entomology":"Biologické centrum AV ČR",
#     "Biology Centre of the Academy of Sciences of the Czech Republic":"Biologické centrum AV ČR",
#     "Biology Centre of the Czech Academy of Sciences, Institute of Entomology":"Biologické centrum AV ČR",
#     "Biology Centre, Academy of Science of the Czech Republic":"Biologické centrum AV ČR",
#     "Biology Centre, Academy of Sciences of the Czech Republic, Institute of Entomology":"Biologické centrum AV ČR",
#     "Biotechnical Faculty":"University of Ljubljana",
#     "Botanic Garden and Botanical Museum Berlin-Dahlem, Freie Universitдt Berlin":"Botanischer Garten und Botanisches Museum Berlin Freie Universität Berlin",
#     "BotanicAll Archaeobotanical Research":"Hortus Botanicus",
#     "Botanical Department of Science Faculty of Porto University":"Porto University",
#     "Botanical Garden of the University of Latvia":"Latvijas Universitātes botāniskais dārzs",
#     "Botanical Garden, National Museum of Natural History":"Jardim Botânico da Universidade de Lisboa",
#     'Charles University in Prague':"Charles University",
#     'Charles University in Prague, Faculty of Science':"Charles University",
#     'Charles University, Faculty of Natural Sciences':"Charles University",
#     'Charles University, Faculty of Science':"Charles University",
#     'Charles University, Faculty of Science, Prague':"Charles University",
#     'Charles University, Prague':"Charles University",
#     'Complutense':"Universidad Complutense",
#     'Complutense University of Madrid':"Universidad Complutense",
#     'Carl von Ossietzky University Oldenburg':"Carl von Ossietzky University",
#     'Carl von Ossietzky-University':"Carl von Ossietzky University",
#     'Carl von Ossietzsky Universitдt Oldenburg':"Carl von Ossietzky University",

#                }

Double check the town, and country strings. Note that I have seen problems in the town strings already, so these cant be trusted for automated location generation. We will need to geocode based on the name of the place.

In [6]:
cleaned_institutes = []

for n, inst in enumerate(df['User_NHM.Home_Institution_Name']):
    if inst == inst: # if the data are valid (not a nan)
        if inst in known_errors:
            cleaned_institutes.append(known_errors[inst])
        else:
            cleaned_institutes.append(inst)
    else:
        cleaned_institutes.append('nil')
        
places = sorted(set(cleaned_institutes))

for n, p in enumerate(places):
    print(n, p)

NameError: name 'known_errors' is not defined

In [None]:
given_inst_string_find_town_country('Botanical Museum')

In [None]:
"Botanical Museum":"Jardim Botânico da Universidade de Lisboa",


In [None]:
search_location("BGBM")

In [None]:
#g = geocoder.google('ÖKOTEAM - Institut für Tierökologie und Naturraumplanung OG')
#g.latlng

N.b. Looks like there are problems Location strings too (e.g. I saw an instance where  MЅLAGA, ES should be Málaga. We can't really use these strings for much. Will need to get this data from the geocoding.

In [None]:
#with open("/Users/Ben/Work/Vizzuality/SYNTHESIS/place_list.txt","w") as f:
#    for place in places:
#        f.write(place+'\n')

In [None]:
for vals in range(1,1438,479):
    print(vals)
    
Ben : lines 1 to 480
Sarah: lines 480 - 959
Katherine: lines 959 - 1438

In [None]:
#import json

In [None]:
#with open("/Users/Ben/Work/Vizzuality/SYNTHESIS/mappings.json") as json_data:
#    d = json.load(json_data)
#d

# BEN's cleaning task
The first 480 unique items

In [5]:
working_list = []
with open("/Users/Ben/Work/Vizzuality/SYNTHESIS/Data/place_list.txt") as f:
    for row in f:
        working_list.append(row.split("\n")[0])
bens_list = working_list[0:480]

In [6]:
#bens_list

In [103]:
known_errors = {
   "Personal  Address": 'nil',
    "--": 'nil',
    "-": 'nil',
    "------": 'nil',
    "Freelance": 'nil',
    "free lance":'nil',
    "freelance":'nil',
    'Freelance researcher':'nil',
    '(private researcher, not on staff)':'nil',
    "[no current affiliation]":'nil',
    "hosted by University Paris 6":'nil',
    "in retirement, member of Czech Entomological Society":'nil',
    "none":'nil',
    "personal address":'nil',
    "presently none":'nil',
    "private home":'nil',
    "unemployed":'nil',
    '2 Rambler Close':'nil',
    '6 Bramley Avenue':'nil',
    "Univ. Of Vienna":"University of Vienna",
    "?nonu University, Art and Science Faculty":"İnönü University",
    '?nonu University, Faculty of Pharmacy':"İnönü University",
    "> Queen's University":"Queen's University",
    "ASL (NHS equivalent)":"Rossano italy",
    "Academy of Science of the Czech Republic":"Academy of Sciences of the Czech Republic",
    'A. Mickiewicz University':"Adam Mickiewicz University in Poznań",
    'Adam Mickiewicz':"Adam Mickiewicz University in Poznań",
    'Adam Mickiewicz University':"Adam Mickiewicz University in Poznań",
    'Adam Mickiewicz University in Pozna?':"Adam Mickiewicz University in Poznań",
    'Adam Mickiewicz University in Poznan':"Adam Mickiewicz University in Poznań",
    'Adam Mickiewicz University in Poznan':"Adam Mickiewicz University in Poznań",
    'Adam Mickiewicz University, Faculty of Biology':"Adam Mickiewicz University in Poznań",
    'Abant ?zzet Baysal №niversitesi - Fen Bilimleri Enstitьsь':"Abant İzzet Baysal University",
    'Abant Izzet Baysal University':"Abant İzzet Baysal University",
    'A Coruсa':"Universidade da Coruña",
    'Academy of Science':"Czech Academy of Sciences",
    'Academy of Sciences of the Czech Republic':"Czech Academy of Sciences",
    'Agencia Estatal Consejo Superior de Investigaciones Cientнficas':"Agencia Estatal Consejo Superior de Investigaciones Científicas",
    'Agriculture Research':'University of Málaga',
    'Akdeniz University, Faculty of Art &amp; Science':'Akdeniz Üniversitesi',
    'Aksaray University Faculty of Science and Literature':"Aksaray University",
    "Albert-Ludwigs-Universitдt":"Albert Ludwig University of Freiburg",
    'AgroBioInstitute':"Sofia University",
    'APM (formerly NBGB)':"Agentschap Plantentuin Meise",
    'Alcalб University':"University of Alcalá",
    'Agricultural University of Wroc?aw':"Wrocław University of Environmental and Life Sciences",
    'Albrecht-von-Haller Institute, Georg-August-Universitдt Gцttingen University':"University of Göttingen",
    'Albrecht-von-Haller-Institute for Plant Sciences':"University of Göttingen",
    'Alexandru Ioan Cuza University from Iasi':"Alexandru Ioan Cuza University",
    'Alexandru Ioan Cuza':"Alexandru Ioan Cuza University",
    "Alicante":"University of Alicante",
    'Alfred Wegener Institute for Polar &amp; Marine Research':"Alfred-Wegener-Institut",
    'Alfred Wegener Institute for Polar and Marine Research':"Alfred-Wegener-Institut",
    'Alfred-Wegener-Institute':"Alfred-Wegener-Institut",
    'Alfred-Wegener-Institute for Polar and Marine Research':"Alfred-Wegener-Institut",
    'Aristotle University of Thessaloniki, Greece':'Aristotle University of Thessaloniki',
    'Association CATHARSIUS':"4 RUE BERNARD MULE 31400 TOULOUSE",
    'Atatьrk University':"Atatürk Üniversitesi",
    "Atatьrk University, Science and Art faculty":"Atatürk Üniversitesi",
    'Austrian Agency for Health and food Safety':"Österreichische Agentur für Gesundheit und Ernährungssicherheit GmbH",
    "Autonomous University of Barcelona":"Universitat Autònoma de Barcelona",
    "Babes Bolyai University":"Babeș-Bolyai University",
    "Babes Bolyai University, Institute for Interdisciplinary Experimental Research":"Babeș-Bolyai University",
    'Babes-Bolyai':"Babeș-Bolyai University",
    'Babes-Bolyai University':"Babeș-Bolyai University",
    'Bal?kesir University, Necatibey Education Faculty':"Balıkesir Üniversitesi",
    'Balikesir University, Necatibey Education Faculty':"Balıkesir Üniversitesi",
    'Bar Ilan':"Bar-Ilan University",
    'Bar Ilan University':"Bar-Ilan University",
    "Bayerische Staatssammlung fur Palaeontologie und Geologie":"Bayerische Staatssammlung für Paläontologie und Geologie",
    'Bayerische Staatssammlung fьr Palдontologie und Geologie':"Bayerische Staatssammlung für Paläontologie und Geologie",
    'Bavarian State Collection':"Zoologische Staatssammlung München",
    'Bavarian State Collection of Zoology':"Zoologische Staatssammlung München",
    'BfN (Federal Agency for Nature Conservation)':"Bundesamt für Naturschutz",
    'Ben Gurion University of the Negev (BGU)':'Ben Gurion University',
    "Biological Museums":"Lund University",
    "Botanical Museum":"Lund University",
    "Biology Centre":"Biologické centrum AV ČR",
    "Biology Centre AS CR, Institute of Entomology":"Biologické centrum AV ČR",
    "Biology Centre of the Academy of Sciences of the Czech Republic":"Biologické centrum AV ČR",
    "Biology Centre of the Czech Academy of Sciences, Institute of Entomology":"Biologické centrum AV ČR",
    "Biology Centre, Academy of Science of the Czech Republic":"Biologické centrum AV ČR",
    "Biology Centre, Academy of Sciences of the Czech Republic, Institute of Entomology":"Biologické centrum AV ČR",
    "Biotechnical Faculty":"University of Ljubljana",
    "Botanic Garden and Botanical Museum Berlin-Dahlem, Freie Universitдt Berlin":"Botanischer Garten und Botanisches Museum Berlin Freie Universität Berlin",
    "BotanicAll Archaeobotanical Research":"Hortus Botanicus",
    "Botanical Department of Science Faculty of Porto University":"Porto University",
    "Botanical Garden of the University of Latvia":"Latvijas Universitātes botāniskais dārzs",
    "Botanical Garden, National Museum of Natural History":"Universidade de Lisboa",
    "Botanische Staatssammlung Mьnchen":"Botanische Staatssammlung Muenchen",
    "Botanischer Garten Mьnchen-Nymphenburg":"Botanische Staatssammlung Muenchen",
    "Botanisches Staatssaammlung Mьnchen":"Botanische Staatssammlung Muenchen",
    "Botanischer Garten und Botanisches Museum":"Botanischer Garten und Botanisches Museum Berlin Freie Universität Berlin",
    "Bradford":"The University of Bradford",
    "Bulgarian Academy of Sciences, Institute of Biodiversity and Ecosystem Research":"Bulgarian Academy of Sciences",
    "C.U.T.G.A.N.A.":"Centro di ricerca dell'Università degli Studi di Catania",
    "CASP":"CASP cambridge",
    'Charles University in Prague':"Charles University",
    'Charles University in Prague, Faculty of Science':"Charles University",
    'Charles University, Faculty of Natural Sciences':"Charles University",
    'Charles University, Faculty of Science':"Charles University",
    'Charles University, Faculty of Science, Prague':"Charles University",
    'Charles University, Prague':"Charles University",
    'Complutense':"Universidad Complutense",
    'Complutense University of Madrid':"Universidad Complutense",
    'Carl von Ossietzky University Oldenburg':"Carl von Ossietzky University",
    'Carl von Ossietzky-University':"Carl von Ossietzky University",
    'Carl von Ossietzsky Universitдt Oldenburg':"Carl von Ossietzky University",
    "CIBIO":"CIBIO Portugal",
    "Centro de Investigaзгo em Biodiversidade e Recursos Genйticos (CIBIO)":"CIBIO Portugal",
    "CIBIO (Center for research on biodiversity and genetic resources - Portugal)":"CIBIO Portugal",
    "CIBIO (Centro de Investigaзгo em Biodiversidade e Recursos Genйticos), Porto University":"CIBIO Portugal",
    "CIBIO - Research Center in Biodiversity and Genetic Resources":"CIBIO Portugal",
    "CIBIO - Research Centre in Biodiversity and Genetic Resources":"CIBIO Portugal",
    "CIBIO, Centro de Investigaзгo em Biodiversidade e Recursos Genйticos":"CIBIO Portugal",
    "CIBIO, Centro de Investigaзгo em Biodiversidade e Recursos Genйticos - Universidade do Porto":"CIBIO Portugal",
    "CIBIO, Centro de Investigaзгo em Biodiversidade e Recursos Genйticos / InBio, Laboratуrio Associado, Universidade do Porto":"CIBIO Portugal",
    "CIBIO, Research Center in Biodiversity and Genetic Resources":"CIBIO Portugal",
    "CIBIO, Research Centre in Biodiversity and Genetic Resources":"CIBIO Portugal",
    "CIBIO/INBIO":"CIBIO Portugal",
    "CIRAD":"Centre de coopération internationale en recherche agronomique pour le développement",
    "CIRAD (Centre de Coopйration Internationale en Recherche Agronomique pour le Dйveloppement)":"Centre de coopération internationale en recherche agronomique pour le développement",
    "CIRAD-IRD":"Centre de coopération internationale en recherche agronomique pour le développement",
    "CNR (National Research Council)":"Consiglio Nazionale delle Ricerche",
    "CR2P, UMR CNRS 7207, Musйum national d'Histoire naturelle":"Muséum national d'Histoire naturelle",
    "CNRS":"Le Centre national de la recherche scientifique Toulouse",
    "Catholic University Leuven":"Catholic University of Leuven",
    "Center of Earth Sciences at the University of Gцttingen":"University of Göttingen",
    "Central Laboratory of General Ecology":"Bulgarian Academy of  Sciences",
    "Central Laboratory of General Ecology, Bulgarian Academy of  Sciences":"Bulgarian Academy of  Sciences",
    "Central Laboratory of General Ecology, Bulgarian Academy of Sciences":"Bulgarian Academy of  Sciences",
    "Central Science Laboratory":"Defra",
    "Centre d'йcologie fonctionnelle et йvolutive. CNRS. UMR 5175 Montpellier":"Centre d'Ecologie Fonctionnelle et Evolutive",
    "Centre de Recherches Pйtrographiques et Gйochimiques (CRPG-CNRS)":"Le Centre de Recherches Pétrographiques et Géochimiques",
    "Centro Mixto UCM-ISCIII de Evoluciуn y Comportamiento Humano":"Instituto de Salud Carlos III",
    "Centro Mixto UCM-ISCIII de Evoluciуn y Comportamiento Humanos":"Instituto de Salud Carlos III",
    "Centro de Estudios Avanzados de Blanes (CSIC)":"Centre for Advanced Studies of Blanes - Spanish National Research Council",
    "Centre of advanced studies of Blanes":"Centre for Advanced Studies of Blanes - Spanish National Research Council",
    "Centre for Functional Ecology":"Universidade de Coimbra",
    "Coimbra University":"Universidade de Coimbra",
    "Centro Nacional para la Investigaciуn de la Evoluciуn Humana (CENIEH)":"CENIEH - Centro Nacional de Investigación sobre la Evolución Humana",
    "Christian-Albrechts-Universitдt Kiel":"Christian-Albrechts University",
    "Christian-Albrechts-University":"Christian-Albrechts University",
    "City Museum of Zoology":"Museo Civico di Zoologia di Roma",
    "Comenius University":"Comenius University in Bratislava",
    "Consejo Superior de Investigaciones Cientificas":"Consejo Superior de Investigaciones Cientificas (CSIC)",
    "Consejo Superior de Investigaciones Cientнficas":"Consejo Superior de Investigaciones Cientificas (CSIC)",
    "Consejo Superior de Investigaciones Cientнficas (C.S.I.C.)":"Consejo Superior de Investigaciones Cientificas (CSIC)",
    "Consejo Superior de Investigaciones Cientнficas (CSIC)":"Consejo Superior de Investigaciones Cientificas (CSIC)",
    "Consiglio per la Ricerca e la sperimentazione in Agricoltura":"Centro di ricerca per l'agrobiologia e la pedologia (ABP)",
    "Consiglio per la ricerca in agricoltura e l'analisi dell'economia agraria":"Centro di ricerca per l'agrobiologia e la pedologia (ABP)",
    "Conservatoire et Jardin botaniques de la Ville de Genиve":"Conservatoire et Jardin botaniques de la Ville de Genève",
    "Croatian Biospeleological Society":"Zagreb, Croatia",
    "Croatian Geological Survey":"Zagreb, Croatia",
    "Croatian Institute for Biodiversity":"Zagreb, Croatia",
    "Crocodile Press":'nil',
    "Crop Research Institut":"Crop Research Institute prauge",
    "Crop Research Institute":"Crop Research Institute prauge",
    "Czech Institute of Nature Conservation (till my retirement in 2011)":"University of South Bohemia",
    "Czech University of Life Sciences Prague":"Czech University of Life Sciences",
    "Darmstadt University of Technology (TU Darmstadt)":"Darmstadt University of Technology",
    "Darmstadt University of Technology (TU Darmstadt) (":"Darmstadt University of Technology",
    "Department of Biological, Geological and Environmental Sciences":"University of Catania",
    "Department of Earth Sciences, Uppsala University":"Uppsala University",
    "Department of Geo- and Environmental Sciences, LMU":"Ludwig Maximilian University of Munich",
    "Department of Philosophy, History, Culture and Art Studies":"University of Helsinki",
    "Department of Zoology":"Stockholm University",
    "Dept. Animal and Human Biology":"Università degli Studi di Torino",
    "Dept. Sci. MGM":"University of Messina",
    "Dept. of Animal Ecology and Systematic Zoology":"Justus-Liebig-Universität Gießen",
    "Dexia Bank Belgium":"Dexia Bank Belgium Brussles",
    "Dipartimento di Biologia Animale e dellіUomo":"Università degli Studi di Torino",
    "Dipartimento di Biologia Vegetale":"University of Florence",
    "ECOFOG":"Umr Ecofog",
    "ELTE (Eцtvцs University)":"The Budapest University of Technology and Economics",
    "Eberhard Karls University Tuebingen":"Eberhard Karls Universitaet Tuebingen",
    "Eberhard Karls Universitдt Tьbingen":"Eberhard Karls Universitaet Tuebingen",
    "Eberhard Karls Universitдt Tьbingensity Munich, Department on Earth- and Environmental Science, Section Palaeontology":"Eberhard Karls Universitaet Tuebingen",
    "Ecole Pratique des Hautes Etudes &amp; Universitй de Bourgogne":"The University of Burgundy",
    "Ecology-Centre":"Christian-Albrechts University",
    "Eotvos Lorand":"Eotvos Lorand University",
    "Ernst-Moritz-Arndt Universitдt Greifswald":"Ernst Moritz Arndt University Greifswald",
    "Ernst-Moritz-Arndt-Universitaet Greifswald, Zoologisches Institut und Museum":"Ernst Moritz Arndt University Greifswald",
    "Ernst-Moritz-Arndt-University of Greifswald":"Ernst Moritz Arndt University Greifswald",
    "Erzincan University, Education Faculty,":"Erzincan University",
    "Estaciуn Biolуgica de Doсana":"Instituto de Biomedicina de Sevilla",
    "European Invertebrate Survey":"Leiden University",
    "Euskal Herriko Unibertsitatea/Universidad del Paнs Vasco":"Universidad del País Vasco",
    "Euskal Herriko Unibertsitatea/Universidad del Paнs Vasco":"University of Belgrade",
    "Faculty of Biology, University of Belgrade":"University of Belgrade",
    "Evolutionary Biology Centre, Uppsala University":"Uppsala University",
    "Faculdade Ciкncias Sociais e Humanas da Universidade de Lisboa":"Universidade de Lisboa",
    "Faculdade Ciкncias, Universidade Lisboa":"Universidade de Lisboa",
    "Faculdade de Ciкncias da Universidade de Lisboa":"Universidade de Lisboa",
    "Faculdade de Letras da Universidade de Lisboa (Faculty of Letters of the University of Lisbon)":"Universidade de Lisboa",
    "Facultad de Farmacia, Universidad Complutense de Madrid":"Universidad Complutense de Madrid",
    "Faculty of Biology":"Belgrade University",
    "Faculdade de Ciencias da Universidade do Porto":"Universidade do Porto",
    "Faculty of Life Sciences, University of Vienna":"University of Vienna",
    "Faculty of Science":"University of Zagreb",
    "Faculty of Science, University of Zagreb":"University of Zagreb",
    "Faculty of Sciences":"University of Novi Sad",
    "Faculty of Sciences, University of Novi Sad (Serbia)":"University of Novi Sad",
    "Faculty of Siences":"Universidade do Porto",
    "Freelance Zoologist":"nil",
    "Freie Universitдt Berlin":"Freie University of Berlin",
    "Freie University of Berlin":"Free University of Berlin",
    "Friedrich- Schiller- Universitдt Jena":"Friedrich Schiller University Jena",
    "Friedrich-Alexander Universitдt Erlangen-Nьrnberg":"Friedrich Schiller University Jena",
    "Friedrich-Alexander-University":"Friedrich Schiller University Jena",
    "Friedrich-Alexander-Universitдt Erlangen-Nьrnberg":"Friedrich Schiller University Jena",
    "Friedrich-Schiller-University":"Friedrich Schiller University Jena",
    "Friedrich-Schiller-Universitдt Jena":"Friedrich Schiller University Jena",
    "Gda?sk University":"Gdansk University",
    "Gdansk University of Technology":"Gdansk University",
    "Geoscience Centre of the University of Goettingen":"University of Goettingen",
    "Geoscience Centre, University of Goettingen":"University of Goettingen",
    "Geoscience Centre, University of Gцttingen":"University of Goettingen",
    "Geowissenschaftliches Zentrum der Universitдt Gцttingen":"University of Goettingen",
    "Hebrew University":"Hebrew University of Jerusalem",
    "Hacettepe":"Hacettepe University",
    "Humboldt Universitat":"Humboldt University",
    "Humboldt Universitдt":"Humboldt University",
    "Humboldt-Universitaet zu Berlin":"Humboldt University",
    "Humboldt-Universitдt zu Berlin":"Humboldt University",
    "Humboldt-Universtitдt zu Berlin":"Humboldt University",
    "I am independent researcher. There is no 'home institution'.":"nil",
    "Independent":"nil",
    "Independent Researcher":"nil",
    "Independent Researcher, currently unemployed":"nil",
    "Institut de Sciences de l'Evolution de Montpellier (I.S.E.M.)":"University de Montpellier",
    "Institut des Sciences de l'Evolution de Montpellier":"University de Montpellier",
    "Institut des Sciences de l'Evolution de Montpellier - Universitй de Montpellier":"University de Montpellier",
    "Institut des Sciences de l'Evolution de Montpellier, Universitй de Montpellier 2":"University de Montpellier",
    "Institute of Biodiversity and Ecosystem Research":"Bulgarian Academy of Sciences",
    "Institute of Biodiversity and Ecosystem Research - BAS":"Bulgarian Academy of Sciences",
    "Institute of Biodiversity and Ecosystem Research, Bulgarian Academy of Sciences":"Bulgarian Academy of Sciences",
    "Institute of Biodiversity and Ecosystems Research,  Bulgarian Academy of Sciences":"Bulgarian Academy of Sciences",
    "Institute of Nature Conservation Polish Academy of Sciences":"Polish Academy of Sciences",
    "Institute of Oceanology":"Polish Academy of Sciences",
    "Institute of Oceanology Polish Academy of Sciences":"Polish Academy of Sciences",
    "Institute of Oceanology, Polish Academy of Sciences":"Polish Academy of Sciences",
    "Institute of Paleobiology":"Polish Academy of Sciences",
    "Institute of Paleobiology Polish Academy of Sciences":"Polish Academy of Sciences",
    "Institute of Paleobiology, Polish Academy of Sciences":"Polish Academy of Sciences",
    "Jardim Botвnico - Museu Nacional de Histуria Natural, Universidade de Lisboa (Botanic Garden - Natural History Museum, University of Lisbon)":"Universidade de Lisboa",
    "Johannes Gutenberg Universitдt":"Johannes Gutenberg University",
    "Johannes Gutenberg-Universitдt":"Johannes Gutenberg University",
    "Johannes Gutenberg-Universitдt Mainz":"Johannes Gutenberg University",
    "Johannes-Gutenberg Universitдt Mainz":"Johannes Gutenberg University",
    "Justus Liebig University":"Justus-Liebig-Universität Gießen",
    "Justus Liebig University Giessen":"Justus-Liebig-Universität Gießen",
    "Justus Liebig University, Giessen":"Justus-Liebig-Universität Gießen",
    "Justus-Liebig-University":"Justus-Liebig-Universität Gießen",
    "Institute of Biotechnology, University of Helsinki":"University of Helsinki",
    "Institute of Biotechnology, Czech Academy of Science":"Czech Academy of Sciences",
    "Institute of Botany of the Czech Academy of Sciences":"Czech Academy of Sciences",
    "Institute of Botany, Academy of Sciences":"Czech Academy of Sciences",
    "Institute of Botany, Academy of Sciences of the Czech Republic":"Czech Academy of Sciences",
    "Institute of Botany, Czech Academy of Sciences":"Czech Academy of Sciences",
    "Johann Wolfgang Goethe Universitдt":"Johann Wolfgang Goethe University",
    "Johann Wolfgang Goethe Universitдt Frankfurt a.M.":"Johann Wolfgang Goethe University",
    "K.U.Leuven":"University of Leuven",
    "KU Leuven":"University of Leuven",
    "KU Leuven (University of Leuven)":"University of Leuven",
    "KULeuven Campus Kortrijk":"University of Leuven",
    "Institute of Zoology - BAS":"Bulgarian Academy of Sciences",
    "Institute of Zoology, Bulgarian Academy of Sciences":"Bulgarian Academy of Sciences",
    "Institute of Zoology. Bulgarian Academy of Sciences":"Bulgarian Academy of Sciences",
    "Institute of Botany, Bulgarian Academy of Sciences":"Bulgarian Academy of Sciences",
    "Georg August University Gцttingen":"Georg August University Gottingen",
    "Georg-August-Universitдt Gцttingen":"Georg August University Gottingen",
    "Florence University":"University of Florence",
}

In [104]:
good_ones = []
for k in known_errors:
    good_ones.append(known_errors[k])

clean_items = []

for item in bens_list:
    if item in known_errors:
        clean_items.append(known_errors[item])
    else:
        clean_items.append(item)

potentially_unchecked = 0
for n, i in enumerate(sorted(set(clean_items))):
    if i in good_ones:
        print(Fore.GREEN + f"{n} {i}")
    else:
        print(Fore.BLACK + f"{n} {i}")
        potentially_unchecked += 1
#print(f"Potentially Unchecked {potentially_unchecked}")

[32m0 4 RUE BERNARD MULE 31400 TOULOUSE
[30m1 ASL
[30m2 Aarhus University
[32m3 Abant İzzet Baysal University
[30m4 Aberdeen University
[32m5 Adam Mickiewicz University in Poznań
[30m6 Adnan Menderes University
[32m7 Agencia Estatal Consejo Superior de Investigaciones Científicas
[32m8 Agentschap Plantentuin Meise
[30m9 Agrarian University of Plovdiv
[30m10 Agricultural Institute of Slovenia
[30m11 Agricultural University - Plovdiv
[30m12 Agricultural University of Athens
[32m13 Akdeniz Üniversitesi
[32m14 Aksaray University
[32m15 Albert Ludwig University of Freiburg
[30m16 Alexander Koenig Research Museum of Zoology
[32m17 Alexandru Ioan Cuza University
[32m18 Alfred-Wegener-Institut
[30m19 Alma Mater Studiorum University of Bologna
[30m20 Ankara University
[30m21 Archaeological Institute of the Hungarian Academy of Sciences
[32m22 Aristotle University of Thessaloniki
[32m23 Atatürk Üniversitesi
[30m24 Aveiro University
[30m25 BGBM
[32m26 Babeș-Bolyai Unive

In [102]:
given_inst_string_find_town_country("Faculty of Natural Sciences")

0: U, Skopje, MK
1: N, Skopje, MK


In [None]:

"Florence University":"University of Florence",


In [207]:
search_location("Universidad del País Vasco")

{'bbox': [-2.969241080291502,
  43.32959431970851,
  -2.966543119708498,
  43.3322922802915],
 'geometry': {'coordinates': [-2.9678921, 43.3309433], 'type': 'Point'},
 'properties': {'accuracy': 'ROOFTOP',
  'address': 'Barrio Sarriena, s/n, 48940 Leioa, Bizkaia, Spain',
  'bbox': [-2.969241080291502,
   43.32959431970851,
   -2.966543119708498,
   43.3322922802915],
  'city': 'Leioa',
  'confidence': 9,
  'country': 'ES',
  'county': 'BI',
  'encoding': 'utf-8',
  'housenumber': 's/n',
  'lat': 43.3309433,
  'lng': -2.9678921,
  'location': 'Universidad del País Vasco ',
  'ok': True,
  'place': 'ChIJv4Wn9wJbTg0R0OzXHJ18IZo',
  'postal': '48940',
  'provider': 'google',
  'quality': 'establishment',
  'state': 'PV',
  'status': 'OK',
  'status_code': 200,
  'street': 'Barrio Sarriena'},
 'type': 'Feature'}

In [None]:
# Stopped here:
# 102 Centre for Advanced Studies of Blanes - Spanish National Research Council

## Jellyfish testing

http://jellyfish.readthedocs.io/en/latest/comparison.html

In [None]:
import jellyfish

In [32]:
bens_list[78:90]

['CEA-Grenoble',
 'CELAL BAYAR UNIVERSITY',
 'CIBIO',
 'CIBIO (Center for research on biodiversity and genetic resources - Portugal)',
 'CIBIO (Centro de Investigaзгo em Biodiversidade e Recursos Genйticos), Porto University',
 'CIBIO - Research Center in Biodiversity and Genetic Resources',
 'CIBIO - Research Centre in Biodiversity and Genetic Resources',
 'CIBIO, Centro de Investigaзгo em Biodiversidade e Recursos Genйticos',
 'CIBIO, Centro de Investigaзгo em Biodiversidade e Recursos Genйticos - Universidade do Porto',
 'CIBIO, Centro de Investigaзгo em Biodiversidade e Recursos Genйticos / InBio, Laboratуrio Associado, Universidade do Porto',
 'CIBIO, Research Center in Biodiversity and Genetic Resources',
 'CIBIO, Research Centre in Biodiversity and Genetic Resources']

In [None]:
for entry in bens_list[78:90]:
    print(f"{jellyfish.levenshtein_distance('CIBIO', entry)} = {entry}")

In [None]:
test_string = "CIBIO"

for entry in bens_list:
    jaro_score = jellyfish.jaro_distance(test_string, entry)
    if jaro_score > 0.65:
        print(f"{jaro_score:3.2f}: {entry}")