Normalising sample list to GRID IDs
========================

We have a selected list of universities now. To fully utilise the GRID information dump we need to normalise the university names to GRID IDs

In [1]:
import os.path
import pandas as pd
import json
input_data_folder = 'data/input/'
output_data_folder = 'data/output/'
core_data_folder = '../general/core_data/'
grid_json_filepath = os.path.join(core_data_folder, 'grid20170926/grid.json')
sample100 = pd.read_csv(os.path.join(output_data_folder, 'uni_sample.csv'))



Normalising to Grid IDs
----------------------

Load up the GRID json file and process much the same as the countries to see what the issues are.

In [2]:
with open(grid_json_filepath) as f:
    grid = json.load(f)

In [3]:
grid_lookup = {}
alias_lookup = {}
acronym_lookup = {}
for i, u in enumerate(grid['institutes']):
    if u.get('name'): 
        if u['name'] in grid_lookup:
            grid_lookup[u['name']] = 'amb'
        grid_lookup[u['name']] = i
        if u.get('aliases') != []:
            for name in u.get('aliases'):
                alias_lookup[name] = i
        if u.get('acronyms') != []:
            for name in u.get('acronyms'):
                grid_lookup[name] = i


In [4]:
unmatched = []
matched = []
for i,uni in enumerate(sample100['name']):
    try:
        grid_index = grid_lookup[uni]
        g = grid['institutes'][grid_index]
        match = {'id' : g['id'],
                 'name' : g['name'],
                 'city' : g['addresses'][0]['city'],
                 'country' : g['addresses'][0]['country'],
                 'thes_int_rank' : sample100['int_rank'][i],
                 'subregion' : sample100['subregion'][i]}
        
        matched.append(match)
        
        if grid_lookup[uni] == 'amb':
            print(uni, 'ambiguous')
        
    except KeyError:
        unmatched.append((uni, i))

print(unmatched)

[('ETH Zurich – Swiss Federal Institute of Technology Zurich', 2), ('University of California, Los Angeles', 5), ('University of Illinois at Urbana-Champaign', 16), ('The University of Tokyo', 21), ('University of Texas at Austin', 24), ('Nanyang Technological University, Singapore', 31), ('Korea Advanced Institute of Science and Technology (KAIST)', 41), ('KTH Royal Institute of Technology', 46), ('TU Dresden', 52), ('Vrije Universiteit Amsterdam', 54), ('Lomonosov Moscow State University', 58), ('Justus Liebig University Giessen', 60), ('Queen’s University', 65), ('St George’s, University of London', 72), ('Bandung Institute of Technology (ITB)', 75), ('Kochi University', 76), ('University of the Andes, Colombia', 78), ('Pontifical Catholic University of Paraná', 83), ('ITMO University', 87), ('Pontifical Javeriana University', 93), ('Pontifical Catholic University of Paraná', 94), ('Indian Institute of Technology (Indian School of Mines) Dhanbad', 96), ('VIT University', 97), ('Indi

Going to try using the fuzzywuzzy library to find better matches for these universities. Will need to set up a list of names in this case to identify which element to search for.

In [5]:
from fuzzywuzzy import process, fuzz
grid_names = [u.get('name') for u in grid['institutes']]

In [6]:
fuzzmatch = []
for u in unmatched:
    d = {'thes_name': u[0],
         'thes_country' : sample100['country'][u[1]],
         'sample_index' : u[1]}
    m = process.extractOne(u[0], grid_names, scorer=fuzz.token_set_ratio)
    d.update({'grid_name': m[0],
              'match_score' : m[1],
              'grid_country' : grid['institutes'][grid_lookup.get(m[0])]["addresses"][0]['country'],
              'grid_index' : grid_lookup.get(m[0])})
    fuzzmatch.append(d)

In [7]:
stilltomatch = []
for m in fuzzmatch:
    print("""
thes: {thes_name} ({thes_country})
grid: {grid_name} ({grid_country})            
             """.format(**m))
    resp = input('-->')
    if resp == 'y':
        g = grid['institutes'][m['grid_index']]
 
        match = {'id' : g['id'],
                 'name' : g['name'],
                 'city' : g['addresses'][0]['city'],
                 'country' : g['addresses'][0]['country'],
                 'thes_int_rank' : sample100['int_rank'][m['sample_index']],
                 'subregion' : sample100['subregion'][m['sample_index']]}
        
        matched.append(match)
    else:
        stilltomatch.append(m)



thes: ETH Zurich – Swiss Federal Institute of Technology Zurich (Switzerland)
grid: Swiss Federal Institute of Technology in Zurich (Switzerland)            
             
-->y

thes: University of California, Los Angeles (United States)
grid: University of California Los Angeles (United States)            
             
-->y

thes: University of Illinois at Urbana-Champaign (United States)
grid: University of Illinois at Urbana Champaign (United States)            
             
-->y

thes: The University of Tokyo (Japan)
grid: University of Tokyo (Japan)            
             
-->y

thes: University of Texas at Austin (United States)
grid: The University of Texas at Austin (United States)            
             
-->y

thes: Nanyang Technological University, Singapore (Singapore)
grid: Nanyang Technological University (Singapore)            
             
-->y

thes: Korea Advanced Institute of Science and Technology (KAIST) (South Korea)
grid: Korea Institute of Science and Tec

In [8]:
len(matched)

94

In [9]:
matched[90:94]

[{'city': 'Bogotá',
  'country': 'Colombia',
  'id': 'grid.41312.35',
  'name': 'Pontifical Xavierian University',
  'subregion': 'South America',
  'thes_int_rank': 600},
 {'city': 'Dhanbad',
  'country': 'India',
  'id': 'grid.417984.7',
  'name': 'Indian School of Mines',
  'subregion': 'Southern Asia',
  'thes_int_rank': 800},
 {'city': 'Bengaluru',
  'country': 'India',
  'id': 'grid.34980.36',
  'name': 'Indian Institute of Science Bangalore',
  'subregion': 'Southern Asia',
  'thes_int_rank': 300},
 {'city': 'Marrakesh',
  'country': 'Morocco',
  'id': 'grid.411840.8',
  'name': 'Cadi Ayyad University',
  'subregion': 'Northern Africa',
  'thes_int_rank': 1000}]

Final matching process
--------

Reasonable set of matches made here. Mostly confirmed by checking online via web search, there are still a few remaining which are likely easiest done manually. For each one I will try to manually disambiguate using the GRID database. Copying and pasting the name from THES into the input box at https://grid.ac/disambiguate should give the correct name which can then be looked up in the GRID lookup.

In [10]:
len(stilltomatch)

11

In [11]:
i=0
stilltomatch[i]['thes_name']

'Vrije Universiteit Amsterdam'

In [12]:
grid_index = grid_lookup.get('VU University Amsterdam')
grid_index

1015

In [13]:
sampleindex = stilltomatch[i]['sample_index']
g = grid['institutes'][grid_index]
match = {'id' : g['id'],
         'name' : g['name'],
         'city' : g['addresses'][0]['city'],
         'country' : g['addresses'][0]['country'],
         'thes_int_rank' : sample100['int_rank'][sampleindex],
         'subregion' : sample100['subregion'][sampleindex]}
match

{'city': 'Amsterdam',
 'country': 'Netherlands',
 'id': 'grid.12380.38',
 'name': 'VU University Amsterdam',
 'subregion': 'Western Europe',
 'thes_int_rank': 165}

In [14]:
matched.append(match)

In [15]:
i+=1
stilltomatch[i]['thes_name']

'Queen’s University'

In [16]:
grid_index = grid_lookup.get("Queen's University")
grid_index

4688

In [17]:
sampleindex = stilltomatch[i]['sample_index']
g = grid['institutes'][grid_index]
match = {'id' : g['id'],
         'name' : g['name'],
         'city' : g['addresses'][0]['city'],
         'country' : g['addresses'][0]['country'],
         'thes_int_rank' : sample100['int_rank'][sampleindex],
         'subregion' : sample100['subregion'][sampleindex]}
match

{'city': 'Kingston',
 'country': 'Canada',
 'id': 'grid.410356.5',
 'name': "Queen's University",
 'subregion': 'Northern America',
 'thes_int_rank': 300}

In [18]:
matched.append(match)

In [19]:
i+=1
stilltomatch[i]['thes_name']

'St George’s, University of London'

In [20]:
grid_index = grid_lookup.get("St George’s, University of London")
grid_index

This seems to have failed completely. Will complete manually from data at grid.ac

In [21]:
sampleindex = stilltomatch[i]['sample_index']
#g = grid['institutes'][grid_index]
match = {'id' : 'grid.264200.2',
         'name' : "St George’s, University of London",
         'city' : 'London',
         'country' : 'United Kingdom',
         'thes_int_rank' : sample100['int_rank'][sampleindex],
         'subregion' : sample100['subregion'][sampleindex]}
match

{'city': 'London',
 'country': 'United Kingdom',
 'id': 'grid.264200.2',
 'name': 'St George’s, University of London',
 'subregion': 'Northern Europe',
 'thes_int_rank': 250}

In [22]:
matched.append(match)

In [23]:
i+=1
stilltomatch[i]['thes_name']

'Bandung Institute of Technology (ITB)'

In [24]:
grid_index = grid_lookup.get('Institut Teknologi Bandung')
grid_index

29120

In [25]:
sampleindex = stilltomatch[i]['sample_index']
g = grid['institutes'][grid_index]
match = {'id' : g['id'],
         'name' : g['name'],
         'city' : g['addresses'][0]['city'],
         'country' : g['addresses'][0]['country'],
         'thes_int_rank' : sample100['int_rank'][sampleindex],
         'subregion' : sample100['subregion'][sampleindex]}
matched.append(match)

In [26]:
i+=1
stilltomatch[i]['thes_name']

'Kochi University'

In [27]:
grid_index = grid_lookup.get('Kōchi University')
grid_index

4012

In [28]:
sampleindex = stilltomatch[i]['sample_index']
g = grid['institutes'][grid_index]
match = {'id' : g['id'],
         'name' : g['name'],
         'city' : g['addresses'][0]['city'],
         'country' : g['addresses'][0]['country'],
         'thes_int_rank' : sample100['int_rank'][sampleindex],
         'subregion' : sample100['subregion'][sampleindex]}
matched.append(match)

In [29]:
i+=1
stilltomatch[i]['thes_name']

'University of the Andes, Colombia'

In [30]:
grid_index = grid_lookup.get('Universidad de Los Andes')
grid_index

36011

This is not actually the correct match so doing this one manually

In [32]:
sampleindex = stilltomatch[i]['sample_index']
#g = grid['institutes'][grid_index]
match = {'id' : 'grid.7247.6',
         'name' : 'Universidad de Los Andes',
         'city' : 'Bogota',
         'country' : 'Columbia',
         'thes_int_rank' : sample100['int_rank'][sampleindex],
         'subregion' : sample100['subregion'][sampleindex]}
match

{'city': 'Bogota',
 'country': 'Columbia',
 'id': 'grid.7247.6',
 'name': 'Universidad de Los Andes',
 'subregion': 'South America',
 'thes_int_rank': 800}

In [33]:
matched.append(match)

In [34]:
i+=1
stilltomatch[i]['thes_name']

'Pontifical Catholic University of Paraná'

In [35]:
grid_index = grid_lookup.get('Pontifícia Universidade Católica do Paraná')
grid_index

6723

In [36]:
sampleindex = stilltomatch[i]['sample_index']
g = grid['institutes'][grid_index]
match = {'id' : g['id'],
         'name' : g['name'],
         'city' : g['addresses'][0]['city'],
         'country' : g['addresses'][0]['country'],
         'thes_int_rank' : sample100['int_rank'][sampleindex],
         'subregion' : sample100['subregion'][sampleindex]}
match

{'city': 'Curitiba',
 'country': 'Brazil',
 'id': 'grid.412522.2',
 'name': 'Pontifícia Universidade Católica do Paraná',
 'subregion': 'South America',
 'thes_int_rank': 1000}

In [37]:
matched.append(match)

In [38]:
i+=1
stilltomatch[i]['thes_name']

'ITMO University'

In [39]:
grid_index = grid_lookup.get('Saint Petersburg State University of Information Technologies, Mechanics and Optics')
grid_index

1769

In [40]:
sampleindex = stilltomatch[i]['sample_index']
g = grid['institutes'][grid_index]
match = {'id' : g['id'],
         'name' : g['name'],
         'city' : g['addresses'][0]['city'],
         'country' : g['addresses'][0]['country'],
         'thes_int_rank' : sample100['int_rank'][sampleindex],
         'subregion' : sample100['subregion'][sampleindex]}
matched.append(match)

In [41]:
i+=1
stilltomatch[i]['thes_name']

'Pontifical Catholic University of Paraná'

Not sure why this one is here twice, but it is the same one (Brazil in both cases for country) so will ignore it and move onto the last one.

In [42]:
i+=1
stilltomatch[i]['thes_name']

'VIT University'

In [43]:
grid_index = grid_lookup.get('Vellore Institute of Technology University')
grid_index

7011

In [44]:
sampleindex = stilltomatch[i]['sample_index']
g = grid['institutes'][grid_index]
match = {'id' : g['id'],
         'name' : g['name'],
         'city' : g['addresses'][0]['city'],
         'country' : g['addresses'][0]['country'],
         'thes_int_rank' : sample100['int_rank'][sampleindex],
         'subregion' : sample100['subregion'][sampleindex]}
matched.append(match)

In [45]:
i+=1
stilltomatch[i]['thes_name']

'University of Tlemcen'

No obvious match for this in the GRID database. May require further work to disambiguate.

In [47]:
len(matched)

103

In [51]:
import csv
headers = ['id', 'name', 'subregion', 'country', 'city', 'thes_int_rank']
filepath = os.path.join(output_data_folder, 'sample100.csv')
with open(filepath, 'w', encoding = 'utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=headers)
    writer.writeheader()
    writer.writerows(matched)