In [1]:
import pandas as pd
import urllib
import csv

# To query Wikipedia API
import requests
import json
from requests.packages.urllib3.exceptions import InsecureRequestWarning

requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

## Leiden Ranking

[CWTS Leiden Ranking 2021](https://www.leidenranking.com/ranking/2021/list)

Leiden Ranking (2021) includes a total of 1225 universities.

In [2]:
df_leiden = pd.read_csv('data/leiden_ranking_2021.tsv', sep='\t')
df_leiden

Unnamed: 0,id,university,publications,publications_frac,top_pubs,top_pubs_frac,collaboration,oa
0,6,Univ Graz,3435,1468,413,138,2876,2273
1,8,Graz Univ Technol,3481,1545,444,177,2927,1917
2,10,Univ Innsbruck,5704,2244,705,216,4838,3957
3,14,Johannes Kepler Univ Linz,2628,1250,291,124,2124,1490
4,15,Univ Salzburg,1850,851,230,99,1507,1163
...,...,...,...,...,...,...,...,...
1220,19138,SRM Inst Sci & Technol,2543,1300,216,96,1918,480
1221,32912,Universidade Lisboa,17465,7342,2025,657,14904,8696
1222,33166,Univ Hlth Sci,2093,1311,52,25,1432,780
1223,101154,Soonchunhyang Univ,3193,1377,201,65,2512,1696


## GRID

[GRID](https://www.grid.ac/)

GRID includes a total of 13,158 universities with a link to a Wikipedia page.

In [3]:
df_grid = pd.read_csv('data/wikipedia_grid_univ_2021.csv', sep=';')
df_grid = df_grid[~df_grid['wikipedia_url'].isna()]
df_grid

Unnamed: 0,grid_id,organization_name,established_year,closed_year,wikipedia_url,website_url
0,grid.1001.0,Australian National University,1946.0,,http://en.wikipedia.org/wiki/Australian_Nation...,http://www.anu.edu.au/
1,grid.1002.3,Monash University,1958.0,,http://en.wikipedia.org/wiki/Monash_University,http://www.monash.edu/
2,grid.10025.36,University of Liverpool,1882.0,,http://en.wikipedia.org/wiki/University_of_Liv...,http://www.liv.ac.uk/
3,grid.1003.2,University of Queensland,1909.0,,http://en.wikipedia.org/wiki/University_of_Que...,http://www.uq.edu.au/
4,grid.1004.5,Macquarie University,1964.0,,http://en.wikipedia.org/wiki/Macquarie_University,http://mq.edu.au/
...,...,...,...,...,...,...
19634,grid.9966.0,University of Limoges,1968.0,,http://en.wikipedia.org/wiki/University_of_Lim...,http://www.unilim.fr/
19635,grid.9970.7,Johannes Kepler University of Linz,1966.0,,http://en.wikipedia.org/wiki/Johannes_Kepler_U...,http://www.jku.at/content
19636,grid.9982.a,Slovak Medical University,2002.0,,https://en.wikipedia.org/wiki/Slovak_Medical_U...,http://eng.szu.sk/
19637,grid.9983.b,University of Lisbon,1911.0,,http://en.wikipedia.org/wiki/University_of_Lisbon,http://www.ulisboa.pt/


Wikipedia URLs are cleaned, English Wikipedia links are selected and the beginning of the URL is removed to obtain the Wikipedia page title.

In [4]:
df_grid.wikipedia_url = df_grid.wikipedia_url.str.replace('http://', '', regex=False)
df_grid.wikipedia_url = df_grid.wikipedia_url.str.replace('https://', '', regex=False)
df_grid = df_grid[df_grid.wikipedia_url.str.startswith('en.wikipedia.org/')].copy()
df_grid.wikipedia_url = df_grid.wikipedia_url.str.replace('en.wikipedia.org/wiki/', '', regex=False)

URLs are unquoted to avoid problems when joinning them.

In [5]:
df_grid.wikipedia_url = [urllib.parse.unquote(x) for x in df_grid.wikipedia_url]
df_grid

Unnamed: 0,grid_id,organization_name,established_year,closed_year,wikipedia_url,website_url
0,grid.1001.0,Australian National University,1946.0,,Australian_National_University,http://www.anu.edu.au/
1,grid.1002.3,Monash University,1958.0,,Monash_University,http://www.monash.edu/
2,grid.10025.36,University of Liverpool,1882.0,,University_of_Liverpool,http://www.liv.ac.uk/
3,grid.1003.2,University of Queensland,1909.0,,University_of_Queensland,http://www.uq.edu.au/
4,grid.1004.5,Macquarie University,1964.0,,Macquarie_University,http://mq.edu.au/
...,...,...,...,...,...,...
19634,grid.9966.0,University of Limoges,1968.0,,University_of_Limoges,http://www.unilim.fr/
19635,grid.9970.7,Johannes Kepler University of Linz,1966.0,,Johannes_Kepler_University_Linz,http://www.jku.at/content
19636,grid.9982.a,Slovak Medical University,2002.0,,Slovak_Medical_University,http://eng.szu.sk/
19637,grid.9983.b,University of Lisbon,1911.0,,University_of_Lisbon,http://www.ulisboa.pt/


## Intermediate table

We use an intermediate table to univocally link the Leiden Ranking with the GRID data.

In [6]:
df_link = pd.read_csv('data/leiden_grid_2021.csv', sep=';')
df_link.wikipedia_url = df_link.wikipedia_url.str.replace('http://', '', regex=False)
df_link.wikipedia_url = df_link.wikipedia_url.str.replace('https://', '', regex=False)
df_link = df_link[df_link.wikipedia_url.str.startswith('en.wikipedia.org/')].copy()
df_link.wikipedia_url = df_link.wikipedia_url.str.replace('en.wikipedia.org/wiki/', '', regex=False)
df_link.wikipedia_url = [urllib.parse.unquote(x) for x in df_link.wikipedia_url]

As we only want GRID table to include the established year we only keep this data. Both tables are joinned by the Wikipedia title. Despite some universities are in both tables they are not joinned due to differences in this field. However, it is the most accurate way to do it.

In [7]:
df_link_grid = pd.merge(df_link[['id', 'country_iso_num_code', 'wikipedia_url']], df_grid[['wikipedia_url', 'established_year']].drop_duplicates(), how='left', on='wikipedia_url')

It generates some duplicated institutions. They have the same Wikipedia title.

In [8]:
df_grid[df_grid.wikipedia_url=='Istituto_Superiore_per_le_Industrie_Artistiche']

Unnamed: 0,grid_id,organization_name,established_year,closed_year,wikipedia_url,website_url
14505,grid.465958.3,Higher Institute for Artistic Industries Faenza,1980.0,,Istituto_Superiore_per_le_Industrie_Artistiche,http://www.isiafaenza.it/?lang=en
14507,grid.465961.9,Higher Institute for Artistic Industries Firenze,1975.0,,Istituto_Superiore_per_le_Industrie_Artistiche,http://www.isiadesign.fi.it/
14509,grid.465963.b,Higher Institute for Artistic Industries Roma,1965.0,,Istituto_Superiore_per_le_Industrie_Artistiche,http://www.isiaroma.it/
14510,grid.465964.c,Higher Institute for Artistic Industries Urbino,1974.0,,Istituto_Superiore_per_le_Industrie_Artistiche,http://www.isiaurbino.net/home/


These datasets are merged.

In [9]:
df = df_leiden.merge(df_link_grid, how='left', on='id')

Only 9 institutions do not have a Wikipedia URL and they are removed.

In [10]:
df[df.wikipedia_url.isna()].shape

(9, 11)

In [11]:
df = df[~df.wikipedia_url.isna()].copy()

109 do not have a established year (they couldn't be linked to GRID data).

In [12]:
df[df.established_year.isna()].shape

(109, 11)

There are 3 pages duplicated that need to be fixed manually.

In [13]:
df[(df.wikipedia_url.duplicated()) & (~df.wikipedia_url.isna())]

Unnamed: 0,id,university,publications,publications_frac,top_pubs,top_pubs_frac,collaboration,oa,country_iso_num_code,wikipedia_url,established_year
129,349,Univ Southampton,16280,6333,2766,872,13995,14591,826.0,University_of_Southampton,1952.0
567,1274,Carnegie Mellon Univ,7850,3424,1279,499,6379,4850,840.0,Carnegie_Mellon_University,2011.0
929,3484,Kent State Univ,3096,1447,380,159,2488,1386,840.0,Kent_State_University,1964.0


In all of them, the lower date is maintained.

In [14]:
df.loc[df.wikipedia_url=='University_of_Southampton', 'established_year'] = 1952
df.loc[df.wikipedia_url=='Carnegie_Mellon_University', 'established_year'] = 1900
df.loc[df.wikipedia_url=='Kent_State_University', 'established_year'] = 1910

In [15]:
df = df.drop_duplicates().copy()

Our sample (at this point) is composed of 1216 universities with a Wikipedia URL.

In [16]:
df.shape

(1216, 11)

## Wikipedia

Wikipedia pages dataset is imported and reduced to only article pages.

In [17]:
df_wiki = pd.read_csv('data/page.tsv', sep='\t', quoting=csv.QUOTE_NONE)
df_wiki = df_wiki[df_wiki['namespace']==0].copy()
df_wiki

Unnamed: 0,page_id,namespace,title,is_redirect,is_new,touched,links_updated,latest,len,content_model,page_edits,creation,editors,views,references
0,10,0,AccessibleComputing,1,0,20210607122734,2.021061e+13,1002250816,111,wikitext,14.0,2001-01-21,13.0,186.0,
1,12,0,Anarchism,0,0,20210701093040,2.021070e+13,1030472204,96584,wikitext,19819.0,2001-10-11,3773.0,237226.0,92.0
2,13,0,AfghanistanHistory,1,0,20210629133822,2.021061e+13,783865149,90,wikitext,6.0,2001-04-05,5.0,47.0,
3,14,0,AfghanistanGeography,1,0,20210607122734,2.021061e+13,783865160,92,wikitext,7.0,2001-01-21,7.0,23.0,
4,15,0,AfghanistanPeople,1,0,20210629123442,2.021061e+13,783865293,95,wikitext,8.0,2001-01-21,7.0,16.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53710509,68103359,0,Carrie_Flemmer,0,1,20210701094546,2.021070e+13,1031387168,1300,wikitext,1.0,2021-07-01,1.0,,
53710510,68103360,0,US_des_Forces_Armees,1,1,20210701094549,2.021070e+13,1031387177,35,wikitext,1.0,2021-07-01,1.0,,
53710512,68103362,0,Carrie_Flemmer-Marshall,1,1,20210701094611,2.021070e+13,1031387212,27,wikitext,1.0,2021-07-01,1.0,,
53710515,68103365,0,Dapp_Browsers,0,1,20210701094630,2.021070e+13,1031387241,2682,wikitext,3.0,2021-07-01,2.0,,


As some of them don't include the title they are removed.

In [18]:
df_wiki = df_wiki[~df_wiki.title.isna()].copy()

1213 university pages are found in the Wikipedia dataset.

In [19]:
df_wiki[df_wiki.title.isin(df.wikipedia_url)]

Unnamed: 0,page_id,namespace,title,is_redirect,is_new,touched,links_updated,latest,len,content_model,page_edits,creation,editors,views,references
1025,1859,0,Arizona_State_University,0,0,20210630203037,2.021063e+13,1031291521,153688,wikitext,5260.0,2001-09-23,1813.0,83094.0,241.0
2638,4157,0,Brown_University,0,0,20210630041419,2.021063e+13,1030113579,154347,wikitext,5366.0,2001-09-10,2081.0,197635.0,101.0
3814,5690,0,Chalmers_University_of_Technology,0,0,20210628143759,2.021063e+13,1030879897,19665,wikitext,620.0,2002-02-25,362.0,12169.0,10.0
3889,5786,0,California_Institute_of_Technology,0,0,20210630041419,2.021063e+13,1031095636,132023,wikitext,2791.0,2001-06-28,1363.0,119083.0,124.0
4279,6310,0,Columbia_University,0,0,20210701064340,2.021070e+13,1031366616,169071,wikitext,9899.0,2001-09-10,3453.0,282775.0,290.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52275519,66560028,0,Freiberg_University_of_Mining_and_Technology,1,1,20210607015344,2.021013e+13,1003914911,106,wikitext,1.0,2021-01-31,1.0,807.0,
53077146,67423223,0,University_of_Côte_d'Azur,1,1,20210623150859,2.021042e+13,1018320796,84,wikitext,1.0,2021-04-17,1.0,1626.0,
53080291,67426585,0,University_of_Rouen,1,1,20210607015848,2.021042e+13,1018381042,89,wikitext,1.0,2021-04-17,1.0,718.0,
53080992,67427342,0,Paul_Sabatier_University,1,1,20210607015848,2.021042e+13,1018396359,100,wikitext,1.0,2021-04-17,1.0,1293.0,


In [20]:
df[~df.wikipedia_url.isin(df_wiki.title)]

Unnamed: 0,id,university,publications,publications_frac,top_pubs,top_pubs_frac,collaboration,oa,country_iso_num_code,wikipedia_url,established_year
617,1371,HSE Univ,2986,1144,306,78,2521,1710,643.0,National_Research_University_–_Higher_School_o...,
1062,9596,Renmin Univ China,3546,1659,415,164,2972,1203,156.0,Renmin_University_of_China#Campuses,
1182,18040,JiLin Agr Univ,1641,820,109,52,1362,724,156.0,JiLin_Agriculture_University,


Two of them were fiexed, but one was removed from Wikipedia ([more info](https://en.wikipedia.org/wiki/Special:Log?type=delete&user=&page=JiLin_Agriculture_University&wpdate=&tagfilter=&subtype=)).

In [21]:
df.loc[df.wikipedia_url=='National_Research_University_–_Higher_School_of_Economics#Regional_branches', 'wikipedia_url'] = 'Higher_School_of_Economics'
df.loc[df.wikipedia_url=='Renmin_University_of_China#Campuses', 'wikipedia_url'] = 'Renmin_University_of_China'

In [22]:
df = df.loc[df.wikipedia_url!='JiLin_Agriculture_University'].copy()

No new problems are generated with these changes.

In [23]:
any(df.wikipedia_url.duplicated())

False

## Redirects

141 university pages are redirect, so we need to fix these links using the Wikipedia API.

In [24]:
redirects = pd.DataFrame({'from':df_wiki[(df_wiki['is_redirect']==1) & (df_wiki.title.isin(df.wikipedia_url))]['page_id'],
                          'to':None})
redirects

Unnamed: 0,from,to
41801,55286,
117721,143732,
138426,171089,
243480,322640,
290638,393099,
...,...,...
52275519,66560028,
53077146,67423223,
53080291,67426585,
53080992,67427342,


In [25]:
count = 0
for i in redirects['from'].tolist():
    count+=1
    print(round(100*count/redirects.shape[0], 2), end='\r')
    url = 'https://en.wikipedia.org/w/api.php?action=query&format=json&redirects&pageids=' + str(i)
    query = requests.get(url, verify=False)
    response = json.loads(query.text)
    redirects.loc[redirects['from']==i,'to'] = int(list(response['query']['pages'].keys())[0])

100.0

In [26]:
redirects

Unnamed: 0,from,to
41801,55286,38091
117721,143732,142298
138426,171089,170430
243480,322640,32053
290638,393099,72595
...,...,...
52275519,66560028,66560028
53077146,67423223,47365029
53080291,67426585,13955537
53080992,67427342,13958387


There is a problem with one redirect.

In [27]:
redirects[redirects['from'] == redirects['to']]

Unnamed: 0,from,to
52275519,66560028,66560028


After looking for more information in Wikipedia it is found that the real page is 1714346.

In [28]:
df_wiki[df_wiki.page_id==66560028]

Unnamed: 0,page_id,namespace,title,is_redirect,is_new,touched,links_updated,latest,len,content_model,page_edits,creation,editors,views,references
52275519,66560028,0,Freiberg_University_of_Mining_and_Technology,1,1,20210607015344,20210130000000.0,1003914911,106,wikitext,1.0,2021-01-31,1.0,807.0,


In [29]:
df_wiki[df_wiki.page_id==1714346]

Unnamed: 0,page_id,namespace,title,is_redirect,is_new,touched,links_updated,latest,len,content_model,page_edits,creation,editors,views,references
1140866,1714346,0,Technical_University_of_Bergakademie_Freiberg,0,0,20210626181757,20210620000000.0,1013038144,4697,wikitext,137.0,2005-04-09,83.0,1957.0,


In [30]:
redirects.loc[redirects['from'] == 66560028, 'to'] = 1714346

Before solving this problem both datasets are joinned.

In [31]:
df = df.merge(df_wiki, how='inner', left_on='wikipedia_url', right_on='title')
df

Unnamed: 0,id,university,publications,publications_frac,top_pubs,top_pubs_frac,collaboration,oa,country_iso_num_code,wikipedia_url,...,touched,links_updated,latest,len,content_model,page_edits,creation,editors,views,references
0,6,Univ Graz,3435,1468,413,138,2876,2273,40.0,University_of_Graz,...,20210630235338,2.021063e+13,1029166363,14288,wikitext,282.0,2003-10-29,187.0,5330.0,9.0
1,8,Graz Univ Technol,3481,1545,444,177,2927,1917,40.0,Graz_University_of_Technology,...,20210626181333,2.021062e+13,1019599621,11728,wikitext,266.0,2004-09-14,132.0,9628.0,14.0
2,10,Univ Innsbruck,5704,2244,705,216,4838,3957,40.0,University_of_Innsbruck,...,20210626181111,2.021062e+13,1018704964,14998,wikitext,272.0,2004-05-02,166.0,5561.0,3.0
3,14,Johannes Kepler Univ Linz,2628,1250,291,124,2124,1490,40.0,Johannes_Kepler_University_Linz,...,20210626181521,2.021062e+13,1016738358,13211,wikitext,156.0,2004-12-11,82.0,3196.0,9.0
4,15,Univ Salzburg,1850,851,230,99,1507,1163,40.0,University_of_Salzburg,...,20210626181522,2.021062e+13,1023529717,9565,wikitext,140.0,2004-12-11,97.0,4463.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1210,19133,IIEST Shibpur,1472,907,99,63,1024,214,356.0,Indian_Institute_of_Engineering_Science_and_Te...,...,20210626181444,2.021062e+13,1029533372,15593,wikitext,2887.0,2004-11-16,824.0,13663.0,17.0
1211,19138,SRM Inst Sci & Technol,2543,1300,216,96,1918,480,356.0,SRM_Institute_of_Science_and_Technology,...,20210626182619,2.021063e+13,1030347644,11414,wikitext,2490.0,2006-02-25,1025.0,39158.0,16.0
1212,32912,Universidade Lisboa,17465,7342,2025,657,14904,8696,620.0,University_of_Lisbon,...,20210627001535,2.021063e+13,1015299152,9753,wikitext,360.0,2004-04-09,193.0,6414.0,8.0
1213,101154,Soonchunhyang Univ,3193,1377,201,65,2512,1696,410.0,Soonchunhyang_University,...,20210623125139,2.021062e+13,1023052859,12214,wikitext,147.0,2007-01-14,75.0,1875.0,2.0


## Final check

The length of Wikipedia pages are checked to identify the problematic ones.

In [32]:
df.sort_values(by='len', ascending=True)[['wikipedia_url', 'len']].head(5)

Unnamed: 0,wikipedia_url,len
770,Indian_Institute_of_Technology_Delhi,23
764,Indian_Institute_of_Technology_Bombay,24
13,Universiteit_Gent,30
205,University_of_Utrecht,32
21,Masaryk_University_in_Brno,32


This is the case of China University of Geosciences. It is a disambiguation page. The real one is `China_University_of_Geosciences_(Wuhan)`.

In [33]:
df_wiki[df_wiki['page_id']==60119299]

Unnamed: 0,page_id,namespace,title,is_redirect,is_new,touched,links_updated,latest,len,content_model,page_edits,creation,editors,views,references
46599704,60119299,0,China_University_of_Geosciences,0,0,20210614154603,20210610000000.0,886104655,194,wikitext,4.0,2019-03-02,2.0,56.0,


This problem will be solved with the other redirects.

In [34]:
redirects = pd.concat([redirects, pd.DataFrame(data={'from':[60119299], 'to':[2790409]})])

Redirects are fixed.

In [35]:
for i in redirects['from'].tolist():
    df.loc[df['page_id']==i, df_wiki.columns.tolist()] = df_wiki[df_wiki['page_id']==int(redirects.loc[redirects['from']==i,'to'].values[0])].values.flatten().tolist()

In [36]:
df

Unnamed: 0,id,university,publications,publications_frac,top_pubs,top_pubs_frac,collaboration,oa,country_iso_num_code,wikipedia_url,...,touched,links_updated,latest,len,content_model,page_edits,creation,editors,views,references
0,6,Univ Graz,3435,1468,413,138,2876,2273,40.0,University_of_Graz,...,20210630235338,2.021063e+13,1029166363,14288,wikitext,282.0,2003-10-29,187.0,5330.0,9.0
1,8,Graz Univ Technol,3481,1545,444,177,2927,1917,40.0,Graz_University_of_Technology,...,20210626181333,2.021062e+13,1019599621,11728,wikitext,266.0,2004-09-14,132.0,9628.0,14.0
2,10,Univ Innsbruck,5704,2244,705,216,4838,3957,40.0,University_of_Innsbruck,...,20210626181111,2.021062e+13,1018704964,14998,wikitext,272.0,2004-05-02,166.0,5561.0,3.0
3,14,Johannes Kepler Univ Linz,2628,1250,291,124,2124,1490,40.0,Johannes_Kepler_University_Linz,...,20210626181521,2.021062e+13,1016738358,13211,wikitext,156.0,2004-12-11,82.0,3196.0,9.0
4,15,Univ Salzburg,1850,851,230,99,1507,1163,40.0,University_of_Salzburg,...,20210626181522,2.021062e+13,1023529717,9565,wikitext,140.0,2004-12-11,97.0,4463.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1210,19133,IIEST Shibpur,1472,907,99,63,1024,214,356.0,Indian_Institute_of_Engineering_Science_and_Te...,...,20210626181444,2.021062e+13,1029533372,15593,wikitext,2887.0,2004-11-16,824.0,13663.0,17.0
1211,19138,SRM Inst Sci & Technol,2543,1300,216,96,1918,480,356.0,SRM_Institute_of_Science_and_Technology,...,20210626182619,2.021063e+13,1030347644,11414,wikitext,2490.0,2006-02-25,1025.0,39158.0,16.0
1212,32912,Universidade Lisboa,17465,7342,2025,657,14904,8696,620.0,University_of_Lisbon,...,20210627001535,2.021063e+13,1015299152,9753,wikitext,360.0,2004-04-09,193.0,6414.0,8.0
1213,101154,Soonchunhyang Univ,3193,1377,201,65,2512,1696,410.0,Soonchunhyang_University,...,20210623125139,2.021062e+13,1023052859,12214,wikitext,147.0,2007-01-14,75.0,1875.0,2.0


Data looks fine.

In [37]:
df.sort_values(by='len', ascending=True)[['wikipedia_url', 'len']].head(5)

Unnamed: 0,wikipedia_url,len
1187,Shenyang_Agricultural_University,791
1166,Xinxiang_Medical_University,1259
980,Zagazig_University,1400
1064,Tianjin_University_of_Science_and_Technology,1506
1162,Shanxi_Medical_University,1816


Our final sample is composed by 1215 universities.

In [38]:
df_wiki[df_wiki.title.isin(df.wikipedia_url)]

Unnamed: 0,page_id,namespace,title,is_redirect,is_new,touched,links_updated,latest,len,content_model,page_edits,creation,editors,views,references
1025,1859,0,Arizona_State_University,0,0,20210630203037,2.021063e+13,1031291521,153688,wikitext,5260.0,2001-09-23,1813.0,83094.0,241.0
2638,4157,0,Brown_University,0,0,20210630041419,2.021063e+13,1030113579,154347,wikitext,5366.0,2001-09-10,2081.0,197635.0,101.0
3814,5690,0,Chalmers_University_of_Technology,0,0,20210628143759,2.021063e+13,1030879897,19665,wikitext,620.0,2002-02-25,362.0,12169.0,10.0
3889,5786,0,California_Institute_of_Technology,0,0,20210630041419,2.021063e+13,1031095636,132023,wikitext,2791.0,2001-06-28,1363.0,119083.0,124.0
4279,6310,0,Columbia_University,0,0,20210701064340,2.021070e+13,1031366616,169071,wikitext,9899.0,2001-09-10,3453.0,282775.0,290.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52275519,66560028,0,Freiberg_University_of_Mining_and_Technology,1,1,20210607015344,2.021013e+13,1003914911,106,wikitext,1.0,2021-01-31,1.0,807.0,
53077146,67423223,0,University_of_Côte_d'Azur,1,1,20210623150859,2.021042e+13,1018320796,84,wikitext,1.0,2021-04-17,1.0,1626.0,
53080291,67426585,0,University_of_Rouen,1,1,20210607015848,2.021042e+13,1018381042,89,wikitext,1.0,2021-04-17,1.0,718.0,
53080992,67427342,0,Paul_Sabatier_University,1,1,20210607015848,2.021042e+13,1018396359,100,wikitext,1.0,2021-04-17,1.0,1293.0,


## Language links

The number of language links is retrieved by Wikipedia API.

In [39]:
df['langlinks'] = 0

In [40]:
count=0
for i in df['page_id']:
    count+=1
    print(round(100*count/df.shape[0],2), end='\r')
    url = 'https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&format=json&lllimit=max&pageids=' + str(i)
    query = requests.get(url, verify=False)
    response = json.loads(query.text)
    if ('langlinks' in response['query']['pages'][str(i)]):
        df.loc[df['page_id']==i,'langlinks'] = len(response['query']['pages'][str(i)]['langlinks'])
    else:
        df.loc[df['page_id']==i,'langlinks'] = 0

100.0

## Ages

Since the pages and the ranking are dated 2021, this year is taken as the reference year.

In [41]:
df['univ_age'] = 2021-df.established_year
df['page_age'] = 2021-df.creation.str[0:4].astype(int)

## % Top publications

Percentage of top publications is calculated.

In [42]:
df['pp_top'] = df['top_pubs']/df['publications']

In [43]:
df

Unnamed: 0,id,university,publications,publications_frac,top_pubs,top_pubs_frac,collaboration,oa,country_iso_num_code,wikipedia_url,...,content_model,page_edits,creation,editors,views,references,langlinks,univ_age,page_age,pp_top
0,6,Univ Graz,3435,1468,413,138,2876,2273,40.0,University_of_Graz,...,wikitext,282.0,2003-10-29,187.0,5330.0,9.0,45,436.0,18,0.120233
1,8,Graz Univ Technol,3481,1545,444,177,2927,1917,40.0,Graz_University_of_Technology,...,wikitext,266.0,2004-09-14,132.0,9628.0,14.0,18,210.0,17,0.127550
2,10,Univ Innsbruck,5704,2244,705,216,4838,3957,40.0,University_of_Innsbruck,...,wikitext,272.0,2004-05-02,166.0,5561.0,3.0,37,352.0,17,0.123597
3,14,Johannes Kepler Univ Linz,2628,1250,291,124,2124,1490,40.0,Johannes_Kepler_University_Linz,...,wikitext,156.0,2004-12-11,82.0,3196.0,9.0,17,55.0,17,0.110731
4,15,Univ Salzburg,1850,851,230,99,1507,1163,40.0,University_of_Salzburg,...,wikitext,140.0,2004-12-11,97.0,4463.0,,23,399.0,17,0.124324
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1210,19133,IIEST Shibpur,1472,907,99,63,1024,214,356.0,Indian_Institute_of_Engineering_Science_and_Te...,...,wikitext,2887.0,2004-11-16,824.0,13663.0,17.0,2,165.0,17,0.067255
1211,19138,SRM Inst Sci & Technol,2543,1300,216,96,1918,480,356.0,SRM_Institute_of_Science_and_Technology,...,wikitext,2490.0,2006-02-25,1025.0,39158.0,16.0,2,36.0,15,0.084939
1212,32912,Universidade Lisboa,17465,7342,2025,657,14904,8696,620.0,University_of_Lisbon,...,wikitext,360.0,2004-04-09,193.0,6414.0,8.0,38,110.0,17,0.115946
1213,101154,Soonchunhyang Univ,3193,1377,201,65,2512,1696,410.0,Soonchunhyang_University,...,wikitext,147.0,2007-01-14,75.0,1875.0,2.0,5,43.0,14,0.062950


## Two years views

To validate the use of the views, the total for one and two years is downloaded.

In [44]:
df['1y_views'] = 0
df['2y_views'] = 0

In [45]:
count=0
for i in df['title']:
    count+=1
    print(round(100*count/df.shape[0],2), end='\r')
    url = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/' + i + '/monthly/2020070100/2021070100'
    query = requests.get(url, verify=False, headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'})
    response = json.loads(query.text)
    df.loc[df['title']==i,'1y_views'] = sum([x['views'] for x in response['items']])

100.0

In [46]:
count=0
for i in df['title']:
    count+=1
    print(round(100*count/df.shape[0],2), end='\r')
    url = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/' + i + '/monthly/2019070100/2021070100'
    query = requests.get(url, verify=False, headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'})
    response = json.loads(query.text)
    df.loc[df['title']==i,'2y_views'] = sum([x['views'] for x in response['items']])

100.0

Top viewed Wikipedia pages.

In [47]:
df[['title', 'publications', 'publications_frac', 'top_pubs', 'top_pubs_frac', 'pp_top', 'collaboration', 'oa', 'len', 'page_edits', 'editors', 'views', '1y_views', '2y_views', 'references', 'langlinks', 'univ_age', 'page_age']].sort_values(by='views', ascending=False).head(20)

Unnamed: 0,title,publications,publications_frac,top_pubs,top_pubs_frac,pp_top,collaboration,oa,len,page_edits,editors,views,1y_views,2y_views,references,langlinks,univ_age,page_age
495,Harvard_University,82476,34234,18327,7246,0.22221,71265,59599,85406,10470.0,4395.0,536066.0,2743858,5596572,148.0,117,385.0,20
124,University_of_Oxford,41154,16088,8524,2991,0.207124,35001,35031,190150,6726.0,2935.0,363921.0,1697222,3262418,199.0,122,925.0,20
586,Stanford_University,39938,16454,9316,3563,0.233262,33366,27390,143495,7994.0,3222.0,336640.0,1829344,3967401,218.0,99,130.0,20
553,Columbia_University,33011,12558,6757,2221,0.204689,28386,22631,169071,9899.0,3453.0,282775.0,1456216,3015108,290.0,88,267.0,20
952,Baylor_University,4064,1465,593,149,0.145915,3454,2211,73040,4000.0,1623.0,272730.0,603949,1136138,64.0,24,176.0,18
551,Yale_University,28569,11716,5724,2130,0.200357,23964,19760,220239,7850.0,3446.0,271215.0,1376915,2894465,173.0,94,320.0,20
496,Massachusetts_Institute_of_Technology,29268,10507,7586,2616,0.259191,25571,22543,197542,7237.0,2710.0,247107.0,1272163,2758928,391.0,94,160.0,20
92,University_of_Cambridge,35202,14080,7302,2628,0.207431,29917,29673,154558,6586.0,2640.0,233540.0,1143017,2318268,165.0,114,812.0,20
568,Princeton_University,13702,5332,3090,1214,0.225515,11661,10360,144244,5319.0,2621.0,215139.0,1084928,2291041,218.0,90,275.0,20
564,University_of_Pennsylvania,32731,13568,6592,2336,0.201399,27387,22401,249678,7821.0,2643.0,208975.0,1182602,2245820,150.0,72,281.0,20


The dataset is exported.

In [48]:
df.to_csv('results/wiki_univ.tsv', sep='\t', index=False)