## Summary

This notebook extracts the list and identifiers of universities from the Leiden Ranking 2021.

# Packages

In [None]:
import pandas as pd

## Leiden Ranking

<div class="alert alert-block alert-info"> <b>More info:</b> <a href="https://www.leidenranking.com/ranking/2021/list">CWTS Leiden Ranking 2021</a> </div>

Leiden Ranking (2021) includes a total of 1225 universities. This dataset includes their short name and id (it is needed to link each university to its Wikipedia page).

In [None]:
df_leiden = pd.read_csv('data/leiden_ranking_2021.tsv', sep='\t')
df_leiden

There are 1225 universities and their name are unique.

In [None]:
len(df_leiden.university.drop_duplicates())

The full dataset include the full name among a large amount of indicators, however it doesn't include the university id. This dataset is reduced to only data for the 2016-2019 period.

In [None]:
df_leiden_total = pd.read_csv('data/CWTS_leiden_ranking_2021.csv', sep=';')
df_leiden_total = df_leiden_total[(df_leiden_total.Field=='All sciences') & (df_leiden_total.Period=='2016–2019')]
df_leiden_total

Then only total values are selected.

In [None]:
df_leiden_total = df_leiden_total[df_leiden_total.Frac_counting==0]
df_leiden_total

There is a total of 1225 unique universities.

In [None]:
len(df_leiden_total.University.drop_duplicates())

As the universities cannot be linked by the name (short/full) the number of publications and collaborations are used.

In [None]:
df_leiden_total[['impact_P', 'P_collab']].drop_duplicates()

A new `data.frame` is created with the id, full name and short name of the universities.

In [None]:
id_table = df_leiden.merge(df_leiden_total, how='inner', left_on=['publications', 'collaboration'], right_on=['impact_P', 'P_collab'])[['id', 'University', 'university']]
id_table

Finally, it is exported.

In [None]:
id_table.to_csv('data/universities_ids.tsv', sep='\t', index=False)