## Summary

The bibliometric indicators, mostly taken from the Leiden Ranking itself, are obtained for each university.

# Packages

In [None]:
import pandas as pd

# Leiden Ranking

Data from the Leiden Ranking (2021) Excel file is imported.

In [None]:
df_l = pd.read_csv('data/CWTS_leiden_ranking_2021.csv', sep=';', encoding='UTF-8')
df_l

The main indicators are:
+ impact_P - Total number of publications of a university.
+ P_top10 - The number of a university’s publications that, compared with other publications in the same field and in the same year, belong to the top 10% most frequently cited.
+ collab_P -  The number of a university’s publications that have been co-authored with one or more other organizations.
+ P_int_collab - The number of a university’s publications that have been co-authored by two or more countries.
+ P_OA - The number of open access publications of a university.
+ TCS - The total number of citations of the publications of a university.
+ TCNS - The total number of citations of the publications of a university, normalized for field and publication year.
+ P_industry_collab - The number of a university’s publications that have been co-authored with one or more industrial organizations.
+ PA_F_MF - The number of female authorships as a proportion of a university’s number of male and female authorships.

Indicators are filtered to only 2016-2019 period and non fractional counting values.

In [None]:
df_l = df_l[(df_l.Period == '2016–2019') & (df_l.Frac_counting == 0)].copy()
df_l.shape

## Publications by categories

In [None]:
df_categories = df_l[['University', 'Country', 'Field', 'impact_P']].copy()
df_categories

In [None]:
df_categories = df_categories.pivot_table(columns='Field', index=['University', 'Country'], values='impact_P').reset_index()
df_categories

In addition to the overall number of publications (impact_P), this value is obtained for the broad areas of knowledge:
+ impact_P_Bio_Health - Biomedical and health sciences
+ impact_P_Life_Earth - Life and earth sciences
+ impact_P_Math_Comp - Mathematics and computer science
+ impact_P_Phy_Eng - Physical sciences and engineering
+ impact_P_Soc_Hum - Social sciences and humanities

In [None]:
df_categories.rename(columns={'All sciences':'impact_P',
                              'Biomedical and health sciences':'impact_P_Bio_Health',
                              'Life and earth sciences':'impact_P_Life_Earth',
                              'Mathematics and computer science':'impact_P_Math_Comp',
                              'Physical sciences and engineering':'impact_P_Phy_Eng',
                              'Social sciences and humanities':'impact_P_Soc_Hum'},
                    inplace=True)

## Selecting indicators

The indicators are obtained with respect to the total number of disciplines.

In [None]:
df_l = df_l[df_l.Field == 'All sciences'][['University', 'P_top10', 'collab_P', 'P_int_collab', 'P_OA', 'TCS', 'TNCS', 'P_industry_collab', 'PA_F_MF']].copy()
df_l

Publications by areas of knowledge and indicators are merged.

In [None]:
df = df_categories.merge(df_l, how='inner', on='University')
df

PA_F_MF column type is fixed.

In [None]:
df['PA_F_MF'] = df['PA_F_MF'].str.replace('%', '')
df['PA_F_MF'] = df['PA_F_MF'].str.replace(',', '.')
df['PA_F_MF'] = df['PA_F_MF'].astype('float')

# GRID ID

Leiden Ranknig id and GRID id are merged with the main dataset.

In [None]:
df_ids = pd.read_csv('data/universities_ids.tsv', sep='\t', encoding='UTF-8')
df_ids

By curating the dataset from the previous notebook, this dataset is generated.

In [None]:
df_grid = pd.read_csv('data/leiden_grid_id.tsv', sep='\t', encoding='UTF-8')
df_grid

Both names (short and full) and both ids (Leiden and GRID) are combined in a dataframe.

In [None]:
df_grid = df_grid.merge(df_ids[['id','University']], how='inner', on='id')
df_grid = df_grid[['id', 'grid_id', 'university', 'University']]
df_grid.rename(columns={'university':'short_name',
                        'University':'full_name'},
              inplace=True)
df_grid

There are only one duplicated (two universities have the same GRID id).

In [None]:
df_grid[df_grid.grid_id.duplicated()]

In [None]:
df_grid[df_grid.short_name.duplicated()]

In [None]:
df_grid[df_grid.full_name.duplicated()]

In [None]:
df[df.University.duplicated()]

In [None]:
df[df.University.isin(df_grid.full_name.tolist())]

In [None]:
df_grid = df_grid.merge(df, how='inner', left_on='full_name', right_on='University')
df_grid

In [None]:
df_grid.drop(columns='University', inplace=True)

# Dimensions publications

In [None]:
df_dim_wos = pd.read_csv('data/grid_pubs_dim_wos.tsv', sep='\t', encoding='UTF-8')
df_dim_wos

In [None]:
df_dim_wos[df_dim_wos['grid_id'].isin(df_grid.grid_id.tolist())].shape

The percentage of Dimensions publications indexed in Web of Science is added to the main dataframe.

In [None]:
df_grid = df_grid.merge(df_dim_wos, how='inner', on='grid_id')
df_grid

Final dataset is exported.

In [None]:
df_grid.to_csv('data/leiden_indicators.tsv', index=False, sep='\t')