# Librarian's quest to exhaustivity and openness

## Consolidation of AoU data

Project for the EAHIL conference 2024 : https://eahil2024.rsu.lv/

Authors : **Floriane Muller & Pablo Iriarte**, University of Geneva  
Last update : 31.05.2024  

This notebook is used to merge the data exported from AoU with PubMed and OpenAlex informations.

### Sources

* Files extracted with the notebook 1:

    * DOIs: data/temp/[year]/aou_dois.tsv
    * PMIDs: data/temp/[year]/aou_pmids.tsv
    * Links: data/temp/[year]/aou_856.tsv
    * UNIGE structures: data/temp/[year]/aou_928.tsv
    * Publication dates: data/temp/[year]/aou_dates.tsv
    * Funders: data/temp/[year]/aou_988f.tsv
    * Document types: data/temp/[year]/aou_980a.tsv

* Mapping of UNIGE biomedical units to general structures: data/sources/Structures_AoU_EAHIL2024_V2.xlsx

### Results

* data/temp/[year]/aou_structures_non_biomed.tsv
* data/temp/[year]/aou_ids_structures_non_biomed.tsv
* data/temp/[year]/aou_ids_structures_non_biomed.xlsx
* data/temp/[year]/aou_dois_duplicates.tsv
* data/temp/[year]/aou_pmids_duplicates.tsv
* results/[year]/aou_all.tsv
* results/[year]/aou_all.xlsx
* results/[year]/aou_biomed_2015_2022.tsv
* results/[year]/aou_biomed_2015_2022.xlsx
* results/[year]/aou_856.tsv
* results/[year]/aou_856.xlsx

In [1]:
import pandas as pd
import csv
#from tqdm.auto import tqdm

# paramètres
structures_biomed = 'data/sources/Structures_AoU_EAHIL2024_V2.xlsx'
myfolder_temp = 'data/temp/2024/'
myfolder_results = 'results/2024/'
export_aou_856 = myfolder_temp + 'aou_856.tsv'
export_aou_988 = myfolder_temp + 'aou_988f.tsv'
export_aou_928 = myfolder_temp + 'aou_928.tsv'
export_aou_dois = myfolder_temp + 'aou_dois.tsv'
export_aou_pmids = myfolder_temp + 'aou_pmids.tsv'
export_aou_980a = myfolder_temp + 'aou_980a.tsv'
export_aou_dates = myfolder_temp + 'aou_dates.tsv'

# afficher toutes les colonnes
pd.set_option('display.max_columns', None)

## Import AoU data

### 1. Clean and prepare Data about supplementary files and associated datasets

In [2]:
aou_856 = pd.read_csv(export_aou_856, encoding='utf-8', header=0, sep='\t')
aou_856

Unnamed: 0,id,856_3,856_f,856_u,856_x,856_z
0,unige:1,Article (Published version),unige1.pdf,https://archive-ouverte.unige.ch/download/ffe9...,Restricted access,
1,unige:3,Article (Published version),unige3.pdf,https://archive-ouverte.unige.ch/download/c0ba...,Restricted access,
2,unige:4,Article (Published version),1-s2.0-S0022354916324418-main.pdf,https://archive-ouverte.unige.ch/download/e6dc...,Restricted access,
3,unige:5,Article (Published version),11095_2007_Article_9429.pdf,https://archive-ouverte.unige.ch/download/a32f...,Public access,OA NATIONAL
4,unige:6,Article,fulltext.pdf,https://archive-ouverte.unige.ch/download/470f...,Restricted access,
...,...,...,...,...,...,...
148000,unige:167401,Article (Published version),Article-scientifique-from-europePmc.pdf,https://archive-ouverte.unige.ch/download/8e3d...,Public access,CC BY-4.0
148001,unige:167401,Alternate edition,,https://www.mdpi.com/1422-0067/23/23/14585,,
148002,unige:167402,Article (Published version),A 578. Assessing the androgenic and.pdf,https://archive-ouverte.unige.ch/download/2e32...,Restricted access,
148003,unige:167402,Appendix,cen14847-sup-0001-supmat.docx,https://archive-ouverte.unige.ch/download/0cbf...,Public access,


In [3]:
# check values for 170420
aou_856.loc[aou_856['id'] == 'unige:170420']

Unnamed: 0,id,856_3,856_f,856_u,856_x,856_z
5746,unige:170420,Article (Published version),Article-scientifique-from-europePmc.pdf,https://archive-ouverte.unige.ch/download/a5ba...,Public access,CC BY
5747,unige:170420,Appendix,41467_2020_15706_MOESM1_ESM.pdf,https://archive-ouverte.unige.ch/download/4448...,Public access,
5748,unige:170420,Appendix,41467_2020_15706_MOESM3_ESM.pdf,https://archive-ouverte.unige.ch/download/18a9...,Public access,
5749,unige:170420,Supplemental data,41467_2020_15706_MOESM4_ESM.xlsx,https://archive-ouverte.unige.ch/download/a849...,Public access,
5750,unige:170420,Supplemental data,41467_2020_15706_MOESM5_ESM.xlsx,https://archive-ouverte.unige.ch/download/8298...,Public access,
5751,unige:170420,Supplemental data,41467_2020_15706_MOESM6_ESM.xlsx,https://archive-ouverte.unige.ch/download/b6b3...,Public access,
5752,unige:170420,Supplemental data,41467_2020_15706_MOESM7_ESM.xlsx,https://archive-ouverte.unige.ch/download/d534...,Public access,
5753,unige:170420,Supplemental data,41467_2020_15706_MOESM8_ESM.xlsx,https://archive-ouverte.unige.ch/download/83bf...,Public access,
5754,unige:170420,Supplemental data,41467_2020_15706_MOESM9_ESM.xlsx,https://archive-ouverte.unige.ch/download/2a86...,Public access,
5755,unige:170420,Supplemental data,41467_2020_15706_MOESM10_ESM.xlsx,https://archive-ouverte.unige.ch/download/ad2e...,Public access,


In [4]:
aou_856['856_3'].value_counts()

Article (Published version)                61595
Alternate edition                          35871
Thesis                                      6968
Master thesis                               6642
Article (Accepted version)                  5800
Book chapter (Published version)            5220
Article                                     4820
Appendix                                    3505
Proceedings chapter (Published version)     2666
Supplemental data                           2665
Book (Published version)                    1445
Dataset                                     1397
Report                                      1297
Book chapter (Accepted version)             1069
Presentation                                1017
Extract                                     1016
Book chapter                                 920
Proceedings chapter (Accepted version)       666
Proceedings chapter                          585
Article (Submitted version)                  433
Working paper       

In [5]:
# count Supplemental data, Appendix Dataset
aou_856.loc[aou_856['856_3'] == 'Supplemental data', 'DATA'] = 1
aou_856.loc[aou_856['856_3'] == 'Appendix', 'DATA'] = 1
aou_856.loc[aou_856['856_3'] == 'Dataset', 'DATA'] = 1
aou_856 = aou_856.fillna(0)
aou_856

Unnamed: 0,id,856_3,856_f,856_u,856_x,856_z,DATA
0,unige:1,Article (Published version),unige1.pdf,https://archive-ouverte.unige.ch/download/ffe9...,Restricted access,0,0.0
1,unige:3,Article (Published version),unige3.pdf,https://archive-ouverte.unige.ch/download/c0ba...,Restricted access,0,0.0
2,unige:4,Article (Published version),1-s2.0-S0022354916324418-main.pdf,https://archive-ouverte.unige.ch/download/e6dc...,Restricted access,0,0.0
3,unige:5,Article (Published version),11095_2007_Article_9429.pdf,https://archive-ouverte.unige.ch/download/a32f...,Public access,OA NATIONAL,0.0
4,unige:6,Article,fulltext.pdf,https://archive-ouverte.unige.ch/download/470f...,Restricted access,0,0.0
...,...,...,...,...,...,...,...
148000,unige:167401,Article (Published version),Article-scientifique-from-europePmc.pdf,https://archive-ouverte.unige.ch/download/8e3d...,Public access,CC BY-4.0,0.0
148001,unige:167401,Alternate edition,0,https://www.mdpi.com/1422-0067/23/23/14585,0,0,0.0
148002,unige:167402,Article (Published version),A 578. Assessing the androgenic and.pdf,https://archive-ouverte.unige.ch/download/2e32...,Restricted access,0,0.0
148003,unige:167402,Appendix,cen14847-sup-0001-supmat.docx,https://archive-ouverte.unige.ch/download/0cbf...,Public access,0,1.0


In [6]:
# Categorize shared data, data supplements or appendixes
aou_856.loc[aou_856['856_3'] == 'Supplemental data', 'DATA_TYPE'] = 'data supplements'
aou_856.loc[aou_856['856_3'] == 'Appendix', 'DATA_TYPE'] = 'appendixes'
aou_856.loc[aou_856['856_3'] == 'Dataset', 'DATA_TYPE'] = 'shared data'
aou_856['DATA_TYPE'] = aou_856['DATA_TYPE'].fillna('')
aou_856

Unnamed: 0,id,856_3,856_f,856_u,856_x,856_z,DATA,DATA_TYPE
0,unige:1,Article (Published version),unige1.pdf,https://archive-ouverte.unige.ch/download/ffe9...,Restricted access,0,0.0,
1,unige:3,Article (Published version),unige3.pdf,https://archive-ouverte.unige.ch/download/c0ba...,Restricted access,0,0.0,
2,unige:4,Article (Published version),1-s2.0-S0022354916324418-main.pdf,https://archive-ouverte.unige.ch/download/e6dc...,Restricted access,0,0.0,
3,unige:5,Article (Published version),11095_2007_Article_9429.pdf,https://archive-ouverte.unige.ch/download/a32f...,Public access,OA NATIONAL,0.0,
4,unige:6,Article,fulltext.pdf,https://archive-ouverte.unige.ch/download/470f...,Restricted access,0,0.0,
...,...,...,...,...,...,...,...,...
148000,unige:167401,Article (Published version),Article-scientifique-from-europePmc.pdf,https://archive-ouverte.unige.ch/download/8e3d...,Public access,CC BY-4.0,0.0,
148001,unige:167401,Alternate edition,0,https://www.mdpi.com/1422-0067/23/23/14585,0,0,0.0,
148002,unige:167402,Article (Published version),A 578. Assessing the androgenic and.pdf,https://archive-ouverte.unige.ch/download/2e32...,Restricted access,0,0.0,
148003,unige:167402,Appendix,cen14847-sup-0001-supmat.docx,https://archive-ouverte.unige.ch/download/0cbf...,Public access,0,1.0,appendixes


In [7]:
aou_856['DATA'].value_counts()

0.0    140438
1.0      7567
Name: DATA, dtype: int64

In [8]:
aou_856_data = aou_856[['id', 'DATA', 'DATA_TYPE']].loc[aou_856['DATA'] == 1]
# aou_856_data = aou_856_data.drop_duplicates(subset='id')
aou_856_data

Unnamed: 0,id,DATA,DATA_TYPE
337,unige:1047,1.0,data supplements
338,unige:1047,1.0,data supplements
339,unige:1047,1.0,data supplements
358,unige:1059,1.0,data supplements
359,unige:1059,1.0,data supplements
...,...,...,...
147976,unige:167391,1.0,data supplements
147977,unige:167391,1.0,appendixes
147990,unige:167397,1.0,appendixes
147998,unige:167400,1.0,appendixes


In [9]:
# dedup categorized 856 data
aou_856_data = aou_856_data.drop_duplicates()
aou_856_data

Unnamed: 0,id,DATA,DATA_TYPE
337,unige:1047,1.0,data supplements
358,unige:1059,1.0,data supplements
374,unige:1068,1.0,data supplements
619,unige:167403,1.0,appendixes
621,unige:167403,1.0,shared data
...,...,...,...
147976,unige:167391,1.0,data supplements
147977,unige:167391,1.0,appendixes
147990,unige:167397,1.0,appendixes
147998,unige:167400,1.0,appendixes


In [10]:
# check lines with several data types
aou_856_data.loc[aou_856_data.duplicated(subset='id', keep=False)]

Unnamed: 0,id,DATA,DATA_TYPE
619,unige:167403,1.0,appendixes
621,unige:167403,1.0,shared data
774,unige:167507,1.0,data supplements
776,unige:167507,1.0,shared data
1122,unige:167758,1.0,data supplements
...,...,...,...
147658,unige:167177,1.0,appendixes
147959,unige:167385,1.0,data supplements
147961,unige:167385,1.0,shared data
147976,unige:167391,1.0,data supplements


In [11]:
# check values for 170420
aou_856_data.loc[aou_856_data['id'] == 'unige:170420']

Unnamed: 0,id,DATA,DATA_TYPE
5747,unige:170420,1.0,appendixes
5749,unige:170420,1.0,data supplements
5774,unige:170420,1.0,shared data


In [12]:
# check values to convert in columns
aou_856_data['DATA_TYPE'].value_counts()

appendixes          1941
data supplements    1723
shared data         1081
Name: DATA_TYPE, dtype: int64

In [13]:
# separate values 
aou_856_data_appendixes = aou_856_data[['id', 'DATA_TYPE']].loc[aou_856_data['DATA_TYPE'] == 'appendixes']
aou_856_data_supplements = aou_856_data[['id', 'DATA_TYPE']].loc[aou_856_data['DATA_TYPE'] == 'data supplements']
aou_856_data_shared = aou_856_data[['id', 'DATA_TYPE']].loc[aou_856_data['DATA_TYPE'] == 'shared data']

In [14]:
# add new columns
aou_856_data_appendixes = aou_856_data_appendixes.rename(columns={'DATA_TYPE' : 'DATA_TYPE_appendixes'})
aou_856_data_supplements = aou_856_data_supplements.rename(columns={'DATA_TYPE' : 'DATA_TYPE_data_supplements'})
aou_856_data_shared = aou_856_data_shared.rename(columns={'DATA_TYPE' : 'DATA_TYPE_shared_data'})

In [15]:
# remove duplicates
aou_856_data_dedup = aou_856_data[['id', 'DATA']].drop_duplicates(subset='id')
aou_856_data_dedup

Unnamed: 0,id,DATA
337,unige:1047,1.0
358,unige:1059,1.0
374,unige:1068,1.0
619,unige:167403,1.0
715,unige:167476,1.0
...,...,...
147969,unige:167388,1.0
147976,unige:167391,1.0
147990,unige:167397,1.0
147998,unige:167400,1.0


In [16]:
# add data types
aou_856_data_dedup = aou_856_data_dedup.merge(aou_856_data_appendixes, on='id', how='left')
aou_856_data_dedup = aou_856_data_dedup.merge(aou_856_data_supplements, on='id', how='left')
aou_856_data_dedup = aou_856_data_dedup.merge(aou_856_data_shared, on='id', how='left')
aou_856_data_dedup

Unnamed: 0,id,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data
0,unige:1047,1.0,,data supplements,
1,unige:1059,1.0,,data supplements,
2,unige:1068,1.0,,data supplements,
3,unige:167403,1.0,appendixes,,shared data
4,unige:167476,1.0,appendixes,,
...,...,...,...,...,...
4187,unige:167388,1.0,appendixes,,
4188,unige:167391,1.0,appendixes,data supplements,
4189,unige:167397,1.0,appendixes,,
4190,unige:167400,1.0,appendixes,,


In [17]:
# change values
aou_856_data_dedup.loc[aou_856_data_dedup['DATA_TYPE_appendixes'] == 'appendixes', 'DATA_TYPE_appendixes'] = 1
aou_856_data_dedup.loc[aou_856_data_dedup['DATA_TYPE_data_supplements'] == 'data supplements', 'DATA_TYPE_data_supplements'] = 1
aou_856_data_dedup.loc[aou_856_data_dedup['DATA_TYPE_shared_data'] == 'shared data', 'DATA_TYPE_shared_data'] = 1
aou_856_data_dedup

Unnamed: 0,id,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data
0,unige:1047,1.0,,1,
1,unige:1059,1.0,,1,
2,unige:1068,1.0,,1,
3,unige:167403,1.0,1,,1
4,unige:167476,1.0,1,,
...,...,...,...,...,...
4187,unige:167388,1.0,1,,
4188,unige:167391,1.0,1,1,
4189,unige:167397,1.0,1,,
4190,unige:167400,1.0,1,,


In [18]:
# fill nas
aou_856_data_dedup = aou_856_data_dedup.fillna(0)
aou_856_data_dedup

Unnamed: 0,id,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data
0,unige:1047,1.0,0,1,0
1,unige:1059,1.0,0,1,0
2,unige:1068,1.0,0,1,0
3,unige:167403,1.0,1,0,1
4,unige:167476,1.0,1,0,0
...,...,...,...,...,...
4187,unige:167388,1.0,1,0,0
4188,unige:167391,1.0,1,1,0
4189,unige:167397,1.0,1,0,0
4190,unige:167400,1.0,1,0,0


### 2. Clean and prepare data about funding with data sharing requirements
1. import file <span style="color:blue">*OK*</span> 
2. check occurencies and identify variants spellings of funders with data requirements <span style="color:blue">*OK*</span>
4. add a column with EU or SNSF funding <span style="color:blue">*OK*</span>


In [19]:
aou_988 = pd.read_csv(export_aou_988, encoding='utf-8', header=0, sep='\t')
aou_988

Unnamed: 0,id,988f
0,unige:167403,UCB Pharma
1,unige:167403,Amgen Inc
2,unige:167416,NIGMS NIH HHS
3,unige:167419,UK Research and Innovation
4,unige:167424,Fondation Franklinia
...,...,...
13493,unige:167391,Jacobs Foundation and Federal Office of public...
13494,unige:167392,Alexion Pharmaceutical
13495,unige:167397,Swiss National Science Foundation
13496,unige:167398,Institut de Recherches Servier


In [20]:
# check values for 166493 which has multiple funding streams mentionned
aou_988.loc[aou_988['id'] == 'unige:166493']

Unnamed: 0,id,988f
13213,unige:166493,Swiss National Science Foundation
13214,unige:166493,Swiss National Science Foundation
13215,unige:166493,European Commission


In [21]:
# check values for 176795 which has multiple funding streams mentionned and is more recent (2024 upadate)
aou_988.loc[aou_988['id'] == 'unige:176795']  

Unnamed: 0,id,988f
2788,unige:176795,Swiss National Science Foundation
2789,unige:176795,Swiss National Science Foundation
2790,unige:176795,Swiss National Science Foundation
2791,unige:176795,Swiss National Science Foundation
2792,unige:176795,European Commission
2793,unige:176795,Fondation Privée des Hôpitaux Universitaires d...
2794,unige:176795,Fondation Pôle Autisme
2795,unige:176795,Swiss National Science Foundation


In [22]:
aou_988 = aou_988.rename(columns={'988f' : 'funder'})

In [23]:
aou_988['funder'].value_counts()

Swiss National Science Foundation                              8333
European Commission                                            1336
Autre                                                          1303
Wellcome Trust                                                   92
UK Research and Innovation                                       90
                                                               ... 
NCCRâOn the Move                                                1
Alzheimer’s Research UK                                           1
Ministero dell'Istruzione, dell'UniversitÃ  e della Ricerca       1
NCCR ON THE MOVE                                                  1
SystemsX.ch initiative                                            1
Name: funder, Length: 1021, dtype: int64

Note: In previous version of IR, the field Autre (Other) was used for anything that wasn't SNSF or EU or FP7 or H2020. Thus, if may contain values that are present in the list (eg. Wellcome trust, etc.) -> We can only analyse presence/absence of funding in general and presence/absence of SNSF, or EU funding. 

In [24]:
aou_988['funder'].value_counts().reset_index().to_csv(myfolder_temp + 'aou_funder.tsv', sep='\t')

In [25]:
# check values for National Science Foundation to see if the swiss one or another
aou_988.loc[aou_988['funder'] == 'National Science Foundation']  

Unnamed: 0,id,funder
102,unige:167832,National Science Foundation
223,unige:168439,National Science Foundation
224,unige:168439,National Science Foundation
592,unige:169375,National Science Foundation
669,unige:169713,National Science Foundation
1328,unige:172867,National Science Foundation
1393,unige:173150,National Science Foundation
1394,unige:173150,National Science Foundation
1395,unige:173150,National Science Foundation
1523,unige:173588,National Science Foundation


conclusion after evaluating the publications: it is the US one.

In [26]:
# check values for Swiss National Science Foundation with bad spelling
aou_988.loc[aou_988['funder'] == 'Swiss National Science Fondation']  

Unnamed: 0,id,funder
12645,unige:164712,Swiss National Science Fondation


<span style="color:orange">*Conclusion: we need to correct those in our repository: CORRECTIONS TO BE MADE*</span>

In [27]:
# check values for variant names in French identified
aou_988.loc[(aou_988['funder'] == 'Fonds national suisse (FNS)') | (aou_988['funder'] == 'Fonds National Suisse') | (aou_988['funder'] == 'Fonds National Suisse de la recherche scientifique')]  

Unnamed: 0,id,funder
986,unige:170951,Fonds National Suisse de la recherche scientif...
987,unige:170951,Fonds National Suisse de la recherche scientif...
2346,unige:175755,Fonds National Suisse
2383,unige:175941,Fonds National Suisse
2385,unige:175943,Fonds National Suisse
11082,unige:159104,Fonds National Suisse
11603,unige:161206,Fonds National Suisse
12219,unige:163001,Fonds national suisse (FNS)
13161,unige:166285,Fonds national suisse (FNS)
13162,unige:166288,Fonds national suisse (FNS)


<span style="color:orange">*Conclusion: we need to correct those in our repository: CORRECTIONS TO BE MADE*</span>

In [28]:
# check values for variant names in English identified
aou_988.loc[(aou_988['funder'] == 'Swiss National Science Foundation (SNSF)') | (aou_988['funder'] == 'Swiss National Research Foundation') | (aou_988['funder'] == 'Swiss National Foundation') | (aou_988['funder'] == 'SNF') | (aou_988['funder'] == 'Swiss National Science Foundation Grants')]  

Unnamed: 0,id,funder
72,unige:167762,SNF
1697,unige:174047,SNF
1937,unige:174846,SNF
2444,unige:176127,Swiss National Foundation
11222,unige:159654,Swiss National Science Foundation (SNSF)
11511,unige:161050,Swiss National Research Foundation
11522,unige:161060,Swiss National Foundation
11523,unige:161060,Swiss National Foundation
11585,unige:161096,Swiss National Research Foundation
11594,unige:161135,Swiss National Science Foundation Grants


<span style="color:orange">*Conclusion: we need to correct those in our repository: CORRECTIONS TO BE MADE*</span>

In [29]:
# check values for specific grant name identified
aou_988.loc[aou_988['funder'] == 'the Swiss National Centre of Competence in Research LIVES, Overcoming Vulnerability: Life Course Perspectives, Swiss National Science Foundation']  

Unnamed: 0,id,funder
12905,unige:165416,the Swiss National Centre of Competence in Res...


<span style="color:orange">*Conclusion: we need to correct those in our repository: CORRECTIONS TO BE MADE*</span>

In [30]:
# check values for 2 variant names in German identified
aou_988.loc[(aou_988['funder'] == 'Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (SNF)') | (aou_988['funder'] == 'Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung')]  

Unnamed: 0,id,funder
106,unige:167917,Schweizerischer Nationalfonds zur Förderung de...
12195,unige:162926,Schweizerischer Nationalfonds zur Förderung de...


<span style="color:orange">*Conclusion: we need to correct those in our repository: CORRECTIONS TO BE MADE*</span>
also, we could continue to explore the file for other vairant names to correct...

In [31]:
# check values for variant names in English identified
aou_988.loc[(aou_988['funder'] == 'European Research Council') | (aou_988['funder'] == 'European 7th Framework') | (aou_988['funder'] == 'European commission') | (aou_988['funder'] == 'European Union Horizon 2020') | (aou_988['funder'] == 'European Union') | (aou_988['funder'] == 'European Union Seventh Framework Program') | (aou_988['funder'] == 'Eurostars program of the European commission')]  

Unnamed: 0,id,funder
32,unige:167495,European Research Council
338,unige:168806,European Research Council
697,unige:169865,European Union
744,unige:169934,European Research Council
1163,unige:171898,European Research Council
...,...,...
12775,unige:165124,European Research Council
12817,unige:165180,European Research Council
13102,unige:166079,European Research Council
13193,unige:166400,European Research Council


<span style="color:orange">*Conclusion: we need to correct those in our repository: CORRECTIONS TO BE MADE*</span>

In [32]:
# mark records with FNS and EU funding mentions (or know funders with OA requirements) 
aou_988.loc[aou_988['funder'] == 'Swiss National Science Foundation', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'Swiss National Science Fondation', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'Fonds national suisse (FNS)', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'Fonds National Suisse', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'Fonds National Suisse de la recherche scientifique', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'Swiss National Research Foundation', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'Swiss National Foundation', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'SNF', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'Swiss National Science Foundation Grants', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (SNF)', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung', 'FUNDER'] = 1

aou_988.loc[aou_988['funder'] == 'European Commission', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'European Research Council', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'European 7th Framework', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'European Commission', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'European Union Horizon 2020', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'European Union', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'European Union Seventh Framework Program', 'FUNDER'] = 1
aou_988.loc[aou_988['funder'] == 'Eurostars program of the European commission', 'FUNDER'] = 1
aou_988 = aou_988.fillna(0)
aou_988

Unnamed: 0,id,funder,FUNDER
0,unige:167403,UCB Pharma,0.0
1,unige:167403,Amgen Inc,0.0
2,unige:167416,NIGMS NIH HHS,0.0
3,unige:167419,UK Research and Innovation,0.0
4,unige:167424,Fondation Franklinia,0.0
...,...,...,...
13493,unige:167391,Jacobs Foundation and Federal Office of public...,0.0
13494,unige:167392,Alexion Pharmaceutical,0.0
13495,unige:167397,Swiss National Science Foundation,1.0
13496,unige:167398,Institut de Recherches Servier,0.0


In [33]:
# Categorize FNS and EU funding mentions
aou_988.loc[aou_988['funder'] == 'Swiss National Science Foundation', 'FUNDER_TYPE'] = 'FNS'
aou_988.loc[aou_988['funder'] == 'Swiss National Science Fondation', 'FUNDER_TYPE'] = 'FNS'
aou_988.loc[aou_988['funder'] == 'Fonds national suisse (FNS)', 'FUNDER_TYPE'] = 'FNS'
aou_988.loc[aou_988['funder'] == 'Fonds National Suisse', 'FUNDER_TYPE'] = 'FNS'
aou_988.loc[aou_988['funder'] == 'Fonds National Suisse de la recherche scientifique', 'FUNDER_TYPE'] = 'FNS'
aou_988.loc[aou_988['funder'] == 'Swiss National Research Foundation', 'FUNDER_TYPE'] = 'FNS'
aou_988.loc[aou_988['funder'] == 'Swiss National Foundation', 'FUNDER_TYPE'] = 'FNS'
aou_988.loc[aou_988['funder'] == 'SNF', 'FUNDER_TYPE'] = 'FNS'
aou_988.loc[aou_988['funder'] == 'Swiss National Science Foundation Grants', 'FUNDER_TYPE'] = 'FNS'
aou_988.loc[aou_988['funder'] == 'Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (SNF)', 'FUNDER_TYPE'] = 'FNS'
aou_988.loc[aou_988['funder'] == 'Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung', 'FUNDER_TYPE'] = 'FNS'

aou_988.loc[aou_988['funder'] == 'European Commission', 'FUNDER_TYPE'] = 'EU'
aou_988.loc[aou_988['funder'] == 'European Research Council', 'FUNDER_TYPE'] = 'EU'
aou_988.loc[aou_988['funder'] == 'European 7th Framework', 'FUNDER_TYPE'] = 'EU'
aou_988.loc[aou_988['funder'] == 'European Commission', 'FUNDER_TYPE'] = 'EU'
aou_988.loc[aou_988['funder'] == 'European Union Horizon 2020', 'FUNDER_TYPE'] = 'EU'
aou_988.loc[aou_988['funder'] == 'European Union', 'FUNDER_TYPE'] = 'EU'
aou_988.loc[aou_988['funder'] == 'European Union Seventh Framework Program', 'FUNDER_TYPE'] = 'EU'
aou_988.loc[aou_988['funder'] == 'Eurostars program of the European commission', 'FUNDER_TYPE'] = 'EU'
aou_988['FUNDER_TYPE'] = aou_988['FUNDER_TYPE'].fillna('')
aou_988

Unnamed: 0,id,funder,FUNDER,FUNDER_TYPE
0,unige:167403,UCB Pharma,0.0,
1,unige:167403,Amgen Inc,0.0,
2,unige:167416,NIGMS NIH HHS,0.0,
3,unige:167419,UK Research and Innovation,0.0,
4,unige:167424,Fondation Franklinia,0.0,
...,...,...,...,...
13493,unige:167391,Jacobs Foundation and Federal Office of public...,0.0,
13494,unige:167392,Alexion Pharmaceutical,0.0,
13495,unige:167397,Swiss National Science Foundation,1.0,FNS
13496,unige:167398,Institut de Recherches Servier,0.0,


In [34]:
aou_988['FUNDER'].value_counts()

1.0    9758
0.0    3740
Name: FUNDER, dtype: int64

In [35]:
aou_988['FUNDER_TYPE'].value_counts()

FNS    8357
       3740
EU     1401
Name: FUNDER_TYPE, dtype: int64

In [36]:
aou_988_funder = aou_988[['id', 'FUNDER', 'FUNDER_TYPE']].loc[aou_988['FUNDER'] == 1]
aou_988_funder

Unnamed: 0,id,FUNDER,FUNDER_TYPE
5,unige:167427,1.0,FNS
10,unige:167460,1.0,FNS
11,unige:167461,1.0,EU
19,unige:167491,1.0,FNS
20,unige:167492,1.0,EU
...,...,...,...
13487,unige:167330,1.0,FNS
13488,unige:167356,1.0,FNS
13489,unige:167367,1.0,EU
13491,unige:167387,1.0,FNS


In [37]:
# dedup categorized funder data data
aou_988_funder = aou_988_funder.drop_duplicates()
aou_988_funder

Unnamed: 0,id,FUNDER,FUNDER_TYPE
5,unige:167427,1.0,FNS
10,unige:167460,1.0,FNS
11,unige:167461,1.0,EU
19,unige:167491,1.0,FNS
20,unige:167492,1.0,EU
...,...,...,...
13487,unige:167330,1.0,FNS
13488,unige:167356,1.0,FNS
13489,unige:167367,1.0,EU
13491,unige:167387,1.0,FNS


In [38]:
# check lines with both FNS and EU funder types
aou_988_funder.loc[aou_988_funder.duplicated(subset='id', keep=False)]

Unnamed: 0,id,FUNDER,FUNDER_TYPE
22,unige:167493,1.0,FNS
23,unige:167493,1.0,EU
27,unige:167495,1.0,EU
34,unige:167495,1.0,FNS
37,unige:167507,1.0,FNS
...,...,...,...
13328,unige:166947,1.0,EU
13339,unige:166969,1.0,FNS
13340,unige:166969,1.0,EU
13381,unige:167014,1.0,FNS


In [39]:
# check values for unige:166493
aou_988_funder.loc[aou_988_funder['id'] == 'unige:166493']

Unnamed: 0,id,FUNDER,FUNDER_TYPE
13213,unige:166493,1.0,FNS
13215,unige:166493,1.0,EU


In [40]:
# check values to convert in columns
aou_988_funder['FUNDER_TYPE'].value_counts()

FNS    6317
EU     1184
Name: FUNDER_TYPE, dtype: int64

In [41]:
# separate values 
aou_988_funder_FNS = aou_988_funder[['id', 'FUNDER_TYPE']].loc[aou_988_funder['FUNDER_TYPE'] == 'FNS']
aou_988_funder_EU = aou_988_funder[['id', 'FUNDER_TYPE']].loc[aou_988_funder['FUNDER_TYPE'] == 'EU']

In [42]:
# add new columns
aou_988_funder_FNS = aou_988_funder_FNS.rename(columns={'FUNDER_TYPE' : 'FNS_FUNDER'})
aou_988_funder_EU = aou_988_funder_EU.rename(columns={'FUNDER_TYPE' : 'EU_FUNDER'})

In [43]:
# remove duplicates
aou_988_funder_dedup = aou_988_funder[['id', 'FUNDER']].drop_duplicates(subset='id')
aou_988_funder_dedup

Unnamed: 0,id,FUNDER
5,unige:167427,1.0
10,unige:167460,1.0
11,unige:167461,1.0
19,unige:167491,1.0
20,unige:167492,1.0
...,...,...
13487,unige:167330,1.0
13488,unige:167356,1.0
13489,unige:167367,1.0
13491,unige:167387,1.0


In [44]:
# add funder types
aou_988_funder_dedup = aou_988_funder_dedup.merge(aou_988_funder_FNS, on='id', how='left')
aou_988_funder_dedup = aou_988_funder_dedup.merge(aou_988_funder_EU, on='id', how='left')
aou_988_funder_dedup

Unnamed: 0,id,FUNDER,FNS_FUNDER,EU_FUNDER
0,unige:167427,1.0,FNS,
1,unige:167460,1.0,FNS,
2,unige:167461,1.0,,EU
3,unige:167491,1.0,FNS,
4,unige:167492,1.0,,EU
...,...,...,...,...
7043,unige:167330,1.0,FNS,
7044,unige:167356,1.0,FNS,
7045,unige:167367,1.0,,EU
7046,unige:167387,1.0,FNS,


In [45]:
# check values for unige:166493
aou_988_funder_dedup.loc[aou_988_funder_dedup['id'] == 'unige:166493']

Unnamed: 0,id,FUNDER,FNS_FUNDER,EU_FUNDER
6926,unige:166493,1.0,FNS,EU


In [46]:
# change values
aou_988_funder_dedup.loc[aou_988_funder_dedup['FNS_FUNDER'] == 'FNS', 'FNS_FUNDER'] = 1
aou_988_funder_dedup.loc[aou_988_funder_dedup['EU_FUNDER'] == 'EU', 'EU_FUNDER'] = 1
aou_988_funder_dedup

Unnamed: 0,id,FUNDER,FNS_FUNDER,EU_FUNDER
0,unige:167427,1.0,1,
1,unige:167460,1.0,1,
2,unige:167461,1.0,,1
3,unige:167491,1.0,1,
4,unige:167492,1.0,,1
...,...,...,...,...
7043,unige:167330,1.0,1,
7044,unige:167356,1.0,1,
7045,unige:167367,1.0,,1
7046,unige:167387,1.0,1,


In [47]:
# fill nas
aou_988_funder_dedup = aou_988_funder_dedup.fillna(0)
aou_988_funder_dedup

Unnamed: 0,id,FUNDER,FNS_FUNDER,EU_FUNDER
0,unige:167427,1.0,1,0
1,unige:167460,1.0,1,0
2,unige:167461,1.0,0,1
3,unige:167491,1.0,1,0
4,unige:167492,1.0,0,1
...,...,...,...,...
7043,unige:167330,1.0,1,0
7044,unige:167356,1.0,1,0
7045,unige:167367,1.0,0,1
7046,unige:167387,1.0,1,0


### 3. Clean and prepare Data about UNIGE Structures and Disciplines

In [48]:
# UNIGE structures
aou_structures = pd.read_csv(export_aou_928, encoding='utf-8', header=0, sep='\t')
aou_structures

Unnamed: 0,id,928
0,unige:1,Section des sciences pharmaceutiques
1,unige:2,"Département de chimie minérale, analytique et ..."
2,unige:3,Section des sciences pharmaceutiques
3,unige:4,Section des sciences pharmaceutiques
4,unige:5,Section des sciences pharmaceutiques
...,...,...
127105,unige:167398,Département de médecine
127106,unige:167399,Département de médecine
127107,unige:167400,Département de médecine
127108,unige:167401,Département de médecine


In [49]:
aou_structures = aou_structures.rename(columns={'928' : 'structure'})

In [50]:
aou_structures['structure'].value_counts()

Département de médecine                                 10334
Section des sciences de l'éducation                      5842
Section de psychologie                                   5534
Département de chirurgie                                 4624
Département de pédiatrie, gynécologie et obstétrique     4530
                                                        ...  
Centre pour la formation continue et à distance             2
Service des Relations Internationales & Partenariats        2
InZone                                                      2
Unité d'action sociale                                      1
Centre Maurice Chalumeau en sciences des sexualités         1
Name: structure, Length: 159, dtype: int64

In [51]:
# include only Medicine & Biology refs
structures_biomed = pd.read_excel(structures_biomed, skiprows=1, header=0)
structures_biomed

Unnamed: 0,actif,code,structure,discipline
0,Actif,16,Section de biologie,Biology
1,historique,161,Département de biologie moléculaire,Biology
2,Actif,162,Département de génétique et évolution,Biology
3,Actif,163,Département des sciences végétales,Biology
4,historique,164,Département d'anthropologie et écologie,Biology
5,historique,165,Département de biologie cellulaire,Biology
6,Actif,168,Département de biologie moléculaire et cellulaire,Biology
7,Actif,17,Section des sciences pharmaceutiques,Pharmaceutical sciences
8,Actif,2,Faculté de médecine,Medicine (general)
9,Actif,21,Section de médecine fondamentale,Basic Medicine


In [52]:
structures_biomed['BIOMED'] = 1
structures_biomed

Unnamed: 0,actif,code,structure,discipline,BIOMED
0,Actif,16,Section de biologie,Biology,1
1,historique,161,Département de biologie moléculaire,Biology,1
2,Actif,162,Département de génétique et évolution,Biology,1
3,Actif,163,Département des sciences végétales,Biology,1
4,historique,164,Département d'anthropologie et écologie,Biology,1
5,historique,165,Département de biologie cellulaire,Biology,1
6,Actif,168,Département de biologie moléculaire et cellulaire,Biology,1
7,Actif,17,Section des sciences pharmaceutiques,Pharmaceutical sciences,1
8,Actif,2,Faculté de médecine,Medicine (general),1
9,Actif,21,Section de médecine fondamentale,Basic Medicine,1


In [53]:
# merge by structure
aou_structures = pd.merge(aou_structures, structures_biomed, on='structure', how='left')
aou_structures

Unnamed: 0,id,structure,actif,code,discipline,BIOMED
0,unige:1,Section des sciences pharmaceutiques,Actif,17,Pharmaceutical sciences,1.0
1,unige:2,"Département de chimie minérale, analytique et ...",,,,
2,unige:3,Section des sciences pharmaceutiques,Actif,17,Pharmaceutical sciences,1.0
3,unige:4,Section des sciences pharmaceutiques,Actif,17,Pharmaceutical sciences,1.0
4,unige:5,Section des sciences pharmaceutiques,Actif,17,Pharmaceutical sciences,1.0
...,...,...,...,...,...,...
127105,unige:167398,Département de médecine,Actif,221,Clinical Medicine,1.0
127106,unige:167399,Département de médecine,Actif,221,Clinical Medicine,1.0
127107,unige:167400,Département de médecine,Actif,221,Clinical Medicine,1.0
127108,unige:167401,Département de médecine,Actif,221,Clinical Medicine,1.0


In [54]:
aou_structures['BIOMED'] = aou_structures['BIOMED'].fillna(0)
aou_structures

Unnamed: 0,id,structure,actif,code,discipline,BIOMED
0,unige:1,Section des sciences pharmaceutiques,Actif,17,Pharmaceutical sciences,1.0
1,unige:2,"Département de chimie minérale, analytique et ...",,,,0.0
2,unige:3,Section des sciences pharmaceutiques,Actif,17,Pharmaceutical sciences,1.0
3,unige:4,Section des sciences pharmaceutiques,Actif,17,Pharmaceutical sciences,1.0
4,unige:5,Section des sciences pharmaceutiques,Actif,17,Pharmaceutical sciences,1.0
...,...,...,...,...,...,...
127105,unige:167398,Département de médecine,Actif,221,Clinical Medicine,1.0
127106,unige:167399,Département de médecine,Actif,221,Clinical Medicine,1.0
127107,unige:167400,Département de médecine,Actif,221,Clinical Medicine,1.0
127108,unige:167401,Département de médecine,Actif,221,Clinical Medicine,1.0


In [55]:
# export of structures not Biomed
aou_structures.loc[aou_structures['BIOMED'] == 0]['structure'].value_counts().to_csv(myfolder_temp + 'aou_structures_non_biomed.tsv')

  


In [56]:
# export of ids and structures not Biomed
aou_structures.loc[aou_structures['BIOMED'] == 0][['id', 'structure']].to_csv(myfolder_temp + 'aou_ids_structures_non_biomed.tsv', sep='\t', index=False)
aou_structures.loc[aou_structures['BIOMED'] == 0][['id', 'structure']].to_excel(myfolder_temp + 'aou_ids_structures_non_biomed.xlsx', index=False)

In [57]:
aou_structures

Unnamed: 0,id,structure,actif,code,discipline,BIOMED
0,unige:1,Section des sciences pharmaceutiques,Actif,17,Pharmaceutical sciences,1.0
1,unige:2,"Département de chimie minérale, analytique et ...",,,,0.0
2,unige:3,Section des sciences pharmaceutiques,Actif,17,Pharmaceutical sciences,1.0
3,unige:4,Section des sciences pharmaceutiques,Actif,17,Pharmaceutical sciences,1.0
4,unige:5,Section des sciences pharmaceutiques,Actif,17,Pharmaceutical sciences,1.0
...,...,...,...,...,...,...
127105,unige:167398,Département de médecine,Actif,221,Clinical Medicine,1.0
127106,unige:167399,Département de médecine,Actif,221,Clinical Medicine,1.0
127107,unige:167400,Département de médecine,Actif,221,Clinical Medicine,1.0
127108,unige:167401,Département de médecine,Actif,221,Clinical Medicine,1.0


In [58]:
# check lines with multiple structures
aou_structures.loc[aou_structures.duplicated(subset='id', keep=False)]

Unnamed: 0,id,structure,actif,code,discipline,BIOMED
12,unige:13,Département de génétique et évolution,Actif,162,Biology,1.0
13,unige:13,Section des sciences pharmaceutiques,Actif,17,Pharmaceutical sciences,1.0
29,unige:29,Département de droit civil,,,,0.0
30,unige:29,Département de droit international privé,,,,0.0
51,unige:51,Département de droit civil,,,,0.0
...,...,...,...,...,...,...
127081,unige:167377,Institut universitaire de formation des enseig...,,,,0.0
127095,unige:167391,Département de santé et médecine communautaires,Actif,22A,Clinical Medicine,1.0
127096,unige:167391,"Département de pédiatrie, gynécologie et obsté...",Actif,223,Clinical Medicine,1.0
127097,unige:167391,Département de médecine,Actif,221,Clinical Medicine,1.0


In [59]:
# keep only biomed structures
aou_structures_biomed = aou_structures[['id', 'structure', 'code', 'discipline', 'BIOMED']].loc[aou_structures['BIOMED'] == 1]
aou_structures_biomed

Unnamed: 0,id,structure,code,discipline,BIOMED
0,unige:1,Section des sciences pharmaceutiques,17,Pharmaceutical sciences,1.0
2,unige:3,Section des sciences pharmaceutiques,17,Pharmaceutical sciences,1.0
3,unige:4,Section des sciences pharmaceutiques,17,Pharmaceutical sciences,1.0
4,unige:5,Section des sciences pharmaceutiques,17,Pharmaceutical sciences,1.0
5,unige:6,Section des sciences pharmaceutiques,17,Pharmaceutical sciences,1.0
...,...,...,...,...,...
127105,unige:167398,Département de médecine,221,Clinical Medicine,1.0
127106,unige:167399,Département de médecine,221,Clinical Medicine,1.0
127107,unige:167400,Département de médecine,221,Clinical Medicine,1.0
127108,unige:167401,Département de médecine,221,Clinical Medicine,1.0


In [60]:
# check ref with multiple disciplines
aou_structures_biomed.loc[aou_structures_biomed['id'] == 'unige:160592']

Unnamed: 0,id,structure,code,discipline,BIOMED
119030,unige:160592,Enseignement de chirurgie orale et implantologie,24R,Dentistry,1.0
119031,unige:160592,Département de chirurgie,222,Clinical Medicine,1.0
119032,unige:160592,Département de radiologie et informatique médi...,227,Clinical Medicine,1.0
119033,unige:160592,Division d'orthodontie,24Q1,Dentistry,1.0


In [61]:
# check valules to convert in columns
aou_structures_biomed['discipline'].value_counts()

Clinical Medicine          35601
Basic Medicine              9160
Biology                     6175
Pharmaceutical sciences     2851
Affective sciences          1620
Dentistry                   1482
Medicine (general)           915
Neurosciences                273
Name: discipline, dtype: int64

In [62]:
# separate values 
aou_structures_biomed_cm = aou_structures_biomed[['id', 'discipline']].loc[aou_structures_biomed['discipline'] == 'Clinical Medicine']
aou_structures_biomed_bm = aou_structures_biomed[['id', 'discipline']].loc[aou_structures_biomed['discipline'] == 'Basic Medicine']
aou_structures_biomed_b = aou_structures_biomed[['id', 'discipline']].loc[aou_structures_biomed['discipline'] == 'Biology']
aou_structures_biomed_ps = aou_structures_biomed[['id', 'discipline']].loc[aou_structures_biomed['discipline'] == 'Pharmaceutical sciences']
aou_structures_biomed_as = aou_structures_biomed[['id', 'discipline']].loc[aou_structures_biomed['discipline'] == 'Affective sciences']
aou_structures_biomed_d = aou_structures_biomed[['id', 'discipline']].loc[aou_structures_biomed['discipline'] == 'Dentistry']
aou_structures_biomed_m = aou_structures_biomed[['id', 'discipline']].loc[aou_structures_biomed['discipline'] == 'Medicine (general)']
aou_structures_biomed_n = aou_structures_biomed[['id', 'discipline']].loc[aou_structures_biomed['discipline'] == 'Neurosciences']

In [63]:
# add new columns
aou_structures_biomed_cm = aou_structures_biomed_cm.rename(columns={'discipline' : 'discipline_clinical'})
aou_structures_biomed_bm = aou_structures_biomed_bm.rename(columns={'discipline' : 'discipline_basic'})
aou_structures_biomed_b = aou_structures_biomed_b.rename(columns={'discipline' : 'discipline_biology'})
aou_structures_biomed_ps = aou_structures_biomed_ps.rename(columns={'discipline' : 'discipline_pharma'})
aou_structures_biomed_as = aou_structures_biomed_as.rename(columns={'discipline' : 'discipline_affective'})
aou_structures_biomed_d = aou_structures_biomed_d.rename(columns={'discipline' : 'discipline_dentistry'})
aou_structures_biomed_m = aou_structures_biomed_m.rename(columns={'discipline' : 'discipline_medicine_general'})
aou_structures_biomed_n = aou_structures_biomed_n.rename(columns={'discipline' : 'discipline_neurosciences'})

In [64]:
aou_structures_biomed_cm

Unnamed: 0,id,discipline_clinical
87,unige:85,Clinical Medicine
89,unige:86,Clinical Medicine
91,unige:87,Clinical Medicine
95,unige:90,Clinical Medicine
96,unige:91,Clinical Medicine
...,...,...
127105,unige:167398,Clinical Medicine
127106,unige:167399,Clinical Medicine
127107,unige:167400,Clinical Medicine
127108,unige:167401,Clinical Medicine


In [65]:
# remove duplicates
aou_structures_biomed_dedup = aou_structures_biomed[['id', 'BIOMED']].drop_duplicates(subset='id')
aou_structures_biomed_cm = aou_structures_biomed_cm.drop_duplicates(subset='id')
aou_structures_biomed_bm = aou_structures_biomed_bm.drop_duplicates(subset='id')
aou_structures_biomed_b = aou_structures_biomed_b.drop_duplicates(subset='id')
aou_structures_biomed_ps = aou_structures_biomed_ps.drop_duplicates(subset='id')
aou_structures_biomed_as = aou_structures_biomed_as.drop_duplicates(subset='id')
aou_structures_biomed_d = aou_structures_biomed_d.drop_duplicates(subset='id')
aou_structures_biomed_m = aou_structures_biomed_m.drop_duplicates(subset='id')
aou_structures_biomed_n = aou_structures_biomed_n.drop_duplicates(subset='id')
aou_structures_biomed_dedup

Unnamed: 0,id,BIOMED
0,unige:1,1.0
2,unige:3,1.0
3,unige:4,1.0
4,unige:5,1.0
5,unige:6,1.0
...,...,...
127105,unige:167398,1.0
127106,unige:167399,1.0
127107,unige:167400,1.0
127108,unige:167401,1.0


In [66]:
# add disciplines
aou_structures_biomed_dedup = aou_structures_biomed_dedup.merge(aou_structures_biomed_cm, on='id', how='left')
aou_structures_biomed_dedup = aou_structures_biomed_dedup.merge(aou_structures_biomed_bm, on='id', how='left')
aou_structures_biomed_dedup = aou_structures_biomed_dedup.merge(aou_structures_biomed_b, on='id', how='left')
aou_structures_biomed_dedup = aou_structures_biomed_dedup.merge(aou_structures_biomed_ps, on='id', how='left')
aou_structures_biomed_dedup = aou_structures_biomed_dedup.merge(aou_structures_biomed_as, on='id', how='left')
aou_structures_biomed_dedup = aou_structures_biomed_dedup.merge(aou_structures_biomed_d, on='id', how='left')
aou_structures_biomed_dedup = aou_structures_biomed_dedup.merge(aou_structures_biomed_m, on='id', how='left')
aou_structures_biomed_dedup = aou_structures_biomed_dedup.merge(aou_structures_biomed_n, on='id', how='left')
aou_structures_biomed_dedup

Unnamed: 0,id,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences
0,unige:1,1.0,,,,Pharmaceutical sciences,,,,
1,unige:3,1.0,,,,Pharmaceutical sciences,,,,
2,unige:4,1.0,,,,Pharmaceutical sciences,,,,
3,unige:5,1.0,,,,Pharmaceutical sciences,,,,
4,unige:6,1.0,,,,Pharmaceutical sciences,,,,
...,...,...,...,...,...,...,...,...,...,...
45423,unige:167398,1.0,Clinical Medicine,,,,,,,
45424,unige:167399,1.0,Clinical Medicine,,,,,,,
45425,unige:167400,1.0,Clinical Medicine,,,,,,,
45426,unige:167401,1.0,Clinical Medicine,,,,,,,


In [67]:
# change values
aou_structures_biomed_dedup.loc[aou_structures_biomed_dedup['discipline_clinical'] == 'Clinical Medicine', 'discipline_clinical'] = 1
aou_structures_biomed_dedup.loc[aou_structures_biomed_dedup['discipline_basic'] == 'Basic Medicine', 'discipline_basic'] = 1
aou_structures_biomed_dedup.loc[aou_structures_biomed_dedup['discipline_biology'] == 'Biology', 'discipline_biology'] = 1
aou_structures_biomed_dedup.loc[aou_structures_biomed_dedup['discipline_pharma'] == 'Pharmaceutical sciences', 'discipline_pharma'] = 1
aou_structures_biomed_dedup.loc[aou_structures_biomed_dedup['discipline_affective'] == 'Affective sciences', 'discipline_affective'] = 1
aou_structures_biomed_dedup.loc[aou_structures_biomed_dedup['discipline_dentistry'] == 'Dentistry', 'discipline_dentistry'] = 1
aou_structures_biomed_dedup.loc[aou_structures_biomed_dedup['discipline_medicine_general'] == 'Medicine (general)', 'discipline_medicine_general'] = 1
aou_structures_biomed_dedup.loc[aou_structures_biomed_dedup['discipline_neurosciences'] == 'Neurosciences', 'discipline_neurosciences'] = 1
aou_structures_biomed_dedup

Unnamed: 0,id,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences
0,unige:1,1.0,,,,1,,,,
1,unige:3,1.0,,,,1,,,,
2,unige:4,1.0,,,,1,,,,
3,unige:5,1.0,,,,1,,,,
4,unige:6,1.0,,,,1,,,,
...,...,...,...,...,...,...,...,...,...,...
45423,unige:167398,1.0,1,,,,,,,
45424,unige:167399,1.0,1,,,,,,,
45425,unige:167400,1.0,1,,,,,,,
45426,unige:167401,1.0,1,,,,,,,


In [68]:
# fill nas
aou_structures_biomed_dedup = aou_structures_biomed_dedup.fillna(0)
aou_structures_biomed_dedup

Unnamed: 0,id,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences
0,unige:1,1.0,0,0,0,1,0,0,0,0
1,unige:3,1.0,0,0,0,1,0,0,0,0
2,unige:4,1.0,0,0,0,1,0,0,0,0
3,unige:5,1.0,0,0,0,1,0,0,0,0
4,unige:6,1.0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
45423,unige:167398,1.0,1,0,0,0,0,0,0,0
45424,unige:167399,1.0,1,0,0,0,0,0,0,0
45425,unige:167400,1.0,1,0,0,0,0,0,0,0
45426,unige:167401,1.0,1,0,0,0,0,0,0,0


In [69]:
aou_structures_biomed_dedup['BIOMED'] = aou_structures_biomed_dedup['BIOMED'].fillna(0).astype(int)
aou_structures_biomed_dedup

Unnamed: 0,id,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences
0,unige:1,1,0,0,0,1,0,0,0,0
1,unige:3,1,0,0,0,1,0,0,0,0
2,unige:4,1,0,0,0,1,0,0,0,0
3,unige:5,1,0,0,0,1,0,0,0,0
4,unige:6,1,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
45423,unige:167398,1,1,0,0,0,0,0,0,0
45424,unige:167399,1,1,0,0,0,0,0,0,0
45425,unige:167400,1,1,0,0,0,0,0,0,0
45426,unige:167401,1,1,0,0,0,0,0,0,0


In [70]:
# check ref with multiple disciplines
aou_structures_biomed_dedup.loc[aou_structures_biomed_dedup['id'] == 'unige:160592']

Unnamed: 0,id,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences
42573,unige:160592,1,1,0,0,0,0,1,0,0


### 4. Clean and prepare Data about DOIs

In [71]:
# DOIs
aou_dois =  pd.read_csv(export_aou_dois, encoding='utf-8', header=0, sep='\t')
aou_dois

Unnamed: 0,id,doi
0,unige:1,10.1016/j.ijpharm.2007.09.007
1,unige:2,10.1039/b714861e
2,unige:3,10.1016/j.ejpb.2007.06.020
3,unige:4,10.1002/jps.20829
4,unige:5,10.1007/s11095-007-9429-7
...,...,...
66542,unige:167398,10.1016/j.bonr.2022.101623
66543,unige:167399,10.1016/j.jbspin.2022.105521
66544,unige:167400,10.1093/ejendo/lvad001
66545,unige:167401,10.3390/ijms232314585


In [72]:
# check duplicates
aou_dois.loc[aou_dois.duplicated(subset='doi', keep=False)].sort_values(by='doi')

Unnamed: 0,id,doi
66016,unige:166718,10.1016/j.ajodo.2022.03.014
66191,unige:166940,10.1016/j.ajodo.2022.03.014
2315,unige:170403,10.1093/eurheartj/ehab866
64779,unige:165129,10.1093/eurheartj/ehab866


In [73]:
# export CSV for duplicates
aou_dois.loc[aou_dois.duplicated(subset='doi', keep=False)].sort_values(by='doi').to_csv(myfolder_temp + 'aou_dois_duplicates.tsv', sep='\t', index=False)

In [74]:
# dedup dois
aou_dois = aou_dois.drop_duplicates(subset='doi', keep='first')
aou_dois

Unnamed: 0,id,doi
0,unige:1,10.1016/j.ijpharm.2007.09.007
1,unige:2,10.1039/b714861e
2,unige:3,10.1016/j.ejpb.2007.06.020
3,unige:4,10.1002/jps.20829
4,unige:5,10.1007/s11095-007-9429-7
...,...,...
66542,unige:167398,10.1016/j.bonr.2022.101623
66543,unige:167399,10.1016/j.jbspin.2022.105521
66544,unige:167400,10.1093/ejendo/lvad001
66545,unige:167401,10.3390/ijms232314585


### 5. Clean and prepare Data about PMIDs

In [75]:
# PMIDs
aou_pmids =  pd.read_csv(export_aou_pmids, encoding='utf-8', header=0, sep='\t')
aou_pmids

Unnamed: 0,id,pmid
0,unige:1,17997238
1,unige:2,18092080
2,unige:3,17884402
3,unige:4,17497726
4,unige:5,17985216
...,...,...
37226,unige:167398,36213624
37227,unige:167399,36566976
37228,unige:167400,36762943
37229,unige:167401,36498911


In [76]:
# check duplicates
aou_pmids.loc[aou_pmids.duplicated(subset='pmid', keep=False)].sort_values(by='pmid')

Unnamed: 0,id,pmid


In [77]:
# export CSV for duplicates
aou_pmids.loc[aou_pmids.duplicated(subset='pmid', keep=False)].sort_values(by='pmid').to_csv(myfolder_temp + 'aou_pmids_duplicates.tsv', sep='\t', index=False)

In [78]:
# dedup pmids
aou_pmids = aou_pmids.drop_duplicates(subset='pmid', keep='first')
aou_pmids

Unnamed: 0,id,pmid
0,unige:1,17997238
1,unige:2,18092080
2,unige:3,17884402
3,unige:4,17497726
4,unige:5,17985216
...,...,...
37226,unige:167398,36213624
37227,unige:167399,36566976
37228,unige:167400,36762943
37229,unige:167401,36498911


### 6. Clean and prepare Data about publication type

In [79]:
# Document Type 980
aou_doctype = pd.read_csv(export_aou_980a, encoding='utf-8', header=0, sep='\t')
aou_doctype

Unnamed: 0,id,980a
0,unige:1,Article scientifique
1,unige:2,Article scientifique
2,unige:3,Article scientifique
3,unige:4,Article scientifique
4,unige:5,Article scientifique
...,...,...
109004,unige:167398,Article scientifique
109005,unige:167399,Article scientifique
109006,unige:167400,Article scientifique
109007,unige:167401,Article scientifique


In [80]:
aou_doctype = aou_doctype.rename(columns={'980a' : 'type'})

In [81]:
aou_doctype['type'].value_counts()

Article scientifique                                 71525
Chapitre de livre                                     8330
Thèse                                                 6543
Master                                                6264
Chapitre d'actes                                      4399
Article professionnel                                 2793
Livre                                                 2024
Présentation / Intervention                           1279
Rapport de recherche                                  1151
Autre article                                          816
Ouvrage collectif                                      771
Actes de conférence                                    447
Contribution à un dictionnaire / une encyclopédie      436
Numéro de revue                                        405
Master d'études avancées                               374
Working paper                                          367
Thèse de privat-docent                                 3

In [82]:
# check lines with several types (normally that's not possible)
aou_doctype.loc[aou_doctype.duplicated(subset='id', keep=False)]

Unnamed: 0,id,type


In [83]:
# dedup just in case
aou_doctype = aou_doctype.drop_duplicates(subset='id')
aou_doctype

Unnamed: 0,id,type
0,unige:1,Article scientifique
1,unige:2,Article scientifique
2,unige:3,Article scientifique
3,unige:4,Article scientifique
4,unige:5,Article scientifique
...,...,...
109004,unige:167398,Article scientifique
109005,unige:167399,Article scientifique
109006,unige:167400,Article scientifique
109007,unige:167401,Article scientifique


### 7. Merging cleaned aou data all together, starting with basic file including publication year

In [84]:
aou = pd.read_csv(export_aou_dates, dtype={'date': 'int'}, encoding='utf-8', header=0, sep='\t')
aou

Unnamed: 0,id,date
0,unige:1,2008
1,unige:2,2008
2,unige:3,2008
3,unige:4,2008
4,unige:5,2008
...,...,...
109004,unige:167398,2022
109005,unige:167399,2023
109006,unige:167400,2023
109007,unige:167401,2022


In [85]:
aou['date'].value_counts()

2021    5427
2020    5340
2019    5305
2018    5252
2016    5233
        ... 
1844       1
1861       1
1841       1
1832       1
1856       1
Name: date, Length: 181, dtype: int64

In [86]:
aou

Unnamed: 0,id,date
0,unige:1,2008
1,unige:2,2008
2,unige:3,2008
3,unige:4,2008
4,unige:5,2008
...,...,...
109004,unige:167398,2022
109005,unige:167399,2023
109006,unige:167400,2023
109007,unige:167401,2022


In [87]:
# merge AoU with structures information
aou = aou.merge(aou_structures_biomed_dedup, on='id', how='left')
aou

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences
0,unige:1,2008,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,unige:2,2008,,,,,,,,,
2,unige:3,2008,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,unige:4,2008,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,unige:5,2008,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
109004,unige:167398,2022,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
109005,unige:167399,2023,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
109006,unige:167400,2023,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
109007,unige:167401,2022,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [88]:
# merge AoU with data information
aou = aou.merge(aou_856_data_dedup, on='id', how='left')
aou

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data
0,unige:1,2008,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,
1,unige:2,2008,,,,,,,,,,,,,
2,unige:3,2008,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,
3,unige:4,2008,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,
4,unige:5,2008,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109004,unige:167398,2022,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,
109005,unige:167399,2023,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,
109006,unige:167400,2023,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
109007,unige:167401,2022,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,


In [89]:
# merge AoU with data information
aou = aou.merge(aou_988_funder_dedup, on='id', how='left')
aou

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER
0,unige:1,2008,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,,
1,unige:2,2008,,,,,,,,,,,,,,,,
2,unige:3,2008,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,,
3,unige:4,2008,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,,
4,unige:5,2008,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109004,unige:167398,2022,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,
109005,unige:167399,2023,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,
109006,unige:167400,2023,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,,,
109007,unige:167401,2022,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,


In [90]:
aou['BIOMED'] = aou['BIOMED'].fillna(0).astype(int)
aou['DATA'] = aou['DATA'].fillna(0).astype(int)
aou['DATA_TYPE_appendixes'] = aou['DATA_TYPE_appendixes'].fillna(0).astype(int)
aou['DATA_TYPE_data_supplements'] = aou['DATA_TYPE_data_supplements'].fillna(0).astype(int)
aou['DATA_TYPE_shared_data'] = aou['DATA_TYPE_shared_data'].fillna(0).astype(int)
aou['FUNDER'] = aou['FUNDER'].fillna(0).astype(int)
aou['FNS_FUNDER'] = aou['FNS_FUNDER'].fillna(0).astype(int)
aou['EU_FUNDER'] = aou['EU_FUNDER'].fillna(0).astype(int)
aou['discipline_clinical'] = aou['discipline_clinical'].fillna(0).astype(int)
aou['discipline_basic'] = aou['discipline_basic'].fillna(0).astype(int)
aou['discipline_biology'] = aou['discipline_biology'].fillna(0).astype(int)
aou['discipline_pharma'] = aou['discipline_pharma'].fillna(0).astype(int)
aou['discipline_affective'] = aou['discipline_affective'].fillna(0).astype(int)
aou['discipline_dentistry'] = aou['discipline_dentistry'].fillna(0).astype(int)
aou['discipline_medicine_general'] = aou['discipline_medicine_general'].fillna(0).astype(int)
aou['discipline_neurosciences'] = aou['discipline_neurosciences'].fillna(0).astype(int)
aou

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER
0,unige:1,2008,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
1,unige:2,2008,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,unige:3,2008,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3,unige:4,2008,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,unige:5,2008,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109004,unige:167398,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
109005,unige:167399,2023,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
109006,unige:167400,2023,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0
109007,unige:167401,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [91]:
# merge AoU with type information
aou = aou.merge(aou_doctype, on='id', how='left')
aou

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type
0,unige:1,2008,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
1,unige:2,2008,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
2,unige:3,2008,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
3,unige:4,2008,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
4,unige:5,2008,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109004,unige:167398,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
109005,unige:167399,2023,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
109006,unige:167400,2023,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,Article scientifique
109007,unige:167401,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique


In [92]:
# keep only publications from 2015 -> 2022
aou_biomed_2015_2022 = aou.loc[(aou['date'] > 2014) & (aou['date'] < 2023) & (aou['BIOMED'] == 1)]
aou_biomed_2015_2022

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type
400,unige:167403,2022,1,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,Article scientifique
402,unige:167405,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Thèse
403,unige:167406,2022,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,Thèse
404,unige:167407,2022,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Thèse
406,unige:167409,2022,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Chapitre de livre
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109000,unige:167394,2020,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article professionnel
109001,unige:167395,2021,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article professionnel
109002,unige:167396,2021,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
109004,unige:167398,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique


### 8. Prepare another detailed file only with data about supplementary fiels and dataset, not deduplicated by publication, to analyse where data is shared. (eg. on which repo)

In [93]:
# 856 add aou data
aou_856 = aou_856.merge(aou, on='id', how='left')
aou_856

Unnamed: 0,id,856_3,856_f,856_u,856_x,856_z,DATA_x,DATA_TYPE,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA_y,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type
0,unige:1,Article (Published version),unige1.pdf,https://archive-ouverte.unige.ch/download/ffe9...,Restricted access,0,0.0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
1,unige:3,Article (Published version),unige3.pdf,https://archive-ouverte.unige.ch/download/c0ba...,Restricted access,0,0.0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
2,unige:4,Article (Published version),1-s2.0-S0022354916324418-main.pdf,https://archive-ouverte.unige.ch/download/e6dc...,Restricted access,0,0.0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
3,unige:5,Article (Published version),11095_2007_Article_9429.pdf,https://archive-ouverte.unige.ch/download/a32f...,Public access,OA NATIONAL,0.0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
4,unige:6,Article,fulltext.pdf,https://archive-ouverte.unige.ch/download/470f...,Restricted access,0,0.0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148000,unige:167401,Article (Published version),Article-scientifique-from-europePmc.pdf,https://archive-ouverte.unige.ch/download/8e3d...,Public access,CC BY-4.0,0.0,,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
148001,unige:167401,Alternate edition,0,https://www.mdpi.com/1422-0067/23/23/14585,0,0,0.0,,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique
148002,unige:167402,Article (Published version),A 578. Assessing the androgenic and.pdf,https://archive-ouverte.unige.ch/download/2e32...,Restricted access,0,0.0,,2023,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,Article scientifique
148003,unige:167402,Appendix,cen14847-sup-0001-supmat.docx,https://archive-ouverte.unige.ch/download/0cbf...,Public access,0,1.0,appendixes,2023,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,Article scientifique


In [94]:
# simplify DATA field
del aou_856['DATA_y']
del aou_856['DATA_TYPE_appendixes']
del aou_856['DATA_TYPE_data_supplements']
del aou_856['DATA_TYPE_shared_data']
aou_856 = aou_856.rename(columns={'DATA_x' : 'DATA'})
aou_856['DATA'] = aou_856['DATA'].fillna(0).astype(int)
aou_856

Unnamed: 0,id,856_3,856_f,856_u,856_x,856_z,DATA,DATA_TYPE,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,FUNDER,FNS_FUNDER,EU_FUNDER,type
0,unige:1,Article (Published version),unige1.pdf,https://archive-ouverte.unige.ch/download/ffe9...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique
1,unige:3,Article (Published version),unige3.pdf,https://archive-ouverte.unige.ch/download/c0ba...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique
2,unige:4,Article (Published version),1-s2.0-S0022354916324418-main.pdf,https://archive-ouverte.unige.ch/download/e6dc...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique
3,unige:5,Article (Published version),11095_2007_Article_9429.pdf,https://archive-ouverte.unige.ch/download/a32f...,Public access,OA NATIONAL,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique
4,unige:6,Article,fulltext.pdf,https://archive-ouverte.unige.ch/download/470f...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148000,unige:167401,Article (Published version),Article-scientifique-from-europePmc.pdf,https://archive-ouverte.unige.ch/download/8e3d...,Public access,CC BY-4.0,0,,2022,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique
148001,unige:167401,Alternate edition,0,https://www.mdpi.com/1422-0067/23/23/14585,0,0,0,,2022,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique
148002,unige:167402,Article (Published version),A 578. Assessing the androgenic and.pdf,https://archive-ouverte.unige.ch/download/2e32...,Restricted access,0,0,,2023,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique
148003,unige:167402,Appendix,cen14847-sup-0001-supmat.docx,https://archive-ouverte.unige.ch/download/0cbf...,Public access,0,1,appendixes,2023,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique


In [95]:
# list of 856 URLs: extract domains and filetypes (if DOI: add information by prefixes, and add the final URL)
# extract the domain name on 856_u
aou_856.loc[aou_856['856_u'].notna(), '856_u_split'] = aou_856['856_u'].str.upper()
aou_856['856_u_split'] = aou_856['856_u_split'].str.replace('HTTPS:\/\/WWW\.', '')
aou_856['856_u_split'] = aou_856['856_u_split'].str.replace('HTTP:\/\/WWW\.', '')
aou_856['856_u_split'] = aou_856['856_u_split'].str.replace('HTTPS:\/\/', '')
aou_856['856_u_split'] = aou_856['856_u_split'].str.replace('HTTP:\/\/', '')
aou_856['856_u_split'] = aou_856['856_u_split'].str.split(pat='/')
aou_856['856_domain'] = aou_856['856_u_split'].str[0]
aou_856

Unnamed: 0,id,856_3,856_f,856_u,856_x,856_z,DATA,DATA_TYPE,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,FUNDER,FNS_FUNDER,EU_FUNDER,type,856_u_split,856_domain
0,unige:1,Article (Published version),unige1.pdf,https://archive-ouverte.unige.ch/download/ffe9...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, FFE949D9-...",ARCHIVE-OUVERTE.UNIGE.CH
1,unige:3,Article (Published version),unige3.pdf,https://archive-ouverte.unige.ch/download/c0ba...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, C0BA4635-...",ARCHIVE-OUVERTE.UNIGE.CH
2,unige:4,Article (Published version),1-s2.0-S0022354916324418-main.pdf,https://archive-ouverte.unige.ch/download/e6dc...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, E6DC92FA-...",ARCHIVE-OUVERTE.UNIGE.CH
3,unige:5,Article (Published version),11095_2007_Article_9429.pdf,https://archive-ouverte.unige.ch/download/a32f...,Public access,OA NATIONAL,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, A32F50BF-...",ARCHIVE-OUVERTE.UNIGE.CH
4,unige:6,Article,fulltext.pdf,https://archive-ouverte.unige.ch/download/470f...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, 470F409D-...",ARCHIVE-OUVERTE.UNIGE.CH
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148000,unige:167401,Article (Published version),Article-scientifique-from-europePmc.pdf,https://archive-ouverte.unige.ch/download/8e3d...,Public access,CC BY-4.0,0,,2022,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, 8E3D7D5F-...",ARCHIVE-OUVERTE.UNIGE.CH
148001,unige:167401,Alternate edition,0,https://www.mdpi.com/1422-0067/23/23/14585,0,0,0,,2022,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[MDPI.COM, 1422-0067, 23, 23, 14585]",MDPI.COM
148002,unige:167402,Article (Published version),A 578. Assessing the androgenic and.pdf,https://archive-ouverte.unige.ch/download/2e32...,Restricted access,0,0,,2023,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, 2E32D627-...",ARCHIVE-OUVERTE.UNIGE.CH
148003,unige:167402,Appendix,cen14847-sup-0001-supmat.docx,https://archive-ouverte.unige.ch/download/0cbf...,Public access,0,1,appendixes,2023,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, 0CBFB5CC-...",ARCHIVE-OUVERTE.UNIGE.CH


In [96]:
# extract the DOI prefix on 856_u
aou_856.loc[aou_856['856_domain'] == 'DOI.ORG', '856_doi_prefix'] = '/' + aou_856['856_u_split'].str[1]
aou_856.loc[aou_856['856_domain'] == 'DOI.ORG', '856_domain'] = aou_856['856_domain'] + aou_856['856_doi_prefix']
aou_856

Unnamed: 0,id,856_3,856_f,856_u,856_x,856_z,DATA,DATA_TYPE,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,FUNDER,FNS_FUNDER,EU_FUNDER,type,856_u_split,856_domain,856_doi_prefix
0,unige:1,Article (Published version),unige1.pdf,https://archive-ouverte.unige.ch/download/ffe9...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, FFE949D9-...",ARCHIVE-OUVERTE.UNIGE.CH,
1,unige:3,Article (Published version),unige3.pdf,https://archive-ouverte.unige.ch/download/c0ba...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, C0BA4635-...",ARCHIVE-OUVERTE.UNIGE.CH,
2,unige:4,Article (Published version),1-s2.0-S0022354916324418-main.pdf,https://archive-ouverte.unige.ch/download/e6dc...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, E6DC92FA-...",ARCHIVE-OUVERTE.UNIGE.CH,
3,unige:5,Article (Published version),11095_2007_Article_9429.pdf,https://archive-ouverte.unige.ch/download/a32f...,Public access,OA NATIONAL,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, A32F50BF-...",ARCHIVE-OUVERTE.UNIGE.CH,
4,unige:6,Article,fulltext.pdf,https://archive-ouverte.unige.ch/download/470f...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, 470F409D-...",ARCHIVE-OUVERTE.UNIGE.CH,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148000,unige:167401,Article (Published version),Article-scientifique-from-europePmc.pdf,https://archive-ouverte.unige.ch/download/8e3d...,Public access,CC BY-4.0,0,,2022,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, 8E3D7D5F-...",ARCHIVE-OUVERTE.UNIGE.CH,
148001,unige:167401,Alternate edition,0,https://www.mdpi.com/1422-0067/23/23/14585,0,0,0,,2022,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[MDPI.COM, 1422-0067, 23, 23, 14585]",MDPI.COM,
148002,unige:167402,Article (Published version),A 578. Assessing the androgenic and.pdf,https://archive-ouverte.unige.ch/download/2e32...,Restricted access,0,0,,2023,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, 2E32D627-...",ARCHIVE-OUVERTE.UNIGE.CH,
148003,unige:167402,Appendix,cen14847-sup-0001-supmat.docx,https://archive-ouverte.unige.ch/download/0cbf...,Public access,0,1,appendixes,2023,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, 0CBFB5CC-...",ARCHIVE-OUVERTE.UNIGE.CH,


In [97]:
aou_856.loc[aou_856['856_domain'].str.startswith('DOI.ORG', na=False)]

Unnamed: 0,id,856_3,856_f,856_u,856_x,856_z,DATA,DATA_TYPE,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,FUNDER,FNS_FUNDER,EU_FUNDER,type,856_u_split,856_domain,856_doi_prefix
660,unige:167429,Alternate edition,0,https://doi.org/10.5194/egusphere-egu23-6005,0,0,0,,2023,0,0,0,0,0,0,0,0,0,0,0,0,Chapitre d'actes,"[DOI.ORG, 10.5194, EGUSPHERE-EGU23-6005]",DOI.ORG/10.5194,/10.5194
1182,unige:167783,Dataset,0,https://doi.org/10.5281/zenodo.5150443,0,0,1,shared data,2023,1,0,1,1,0,0,0,0,0,1,1,0,Article scientifique,"[DOI.ORG, 10.5281, ZENODO.5150443]",DOI.ORG/10.5281,/10.5281
1621,unige:168062,Dataset,0,https://doi.org/10.5281/zenodo.7509571,0,0,1,shared data,2023,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[DOI.ORG, 10.5281, ZENODO.7509571]",DOI.ORG/10.5281,/10.5281
1636,unige:168071,Dataset,0,https://doi.org/10.5061/dryad.ngf1vhhvp,0,0,1,shared data,2023,1,0,1,0,0,0,0,0,1,1,1,0,Article scientifique,"[DOI.ORG, 10.5061, DRYAD.NGF1VHHVP]",DOI.ORG/10.5061,/10.5061
1829,unige:168196,Dataset,0,https://doi.org/10.1080/09658211.2023.2191901,0,0,1,shared data,2023,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[DOI.ORG, 10.1080, 09658211.2023.2191901]",DOI.ORG/10.1080,/10.1080
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
147327,unige:167048,Dataset,0,https://doi.org/10.26037/yareta:ry65havvezhmnl...,0,0,1,shared data,2022,1,1,1,0,0,0,0,0,0,1,1,0,Article scientifique,"[DOI.ORG, 10.26037, YARETA:RY65HAVVEZHMNLWA2HZ...",DOI.ORG/10.26037,/10.26037
147339,unige:167053,Dataset,0,https://doi.org/10.5281/zenodo.6645256,0,0,1,shared data,2022,1,1,0,0,0,0,0,0,0,1,1,0,Article scientifique,"[DOI.ORG, 10.5281, ZENODO.6645256]",DOI.ORG/10.5281,/10.5281
147347,unige:167056,Dataset,0,https://doi.org/10.26037/yareta:naegejqvxzfwli...,0,0,1,shared data,2021,1,1,0,0,0,0,0,0,0,1,1,0,Article scientifique,"[DOI.ORG, 10.26037, YARETA:NAEGEJQVXZFWLIU236P...",DOI.ORG/10.26037,/10.26037
147815,unige:167278,Dataset,0,https://doi.org/10.1080/09538259.2018.1442784,0,0,1,shared data,2018,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[DOI.ORG, 10.1080, 09538259.2018.1442784]",DOI.ORG/10.1080,/10.1080


In [98]:
# extract the 856_f filetype
aou_856.loc[aou_856['856_f'].notna(), '856_f_split'] = aou_856['856_f'].str.upper()
aou_856['856_f_split'] = aou_856['856_f_split'].str.rsplit(pat='.')
aou_856['856_filetype'] = aou_856['856_f_split'].str[-1]
aou_856

Unnamed: 0,id,856_3,856_f,856_u,856_x,856_z,DATA,DATA_TYPE,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,FUNDER,FNS_FUNDER,EU_FUNDER,type,856_u_split,856_domain,856_doi_prefix,856_f_split,856_filetype
0,unige:1,Article (Published version),unige1.pdf,https://archive-ouverte.unige.ch/download/ffe9...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, FFE949D9-...",ARCHIVE-OUVERTE.UNIGE.CH,,"[UNIGE1, PDF]",PDF
1,unige:3,Article (Published version),unige3.pdf,https://archive-ouverte.unige.ch/download/c0ba...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, C0BA4635-...",ARCHIVE-OUVERTE.UNIGE.CH,,"[UNIGE3, PDF]",PDF
2,unige:4,Article (Published version),1-s2.0-S0022354916324418-main.pdf,https://archive-ouverte.unige.ch/download/e6dc...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, E6DC92FA-...",ARCHIVE-OUVERTE.UNIGE.CH,,"[1-S2, 0-S0022354916324418-MAIN, PDF]",PDF
3,unige:5,Article (Published version),11095_2007_Article_9429.pdf,https://archive-ouverte.unige.ch/download/a32f...,Public access,OA NATIONAL,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, A32F50BF-...",ARCHIVE-OUVERTE.UNIGE.CH,,"[11095_2007_ARTICLE_9429, PDF]",PDF
4,unige:6,Article,fulltext.pdf,https://archive-ouverte.unige.ch/download/470f...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, 470F409D-...",ARCHIVE-OUVERTE.UNIGE.CH,,"[FULLTEXT, PDF]",PDF
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148000,unige:167401,Article (Published version),Article-scientifique-from-europePmc.pdf,https://archive-ouverte.unige.ch/download/8e3d...,Public access,CC BY-4.0,0,,2022,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, 8E3D7D5F-...",ARCHIVE-OUVERTE.UNIGE.CH,,"[ARTICLE-SCIENTIFIQUE-FROM-EUROPEPMC, PDF]",PDF
148001,unige:167401,Alternate edition,0,https://www.mdpi.com/1422-0067/23/23/14585,0,0,0,,2022,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[MDPI.COM, 1422-0067, 23, 23, 14585]",MDPI.COM,,,
148002,unige:167402,Article (Published version),A 578. Assessing the androgenic and.pdf,https://archive-ouverte.unige.ch/download/2e32...,Restricted access,0,0,,2023,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, 2E32D627-...",ARCHIVE-OUVERTE.UNIGE.CH,,"[A 578, ASSESSING THE ANDROGENIC AND, PDF]",PDF
148003,unige:167402,Appendix,cen14847-sup-0001-supmat.docx,https://archive-ouverte.unige.ch/download/0cbf...,Public access,0,1,appendixes,2023,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,"[ARCHIVE-OUVERTE.UNIGE.CH, DOWNLOAD, 0CBFB5CC-...",ARCHIVE-OUVERTE.UNIGE.CH,,"[CEN14847-SUP-0001-SUPMAT, DOCX]",DOCX


In [99]:
# clean the temporary columns
del aou_856['856_u_split']
del aou_856['856_doi_prefix']
del aou_856['856_f_split']
aou_856

Unnamed: 0,id,856_3,856_f,856_u,856_x,856_z,DATA,DATA_TYPE,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,FUNDER,FNS_FUNDER,EU_FUNDER,type,856_domain,856_filetype
0,unige:1,Article (Published version),unige1.pdf,https://archive-ouverte.unige.ch/download/ffe9...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,ARCHIVE-OUVERTE.UNIGE.CH,PDF
1,unige:3,Article (Published version),unige3.pdf,https://archive-ouverte.unige.ch/download/c0ba...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,ARCHIVE-OUVERTE.UNIGE.CH,PDF
2,unige:4,Article (Published version),1-s2.0-S0022354916324418-main.pdf,https://archive-ouverte.unige.ch/download/e6dc...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,ARCHIVE-OUVERTE.UNIGE.CH,PDF
3,unige:5,Article (Published version),11095_2007_Article_9429.pdf,https://archive-ouverte.unige.ch/download/a32f...,Public access,OA NATIONAL,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,ARCHIVE-OUVERTE.UNIGE.CH,PDF
4,unige:6,Article,fulltext.pdf,https://archive-ouverte.unige.ch/download/470f...,Restricted access,0,0,,2008,1,0,0,0,1,0,0,0,0,0,0,0,Article scientifique,ARCHIVE-OUVERTE.UNIGE.CH,PDF
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148000,unige:167401,Article (Published version),Article-scientifique-from-europePmc.pdf,https://archive-ouverte.unige.ch/download/8e3d...,Public access,CC BY-4.0,0,,2022,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,ARCHIVE-OUVERTE.UNIGE.CH,PDF
148001,unige:167401,Alternate edition,0,https://www.mdpi.com/1422-0067/23/23/14585,0,0,0,,2022,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,MDPI.COM,
148002,unige:167402,Article (Published version),A 578. Assessing the androgenic and.pdf,https://archive-ouverte.unige.ch/download/2e32...,Restricted access,0,0,,2023,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,ARCHIVE-OUVERTE.UNIGE.CH,PDF
148003,unige:167402,Appendix,cen14847-sup-0001-supmat.docx,https://archive-ouverte.unige.ch/download/0cbf...,Public access,0,1,appendixes,2023,1,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,ARCHIVE-OUVERTE.UNIGE.CH,DOCX


In [100]:
# export CSV & Excel
aou_856.to_csv(myfolder_results + 'aou_856.tsv', sep='\t', index=False)
aou_856.to_excel(myfolder_results + 'aou_856.xlsx', index=False)

  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_un

  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % 

  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode

  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % for

  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode

  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [101]:
# export CSV
aou.to_csv(myfolder_results + 'aou_all.tsv', sep='\t', index=False)
aou_biomed_2015_2022.to_csv(myfolder_results + 'aou_biomed_2015_2022.tsv', sep='\t', index=False)

In [102]:
# export Excel
aou.to_excel(myfolder_results + 'aou_all.xlsx', index=False)
aou_biomed_2015_2022.to_excel(myfolder_results + 'aou_biomed_2015_2022.xlsx', index=False)

In [103]:
print ('AoU Biomed scientific articles (2015-2022): ' + str(aou_biomed_2015_2022.shape[0]))

AoU Biomed scientific articles (2015-2022): 19319


In [104]:
print ('AoU Biomed publications (2015-2022) with associated data: ' + str(aou_biomed_2015_2022.loc[aou_biomed_2015_2022['DATA'] == 1].shape[0]) + ' (' + str(aou_biomed_2015_2022.loc[aou_biomed_2015_2022['DATA'] == 1].shape[0] * 100 / aou_biomed_2015_2022.shape[0]) + '%)')

AoU Biomed publications (2015-2022) with associated data: 2459 (12.728402091205549%)


In [105]:
print ('AoU Biomed publications (2015-2022) with known SNSF or EU funder: ' + str(aou_biomed_2015_2022.loc[aou_biomed_2015_2022['FUNDER'] == 1].shape[0]) + ' (' + str(aou_biomed_2015_2022.loc[aou_biomed_2015_2022['FUNDER'] == 1].shape[0] * 100 / aou_biomed_2015_2022.shape[0]) + '%)')

AoU Biomed publications (2015-2022) with known SNSF or EU funder: 2848 (14.741963869765517%)


In [106]:
print ('AoU Biomed publications (2015-2022) with known SNSF or EU funder AND associated data: ' + str(aou_biomed_2015_2022.loc[(aou_biomed_2015_2022['FUNDER'] == 1) & (aou_biomed_2015_2022['DATA'] == 1)].shape[0]) + ' (' + str(aou_biomed_2015_2022.loc[(aou_biomed_2015_2022['FUNDER'] == 1) & (aou_biomed_2015_2022['DATA'] == 1)].shape[0] * 100 / aou_biomed_2015_2022.shape[0]) + '%)')

AoU Biomed publications (2015-2022) with known SNSF or EU funder AND associated data: 903 (4.674154976965681%)


Note: we could continue to check what discipline those are, their repartition through years, etc.. 