# Librarian's quest to exhaustivity and openness

## Merging AoU with PubMed and OpenAlex data

Project for the EAHIL conference 2024 : https://eahil2024.rsu.lv/

Authors : **Floriane Muller & Pablo Iriarte**, University of Geneva  
Last update : 10.06.2024  

This notebook is used to merge the data from AoU and PubMed with OpenAlex informations.

### Sources

1. **AoU and PubMed data merged on Notebook 3**
2. **OpenAlex search for UNIGE / HUG affiliation export**: https://openalex.org/works?sort=cited_by_count%3Adesc&column=display_name,publication_year,type,open_access.is_oa,cited_by_count&page=1&filter=publication_year%3A2015-2022,authorships.institutions.lineage%3AI114457229%7CI4210106256

In [1]:
import pandas as pd
import csv

# paramètres
myfolder_results = 'results/2024/'
myfolder_temp = 'data/temp/2024/'
export_openalex = 'data/sources/openalex_export_unige_hug_2015_2022_20240529.csv'
export_aou_and_pubmed = myfolder_results + 'aou_and_pubmed.tsv'
export_aou_not_pubmed = myfolder_results + 'aou_not_pubmed.tsv'
export_pubmed_and_aou = myfolder_results + 'pubmed_and_aou.tsv'
export_pubmed_not_aou = myfolder_results + 'pubmed_not_aou.tsv'

# afficher toutes les colonnes
pd.set_option('display.max_columns', None)

## Open AoU and PubMed data

In [2]:
aou_and_pubmed = pd.read_csv(export_aou_and_pubmed, dtype={'id': 'str'}, encoding='utf-8', header=0, sep='\t')
aou_and_pubmed

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI_AOU,DOI,PMID,AOU,PUBMED,PMID_PUBMED,MERGED_BY_PMID,MERGED_BY_DOI,DATA_PUBMED
0,unige:167403,2022,1,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,Article scientifique,10.1007/s00198-022-06544-2,10.1007/S00198-022-06544-2,36173415,1,1,36173415,1,0,0
1,unige:167443,2020,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,,,32877100,1,1,32877100,1,0,0
2,unige:167461,2021,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,Article scientifique,10.1016/j.scitotenv.2021.147871,10.1016/J.SCITOTENV.2021.147871,34098278,1,1,34098278,1,0,0
3,unige:167494,2022,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,Article scientifique,10.1103/PhysRevLett.129.268101,10.1103/PHYSREVLETT.129.268101,36608212,1,1,36608212,1,0,0
4,unige:167495,2022,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,1,Article scientifique,10.1371/journal.pcbi.1010586,10.1371/JOURNAL.PCBI.1010586,36251703,1,1,36251703,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13756,unige:167393,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1007/s40520-022-02100-4,10.1007/S40520-022-02100-4,35332506,1,1,35332506,1,0,0
13757,unige:167394,2020,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article professionnel,10.53738/REVMED.2020.16.676-77.0078,10.53738/REVMED.2020.16.676-77.0078,31961090,1,1,31961090,1,0,0
13758,unige:167395,2021,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article professionnel,10.53738/REVMED.2021.17.720-21.0063,10.53738/REVMED.2021.17.720-21.0063,33443834,1,1,33443834,1,0,0
13759,unige:167398,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1016/j.bonr.2022.101623,10.1016/J.BONR.2022.101623,36213624,1,1,36213624,1,0,0


In [3]:
aou_not_pubmed = pd.read_csv(export_aou_not_pubmed, dtype={'id': 'str'}, encoding='utf-8', header=0, sep='\t')
aou_not_pubmed

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI_AOU,DOI,PMID,AOU,PUBMED,PMID_PUBMED,MERGED_BY_PMID,MERGED_BY_DOI,DATA_PUBMED
0,unige:167405,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Thèse,10.13097/archive-ouverte/unige:167405,10.13097/ARCHIVE-OUVERTE/UNIGE:167405,0,1,0,0,0,0,0
1,unige:167406,2022,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,Thèse,10.13097/archive-ouverte/unige:167406,10.13097/ARCHIVE-OUVERTE/UNIGE:167406,0,1,0,0,0,0,0
2,unige:167407,2022,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Thèse,10.13097/archive-ouverte/unige:167407,10.13097/ARCHIVE-OUVERTE/UNIGE:167407,0,1,0,0,0,0,0
3,unige:167409,2022,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,10.1016/B978-0-323-90999-0.00003-3,10.1016/B978-0-323-90999-0.00003-3,0,1,0,0,0,0,0
4,unige:167424,2022,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1002/tax.12863,10.1002/TAX.12863,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5553,unige:167193,2022,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1093/plphys/kiac605,10.1093/PLPHYS/KIAC605,36583226,1,0,0,0,0,0
5554,unige:167207,2022,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,Thèse,10.13097/archive-ouverte/unige:167207,10.13097/ARCHIVE-OUVERTE/UNIGE:167207,0,1,0,0,0,0,0
5555,unige:167233,2022,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,Article scientifique,10.53738/REVMED.2022.18.798.1880,10.53738/REVMED.2022.18.798.1880,36200968,1,0,0,0,0,0
5556,unige:167345,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1212/WNL.0000000000200133,10.1212/WNL.0000000000200133,35288476,1,0,0,0,0,0


In [4]:
pubmed_and_aou = pd.read_csv(export_pubmed_and_aou, dtype={'id': 'str'}, encoding='utf-8', header=0, sep='\t')
pubmed_and_aou

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DOI_PUBMED,DATA_PUBMED,DOI,PUBMED,AOU,id,MERGED_BY_PMID,MERGED_BY_DOI
0,25773637,Music and emotions: from enchantment to entrai...,"Vuilleumier P, Trost W.",Ann N Y Acad Sci. 2015 Mar;1337:212-22. doi: 1...,Vuilleumier P,Ann N Y Acad Sci,2015,2015/03/17,,,10.1111/nyas.12676,0,10.1111/NYAS.12676,1,1,unige:79934,1,0
1,26569380,Microbiota depletion promotes browning of whit...,"Suárez-Zamorano N, Fabbiano S, Chevalier C, St...",Nat Med. 2015 Dec;21(12):1497-1501. doi: 10.10...,Suárez-Zamorano N,Nat Med,2015,2015/11/17,PMC4675088,EMS65627,10.1038/nm.3994,1,10.1038/NM.3994,1,1,unige:78973,1,0
2,26586182,Sufficiency of Mesolimbic Dopamine Neuron Stim...,"Pascoli V, Terrier J, Hiver A, Lüscher C.",Neuron. 2015 Dec 2;88(5):1054-1066. doi: 10.10...,Pascoli V,Neuron,2015,2015/11/21,,,10.1016/j.neuron.2015.10.017,0,10.1016/J.NEURON.2015.10.017,1,1,unige:76777,1,0
3,29150871,Reversing dopaminergic sensitization,"Castrioto A, Carnicella S, Fraix V, Chabardes ...",Mov Disord. 2017 Dec;32(12):1679-1683. doi: 10...,Castrioto A,Mov Disord,2017,2017/11/19,,,10.1002/mds.27213,1,10.1002/MDS.27213,1,1,unige:100742,0,1
4,25758032,Response,"Delorenzi M, Tejpar S, Roth AD, Bosman FT; all...",J Natl Cancer Inst. 2015 Mar 10;107(5):djv056....,Delorenzi M,J Natl Cancer Inst,2015,2015/03/12,,,10.1093/jnci/djv056,0,10.1093/JNCI/DJV056,1,1,unige:74740,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13757,33449170,Making sense of missense variants in TTN-relat...,"Rees M, Nikoopour R, Fukuzawa A, Kho AL, Ferna...",Acta Neuropathol. 2021 Mar;141(3):431-453. doi...,Rees M,Acta Neuropathol,2021,2021/01/15,PMC7882473,,10.1007/s00401-020-02257-0,1,10.1007/S00401-020-02257-0,1,1,unige:170917,1,0
13758,34315238,Impact of Atrial Fibrillation on Outcome in Ta...,"El-Battrawy I, Cammann VL, Kato K, Szawan KA, ...",J Am Heart Assoc. 2021 Aug 3;10(15):e014059. d...,El-Battrawy I,J Am Heart Assoc,2021,2021/07/28,PMC8475688,,10.1161/JAHA.119.014059,1,10.1161/JAHA.119.014059,1,1,unige:165851,1,0
13759,35697829,Differential and shared genetic effects on kid...,"Winkler TW, Rasheed H, Teumer A, Gorski M, Row...",Commun Biol. 2022 Jun 13;5(1):580. doi: 10.103...,Winkler TW,Commun Biol,2022,2022/06/13,PMC9192715,,10.1038/s42003-022-03448-z,0,10.1038/S42003-022-03448-Z,1,1,unige:175209,1,0
13760,35099509,Prevalence Estimates of Amyloid Abnormality Ac...,"Jansen WJ, Janssen O, Tijms BM, Vos SJB, Ossen...",JAMA Neurol. 2022 Mar 1;79(3):228-243. doi: 10...,Jansen WJ,JAMA Neurol,2022,2022/01/31,,,10.1001/jamaneurol.2021.5216,0,10.1001/JAMANEUROL.2021.5216,1,1,unige:171654,1,0


In [5]:
pubmed_not_aou = pd.read_csv(export_pubmed_not_aou, dtype={'id': 'str'}, encoding='utf-8', header=0, sep='\t')
pubmed_not_aou

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DOI_PUBMED,DATA_PUBMED,DOI,PUBMED,AOU,id,MERGED_BY_PMID,MERGED_BY_DOI
0,28816886,Reply,"Assouline B, Tramèr MR, Elia N.",Pain. 2017 Sep;158(9):1839-1840. doi: 10.1097/...,Assouline B,Pain,2017,2017/08/18,,,10.1097/j.pain.0000000000000955,1,10.1097/J.PAIN.0000000000000955,1,0,,0,0
1,29039370,De-Identification of Medical Narrative Data,"Foufi V, Gaudet-Blavignac C, Chevrier R, Lovis C.",Stud Health Technol Inform. 2017;244:23-27.,Foufi V,Stud Health Technol Inform,2017,2017/10/18,,,,0,,1,0,,0,0
2,28703551,[Hospital readmissions: current problems and p...,"Blanc AL, Fumeaux T, Stirneman J, Bonnabry P, ...",Rev Med Suisse. 2017 Jan 11;13(544-545):117-120.,Blanc AL,Rev Med Suisse,2017,2017/07/14,,,,0,,1,0,,0,0
3,28837366,Uterine leiomyomata: the snowball effect,"Soave I, Marci R.",Curr Med Res Opin. 2017 Nov;33(11):1909-1911. ...,Soave I,Curr Med Res Opin,2017,2017/08/25,,,10.1080/03007995.2017.1372174,1,10.1080/03007995.2017.1372174,1,0,,0,0
4,28727356,[Management of carotid artery stenosis],"Puccinelli F, Roffi M, Murith N, Sztajzel R.",Rev Med Suisse. 2017 Apr 26;13(560):894-899.,Puccinelli F,Rev Med Suisse,2017,2017/07/21,,,,0,,1,0,,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10580,35086739,The era of reference genomes in conservation g...,"Formenti G, Theissinger K, Fernandes C, Bista ...",Trends Ecol Evol. 2022 Mar;37(3):197-202. doi:...,Formenti G,Trends Ecol Evol,2022,2022/01/28,,,10.1016/j.tree.2021.11.008,0,10.1016/J.TREE.2021.11.008,1,0,,0,0
10581,34895743,The global NAFLD policy review and preparednes...,"Lazarus JV, Mark HE, Villota-Rivas M, Palayew ...",J Hepatol. 2022 Apr;76(4):771-780. doi: 10.101...,Lazarus JV,J Hepatol,2022,2021/12/13,,,10.1016/j.jhep.2021.10.025,0,10.1016/J.JHEP.2021.10.025,1,0,,0,0
10582,33685583,Heterogeneous contributions of change in popul...,NCD Risk Factor Collaboration (NCD-RisC).,Elife. 2021 Mar 9;10:e60060. doi: 10.7554/eLif...,NCD Risk Factor Collaboration (NCD-RisC),Elife,2021,2021/03/09,PMC7943191,,10.7554/eLife.60060,1,10.7554/ELIFE.60060,1,0,,0,0
10583,36257718,Global Impact of the COVID-19 Pandemic on Stro...,"Nguyen TN, Qureshi MM, Klein P, Yamagami H, Mi...",Neurology. 2023 Jan 24;100(4):e408-e421. doi: ...,Nguyen TN,Neurology,2023,2022/10/18,PMC9897052,,10.1212/WNL.0000000000201426,1,10.1212/WNL.0000000000201426,1,0,,0,0


In [6]:
# check duplicates
aou_and_pubmed.loc[aou_and_pubmed.duplicated(subset='id', keep=False)].sort_values(by='id')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI_AOU,DOI,PMID,AOU,PUBMED,PMID_PUBMED,MERGED_BY_PMID,MERGED_BY_DOI,DATA_PUBMED


In [7]:
# check duplicates
aou_and_pubmed.loc[aou_and_pubmed.duplicated(subset='PMID', keep=False)].sort_values(by='PMID')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI_AOU,DOI,PMID,AOU,PUBMED,PMID_PUBMED,MERGED_BY_PMID,MERGED_BY_DOI,DATA_PUBMED


In [8]:
# check duplicates
aou_not_pubmed.loc[aou_not_pubmed.duplicated(subset='id', keep=False)].sort_values(by='id')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI_AOU,DOI,PMID,AOU,PUBMED,PMID_PUBMED,MERGED_BY_PMID,MERGED_BY_DOI,DATA_PUBMED


In [9]:
# check duplicates
aou_not_pubmed.loc[aou_not_pubmed['PMID'] != 0].loc[aou_not_pubmed.duplicated(subset='PMID', keep=False)].sort_values(by='PMID')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI_AOU,DOI,PMID,AOU,PUBMED,PMID_PUBMED,MERGED_BY_PMID,MERGED_BY_DOI,DATA_PUBMED


In [10]:
# check duplicates
pubmed_and_aou.loc[pubmed_and_aou.duplicated(subset='id', keep=False)].sort_values(by='id')

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DOI_PUBMED,DATA_PUBMED,DOI,PUBMED,AOU,id,MERGED_BY_PMID,MERGED_BY_DOI
9405,31193955,Characterization of paralogous uncx transcript...,"Nittoli V, Fortunato AE, Fasano G, Coppola U, ...",Gene X. 2019 Jun;2:100011. doi: 10.1016/j.gene...,Nittoli V,Gene X,2019,2019/06/14,PMC6543554,,10.1016/j.gene.2019.100011,0,10.1016/J.GENE.2019.100011,1,1,unige:176664,0,1
9429,34530988,Characterization of paralogous uncx transcript...,"Nittoli V, Fortunato AE, Fasano G, Coppola U, ...",Gene. 2019;721S:100011. doi: 10.1016/j.gene.20...,Nittoli V,Gene,2019,2021/09/17,,,10.1016/j.gene.2019.100011,0,10.1016/J.GENE.2019.100011,1,1,unige:176664,0,1


In [11]:
# check duplicates
pubmed_and_aou.loc[pubmed_and_aou.duplicated(subset='PMID', keep=False)].sort_values(by='PMID')

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DOI_PUBMED,DATA_PUBMED,DOI,PUBMED,AOU,id,MERGED_BY_PMID,MERGED_BY_DOI


In [12]:
# check duplicates
pubmed_not_aou.loc[pubmed_not_aou.duplicated(subset='PMID', keep=False)].sort_values(by='PMID')

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DOI_PUBMED,DATA_PUBMED,DOI,PUBMED,AOU,id,MERGED_BY_PMID,MERGED_BY_DOI


In [13]:
# remove columns not used
del aou_and_pubmed['DOI_AOU']
del aou_and_pubmed['PMID_PUBMED']
del aou_and_pubmed['MERGED_BY_PMID']
del aou_and_pubmed['MERGED_BY_DOI']
del aou_not_pubmed['DOI_AOU']
del aou_not_pubmed['PMID_PUBMED']
del aou_not_pubmed['MERGED_BY_PMID']
del aou_not_pubmed['MERGED_BY_DOI']
del pubmed_and_aou['DOI_PUBMED']
del pubmed_and_aou['MERGED_BY_PMID']
del pubmed_and_aou['MERGED_BY_DOI']
del pubmed_not_aou['DOI_PUBMED']
del pubmed_not_aou['MERGED_BY_PMID']
del pubmed_not_aou['MERGED_BY_DOI']

## Merge wit OpenAlex

In [14]:
openalex = pd.read_csv(export_openalex, usecols=['id', 'primary_location_id', 'primary_location_display_name',
                                                 'primary_location_issn_l', 'primary_location_is_oa',
                                                 'primary_location_version', 'primary_location_license',
                                                 'is_oa', 'oa_status', 'oa_url', 'doi', 'pmid', 'publication_year',
                                                 'type', 'is_retracted', 'biblio_issue',
                                                 'biblio_first_page', 'biblio_volume', 'biblio_last_page'], dtype={'id': 'str', 'pmid': 'str', 'doi': 'str'}, encoding='utf-8', header=0, sep=',')
openalex

  openalex = pd.read_csv(export_openalex, usecols=['id', 'primary_location_id', 'primary_location_display_name',


Unnamed: 0,id,primary_location_id,primary_location_display_name,primary_location_issn_l,primary_location_is_oa,primary_location_version,primary_location_license,is_oa,oa_status,oa_url,doi,pmid,publication_year,type,is_retracted,biblio_issue,biblio_first_page,biblio_volume,biblio_last_page
0,https://openalex.org/W2155628349,https://openalex.org/S52395412,Bioinformatics,1367-4803,True,publishedVersion,,True,bronze,https://academic.oup.com/bioinformatics/articl...,https://doi.org/10.1093/bioinformatics/btv351,https://pubmed.ncbi.nlm.nih.gov/26059717,2015,article,False,19,3210,31,3212
1,https://openalex.org/W2897513125,https://openalex.org/S31768639,Age and ageing,0002-0729,True,publishedVersion,cc-by-nc,True,hybrid,https://academic.oup.com/ageing/article-pdf/48...,https://doi.org/10.1093/ageing/afy169,https://pubmed.ncbi.nlm.nih.gov/31081853,2018,article,False,1,16,48,31
2,https://openalex.org/W2798336535,https://openalex.org/S205231332,Astronomy & astrophysics,0004-6361,True,publishedVersion,,True,bronze,https://www.aanda.org/articles/aa/pdf/2018/08/...,https://doi.org/10.1051/0004-6361/201833051,,2018,article,False,,A1,616,A1
3,https://openalex.org/W2895486342,https://openalex.org/S137773608,Nature,0028-0836,True,publishedVersion,cc-by,True,hybrid,https://www.nature.com/articles/s41586-018-057...,https://doi.org/10.1038/s41586-018-0579-z,https://pubmed.ncbi.nlm.nih.gov/30305743,2018,article,False,7726,203,562,209
4,https://openalex.org/W2225937646,https://openalex.org/S76900504,Chest,0012-3692,False,,,False,closed,,https://doi.org/10.1016/j.chest.2015.11.026,https://pubmed.ncbi.nlm.nih.gov/26867832,2016,article,False,2,315,149,352
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46366,https://openalex.org/W4398680269,https://openalex.org/S4377196806,Harvard Dataverse,,True,,other-oa,True,gold,https://dataverse.harvard.edu/citation?persist...,https://doi.org/10.7910/dvn/f8okbf,,2021,dataset,False,,,,
46367,https://openalex.org/W4398825626,https://openalex.org/S4377196806,Harvard Dataverse,,True,,other-oa,True,gold,https://dataverse.harvard.edu/citation?persist...,https://doi.org/10.7910/dvn/rtaj0x,,2021,dataset,False,,,,
46368,https://openalex.org/W4398846673,https://openalex.org/S4377196806,Harvard Dataverse,,True,,other-oa,True,gold,https://dataverse.harvard.edu/citation?persist...,https://doi.org/10.7910/dvn/yny90h,,2021,dataset,False,,,,
46369,https://openalex.org/W4398972072,https://openalex.org/S4377196806,Harvard Dataverse,,True,,other-oa,True,gold,https://dataverse.harvard.edu/citation?persist...,https://doi.org/10.7910/dvn/vxrl7a,,2021,dataset,False,,,,


In [15]:
# keep only the rows with PMID
openalex = openalex.loc[openalex['pmid'].notna()]
openalex

Unnamed: 0,id,primary_location_id,primary_location_display_name,primary_location_issn_l,primary_location_is_oa,primary_location_version,primary_location_license,is_oa,oa_status,oa_url,doi,pmid,publication_year,type,is_retracted,biblio_issue,biblio_first_page,biblio_volume,biblio_last_page
0,https://openalex.org/W2155628349,https://openalex.org/S52395412,Bioinformatics,1367-4803,True,publishedVersion,,True,bronze,https://academic.oup.com/bioinformatics/articl...,https://doi.org/10.1093/bioinformatics/btv351,https://pubmed.ncbi.nlm.nih.gov/26059717,2015,article,False,19,3210,31,3212
1,https://openalex.org/W2897513125,https://openalex.org/S31768639,Age and ageing,0002-0729,True,publishedVersion,cc-by-nc,True,hybrid,https://academic.oup.com/ageing/article-pdf/48...,https://doi.org/10.1093/ageing/afy169,https://pubmed.ncbi.nlm.nih.gov/31081853,2018,article,False,1,16,48,31
3,https://openalex.org/W2895486342,https://openalex.org/S137773608,Nature,0028-0836,True,publishedVersion,cc-by,True,hybrid,https://www.nature.com/articles/s41586-018-057...,https://doi.org/10.1038/s41586-018-0579-z,https://pubmed.ncbi.nlm.nih.gov/30305743,2018,article,False,7726,203,562,209
4,https://openalex.org/W2225937646,https://openalex.org/S76900504,Chest,0012-3692,False,,,False,closed,,https://doi.org/10.1016/j.chest.2015.11.026,https://pubmed.ncbi.nlm.nih.gov/26867832,2016,article,False,2,315,149,352
5,https://openalex.org/W2504691963,https://openalex.org/S106963461,Nature biotechnology,1087-0156,True,publishedVersion,,True,bronze,https://www.nature.com/articles/nbt.3597.pdf,https://doi.org/10.1038/nbt.3597,https://pubmed.ncbi.nlm.nih.gov/27504778,2016,article,False,8,828,34,837
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46000,https://openalex.org/W4286904356,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34644020,2021,article,False,,,,
46007,https://openalex.org/W4286953485,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34585859,2021,article,False,,,,
46014,https://openalex.org/W4287020831,https://openalex.org/S4306525036,PubMed,,False,submittedVersion,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34431641,2021,article,False,,,,
46015,https://openalex.org/W4287020871,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34431637,2021,article,False,,,,


In [16]:
# normalize PMID & DOI
openalex = openalex.rename(columns={'id' : 'url_openalex', 'pmid' : 'pmid_openalex', 'doi' : 'doi_openalex', 'publication_year' : 'date_openalex'})
openalex['PMID'] = openalex['pmid_openalex'].str.replace('https://pubmed.ncbi.nlm.nih.gov/', '')
# openalex['PMID'] = openalex['PMID'].fillna('').astype(str)
openalex['DOI'] = openalex['doi_openalex'].str.replace('https://doi.org/', '')
# openalex['DOI'] = openalex['DOI'].fillna('').astype(str)
openalex['DOI'] = openalex['DOI'].str.upper()
openalex['ID_OPENALEX'] = openalex['url_openalex'].str.replace('https://openalex.org/', '')
openalex

Unnamed: 0,url_openalex,primary_location_id,primary_location_display_name,primary_location_issn_l,primary_location_is_oa,primary_location_version,primary_location_license,is_oa,oa_status,oa_url,doi_openalex,pmid_openalex,date_openalex,type,is_retracted,biblio_issue,biblio_first_page,biblio_volume,biblio_last_page,PMID,DOI,ID_OPENALEX
0,https://openalex.org/W2155628349,https://openalex.org/S52395412,Bioinformatics,1367-4803,True,publishedVersion,,True,bronze,https://academic.oup.com/bioinformatics/articl...,https://doi.org/10.1093/bioinformatics/btv351,https://pubmed.ncbi.nlm.nih.gov/26059717,2015,article,False,19,3210,31,3212,26059717,10.1093/BIOINFORMATICS/BTV351,W2155628349
1,https://openalex.org/W2897513125,https://openalex.org/S31768639,Age and ageing,0002-0729,True,publishedVersion,cc-by-nc,True,hybrid,https://academic.oup.com/ageing/article-pdf/48...,https://doi.org/10.1093/ageing/afy169,https://pubmed.ncbi.nlm.nih.gov/31081853,2018,article,False,1,16,48,31,31081853,10.1093/AGEING/AFY169,W2897513125
3,https://openalex.org/W2895486342,https://openalex.org/S137773608,Nature,0028-0836,True,publishedVersion,cc-by,True,hybrid,https://www.nature.com/articles/s41586-018-057...,https://doi.org/10.1038/s41586-018-0579-z,https://pubmed.ncbi.nlm.nih.gov/30305743,2018,article,False,7726,203,562,209,30305743,10.1038/S41586-018-0579-Z,W2895486342
4,https://openalex.org/W2225937646,https://openalex.org/S76900504,Chest,0012-3692,False,,,False,closed,,https://doi.org/10.1016/j.chest.2015.11.026,https://pubmed.ncbi.nlm.nih.gov/26867832,2016,article,False,2,315,149,352,26867832,10.1016/J.CHEST.2015.11.026,W2225937646
5,https://openalex.org/W2504691963,https://openalex.org/S106963461,Nature biotechnology,1087-0156,True,publishedVersion,,True,bronze,https://www.nature.com/articles/nbt.3597.pdf,https://doi.org/10.1038/nbt.3597,https://pubmed.ncbi.nlm.nih.gov/27504778,2016,article,False,8,828,34,837,27504778,10.1038/NBT.3597,W2504691963
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46000,https://openalex.org/W4286904356,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34644020,2021,article,False,,,,,34644020,,W4286904356
46007,https://openalex.org/W4286953485,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34585859,2021,article,False,,,,,34585859,,W4286953485
46014,https://openalex.org/W4287020831,https://openalex.org/S4306525036,PubMed,,False,submittedVersion,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34431641,2021,article,False,,,,,34431641,,W4287020831
46015,https://openalex.org/W4287020871,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34431637,2021,article,False,,,,,34431637,,W4287020871


In [17]:
openalex.loc[openalex['PMID'].str.startswith('PMC')]

Unnamed: 0,url_openalex,primary_location_id,primary_location_display_name,primary_location_issn_l,primary_location_is_oa,primary_location_version,primary_location_license,is_oa,oa_status,oa_url,doi_openalex,pmid_openalex,date_openalex,type,is_retracted,biblio_issue,biblio_first_page,biblio_volume,biblio_last_page,PMID,DOI,ID_OPENALEX
8597,https://openalex.org/W2569139640,https://openalex.org/S2004986,PLOS pathogens,1553-7366,True,publishedVersion,cc-by,True,gold,https://journals.plos.org/plospathogens/articl...,https://doi.org/10.1371/journal.ppat.1006092,https://pubmed.ncbi.nlm.nih.gov/PMC5218399,2017,article,False,1,e1006092,13,e1006092,PMC5218399,10.1371/JOURNAL.PPAT.1006092,W2569139640


In [18]:
# fixing PMID errors
openalex.loc[openalex['pmid_openalex'] == 'https://pubmed.ncbi.nlm.nih.gov/PMC5218399', 'PMID'] = '28060920'

In [19]:
openalex.loc[openalex['PMID'].str.startswith('PMC')]

Unnamed: 0,url_openalex,primary_location_id,primary_location_display_name,primary_location_issn_l,primary_location_is_oa,primary_location_version,primary_location_license,is_oa,oa_status,oa_url,doi_openalex,pmid_openalex,date_openalex,type,is_retracted,biblio_issue,biblio_first_page,biblio_volume,biblio_last_page,PMID,DOI,ID_OPENALEX


In [20]:
# openalex.loc[openalex['PMID'] == '', 'PMID'] = '0'
# openalex['PMID'] = openalex['PMID'].fillna(0).astype(int)
openalex['OPENALEX'] = 1
openalex

Unnamed: 0,url_openalex,primary_location_id,primary_location_display_name,primary_location_issn_l,primary_location_is_oa,primary_location_version,primary_location_license,is_oa,oa_status,oa_url,doi_openalex,pmid_openalex,date_openalex,type,is_retracted,biblio_issue,biblio_first_page,biblio_volume,biblio_last_page,PMID,DOI,ID_OPENALEX,OPENALEX
0,https://openalex.org/W2155628349,https://openalex.org/S52395412,Bioinformatics,1367-4803,True,publishedVersion,,True,bronze,https://academic.oup.com/bioinformatics/articl...,https://doi.org/10.1093/bioinformatics/btv351,https://pubmed.ncbi.nlm.nih.gov/26059717,2015,article,False,19,3210,31,3212,26059717,10.1093/BIOINFORMATICS/BTV351,W2155628349,1
1,https://openalex.org/W2897513125,https://openalex.org/S31768639,Age and ageing,0002-0729,True,publishedVersion,cc-by-nc,True,hybrid,https://academic.oup.com/ageing/article-pdf/48...,https://doi.org/10.1093/ageing/afy169,https://pubmed.ncbi.nlm.nih.gov/31081853,2018,article,False,1,16,48,31,31081853,10.1093/AGEING/AFY169,W2897513125,1
3,https://openalex.org/W2895486342,https://openalex.org/S137773608,Nature,0028-0836,True,publishedVersion,cc-by,True,hybrid,https://www.nature.com/articles/s41586-018-057...,https://doi.org/10.1038/s41586-018-0579-z,https://pubmed.ncbi.nlm.nih.gov/30305743,2018,article,False,7726,203,562,209,30305743,10.1038/S41586-018-0579-Z,W2895486342,1
4,https://openalex.org/W2225937646,https://openalex.org/S76900504,Chest,0012-3692,False,,,False,closed,,https://doi.org/10.1016/j.chest.2015.11.026,https://pubmed.ncbi.nlm.nih.gov/26867832,2016,article,False,2,315,149,352,26867832,10.1016/J.CHEST.2015.11.026,W2225937646,1
5,https://openalex.org/W2504691963,https://openalex.org/S106963461,Nature biotechnology,1087-0156,True,publishedVersion,,True,bronze,https://www.nature.com/articles/nbt.3597.pdf,https://doi.org/10.1038/nbt.3597,https://pubmed.ncbi.nlm.nih.gov/27504778,2016,article,False,8,828,34,837,27504778,10.1038/NBT.3597,W2504691963,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46000,https://openalex.org/W4286904356,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34644020,2021,article,False,,,,,34644020,,W4286904356,1
46007,https://openalex.org/W4286953485,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34585859,2021,article,False,,,,,34585859,,W4286953485,1
46014,https://openalex.org/W4287020831,https://openalex.org/S4306525036,PubMed,,False,submittedVersion,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34431641,2021,article,False,,,,,34431641,,W4287020831,1
46015,https://openalex.org/W4287020871,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34431637,2021,article,False,,,,,34431637,,W4287020871,1


In [21]:
# check duplicates
openalex.loc[openalex.duplicated(subset='ID_OPENALEX', keep=False)].sort_values(by='ID_OPENALEX')

Unnamed: 0,url_openalex,primary_location_id,primary_location_display_name,primary_location_issn_l,primary_location_is_oa,primary_location_version,primary_location_license,is_oa,oa_status,oa_url,doi_openalex,pmid_openalex,date_openalex,type,is_retracted,biblio_issue,biblio_first_page,biblio_volume,biblio_last_page,PMID,DOI,ID_OPENALEX,OPENALEX


In [22]:
# check duplicates by PMID
openalex.loc[openalex.duplicated(subset='PMID', keep=False)].sort_values(by='PMID')

Unnamed: 0,url_openalex,primary_location_id,primary_location_display_name,primary_location_issn_l,primary_location_is_oa,primary_location_version,primary_location_license,is_oa,oa_status,oa_url,doi_openalex,pmid_openalex,date_openalex,type,is_retracted,biblio_issue,biblio_first_page,biblio_volume,biblio_last_page,PMID,DOI,ID_OPENALEX,OPENALEX
1374,https://openalex.org/W1993454294,https://openalex.org/S174893293,Journal of pharmaceutical and biomedical analysis,0731-7085,False,,,False,closed,,https://doi.org/10.1016/j.jpba.2014.08.035,https://pubmed.ncbi.nlm.nih.gov/25459925,2015,article,False,,33,102,44,25459925,10.1016/J.JPBA.2014.08.035,W1993454294,1
1597,https://openalex.org/W4252726418,https://openalex.org/S174893293,Journal of pharmaceutical and biomedical analysis,0731-7085,False,,,False,closed,,https://doi.org/10.1016/j.jpba.2014.09.032,https://pubmed.ncbi.nlm.nih.gov/25459925,2015,article,False,,282,102,289,25459925,10.1016/J.JPBA.2014.09.032,W4252726418,1
2417,https://openalex.org/W2076014591,https://openalex.org/S198804558,Intensive care medicine,0342-4642,False,,,False,closed,,https://doi.org/10.1007/s00134-015-3735-z,https://pubmed.ncbi.nlm.nih.gov/25904180,2015,article,False,5,856,41,864,25904180,10.1007/S00134-015-3735-Z,W2076014591,1
36129,https://openalex.org/W4254559537,https://openalex.org/S198804558,Intensive care medicine,0342-4642,True,publishedVersion,,True,bronze,https://link.springer.com/content/pdf/10.1007/...,https://doi.org/10.1007/s00134-015-3782-5,https://pubmed.ncbi.nlm.nih.gov/25904180,2015,erratum,False,6,1177,41,1177,25904180,10.1007/S00134-015-3782-5,W4254559537,1
2243,https://openalex.org/W2163089896,https://openalex.org/S4210230012,"Attention, perception & psychophysics",1943-3921,True,publishedVersion,,True,bronze,https://link.springer.com/content/pdf/10.3758/...,https://doi.org/10.3758/s13414-015-0988-0,https://pubmed.ncbi.nlm.nih.gov/26530189,2015,article,False,1,218,78,241,26530189,10.3758/S13414-015-0988-0,W2163089896,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16871,https://openalex.org/W4206915781,,,,False,,,False,closed,,https://doi.org/10.4414/smw.2022.30049,https://pubmed.ncbi.nlm.nih.gov/35072393,2022,article,False,,w30049,152,w30049,35072393,10.4414/SMW.2022.30049,W4206915781,1
31006,https://openalex.org/W4206917249,,,,False,,,False,closed,,https://doi.org/10.4414/smw.2022.30108,https://pubmed.ncbi.nlm.nih.gov/35072415,2022,article,False,,w30108,152,w30108,35072415,10.4414/SMW.2022.30108,W4206917249,1
26997,https://openalex.org/W4226390186,https://openalex.org/S141475571,Schweizerische medizinische Wochenschrift,0036-7672,True,publishedVersion,cc-by,True,gold,https://smw.ch/index.php/smw/article/download/...,https://doi.org/10.4414/smw.2022.w30108,https://pubmed.ncbi.nlm.nih.gov/35072415,2022,article,False,0304,w30108,152,w30108,35072415,10.4414/SMW.2022.W30108,W4226390186,1
37946,https://openalex.org/W4221114753,https://openalex.org/S94555171,Rhinology (Amsterdam. Online)/Rhinology,0300-0729,True,publishedVersion,,True,bronze,https://www.rhinologyjournal.com/download.php?...,https://doi.org/10.4193/rhin.22.901,https://pubmed.ncbi.nlm.nih.gov/35157749,2022,article,False,1,1,60,1,35157749,10.4193/RHIN.22.901,W4221114753,1


In [23]:
# check duplicates by PMID
openalex.loc[openalex.duplicated(subset='PMID', keep=False)].sort_values(by='PMID')['type'].value_counts()

type
article      149
erratum       15
preprint      10
review         4
editorial      2
Name: count, dtype: int64

In [24]:
# check duplicates by DOI
openalex.loc[openalex.duplicated(subset='DOI', keep=False)].sort_values(by='DOI')

Unnamed: 0,url_openalex,primary_location_id,primary_location_display_name,primary_location_issn_l,primary_location_is_oa,primary_location_version,primary_location_license,is_oa,oa_status,oa_url,doi_openalex,pmid_openalex,date_openalex,type,is_retracted,biblio_issue,biblio_first_page,biblio_volume,biblio_last_page,PMID,DOI,ID_OPENALEX,OPENALEX
5905,https://openalex.org/W2270743897,https://openalex.org/S4306525036,PubMed,,False,,,True,hybrid,,https://doi.org/10.1007/s40266-017-0457-7,https://pubmed.ncbi.nlm.nih.gov/28349413,2017,erratum,False,5,413,34,413,28349413,10.1007/S40266-017-0457-7,W2270743897,1
26475,https://openalex.org/W4210578593,https://openalex.org/S37256505,Drugs & aging,1170-229X,True,publishedVersion,,True,bronze,https://link.springer.com/content/pdf/10.1007%...,https://doi.org/10.1007/s40266-017-0457-7,https://pubmed.ncbi.nlm.nih.gov/28349413,2017,erratum,False,5,413,34,413,28349413,10.1007/S40266-017-0457-7,W4210578593,1
3553,https://openalex.org/W4200246644,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,https://doi.org/10.7326/m21-2625,https://pubmed.ncbi.nlm.nih.gov/34904857,2022,article,False,2,244,175,255,34904857,10.7326/M21-2625,W4200246644,1
3554,https://openalex.org/W4200292381,https://openalex.org/S119722071,Annals of internal medicine,0003-4819,False,,,False,closed,,https://doi.org/10.7326/m21-2625,https://pubmed.ncbi.nlm.nih.gov/34904857,2022,preprint,False,2,244,175,255,34904857,10.7326/M21-2625,W4200292381,1
7839,https://openalex.org/W2195668156,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/26550543,2015,article,False,5,527,5,47,26550543,,W2195668156,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45992,https://openalex.org/W4286892578,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34704682,2021,article,False,,,,,34704682,,W4286892578,1
46000,https://openalex.org/W4286904356,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34644020,2021,article,False,,,,,34644020,,W4286904356,1
46007,https://openalex.org/W4286953485,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34585859,2021,article,False,,,,,34585859,,W4286953485,1
46014,https://openalex.org/W4287020831,https://openalex.org/S4306525036,PubMed,,False,submittedVersion,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34431641,2021,article,False,,,,,34431641,,W4287020831,1


In [25]:
# export duplicates by PMID
openalex.loc[openalex.duplicated(subset='PMID', keep=False)].sort_values(by='PMID').to_csv(myfolder_temp + 'openalex_duplicates_by_pmid.tsv', sep='\t', index=False)
openalex.loc[openalex.duplicated(subset='PMID', keep=False)].sort_values(by='PMID').to_excel(myfolder_temp + 'openalex_duplicates_by_pmid.xlsx', index=False)

In [26]:
# export duplicates by DOI
openalex.loc[openalex.duplicated(subset='DOI', keep=False)].sort_values(by='DOI').to_csv(myfolder_temp + 'openalex_duplicates_by_doi.tsv', sep='\t', index=False)
openalex.loc[openalex.duplicated(subset='DOI', keep=False)].sort_values(by='DOI').to_excel(myfolder_temp + 'openalex_duplicates_by_doi.xlsx', index=False)

## AoU AND PubMed AND OpenAlex

In [27]:
aou_and_pubmed

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED
0,unige:167403,2022,1,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,Article scientifique,10.1007/S00198-022-06544-2,36173415,1,1,0
1,unige:167443,2020,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,,32877100,1,1,0
2,unige:167461,2021,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,Article scientifique,10.1016/J.SCITOTENV.2021.147871,34098278,1,1,0
3,unige:167494,2022,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,Article scientifique,10.1103/PHYSREVLETT.129.268101,36608212,1,1,0
4,unige:167495,2022,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,1,Article scientifique,10.1371/JOURNAL.PCBI.1010586,36251703,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13756,unige:167393,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1007/S40520-022-02100-4,35332506,1,1,0
13757,unige:167394,2020,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article professionnel,10.53738/REVMED.2020.16.676-77.0078,31961090,1,1,0
13758,unige:167395,2021,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article professionnel,10.53738/REVMED.2021.17.720-21.0063,33443834,1,1,0
13759,unige:167398,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1016/J.BONR.2022.101623,36213624,1,1,0


In [28]:
openalex_pmid = openalex.loc[openalex['PMID'].notna()]
openalex_pmid['PMID'] = openalex_pmid['PMID'].astype(int)
openalex_pmid

Unnamed: 0,url_openalex,primary_location_id,primary_location_display_name,primary_location_issn_l,primary_location_is_oa,primary_location_version,primary_location_license,is_oa,oa_status,oa_url,doi_openalex,pmid_openalex,date_openalex,type,is_retracted,biblio_issue,biblio_first_page,biblio_volume,biblio_last_page,PMID,DOI,ID_OPENALEX,OPENALEX
0,https://openalex.org/W2155628349,https://openalex.org/S52395412,Bioinformatics,1367-4803,True,publishedVersion,,True,bronze,https://academic.oup.com/bioinformatics/articl...,https://doi.org/10.1093/bioinformatics/btv351,https://pubmed.ncbi.nlm.nih.gov/26059717,2015,article,False,19,3210,31,3212,26059717,10.1093/BIOINFORMATICS/BTV351,W2155628349,1
1,https://openalex.org/W2897513125,https://openalex.org/S31768639,Age and ageing,0002-0729,True,publishedVersion,cc-by-nc,True,hybrid,https://academic.oup.com/ageing/article-pdf/48...,https://doi.org/10.1093/ageing/afy169,https://pubmed.ncbi.nlm.nih.gov/31081853,2018,article,False,1,16,48,31,31081853,10.1093/AGEING/AFY169,W2897513125,1
3,https://openalex.org/W2895486342,https://openalex.org/S137773608,Nature,0028-0836,True,publishedVersion,cc-by,True,hybrid,https://www.nature.com/articles/s41586-018-057...,https://doi.org/10.1038/s41586-018-0579-z,https://pubmed.ncbi.nlm.nih.gov/30305743,2018,article,False,7726,203,562,209,30305743,10.1038/S41586-018-0579-Z,W2895486342,1
4,https://openalex.org/W2225937646,https://openalex.org/S76900504,Chest,0012-3692,False,,,False,closed,,https://doi.org/10.1016/j.chest.2015.11.026,https://pubmed.ncbi.nlm.nih.gov/26867832,2016,article,False,2,315,149,352,26867832,10.1016/J.CHEST.2015.11.026,W2225937646,1
5,https://openalex.org/W2504691963,https://openalex.org/S106963461,Nature biotechnology,1087-0156,True,publishedVersion,,True,bronze,https://www.nature.com/articles/nbt.3597.pdf,https://doi.org/10.1038/nbt.3597,https://pubmed.ncbi.nlm.nih.gov/27504778,2016,article,False,8,828,34,837,27504778,10.1038/NBT.3597,W2504691963,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46000,https://openalex.org/W4286904356,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34644020,2021,article,False,,,,,34644020,,W4286904356,1
46007,https://openalex.org/W4286953485,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34585859,2021,article,False,,,,,34585859,,W4286953485,1
46014,https://openalex.org/W4287020831,https://openalex.org/S4306525036,PubMed,,False,submittedVersion,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34431641,2021,article,False,,,,,34431641,,W4287020831,1
46015,https://openalex.org/W4287020871,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34431637,2021,article,False,,,,,34431637,,W4287020871,1


In [29]:
openalex_doi = openalex.loc[openalex['DOI'].notna()]
openalex_doi['DOI'] = openalex_doi['DOI'].astype(str)
openalex_doi

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  openalex_doi['DOI'] = openalex_doi['DOI'].astype(str)


Unnamed: 0,url_openalex,primary_location_id,primary_location_display_name,primary_location_issn_l,primary_location_is_oa,primary_location_version,primary_location_license,is_oa,oa_status,oa_url,doi_openalex,pmid_openalex,date_openalex,type,is_retracted,biblio_issue,biblio_first_page,biblio_volume,biblio_last_page,PMID,DOI,ID_OPENALEX,OPENALEX
0,https://openalex.org/W2155628349,https://openalex.org/S52395412,Bioinformatics,1367-4803,True,publishedVersion,,True,bronze,https://academic.oup.com/bioinformatics/articl...,https://doi.org/10.1093/bioinformatics/btv351,https://pubmed.ncbi.nlm.nih.gov/26059717,2015,article,False,19,3210,31,3212,26059717,10.1093/BIOINFORMATICS/BTV351,W2155628349,1
1,https://openalex.org/W2897513125,https://openalex.org/S31768639,Age and ageing,0002-0729,True,publishedVersion,cc-by-nc,True,hybrid,https://academic.oup.com/ageing/article-pdf/48...,https://doi.org/10.1093/ageing/afy169,https://pubmed.ncbi.nlm.nih.gov/31081853,2018,article,False,1,16,48,31,31081853,10.1093/AGEING/AFY169,W2897513125,1
3,https://openalex.org/W2895486342,https://openalex.org/S137773608,Nature,0028-0836,True,publishedVersion,cc-by,True,hybrid,https://www.nature.com/articles/s41586-018-057...,https://doi.org/10.1038/s41586-018-0579-z,https://pubmed.ncbi.nlm.nih.gov/30305743,2018,article,False,7726,203,562,209,30305743,10.1038/S41586-018-0579-Z,W2895486342,1
4,https://openalex.org/W2225937646,https://openalex.org/S76900504,Chest,0012-3692,False,,,False,closed,,https://doi.org/10.1016/j.chest.2015.11.026,https://pubmed.ncbi.nlm.nih.gov/26867832,2016,article,False,2,315,149,352,26867832,10.1016/J.CHEST.2015.11.026,W2225937646,1
5,https://openalex.org/W2504691963,https://openalex.org/S106963461,Nature biotechnology,1087-0156,True,publishedVersion,,True,bronze,https://www.nature.com/articles/nbt.3597.pdf,https://doi.org/10.1038/nbt.3597,https://pubmed.ncbi.nlm.nih.gov/27504778,2016,article,False,8,828,34,837,27504778,10.1038/NBT.3597,W2504691963,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45741,https://openalex.org/W4225726066,https://openalex.org/S39948230,International archives of occupational and env...,0340-0131,True,publishedVersion,cc-by,True,hybrid,https://link.springer.com/content/pdf/10.1007/...,https://doi.org/10.1007/s00420-021-01758-z,https://pubmed.ncbi.nlm.nih.gov/34642804,2021,article,False,1,5,95,5,34642804,10.1007/S00420-021-01758-Z,W4225726066,1
45756,https://openalex.org/W4226099590,https://openalex.org/S13566964,The journal of nervous and mental disease,0022-3018,False,,,False,closed,,https://doi.org/10.1097/nmd.0000000000001391,https://pubmed.ncbi.nlm.nih.gov/34846355,2021,article,False,12,872,209,878,34846355,10.1097/NMD.0000000000001391,W4226099590,1
45758,https://openalex.org/W4226110831,https://openalex.org/S29620287,Journal of patient safety,1549-8417,False,,,False,closed,,https://doi.org/10.1097/pts.0000000000000638,https://pubmed.ncbi.nlm.nih.gov/32175966,2021,article,False,8,e1732,17,e1737,32175966,10.1097/PTS.0000000000000638,W4226110831,1
45768,https://openalex.org/W4226291153,https://openalex.org/S4210176812,JBJS case connector,2160-3251,False,,,False,closed,,https://doi.org/10.2106/jbjs.cc.20.00945,https://pubmed.ncbi.nlm.nih.gov/34788255,2021,article,False,4,,11,,34788255,10.2106/JBJS.CC.20.00945,W4226291153,1


In [30]:
# merge with Aou and PubMed data by PMID
aou_and_pubmed_openalex1 = aou_and_pubmed.merge(openalex_pmid[['OPENALEX', 'PMID', 'DOI', 'ID_OPENALEX', 'is_oa', 'oa_status', 'date_openalex']], on='PMID', how='outer')
aou_and_pubmed_openalex1 = aou_and_pubmed_openalex1.rename(columns={'DOI_x' : 'DOI'})
aou_and_pubmed_openalex1 = aou_and_pubmed_openalex1.rename(columns={'DOI_y' : 'DOI_OPENALEX'})
aou_and_pubmed_openalex1['AOU'] = aou_and_pubmed_openalex1['AOU'].fillna(0).astype(int)
aou_and_pubmed_openalex1['OPENALEX'] = aou_and_pubmed_openalex1['OPENALEX'].fillna(0).astype(int)
aou_and_pubmed_openalex1

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex
0,,,,,,,,,,,,,,,,,,,,,566224,0,,,1,10.1159/000400365,W2260754226,False,closed,2015.0
1,,,,,,,,,,,,,,,,,,,,,614034,0,,,1,10.1159/000446699,W2467949190,True,gold,2016.0
2,,,,,,,,,,,,,,,,,,,,,1658583,0,,,1,10.32388/WOB8WB,W4246486748,True,hybrid,2020.0
3,,,,,,,,,,,,,,,,,,,,,1666809,0,,,1,10.1002/JBMR.5650061114,W2012837775,False,closed,2020.0
4,,,,,,,,,,,,,,,,,,,,,2220455,0,,,1,10.1159/000418835,W2271345702,False,closed,2015.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23487,,,,,,,,,,,,,,,,,,,,,38624653,0,,,1,10.1007/S41244-021-00200-8,W3156941189,True,hybrid,2021.0
23488,unige:150518,2021.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1002/ANSA.202000151,38715737,1,1.0,0.0,1,10.1002/ANSA.202000151,W3109860721,True,gold,2020.0
23489,unige:146726,2020.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1002/ANSA.202000091,38715738,1,1.0,0.0,1,10.1002/ANSA.202000091,W3109700191,True,gold,2020.0
23490,unige:146720,2020.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1002/ANSA.202000131,38715742,1,1.0,0.0,1,10.1002/ANSA.202000131,W3111521613,True,gold,2020.0


In [31]:
aou_and_pubmed.loc[aou_and_pubmed['DOI'].isna()]

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED
1,unige:167443,2020,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,,32877100,1,1,0
209,unige:168996,2022,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,,35522219,1,1,0
507,unige:171858,2021,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,,34704683,1,1,0
508,unige:171859,2020,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,,31995298,1,1,0
510,unige:171862,2020,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,,32558457,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13137,unige:164736,2018,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,,29722498,1,1,0
13240,unige:165126,2021,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,,33538136,1,1,0
13290,unige:165328,2022,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Article scientifique,,36437129,1,1,0
13500,unige:166568,2021,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Article scientifique,,33515226,1,1,0


In [32]:
# merge with Aou and PubMed data by DOI
aou_and_pubmed_openalex2 = aou_and_pubmed.loc[aou_and_pubmed['DOI'].notna()].merge(openalex_doi[['OPENALEX', 'PMID', 'DOI', 'ID_OPENALEX', 'is_oa', 'oa_status', 'date_openalex']], on='DOI', how='outer')
aou_and_pubmed_openalex2 = aou_and_pubmed_openalex2.rename(columns={'PMID_x' : 'PMID'})
aou_and_pubmed_openalex2 = aou_and_pubmed_openalex2.rename(columns={'PMID_y' : 'PMID_OPENALEX'})
aou_and_pubmed_openalex2['AOU'] = aou_and_pubmed_openalex2['AOU'].fillna(0).astype(int)
aou_and_pubmed_openalex2['OPENALEX'] = aou_and_pubmed_openalex2['OPENALEX'].fillna(0).astype(int)
aou_and_pubmed_openalex2

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,PMID_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex
0,unige:73969,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1001/JAMA.2014.18482,25603492.0,1,1.0,0.0,1,25603492,W1980806879,False,closed,2015.0
1,unige:87944,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1001/JAMA.2015.3703,25919531.0,1,1.0,0.0,1,25919531,W2063717932,False,closed,2015.0
2,unige:75878,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1001/JAMA.2015.4668,25988462.0,1,1.0,1.0,1,25988462,W2168357345,True,green,2015.0
3,unige:87940,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1001/JAMA.2015.4970,26151269.0,1,1.0,0.0,1,26151269,W1488627510,False,closed,2015.0
4,unige:108131,2017.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1001/JAMA.2016.20029,28241362.0,1,1.0,1.0,1,28241362,W2592762700,False,closed,2017.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22749,,,,,,,,,,,,,,,,,,,,10.7717/PEERJ.7467,,0,,,1,31423359,W2967306995,True,gold,2019.0
22750,,,,,,,,,,,,,,,,,,,,10.7754/CLIN.LAB.2015.150617,,0,,,1,27012056,W2430034788,True,green,2016.0
22751,unige:55986,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.7785/TCRT.2012.500405,24694113.0,1,1.0,1.0,0,,,,,
22752,,,,,,,,,,,,,,,,,,,,10.9738/INTSURG-D-14-00026.1,,0,,,1,25785333,W2007164292,True,bronze,2015.0


In [33]:
# consolidate both merges
aou_and_pubmed_openalex = pd.concat([aou_and_pubmed_openalex1, aou_and_pubmed_openalex2], ignore_index=True)
aou_and_pubmed_openalex = aou_and_pubmed_openalex.drop_duplicates(subset='id', keep='first')
aou_and_pubmed_openalex

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX
0,,,,,,,,,,,,,,,,,,,,,566224.0,0,,,1,10.1159/000400365,W2260754226,False,closed,2015.0,
14,unige:43442,2016.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1177/0962280212464542,23117409.0,1,1.0,0.0,0,,,,,,
15,unige:36639,2016.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1177/0962280212469716,23267027.0,1,1.0,0.0,0,,,,,,
16,unige:34302,2015.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1007/S10103-013-1337-Y,23660738.0,1,1.0,1.0,0,,,,,,
17,unige:89732,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1016/J.JALZ.2013.03.001,23706515.0,1,1.0,0.0,0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23471,unige:160667,2022.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.2533/CHIMIA.2022.90,38069754.0,1,1.0,0.0,1,10.2533/CHIMIA.2022.90,W4214655598,True,gold,2022.0,
23474,unige:166451,2022.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,Article scientifique,10.2533/CHIMIA.2022.954,38069791.0,1,1.0,0.0,1,10.2533/CHIMIA.2022.954,W4310999321,True,gold,2022.0,
23488,unige:150518,2021.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1002/ANSA.202000151,38715737.0,1,1.0,0.0,1,10.1002/ANSA.202000151,W3109860721,True,gold,2020.0,
23489,unige:146726,2020.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1002/ANSA.202000091,38715738.0,1,1.0,0.0,1,10.1002/ANSA.202000091,W3109700191,True,gold,2020.0,


In [34]:
# merge with and
aou_and_pubmed_and_openalex = aou_and_pubmed_openalex.loc[(aou_and_pubmed_openalex['AOU'] == 1) & (aou_and_pubmed_openalex['OPENALEX'] == 1)]
aou_and_pubmed_and_openalex = aou_and_pubmed_and_openalex.drop_duplicates(subset='id', keep='first')
aou_and_pubmed_and_openalex['date'] = aou_and_pubmed_and_openalex['date'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['BIOMED'] = aou_and_pubmed_and_openalex['BIOMED'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['discipline_clinical'] = aou_and_pubmed_and_openalex['discipline_clinical'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['discipline_basic'] = aou_and_pubmed_and_openalex['discipline_basic'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['discipline_biology'] = aou_and_pubmed_and_openalex['discipline_biology'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['discipline_pharma'] = aou_and_pubmed_and_openalex['discipline_pharma'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['discipline_affective'] = aou_and_pubmed_and_openalex['discipline_affective'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['discipline_dentistry'] = aou_and_pubmed_and_openalex['discipline_dentistry'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['discipline_medicine_general'] = aou_and_pubmed_and_openalex['discipline_medicine_general'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['discipline_neurosciences'] = aou_and_pubmed_and_openalex['discipline_neurosciences'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['DATA'] = aou_and_pubmed_and_openalex['DATA'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['DATA_TYPE_appendixes'] = aou_and_pubmed_and_openalex['DATA_TYPE_appendixes'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['DATA_TYPE_data_supplements'] = aou_and_pubmed_and_openalex['DATA_TYPE_data_supplements'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['DATA_TYPE_shared_data'] = aou_and_pubmed_and_openalex['DATA_TYPE_shared_data'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['FUNDER'] = aou_and_pubmed_and_openalex['FUNDER'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['FNS_FUNDER'] = aou_and_pubmed_and_openalex['FNS_FUNDER'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['EU_FUNDER'] = aou_and_pubmed_and_openalex['EU_FUNDER'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['PMID'] = aou_and_pubmed_and_openalex['PMID'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['PUBMED'] = aou_and_pubmed_and_openalex['PUBMED'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['DATA_PUBMED'] = aou_and_pubmed_and_openalex['DATA_PUBMED'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['date_openalex'] = aou_and_pubmed_and_openalex['date_openalex'].fillna(0).astype(int)
aou_and_pubmed_and_openalex['PMID_OPENALEX'] = aou_and_pubmed_and_openalex['PMID_OPENALEX'].fillna(0).astype(int)
aou_and_pubmed_and_openalex

  aou_and_pubmed_and_openalex['PMID_OPENALEX'] = aou_and_pubmed_and_openalex['PMID_OPENALEX'].fillna(0).astype(int)


Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX
18,unige:32989,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,Article scientifique,10.1016/J.BIOPSYCH.2013.06.023,23993209,1,1,0,1,10.1016/J.BIOPSYCH.2013.06.023,W2108436957,True,green,2015,0
23,unige:84097,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.2174/15701611113116660167,24188484,1,1,0,1,10.2174/15701611113116660167,W2293858562,False,closed,2015,0
24,unige:72702,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.2174/15701611113116660164,24188487,1,1,1,1,10.2174/15701611113116660164,W301338818,True,green,2015,0
25,unige:36640,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.2174/1570161111999131203092322,24188489,1,1,0,1,10.2174/1570161111999131203092322,W87985356,False,closed,2015,0
36,unige:77996,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1097/SLA.0000000000000426,24477161,1,1,0,1,10.1097/SLA.0000000000000426,W2026589533,False,closed,2015,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23471,unige:160667,2022,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.2533/CHIMIA.2022.90,38069754,1,1,0,1,10.2533/CHIMIA.2022.90,W4214655598,True,gold,2022,0
23474,unige:166451,2022,1,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,Article scientifique,10.2533/CHIMIA.2022.954,38069791,1,1,0,1,10.2533/CHIMIA.2022.954,W4310999321,True,gold,2022,0
23488,unige:150518,2021,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1002/ANSA.202000151,38715737,1,1,0,1,10.1002/ANSA.202000151,W3109860721,True,gold,2020,0
23489,unige:146726,2020,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1002/ANSA.202000091,38715738,1,1,0,1,10.1002/ANSA.202000091,W3109700191,True,gold,2020,0


In [35]:
# check duplicates
aou_and_pubmed_and_openalex.loc[aou_and_pubmed_and_openalex.duplicated(subset='id', keep=False)].sort_values(by='id')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [36]:
# check duplicates
aou_and_pubmed_and_openalex.loc[aou_and_pubmed_and_openalex.duplicated(subset='PMID', keep=False)].sort_values(by='PMID')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [37]:
# check duplicates
aou_and_pubmed_and_openalex.loc[aou_and_pubmed_and_openalex.duplicated(subset='ID_OPENALEX', keep=False)].sort_values(by='ID_OPENALEX')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [38]:
# export duplicates
aou_and_pubmed_and_openalex.loc[aou_and_pubmed_and_openalex.duplicated(subset='ID_OPENALEX', keep=False)].sort_values(by='ID_OPENALEX').to_csv(myfolder_temp + 'aou_and_pubmed_and_openalex_duplicates.tsv', sep='\t', index=False)
aou_and_pubmed_and_openalex.loc[aou_and_pubmed_and_openalex.duplicated(subset='ID_OPENALEX', keep=False)].sort_values(by='ID_OPENALEX').to_excel(myfolder_temp + 'aou_and_pubmed_and_openalex_duplicates.xlsx', index=False)

## AoU AND PubMed NOT OpenAlex

In [39]:
# consolidate merge
aou_and_pubmed_not_openalex = aou_and_pubmed_openalex.loc[(aou_and_pubmed_openalex['AOU'] == 1) & (aou_and_pubmed_openalex['OPENALEX'] == 0)]
aou_and_pubmed_not_openalex = aou_and_pubmed_not_openalex.drop_duplicates(subset='id', keep='first')
aou_and_pubmed_not_openalex['date'] = aou_and_pubmed_not_openalex['date'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['BIOMED'] = aou_and_pubmed_not_openalex['BIOMED'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['discipline_clinical'] = aou_and_pubmed_not_openalex['discipline_clinical'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['discipline_basic'] = aou_and_pubmed_not_openalex['discipline_basic'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['discipline_biology'] = aou_and_pubmed_not_openalex['discipline_biology'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['discipline_pharma'] = aou_and_pubmed_not_openalex['discipline_pharma'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['discipline_affective'] = aou_and_pubmed_not_openalex['discipline_affective'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['discipline_dentistry'] = aou_and_pubmed_not_openalex['discipline_dentistry'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['discipline_medicine_general'] = aou_and_pubmed_not_openalex['discipline_medicine_general'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['discipline_neurosciences'] = aou_and_pubmed_not_openalex['discipline_neurosciences'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['DATA'] = aou_and_pubmed_not_openalex['DATA'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['DATA_TYPE_appendixes'] = aou_and_pubmed_not_openalex['DATA_TYPE_appendixes'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['DATA_TYPE_data_supplements'] = aou_and_pubmed_not_openalex['DATA_TYPE_data_supplements'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['DATA_TYPE_shared_data'] = aou_and_pubmed_not_openalex['DATA_TYPE_shared_data'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['FUNDER'] = aou_and_pubmed_not_openalex['FUNDER'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['FNS_FUNDER'] = aou_and_pubmed_not_openalex['FNS_FUNDER'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['EU_FUNDER'] = aou_and_pubmed_not_openalex['EU_FUNDER'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['PMID'] = aou_and_pubmed_not_openalex['PMID'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['PUBMED'] = aou_and_pubmed_not_openalex['PUBMED'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['DATA_PUBMED'] = aou_and_pubmed_not_openalex['DATA_PUBMED'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['date_openalex'] = aou_and_pubmed_not_openalex['date_openalex'].fillna(0).astype(int)
aou_and_pubmed_not_openalex['PMID_OPENALEX'] = aou_and_pubmed_not_openalex['PMID_OPENALEX'].fillna(0).astype(int)
aou_and_pubmed_not_openalex

  aou_and_pubmed_not_openalex['PMID_OPENALEX'] = aou_and_pubmed_not_openalex['PMID_OPENALEX'].fillna(0).astype(int)


Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX
14,unige:43442,2016,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1177/0962280212464542,23117409,1,1,0,0,,,,,0,0
15,unige:36639,2016,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1177/0962280212469716,23267027,1,1,0,0,,,,,0,0
16,unige:34302,2015,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Article scientifique,10.1007/S10103-013-1337-Y,23660738,1,1,1,0,,,,,0,0
17,unige:89732,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1016/J.JALZ.2013.03.001,23706515,1,1,0,0,,,,,0,0
19,unige:31354,2015,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Article scientifique,10.1111/GER.12079,24128078,1,1,1,0,,,,,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23450,unige:172727,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,10.1007/978-3-030-81736-7_5,37494510,1,1,0,0,,,,,0,0
23451,unige:172723,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,10.1007/978-3-030-81736-7_1,37494511,1,1,0,0,,,,,0,0
23452,unige:172728,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,10.1007/978-3-030-81736-7_6,37494512,1,1,0,0,,,,,0,0
23453,unige:172724,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,10.1007/978-3-030-81736-7_2,37494513,1,1,0,0,,,,,0,0


In [40]:
# check duplicates
aou_and_pubmed_not_openalex.loc[aou_and_pubmed_not_openalex.duplicated(subset='id', keep=False)].sort_values(by='id')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [41]:
# check duplicates
aou_and_pubmed_not_openalex.loc[aou_and_pubmed_not_openalex.duplicated(subset='PMID', keep=False)].sort_values(by='PMID')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [42]:
# check duplicates
aou_and_pubmed_not_openalex.loc[aou_and_pubmed_not_openalex['DOI'].notna()].loc[aou_and_pubmed_not_openalex.duplicated(subset='DOI', keep=False)].sort_values(by='DOI')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [43]:
# Union of two sides
aou_and_pubmed_test = pd.concat([aou_and_pubmed_and_openalex, aou_and_pubmed_not_openalex], ignore_index=True)
aou_and_pubmed_test

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX
0,unige:32989,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,Article scientifique,10.1016/J.BIOPSYCH.2013.06.023,23993209,1,1,0,1,10.1016/J.BIOPSYCH.2013.06.023,W2108436957,True,green,2015,0
1,unige:84097,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.2174/15701611113116660167,24188484,1,1,0,1,10.2174/15701611113116660167,W2293858562,False,closed,2015,0
2,unige:72702,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.2174/15701611113116660164,24188487,1,1,1,1,10.2174/15701611113116660164,W301338818,True,green,2015,0
3,unige:36640,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.2174/1570161111999131203092322,24188489,1,1,0,1,10.2174/1570161111999131203092322,W87985356,False,closed,2015,0
4,unige:77996,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1097/SLA.0000000000000426,24477161,1,1,0,1,10.1097/SLA.0000000000000426,W2026589533,False,closed,2015,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13756,unige:172727,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,10.1007/978-3-030-81736-7_5,37494510,1,1,0,0,,,,,0,0
13757,unige:172723,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,10.1007/978-3-030-81736-7_1,37494511,1,1,0,0,,,,,0,0
13758,unige:172728,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,10.1007/978-3-030-81736-7_6,37494512,1,1,0,0,,,,,0,0
13759,unige:172724,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,10.1007/978-3-030-81736-7_2,37494513,1,1,0,0,,,,,0,0


In [44]:
# check duplicates
aou_and_pubmed_test.loc[aou_and_pubmed_test.duplicated(subset='id', keep=False)].sort_values(by='id')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [45]:
# exports
aou_and_pubmed_and_openalex.to_csv(myfolder_results + 'aou_and_pubmed_and_openalex.tsv', sep='\t', index=False)
aou_and_pubmed_and_openalex.to_excel(myfolder_results + 'aou_and_pubmed_and_openalex.xlsx', index=False)

In [46]:
# exports
aou_and_pubmed_not_openalex.to_csv(myfolder_results + 'aou_and_pubmed_not_openalex.tsv', sep='\t', index=False)
aou_and_pubmed_not_openalex.to_excel(myfolder_results + 'aou_and_pubmed_not_openalex.xlsx', index=False)

## AoU NOT PubMed AND OpenAlex

In [47]:
aou_not_pubmed

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED
0,unige:167405,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Thèse,10.13097/ARCHIVE-OUVERTE/UNIGE:167405,0,1,0,0
1,unige:167406,2022,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,Thèse,10.13097/ARCHIVE-OUVERTE/UNIGE:167406,0,1,0,0
2,unige:167407,2022,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Thèse,10.13097/ARCHIVE-OUVERTE/UNIGE:167407,0,1,0,0
3,unige:167409,2022,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,10.1016/B978-0-323-90999-0.00003-3,0,1,0,0
4,unige:167424,2022,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1002/TAX.12863,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5553,unige:167193,2022,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1093/PLPHYS/KIAC605,36583226,1,0,0
5554,unige:167207,2022,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,Thèse,10.13097/ARCHIVE-OUVERTE/UNIGE:167207,0,1,0,0
5555,unige:167233,2022,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,Article scientifique,10.53738/REVMED.2022.18.798.1880,36200968,1,0,0
5556,unige:167345,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1212/WNL.0000000000200133,35288476,1,0,0


In [48]:
# merge with Aou and PubMed data by PMID
aou_not_pubmed_openalex1 = aou_not_pubmed.loc[aou_not_pubmed['PMID'] != 0].merge(openalex_pmid[['OPENALEX', 'PMID', 'DOI', 'ID_OPENALEX', 'is_oa', 'oa_status', 'date_openalex']], on='PMID', how='outer')
aou_not_pubmed_openalex1 = aou_not_pubmed_openalex1.rename(columns={'DOI_x' : 'DOI'})
aou_not_pubmed_openalex1 = aou_not_pubmed_openalex1.rename(columns={'DOI_y' : 'DOI_OPENALEX'})
aou_not_pubmed_openalex1 = pd.concat([aou_not_pubmed_openalex1, aou_not_pubmed.loc[aou_not_pubmed['PMID'] == 0]], ignore_index=True, sort=False)
aou_not_pubmed_openalex1['AOU'] = aou_not_pubmed_openalex1['AOU'].fillna(0).astype(int)
aou_not_pubmed_openalex1['OPENALEX'] = aou_not_pubmed_openalex1['OPENALEX'].fillna(0).astype(int)
aou_not_pubmed_openalex1

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex
0,,,,,,,,,,,,,,,,,,,,,566224,0,,,1,10.1159/000400365,W2260754226,False,closed,2015.0
1,,,,,,,,,,,,,,,,,,,,,614034,0,,,1,10.1159/000446699,W2467949190,True,gold,2016.0
2,,,,,,,,,,,,,,,,,,,,,1658583,0,,,1,10.32388/WOB8WB,W4246486748,True,hybrid,2020.0
3,,,,,,,,,,,,,,,,,,,,,1666809,0,,,1,10.1002/JBMR.5650061114,W2012837775,False,closed,2020.0
4,,,,,,,,,,,,,,,,,,,,,2220455,0,,,1,10.1159/000418835,W2271345702,False,closed,2015.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26474,unige:167053,2022.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,Article scientifique,10.21105/JOSS.04248,0,1,0.0,0.0,0,,,,,
26475,unige:167054,2022.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Chapitre de livre,10.1007/978-3-031-07121-8_12,0,1,0.0,0.0,0,,,,,
26476,unige:167072,2022.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Chapitre de livre,10.1515/9783110670851-020,0,1,0.0,0.0,0,,,,,
26477,unige:167097,2022.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article professionnel,10.4414/BMS.2022.21195,0,1,0.0,0.0,0,,,,,


In [49]:
aou_not_pubmed.loc[aou_not_pubmed['DOI'].isna()]

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED
8,unige:167616,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Chapitre d'actes,,0,1,0,0
10,unige:167618,2021,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Présentation / Intervention,,0,1,0,0
16,unige:167973,2022,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,,0,1,0,0
22,unige:168200,2022,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,Chapitre d'actes,,0,1,0,0
24,unige:168276,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Livre,,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5511,unige:166596,2018,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,,0,1,0,0
5518,unige:166670,2021,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,,0,1,0,0
5533,unige:166891,2022,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Article scientifique,,0,1,0,0
5536,unige:166902,2022,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Article scientifique,,36082659,1,0,0


In [50]:
# merge with Aou and PubMed data by DOI
aou_not_pubmed_openalex2 = aou_not_pubmed.loc[aou_not_pubmed['PMID'] == 0].merge(openalex_doi[['OPENALEX', 'PMID', 'DOI', 'ID_OPENALEX', 'is_oa', 'oa_status', 'date_openalex']], on='DOI', how='outer')
aou_not_pubmed_openalex2 = aou_not_pubmed_openalex2.rename(columns={'PMID_x' : 'PMID'})
aou_not_pubmed_openalex2 = aou_not_pubmed_openalex2.rename(columns={'PMID_y' : 'PMID_OPENALEX'})
aou_not_pubmed_openalex2['AOU'] = aou_not_pubmed_openalex2['AOU'].fillna(0).astype(int)
aou_not_pubmed_openalex2['OPENALEX'] = aou_not_pubmed_openalex2['OPENALEX'].fillna(0).astype(int)
aou_not_pubmed_openalex2

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,PMID_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex
0,,,,,,,,,,,,,,,,,,,,10.1001/JAMA.2014.18482,,0,,,1,25603492,W1980806879,False,closed,2015.0
1,,,,,,,,,,,,,,,,,,,,10.1001/JAMA.2015.3703,,0,,,1,25919531,W2063717932,False,closed,2015.0
2,,,,,,,,,,,,,,,,,,,,10.1001/JAMA.2015.4668,,0,,,1,25988462,W2168357345,True,green,2015.0
3,,,,,,,,,,,,,,,,,,,,10.1001/JAMA.2015.4970,,0,,,1,26151269,W1488627510,False,closed,2015.0
4,,,,,,,,,,,,,,,,,,,,10.1001/JAMA.2016.20029,,0,,,1,28241362,W2592762700,False,closed,2017.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24763,unige:166315,2022.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,Article scientifique,,0.0,1,0.0,0.0,0,,,,,
24764,unige:166596,2018.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,,0.0,1,0.0,0.0,0,,,,,
24765,unige:166670,2021.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Chapitre de livre,,0.0,1,0.0,0.0,0,,,,,
24766,unige:166891,2022.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,,0.0,1,0.0,0.0,0,,,,,


In [51]:
# consolidate both merges
aou_not_pubmed_openalex = pd.concat([aou_not_pubmed_openalex1, aou_not_pubmed_openalex2], ignore_index=True)
aou_not_pubmed_openalex = aou_not_pubmed_openalex.drop_duplicates(subset='id', keep='first')
aou_not_pubmed_openalex

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX
0,,,,,,,,,,,,,,,,,,,,,566224.0,0,,,1,10.1159/000400365,W2260754226,False,closed,2015.0,
14,unige:94273,2017.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,Thèse,10.13097/ARCHIVE-OUVERTE/UNIGE:94273,23775053.0,1,0.0,0.0,0,,,,,,
15,unige:29525,2015.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1016/J.PSYCHRES.2013.06.032,23870492.0,1,0.0,0.0,0,,,,,,
21,unige:35523,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1111/VOP.12137,24373539.0,1,0.0,0.0,0,,,,,,
22,unige:74670,2015.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article scientifique,10.1597/13-145,24437562.0,1,0.0,0.0,0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26474,unige:167053,2022.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,Article scientifique,10.21105/JOSS.04248,0.0,1,0.0,0.0,0,,,,,,
26475,unige:167054,2022.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Chapitre de livre,10.1007/978-3-031-07121-8_12,0.0,1,0.0,0.0,0,,,,,,
26476,unige:167072,2022.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Chapitre de livre,10.1515/9783110670851-020,0.0,1,0.0,0.0,0,,,,,,
26477,unige:167097,2022.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Article professionnel,10.4414/BMS.2022.21195,0.0,1,0.0,0.0,0,,,,,,


In [52]:
# consolidate merge
aou_not_pubmed_and_openalex = aou_not_pubmed_openalex.loc[(aou_not_pubmed_openalex['AOU'] == 1) & (aou_not_pubmed_openalex['OPENALEX'] == 1)]
aou_not_pubmed_and_openalex['date'] = aou_not_pubmed_and_openalex['date'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['BIOMED'] = aou_not_pubmed_and_openalex['BIOMED'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['discipline_clinical'] = aou_not_pubmed_and_openalex['discipline_clinical'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['discipline_basic'] = aou_not_pubmed_and_openalex['discipline_basic'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['discipline_biology'] = aou_not_pubmed_and_openalex['discipline_biology'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['discipline_pharma'] = aou_not_pubmed_and_openalex['discipline_pharma'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['discipline_affective'] = aou_not_pubmed_and_openalex['discipline_affective'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['discipline_dentistry'] = aou_not_pubmed_and_openalex['discipline_dentistry'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['discipline_medicine_general'] = aou_not_pubmed_and_openalex['discipline_medicine_general'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['discipline_neurosciences'] = aou_not_pubmed_and_openalex['discipline_neurosciences'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['DATA'] = aou_not_pubmed_and_openalex['DATA'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['DATA_TYPE_appendixes'] = aou_not_pubmed_and_openalex['DATA_TYPE_appendixes'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['DATA_TYPE_data_supplements'] = aou_not_pubmed_and_openalex['DATA_TYPE_data_supplements'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['DATA_TYPE_shared_data'] = aou_not_pubmed_and_openalex['DATA_TYPE_shared_data'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['FUNDER'] = aou_not_pubmed_and_openalex['FUNDER'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['FNS_FUNDER'] = aou_not_pubmed_and_openalex['FNS_FUNDER'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['EU_FUNDER'] = aou_not_pubmed_and_openalex['EU_FUNDER'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['PMID'] = aou_not_pubmed_and_openalex['PMID'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['PUBMED'] = aou_not_pubmed_and_openalex['PUBMED'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['DATA_PUBMED'] = aou_not_pubmed_and_openalex['DATA_PUBMED'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['date_openalex'] = aou_not_pubmed_and_openalex['date_openalex'].fillna(0).astype(int)
aou_not_pubmed_and_openalex['PMID_OPENALEX'] = aou_not_pubmed_and_openalex['PMID_OPENALEX'].fillna(0).astype(int)
aou_not_pubmed_and_openalex

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  aou_not_pubmed_and_openalex['date'] = aou_not_pubmed_and_openalex['date'].fillna(0).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  aou_not_pubmed_and_openalex['BIOMED'] = aou_not_pubmed_and_openalex['BIOMED'].fillna(0).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  aou_not_p

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX
26,unige:74218,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1016/J.CLNU.2014.01.008,24485773,1,0,0,1,10.1016/J.CLNU.2014.01.008,W2025059392,False,closed,2015,0
38,unige:74671,2015,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Article scientifique,10.1597/13-272,24805871,1,0,0,1,10.1597/13-272,W2089513019,False,closed,2015,0
93,unige:54992,2015,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1016/J.JHEP.2014.09.018,25245893,1,0,0,1,10.1016/J.JHEP.2014.09.018,W1989074294,True,hybrid,2015,0
100,unige:74672,2015,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Article scientifique,10.1597/14-085,25275538,1,0,0,1,10.1597/14-085,W2145851361,False,closed,2015,0
115,unige:98743,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,Article scientifique,10.1016/J.NEUROSCIENCE.2014.09.059,25316409,1,0,0,1,10.1016/J.NEUROSCIENCE.2014.09.059,W2073673228,True,hybrid,2015,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22522,unige:166651,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,Article scientifique,10.1177/15910199221145745,36529940,1,0,0,1,10.1177/15910199221145745,W4311933681,True,hybrid,2022,0
22533,unige:173715,2022,1,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,Article scientifique,10.1530/REP-22-0312,36538648,1,0,0,1,10.1530/REP-22-0312,W4312066132,True,bronze,2022,0
22536,unige:166004,2022,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1093/INFDIS/JIAC487,36542509,1,0,0,1,10.1093/INFDIS/JIAC487,W4312065292,True,bronze,2022,0
22579,unige:166812,2022,1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,Article scientifique,10.1093/STCLTM/SZAC081,36571216,1,0,0,1,10.1093/STCLTM/SZAC081,W4312194965,True,gold,2022,0


In [53]:
# check duplicates
aou_not_pubmed_and_openalex.loc[aou_not_pubmed_and_openalex.duplicated(subset='id', keep=False)].sort_values(by='id')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [54]:
# check duplicates
aou_not_pubmed_and_openalex.loc[aou_not_pubmed_and_openalex['PMID'] == 0].loc[aou_not_pubmed_and_openalex.duplicated(subset='PMID', keep=False)].sort_values(by='PMID')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [55]:
# check duplicates
aou_not_pubmed_and_openalex.loc[aou_not_pubmed_and_openalex.duplicated(subset='ID_OPENALEX', keep=False)].sort_values(by='ID_OPENALEX')

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [56]:
# export duplicates
aou_not_pubmed_and_openalex.loc[aou_not_pubmed_and_openalex.duplicated(subset='ID_OPENALEX', keep=False)].sort_values(by='ID_OPENALEX').to_csv(myfolder_temp + 'aou_not_pubmed_and_openalex_duplicates.tsv', sep='\t', index=False)
aou_not_pubmed_and_openalex.loc[aou_not_pubmed_and_openalex.duplicated(subset='ID_OPENALEX', keep=False)].sort_values(by='ID_OPENALEX').to_excel(myfolder_temp + 'aou_not_pubmed_and_openalex_duplicates.xlsx', index=False)

In [57]:
# exports
aou_not_pubmed_and_openalex.to_csv(myfolder_results + 'aou_not_pubmed_and_openalex.tsv', sep='\t', index=False)
aou_not_pubmed_and_openalex.to_excel(myfolder_results + 'aou_not_pubmed_and_openalex.xlsx', index=False)

## AoU NOT PubMed NOT OpenAlex

In [58]:
# consolidate merge
aou_not_pubmed_not_openalex = aou_not_pubmed_openalex.loc[(aou_not_pubmed_openalex['AOU'] == 1) & (aou_not_pubmed_openalex['OPENALEX'] == 0)]
aou_not_pubmed_not_openalex = aou_not_pubmed_not_openalex.drop_duplicates(subset='id', keep='first')
aou_not_pubmed_not_openalex['date'] = aou_not_pubmed_not_openalex['date'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['BIOMED'] = aou_not_pubmed_not_openalex['BIOMED'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['discipline_clinical'] = aou_not_pubmed_not_openalex['discipline_clinical'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['discipline_basic'] = aou_not_pubmed_not_openalex['discipline_basic'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['discipline_biology'] = aou_not_pubmed_not_openalex['discipline_biology'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['discipline_pharma'] = aou_not_pubmed_not_openalex['discipline_pharma'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['discipline_affective'] = aou_not_pubmed_not_openalex['discipline_affective'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['discipline_dentistry'] = aou_not_pubmed_not_openalex['discipline_dentistry'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['discipline_medicine_general'] = aou_not_pubmed_not_openalex['discipline_medicine_general'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['discipline_neurosciences'] = aou_not_pubmed_not_openalex['discipline_neurosciences'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['DATA'] = aou_not_pubmed_not_openalex['DATA'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['DATA_TYPE_appendixes'] = aou_not_pubmed_not_openalex['DATA_TYPE_appendixes'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['DATA_TYPE_data_supplements'] = aou_not_pubmed_not_openalex['DATA_TYPE_data_supplements'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['DATA_TYPE_shared_data'] = aou_not_pubmed_not_openalex['DATA_TYPE_shared_data'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['FUNDER'] = aou_not_pubmed_not_openalex['FUNDER'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['FNS_FUNDER'] = aou_not_pubmed_not_openalex['FNS_FUNDER'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['EU_FUNDER'] = aou_not_pubmed_not_openalex['EU_FUNDER'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['PMID'] = aou_not_pubmed_not_openalex['PMID'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['PUBMED'] = aou_not_pubmed_not_openalex['PUBMED'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['DATA_PUBMED'] = aou_not_pubmed_not_openalex['DATA_PUBMED'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['date_openalex'] = aou_not_pubmed_not_openalex['date_openalex'].fillna(0).astype(int)
aou_not_pubmed_not_openalex['PMID_OPENALEX'] = aou_not_pubmed_not_openalex['PMID_OPENALEX'].fillna(0).astype(int)
aou_not_pubmed_not_openalex

  aou_not_pubmed_not_openalex['PMID_OPENALEX'] = aou_not_pubmed_not_openalex['PMID_OPENALEX'].fillna(0).astype(int)


Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX
14,unige:94273,2017,1,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,Thèse,10.13097/ARCHIVE-OUVERTE/UNIGE:94273,23775053,1,0,0,0,,,,,0,0
15,unige:29525,2015,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1016/J.PSYCHRES.2013.06.032,23870492,1,0,0,0,,,,,0,0
21,unige:35523,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1111/VOP.12137,24373539,1,0,0,0,,,,,0,0
22,unige:74670,2015,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Article scientifique,10.1597/13-145,24437562,1,0,0,0,,,,,0,0
23,unige:74675,2015,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Article scientifique,10.1597/13-080,24437563,1,0,0,0,,,,,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26474,unige:167053,2022,1,1,0,0,0,0,0,0,0,1,0,0,1,1,1,0,Article scientifique,10.21105/JOSS.04248,0,1,0,0,0,,,,,0,0
26475,unige:167054,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,10.1007/978-3-031-07121-8_12,0,1,0,0,0,,,,,0,0
26476,unige:167072,2022,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,Chapitre de livre,10.1515/9783110670851-020,0,1,0,0,0,,,,,0,0
26477,unige:167097,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article professionnel,10.4414/BMS.2022.21195,0,1,0,0,0,,,,,0,0


In [59]:
# check duplicates
aou_not_pubmed_not_openalex.loc[aou_not_pubmed_not_openalex.duplicated(subset='id', keep=False)]

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [60]:
# exports
aou_not_pubmed_not_openalex.to_csv(myfolder_results + 'aou_not_pubmed_not_openalex.tsv', sep='\t', index=False)
aou_not_pubmed_not_openalex.to_excel(myfolder_results + 'aou_not_pubmed_not_openalex.xlsx', index=False)

## PubMed NOT AoU AND OpenAlex

In [61]:
pubmed_not_aou

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DATA_PUBMED,DOI,PUBMED,AOU,id
0,28816886,Reply,"Assouline B, Tramèr MR, Elia N.",Pain. 2017 Sep;158(9):1839-1840. doi: 10.1097/...,Assouline B,Pain,2017,2017/08/18,,,1,10.1097/J.PAIN.0000000000000955,1,0,
1,29039370,De-Identification of Medical Narrative Data,"Foufi V, Gaudet-Blavignac C, Chevrier R, Lovis C.",Stud Health Technol Inform. 2017;244:23-27.,Foufi V,Stud Health Technol Inform,2017,2017/10/18,,,0,,1,0,
2,28703551,[Hospital readmissions: current problems and p...,"Blanc AL, Fumeaux T, Stirneman J, Bonnabry P, ...",Rev Med Suisse. 2017 Jan 11;13(544-545):117-120.,Blanc AL,Rev Med Suisse,2017,2017/07/14,,,0,,1,0,
3,28837366,Uterine leiomyomata: the snowball effect,"Soave I, Marci R.",Curr Med Res Opin. 2017 Nov;33(11):1909-1911. ...,Soave I,Curr Med Res Opin,2017,2017/08/25,,,1,10.1080/03007995.2017.1372174,1,0,
4,28727356,[Management of carotid artery stenosis],"Puccinelli F, Roffi M, Murith N, Sztajzel R.",Rev Med Suisse. 2017 Apr 26;13(560):894-899.,Puccinelli F,Rev Med Suisse,2017,2017/07/21,,,0,,1,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10580,35086739,The era of reference genomes in conservation g...,"Formenti G, Theissinger K, Fernandes C, Bista ...",Trends Ecol Evol. 2022 Mar;37(3):197-202. doi:...,Formenti G,Trends Ecol Evol,2022,2022/01/28,,,0,10.1016/J.TREE.2021.11.008,1,0,
10581,34895743,The global NAFLD policy review and preparednes...,"Lazarus JV, Mark HE, Villota-Rivas M, Palayew ...",J Hepatol. 2022 Apr;76(4):771-780. doi: 10.101...,Lazarus JV,J Hepatol,2022,2021/12/13,,,0,10.1016/J.JHEP.2021.10.025,1,0,
10582,33685583,Heterogeneous contributions of change in popul...,NCD Risk Factor Collaboration (NCD-RisC).,Elife. 2021 Mar 9;10:e60060. doi: 10.7554/eLif...,NCD Risk Factor Collaboration (NCD-RisC),Elife,2021,2021/03/09,PMC7943191,,1,10.7554/ELIFE.60060,1,0,
10583,36257718,Global Impact of the COVID-19 Pandemic on Stro...,"Nguyen TN, Qureshi MM, Klein P, Yamagami H, Mi...",Neurology. 2023 Jan 24;100(4):e408-e421. doi: ...,Nguyen TN,Neurology,2023,2022/10/18,PMC9897052,,1,10.1212/WNL.0000000000201426,1,0,


In [62]:
# merge with Aou and PubMed data by PMID
pubmed_not_aou_openalex1 = pubmed_not_aou.merge(openalex_pmid[['OPENALEX', 'PMID', 'DOI', 'ID_OPENALEX', 'is_oa', 'oa_status', 'date_openalex']], on='PMID', how='outer')
pubmed_not_aou_openalex1 = pubmed_not_aou_openalex1.rename(columns={'DOI_x' : 'DOI'})
pubmed_not_aou_openalex1 = pubmed_not_aou_openalex1.rename(columns={'DOI_y' : 'DOI_OPENALEX'})
pubmed_not_aou_openalex1['PUBMED'] = pubmed_not_aou_openalex1['PUBMED'].fillna(0).astype(int)
pubmed_not_aou_openalex1['OPENALEX'] = pubmed_not_aou_openalex1['OPENALEX'].fillna(0).astype(int)
pubmed_not_aou_openalex1

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DATA_PUBMED,DOI,PUBMED,AOU,id,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex
0,566224,,,,,,,,,,,,0,,,1,10.1159/000400365,W2260754226,False,closed,2015.0
1,614034,,,,,,,,,,,,0,,,1,10.1159/000446699,W2467949190,True,gold,2016.0
2,1658583,,,,,,,,,,,,0,,,1,10.32388/WOB8WB,W4246486748,True,hybrid,2020.0
3,1666809,,,,,,,,,,,,0,,,1,10.1002/JBMR.5650061114,W2012837775,False,closed,2020.0
4,2220455,,,,,,,,,,,,0,,,1,10.1159/000418835,W2271345702,False,closed,2015.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23738,38624814,"Language of Religion, Religions as Languages. ...",Vestrucci A.,Sophia. 2022;61(1):1-7. doi: 10.1007/s11841-02...,Vestrucci A,Sophia,2022.0,2024/04/16,PMC8968769,,0.0,10.1007/S11841-022-00912-5,1,0.0,,0,,,,,
23739,38715737,,,,,,,,,,,,0,,,1,10.1002/ANSA.202000151,W3109860721,True,gold,2020.0
23740,38715738,,,,,,,,,,,,0,,,1,10.1002/ANSA.202000091,W3109700191,True,gold,2020.0
23741,38715742,,,,,,,,,,,,0,,,1,10.1002/ANSA.202000131,W3111521613,True,gold,2020.0


In [63]:
pubmed_not_aou.loc[pubmed_not_aou['DOI'].isna()]

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DATA_PUBMED,DOI,PUBMED,AOU,id
1,29039370,De-Identification of Medical Narrative Data,"Foufi V, Gaudet-Blavignac C, Chevrier R, Lovis C.",Stud Health Technol Inform. 2017;244:23-27.,Foufi V,Stud Health Technol Inform,2017,2017/10/18,,,0,,1,0,
2,28703551,[Hospital readmissions: current problems and p...,"Blanc AL, Fumeaux T, Stirneman J, Bonnabry P, ...",Rev Med Suisse. 2017 Jan 11;13(544-545):117-120.,Blanc AL,Rev Med Suisse,2017,2017/07/14,,,0,,1,0,
4,28727356,[Management of carotid artery stenosis],"Puccinelli F, Roffi M, Murith N, Sztajzel R.",Rev Med Suisse. 2017 Apr 26;13(560):894-899.,Puccinelli F,Rev Med Suisse,2017,2017/07/21,,,0,,1,0,
33,27800499,Anti-VEGF Agents for the Treatment of Pigment ...,"Panos GD, Gatzioufas Z.",Med Hypothesis Discov Innov Ophthalmol. 2015 W...,Panos GD,Med Hypothesis Discov Innov Ophthalmol,2015,2015/01/01,PMC5087102,,0,,1,0,
59,28905546,[Hypertension in people of African descent],"Cane F, Zisimopoulou S, Pechère-Bertschi A.",Rev Med Suisse. 2017 Sep 13;13(574):1576-1579.,Cane F,Rev Med Suisse,2017,2017/09/15,,,0,,1,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9259,33443823,[What's new in addiction medicine],"Zullino D, Daeppen JB, Seragnoli F, Favrod Cou...",Rev Med Suisse. 2021 Jan 13;17(720-1):10-12.,Zullino D,Rev Med Suisse,2021,2021/01/14,,,0,,1,0,
9486,34644020,[Recommendations for management of misuses and...,"Grandjean C, Crettol Wavre S, Khazaal Y, Sanch...",Rev Med Suisse. 2021 Oct 13;17(754):1754-1759.,Grandjean C,Rev Med Suisse,2021,2021/10/13,,,0,,1,0,
9631,34910408,[Gabapentinoids : misuses and addictions],"Grandjean C, Crettol Wavre S, Khazaal Y, Sanch...",Rev Med Suisse. 2021 Dec 15;17(763):2206-2208.,Grandjean C,Rev Med Suisse,2021,2021/12/15,,,0,,1,0,
9778,34755944,[Personalized medicine and chronic disease pre...,"Cohidon C, Desvergne B, Widmer D, Cerqui D, Gu...",Rev Med Suisse. 2021 Nov 10;17(758):1939-1942.,Cohidon C,Rev Med Suisse,2021,2021/11/10,,,0,,1,0,


In [64]:
# merge with Aou and PubMed data by DOI
pubmed_not_aou_openalex2 = pubmed_not_aou.merge(openalex_doi[['OPENALEX', 'PMID', 'DOI', 'ID_OPENALEX', 'is_oa', 'oa_status', 'date_openalex']], on='DOI', how='outer')
pubmed_not_aou_openalex2 = pubmed_not_aou_openalex2.rename(columns={'PMID_x' : 'PMID'})
pubmed_not_aou_openalex2 = pubmed_not_aou_openalex2.rename(columns={'PMID_y' : 'PMID_OPENALEX'})
pubmed_not_aou_openalex2['PUBMED'] = pubmed_not_aou_openalex2['PUBMED'].fillna(0).astype(int)
pubmed_not_aou_openalex2['OPENALEX'] = pubmed_not_aou_openalex2['OPENALEX'].fillna(0).astype(int)
pubmed_not_aou_openalex2

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DATA_PUBMED,DOI,PUBMED,AOU,id,OPENALEX,PMID_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex
0,,,,,,,,,,,,10.1001/JAMA.2014.18482,0,,,1,25603492,W1980806879,False,closed,2015.0
1,,,,,,,,,,,,10.1001/JAMA.2015.3703,0,,,1,25919531,W2063717932,False,closed,2015.0
2,,,,,,,,,,,,10.1001/JAMA.2015.4668,0,,,1,25988462,W2168357345,True,green,2015.0
3,,,,,,,,,,,,10.1001/JAMA.2015.4970,0,,,1,26151269,W1488627510,False,closed,2015.0
4,,,,,,,,,,,,10.1001/JAMA.2016.20029,0,,,1,28241362,W2592762700,False,closed,2017.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23504,33443823.0,[What's new in addiction medicine],"Zullino D, Daeppen JB, Seragnoli F, Favrod Cou...",Rev Med Suisse. 2021 Jan 13;17(720-1):10-12.,Zullino D,Rev Med Suisse,2021.0,2021/01/14,,,0.0,,1,0.0,,0,,,,,
23505,34644020.0,[Recommendations for management of misuses and...,"Grandjean C, Crettol Wavre S, Khazaal Y, Sanch...",Rev Med Suisse. 2021 Oct 13;17(754):1754-1759.,Grandjean C,Rev Med Suisse,2021.0,2021/10/13,,,0.0,,1,0.0,,0,,,,,
23506,34910408.0,[Gabapentinoids : misuses and addictions],"Grandjean C, Crettol Wavre S, Khazaal Y, Sanch...",Rev Med Suisse. 2021 Dec 15;17(763):2206-2208.,Grandjean C,Rev Med Suisse,2021.0,2021/12/15,,,0.0,,1,0.0,,0,,,,,
23507,34755944.0,[Personalized medicine and chronic disease pre...,"Cohidon C, Desvergne B, Widmer D, Cerqui D, Gu...",Rev Med Suisse. 2021 Nov 10;17(758):1939-1942.,Cohidon C,Rev Med Suisse,2021.0,2021/11/10,,,0.0,,1,0.0,,0,,,,,


In [65]:
# consolidate both merges
pubmed_not_aou_openalex = pd.concat([pubmed_not_aou_openalex1, pubmed_not_aou_openalex2], ignore_index=True)
pubmed_not_aou_openalex = pubmed_not_aou_openalex.drop_duplicates(subset='PMID', keep='first')
pubmed_not_aou_openalex

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DATA_PUBMED,DOI,PUBMED,AOU,id,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX
0,566224.0,,,,,,,,,,,,0,,,1,10.1159/000400365,W2260754226,False,closed,2015.0,
1,614034.0,,,,,,,,,,,,0,,,1,10.1159/000446699,W2467949190,True,gold,2016.0,
2,1658583.0,,,,,,,,,,,,0,,,1,10.32388/WOB8WB,W4246486748,True,hybrid,2020.0,
3,1666809.0,,,,,,,,,,,,0,,,1,10.1002/JBMR.5650061114,W2012837775,False,closed,2020.0,
4,2220455.0,,,,,,,,,,,,0,,,1,10.1159/000418835,W2271345702,False,closed,2015.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23739,38715737.0,,,,,,,,,,,,0,,,1,10.1002/ANSA.202000151,W3109860721,True,gold,2020.0,
23740,38715738.0,,,,,,,,,,,,0,,,1,10.1002/ANSA.202000091,W3109700191,True,gold,2020.0,
23741,38715742.0,,,,,,,,,,,,0,,,1,10.1002/ANSA.202000131,W3111521613,True,gold,2020.0,
23742,38716386.0,Exploring CHEMeDATA. An interview with Damien ...,"Jeannerat D, Trevorrow P.",Anal Sci Adv. 2020 May 15;1(4):254-257. doi: 1...,Jeannerat D,Anal Sci Adv,2020.0,2024/05/08,PMC10989152,,0.0,10.1002/ANSA.202000041,1,0.0,,1,10.1002/ANSA.202000041,W3025354823,True,gold,2020.0,


In [66]:
# consolidate merge
pubmed_not_aou_and_openalex = pubmed_not_aou_openalex.loc[(pubmed_not_aou_openalex['PUBMED'] == 1) & (pubmed_not_aou_openalex['OPENALEX'] == 1)]
pubmed_not_aou_and_openalex = pubmed_not_aou_and_openalex.drop_duplicates(subset='PMID', keep='first')
pubmed_not_aou_and_openalex['Publication Year'] = pubmed_not_aou_and_openalex['Publication Year'].fillna(0).astype(int)
pubmed_not_aou_and_openalex['PMID'] = pubmed_not_aou_and_openalex['PMID'].fillna(0).astype(int)
pubmed_not_aou_and_openalex['AOU'] = pubmed_not_aou_and_openalex['AOU'].fillna(0).astype(int)
pubmed_not_aou_and_openalex['DATA_PUBMED'] = pubmed_not_aou_and_openalex['DATA_PUBMED'].fillna(0).astype(int)
pubmed_not_aou_and_openalex['date_openalex'] = pubmed_not_aou_and_openalex['date_openalex'].fillna(0).astype(int)
pubmed_not_aou_and_openalex['PMID_OPENALEX'] = pubmed_not_aou_and_openalex['PMID_OPENALEX'].fillna(0).astype(int)
pubmed_not_aou_and_openalex

  pubmed_not_aou_and_openalex['PMID_OPENALEX'] = pubmed_not_aou_and_openalex['PMID_OPENALEX'].fillna(0).astype(int)


Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DATA_PUBMED,DOI,PUBMED,AOU,id,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX
18,24368644,Dissecting EASL/AASLD Recommendations With a M...,"Mazzaferro V, Roayaie S, Poon R, Majno PE.",Ann Surg. 2015 Jul;262(1):e17-8. doi: 10.1097/...,Mazzaferro V,Ann Surg,2015,2013/12/26,,,0,10.1097/SLA.0000000000000398,1,0,,1,10.1097/SLA.0000000000000398,W1977259095,False,closed,2015,0
35,24831652,Erlotinib: another candidate for the therapeut...,"Petit-Jean E, Buclin T, Guidi M, Quoix E, Gour...",Ther Drug Monit. 2015 Feb;37(1):2-21. doi: 10....,Petit-Jean E,Ther Drug Monit,2015,2014/05/17,,,0,10.1097/FTD.0000000000000097,1,0,,1,10.1097/FTD.0000000000000097,W2318275056,False,closed,2015,0
38,24892797,"Neurobiology of DHEA and effects on sexuality,...","Pluchino N, Drakopoulos P, Bianchi-Demicheli F...",J Steroid Biochem Mol Biol. 2015 Jan;145:273-8...,Pluchino N,J Steroid Biochem Mol Biol,2015,2014/06/04,,,0,10.1016/J.JSBMB.2014.04.012,1,0,,1,10.1016/J.JSBMB.2014.04.012,W2078298098,False,closed,2015,0
45,25023584,ESPGHAN position paper on management of percut...,"Heuschkel RB, Gottrand F, Devarajan K, Poole H...",J Pediatr Gastroenterol Nutr. 2015 Jan;60(1):1...,Heuschkel RB,J Pediatr Gastroenterol Nutr,2015,2014/07/16,,,0,10.1097/MPG.0000000000000501,1,0,,1,10.1097/MPG.0000000000000501,W2066786861,True,bronze,2015,0
49,25058504,Standards for definitions and use of outcome m...,"Jammer I, Wickboldt N, Sander M, Smith A, Schu...",Eur J Anaesthesiol. 2015 Feb;32(2):88-105. doi...,Jammer I,Eur J Anaesthesiol,2015,2014/07/25,,,0,10.1097/EJA.0000000000000118,1,0,,1,10.1097/EJA.0000000000000118,W1687620054,True,bronze,2015,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23734,38550944,Impact of structural racism on inclusion and d...,"Geneviève LD, Elger BS, Wangmo T.",Camb Prism Precis Med. 2022 Oct 26;1:e5. doi: ...,Geneviève LD,Camb Prism Precis Med,2022,2024/03/29,PMC10953740,,0,10.1017/PCM.2022.4,1,0,,1,10.1017/PCM.2022.4,W4307391990,True,gold,2022,0
23735,38585625,Intersecting vulnerabilities: climatic and dem...,"Rohat G, Monaghan A, Hayden MH, Ryan SJ, Charr...",Environ Res Lett. 2020 Aug;15(8):084046. doi: ...,Rohat G,Environ Res Lett,2020,2024/04/08,PMC10997346,NIHMS1932265,0,10.1088/1748-9326/AB9141,1,0,,1,10.1088/1748-9326/AB9141,W2969452336,True,gold,2020,0
23736,38620847,How COVID-19 changed Italian consumers' behavior,"Cervellati EM, Stella GP, Filotto U, Maino A.",Glob Financ J. 2022 Feb;51:100680. doi: 10.101...,Cervellati EM,Glob Financ J,2022,2024/04/15,PMC8570799,,0,10.1016/J.GFJ.2021.100680,1,0,,1,10.1016/J.GFJ.2021.100680,W3212349705,True,bronze,2022,0
23737,38624653,[Narrative Argumentation in Political Letters ...,Schröter J.,Z Literaturwissenschaft Linguist. 2021;51(2):2...,Schröter J,Z Literaturwissenschaft Linguist,2021,2024/04/16,PMC8045440,,0,10.1007/S41244-021-00200-8,1,0,,1,10.1007/S41244-021-00200-8,W3156941189,True,hybrid,2021,0


In [67]:
# check duplicates
pubmed_not_aou_and_openalex.loc[pubmed_not_aou_and_openalex.duplicated(subset='PMID', keep=False)]

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DATA_PUBMED,DOI,PUBMED,AOU,id,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [68]:
# check duplicates
pubmed_not_aou_and_openalex.loc[pubmed_not_aou_and_openalex.duplicated(subset='ID_OPENALEX', keep=False)].sort_values(by='ID_OPENALEX')

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DATA_PUBMED,DOI,PUBMED,AOU,id,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [69]:
# export duplicates
pubmed_not_aou_and_openalex.loc[pubmed_not_aou_and_openalex.duplicated(subset='ID_OPENALEX', keep=False)].sort_values(by='ID_OPENALEX').to_csv(myfolder_temp + 'pubmed_not_aou_and_openalex_duplicates.tsv', sep='\t', index=False)
pubmed_not_aou_and_openalex.loc[pubmed_not_aou_and_openalex.duplicated(subset='ID_OPENALEX', keep=False)].sort_values(by='ID_OPENALEX').to_excel(myfolder_temp + 'pubmed_not_aou_and_openalex_duplicates.xlsx', index=False)

In [70]:
# exports
pubmed_not_aou_and_openalex.to_csv(myfolder_results + 'pubmed_not_aou_and_openalex.tsv', sep='\t', index=False)
pubmed_not_aou_and_openalex.to_excel(myfolder_results + 'pubmed_not_aou_and_openalex.xlsx', index=False)

## PubMed NOT AoU NOT OpenAlex

In [71]:
# consolidate merge
pubmed_not_aou_not_openalex = pubmed_not_aou_openalex.loc[(pubmed_not_aou_openalex['PUBMED'] == 1) & (pubmed_not_aou_openalex['OPENALEX'] == 0)]
pubmed_not_aou_not_openalex = pubmed_not_aou_not_openalex.drop_duplicates(subset='PMID', keep='first')
pubmed_not_aou_not_openalex['Publication Year'] = pubmed_not_aou_not_openalex['Publication Year'].fillna(0).astype(int)
pubmed_not_aou_not_openalex['PMID'] = pubmed_not_aou_not_openalex['PMID'].fillna(0).astype(int)
pubmed_not_aou_not_openalex['AOU'] = pubmed_not_aou_not_openalex['AOU'].fillna(0).astype(int)
pubmed_not_aou_not_openalex['DATA_PUBMED'] = pubmed_not_aou_not_openalex['DATA_PUBMED'].fillna(0).astype(int)
pubmed_not_aou_not_openalex['date_openalex'] = pubmed_not_aou_not_openalex['date_openalex'].fillna(0).astype(int)
pubmed_not_aou_not_openalex['PMID_OPENALEX'] = pubmed_not_aou_not_openalex['PMID_OPENALEX'].fillna(0).astype(int)
pubmed_not_aou_not_openalex

  pubmed_not_aou_not_openalex['PMID_OPENALEX'] = pubmed_not_aou_not_openalex['PMID_OPENALEX'].fillna(0).astype(int)


Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DATA_PUBMED,DOI,PUBMED,AOU,id,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX
19,24385222,Sleep and emotional functions,"Perogamvros L, Schwartz S.",Curr Top Behav Neurosci. 2015;25:411-31. doi: ...,Perogamvros L,Curr Top Behav Neurosci,2015,2014/01/04,,,1,10.1007/7854_2013_271,1,0,,0,,,,,0,0
20,24448223,Do Inhibitory Control Demands Affect Event-Bas...,"Altgassen M, Koch A, Kliegel M.",J Atten Disord. 2019 Jan;23(1):51-56. doi: 10....,Altgassen M,J Atten Disord,2019,2014/01/23,,,1,10.1177/1087054713518236,1,0,,0,,,,,0,0
26,24553738,Negative and distorted attributions towards ch...,"Schechter DS, Moser DA, Reliford A, McCaw JE, ...",Child Psychiatry Hum Dev. 2015 Feb;46(1):10-20...,Schechter DS,Child Psychiatry Hum Dev,2015,2014/02/21,PMC4139484,NIHMS568871,1,10.1007/S10578-014-0447-5,1,0,,0,,,,,0,0
27,24612083,Factor validity and reliability of the aberran...,"Lehotkay R, Saraswathi Devi T, Raju MV, Bada P...",J Intellect Disabil Res. 2015 Mar;59(3):208-14...,Lehotkay R,J Intellect Disabil Res,2015,2014/03/12,,,1,10.1111/JIR.12128,1,0,,0,,,,,0,0
29,24676279,The Swiss NEHAP: why it ended,Forbat J.,Health Promot Int. 2015 Sep;30(3):716-24. doi:...,Forbat J,Health Promot Int,2015,2014/03/29,,,1,10.1093/HEAPRO/DAU014,1,0,,0,,,,,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23709,37645268,"FAIR4Health: Findable, Accessible, Interoperab...","Alvarez-Romero C, Martínez-García A, Sinaci AA...",Open Res Eur. 2022 May 31;2:34. doi: 10.12688/...,Alvarez-Romero C,Open Res Eur,2022,2023/08/30,PMC10446092,,0,10.12688/OPENRESEUROPE.14349.2,1,0,,0,,,,,0,0
23710,37665839,An extended validation of the Communal Coping ...,"Pété E, Chanal J, Doron J.",Psychol Sport Exerc. 2023 Mar;65:102367. doi: ...,Pété E,Psychol Sport Exerc,2023,2023/09/04,,,0,10.1016/J.PSYCHSPORT.2022.102367,1,0,,0,,,,,0,0
23711,37881579,Schizophrenia Polygenic Risk During Typical De...,"Kirschner M, Paquola C, Khundrakpam BS, Vainik...",Biol Psychiatry Glob Open Sci. 2022 Aug 24;3(4...,Kirschner M,Biol Psychiatry Glob Open Sci,2022,2023/10/26,PMC10593879,,0,10.1016/J.BPSGOS.2022.08.003,1,0,,0,,,,,0,0
23716,37969488,Arrhythmic Burden of Adult Survivors With Repa...,"Touray M, Ladouceur M, Bouchardy J, Schwerzman...",CJC Pediatr Congenit Heart Dis. 2022 Sep 8;1(6...,Touray M,CJC Pediatr Congenit Heart Dis,2022,2023/11/16,PMC10642084,,0,10.1016/J.CJCPC.2022.08.003,1,0,,0,,,,,0,0


In [72]:
# check duplicates
pubmed_not_aou_not_openalex.loc[pubmed_not_aou_not_openalex.duplicated(subset='PMID', keep=False)]

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DATA_PUBMED,DOI,PUBMED,AOU,id,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX


In [73]:
# exports
pubmed_not_aou_not_openalex.to_csv(myfolder_results + 'pubmed_not_aou_not_openalex.tsv', sep='\t', index=False)
pubmed_not_aou_not_openalex.to_excel(myfolder_results + 'pubmed_not_aou_not_openalex.xlsx', index=False)

## OpenAlex NOT AoU NOT PubMed

In [74]:
aou_and_pubmed_and_openalex

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX
18,unige:32989,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,Article scientifique,10.1016/J.BIOPSYCH.2013.06.023,23993209,1,1,0,1,10.1016/J.BIOPSYCH.2013.06.023,W2108436957,True,green,2015,0
23,unige:84097,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.2174/15701611113116660167,24188484,1,1,0,1,10.2174/15701611113116660167,W2293858562,False,closed,2015,0
24,unige:72702,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.2174/15701611113116660164,24188487,1,1,1,1,10.2174/15701611113116660164,W301338818,True,green,2015,0
25,unige:36640,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.2174/1570161111999131203092322,24188489,1,1,0,1,10.2174/1570161111999131203092322,W87985356,False,closed,2015,0
36,unige:77996,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1097/SLA.0000000000000426,24477161,1,1,0,1,10.1097/SLA.0000000000000426,W2026589533,False,closed,2015,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23471,unige:160667,2022,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.2533/CHIMIA.2022.90,38069754,1,1,0,1,10.2533/CHIMIA.2022.90,W4214655598,True,gold,2022,0
23474,unige:166451,2022,1,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,Article scientifique,10.2533/CHIMIA.2022.954,38069791,1,1,0,1,10.2533/CHIMIA.2022.954,W4310999321,True,gold,2022,0
23488,unige:150518,2021,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1002/ANSA.202000151,38715737,1,1,0,1,10.1002/ANSA.202000151,W3109860721,True,gold,2020,0
23489,unige:146726,2020,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1002/ANSA.202000091,38715738,1,1,0,1,10.1002/ANSA.202000091,W3109700191,True,gold,2020,0


In [75]:
aou_not_pubmed_and_openalex

Unnamed: 0,id,date,BIOMED,discipline_clinical,discipline_basic,discipline_biology,discipline_pharma,discipline_affective,discipline_dentistry,discipline_medicine_general,discipline_neurosciences,DATA,DATA_TYPE_appendixes,DATA_TYPE_data_supplements,DATA_TYPE_shared_data,FUNDER,FNS_FUNDER,EU_FUNDER,type,DOI,PMID,AOU,PUBMED,DATA_PUBMED,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX
26,unige:74218,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1016/J.CLNU.2014.01.008,24485773,1,0,0,1,10.1016/J.CLNU.2014.01.008,W2025059392,False,closed,2015,0
38,unige:74671,2015,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Article scientifique,10.1597/13-272,24805871,1,0,0,1,10.1597/13-272,W2089513019,False,closed,2015,0
93,unige:54992,2015,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1016/J.JHEP.2014.09.018,25245893,1,0,0,1,10.1016/J.JHEP.2014.09.018,W1989074294,True,hybrid,2015,0
100,unige:74672,2015,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Article scientifique,10.1597/14-085,25275538,1,0,0,1,10.1597/14-085,W2145851361,False,closed,2015,0
115,unige:98743,2015,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,Article scientifique,10.1016/J.NEUROSCIENCE.2014.09.059,25316409,1,0,0,1,10.1016/J.NEUROSCIENCE.2014.09.059,W2073673228,True,hybrid,2015,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22522,unige:166651,2022,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,Article scientifique,10.1177/15910199221145745,36529940,1,0,0,1,10.1177/15910199221145745,W4311933681,True,hybrid,2022,0
22533,unige:173715,2022,1,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,Article scientifique,10.1530/REP-22-0312,36538648,1,0,0,1,10.1530/REP-22-0312,W4312066132,True,bronze,2022,0
22536,unige:166004,2022,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,Article scientifique,10.1093/INFDIS/JIAC487,36542509,1,0,0,1,10.1093/INFDIS/JIAC487,W4312065292,True,bronze,2022,0
22579,unige:166812,2022,1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,Article scientifique,10.1093/STCLTM/SZAC081,36571216,1,0,0,1,10.1093/STCLTM/SZAC081,W4312194965,True,gold,2022,0


In [76]:
pubmed_not_aou_and_openalex

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DATA_PUBMED,DOI,PUBMED,AOU,id,OPENALEX,DOI_OPENALEX,ID_OPENALEX,is_oa,oa_status,date_openalex,PMID_OPENALEX
18,24368644,Dissecting EASL/AASLD Recommendations With a M...,"Mazzaferro V, Roayaie S, Poon R, Majno PE.",Ann Surg. 2015 Jul;262(1):e17-8. doi: 10.1097/...,Mazzaferro V,Ann Surg,2015,2013/12/26,,,0,10.1097/SLA.0000000000000398,1,0,,1,10.1097/SLA.0000000000000398,W1977259095,False,closed,2015,0
35,24831652,Erlotinib: another candidate for the therapeut...,"Petit-Jean E, Buclin T, Guidi M, Quoix E, Gour...",Ther Drug Monit. 2015 Feb;37(1):2-21. doi: 10....,Petit-Jean E,Ther Drug Monit,2015,2014/05/17,,,0,10.1097/FTD.0000000000000097,1,0,,1,10.1097/FTD.0000000000000097,W2318275056,False,closed,2015,0
38,24892797,"Neurobiology of DHEA and effects on sexuality,...","Pluchino N, Drakopoulos P, Bianchi-Demicheli F...",J Steroid Biochem Mol Biol. 2015 Jan;145:273-8...,Pluchino N,J Steroid Biochem Mol Biol,2015,2014/06/04,,,0,10.1016/J.JSBMB.2014.04.012,1,0,,1,10.1016/J.JSBMB.2014.04.012,W2078298098,False,closed,2015,0
45,25023584,ESPGHAN position paper on management of percut...,"Heuschkel RB, Gottrand F, Devarajan K, Poole H...",J Pediatr Gastroenterol Nutr. 2015 Jan;60(1):1...,Heuschkel RB,J Pediatr Gastroenterol Nutr,2015,2014/07/16,,,0,10.1097/MPG.0000000000000501,1,0,,1,10.1097/MPG.0000000000000501,W2066786861,True,bronze,2015,0
49,25058504,Standards for definitions and use of outcome m...,"Jammer I, Wickboldt N, Sander M, Smith A, Schu...",Eur J Anaesthesiol. 2015 Feb;32(2):88-105. doi...,Jammer I,Eur J Anaesthesiol,2015,2014/07/25,,,0,10.1097/EJA.0000000000000118,1,0,,1,10.1097/EJA.0000000000000118,W1687620054,True,bronze,2015,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23734,38550944,Impact of structural racism on inclusion and d...,"Geneviève LD, Elger BS, Wangmo T.",Camb Prism Precis Med. 2022 Oct 26;1:e5. doi: ...,Geneviève LD,Camb Prism Precis Med,2022,2024/03/29,PMC10953740,,0,10.1017/PCM.2022.4,1,0,,1,10.1017/PCM.2022.4,W4307391990,True,gold,2022,0
23735,38585625,Intersecting vulnerabilities: climatic and dem...,"Rohat G, Monaghan A, Hayden MH, Ryan SJ, Charr...",Environ Res Lett. 2020 Aug;15(8):084046. doi: ...,Rohat G,Environ Res Lett,2020,2024/04/08,PMC10997346,NIHMS1932265,0,10.1088/1748-9326/AB9141,1,0,,1,10.1088/1748-9326/AB9141,W2969452336,True,gold,2020,0
23736,38620847,How COVID-19 changed Italian consumers' behavior,"Cervellati EM, Stella GP, Filotto U, Maino A.",Glob Financ J. 2022 Feb;51:100680. doi: 10.101...,Cervellati EM,Glob Financ J,2022,2024/04/15,PMC8570799,,0,10.1016/J.GFJ.2021.100680,1,0,,1,10.1016/J.GFJ.2021.100680,W3212349705,True,bronze,2022,0
23737,38624653,[Narrative Argumentation in Political Letters ...,Schröter J.,Z Literaturwissenschaft Linguist. 2021;51(2):2...,Schröter J,Z Literaturwissenschaft Linguist,2021,2024/04/16,PMC8045440,,0,10.1007/S41244-021-00200-8,1,0,,1,10.1007/S41244-021-00200-8,W3156941189,True,hybrid,2021,0


In [77]:
# concat 3 df with openalex merge
openalex_merged = pd.concat([aou_and_pubmed_and_openalex[['ID_OPENALEX', 'id', 'PMID']], aou_not_pubmed_and_openalex[['ID_OPENALEX', 'id', 'PMID']]], ignore_index=True, sort=False)
openalex_merged = pd.concat([openalex_merged, pubmed_not_aou_and_openalex[['ID_OPENALEX', 'PMID']]], ignore_index=True, sort=False)
openalex_merged

Unnamed: 0,ID_OPENALEX,id,PMID
0,W2108436957,unige:32989,23993209
1,W2293858562,unige:84097,24188484
2,W301338818,unige:72702,24188487
3,W87985356,unige:36640,24188489
4,W2026589533,unige:77996,24477161
...,...,...,...
20262,W4307391990,,38550944
20263,W2969452336,,38585625
20264,W3212349705,,38620847
20265,W3156941189,,38624653


In [78]:
# check duplicates
openalex_merged.loc[openalex_merged.duplicated(subset='ID_OPENALEX', keep=False)]

Unnamed: 0,ID_OPENALEX,id,PMID


In [79]:
# remove duplicates
openalex_merged = openalex_merged.drop_duplicates(subset='ID_OPENALEX', keep='first')
openalex_merged

Unnamed: 0,ID_OPENALEX,id,PMID
0,W2108436957,unige:32989,23993209
1,W2293858562,unige:84097,24188484
2,W301338818,unige:72702,24188487
3,W87985356,unige:36640,24188489
4,W2026589533,unige:77996,24477161
...,...,...,...
20262,W4307391990,,38550944
20263,W2969452336,,38585625
20264,W3212349705,,38620847
20265,W3156941189,,38624653


In [80]:
# opnealex not merged
openalex_not_merged = openalex.loc[~openalex['ID_OPENALEX'].isin(openalex_merged['ID_OPENALEX'])]
openalex_not_merged

Unnamed: 0,url_openalex,primary_location_id,primary_location_display_name,primary_location_issn_l,primary_location_is_oa,primary_location_version,primary_location_license,is_oa,oa_status,oa_url,doi_openalex,pmid_openalex,date_openalex,type,is_retracted,biblio_issue,biblio_first_page,biblio_volume,biblio_last_page,PMID,DOI,ID_OPENALEX,OPENALEX
1,https://openalex.org/W2897513125,https://openalex.org/S31768639,Age and ageing,0002-0729,True,publishedVersion,cc-by-nc,True,hybrid,https://academic.oup.com/ageing/article-pdf/48...,https://doi.org/10.1093/ageing/afy169,https://pubmed.ncbi.nlm.nih.gov/31081853,2018,article,False,1,16,48,31,31081853,10.1093/AGEING/AFY169,W2897513125,1
8,https://openalex.org/W2024227324,https://openalex.org/S23181512,Nanoscale,2040-3364,True,publishedVersion,cc-by,True,hybrid,https://pubs.rsc.org/en/content/articlepdf/201...,https://doi.org/10.1039/c4nr01600a,https://pubmed.ncbi.nlm.nih.gov/25707682,2015,article,False,11,4598,7,4810,25707682,10.1039/C4NR01600A,W2024227324,1
53,https://openalex.org/W3102157866,https://openalex.org/S190924190,Journal of physics. Condensed matter,0953-8984,True,publishedVersion,cc-by,True,hybrid,https://doi.org/10.1088/1361-648x/ab51ff,https://doi.org/10.1088/1361-648x/ab51ff,https://pubmed.ncbi.nlm.nih.gov/31658458,2020,article,False,16,165902,32,165902,31658458,10.1088/1361-648X/AB51FF,W3102157866,1
56,https://openalex.org/W2195190137,https://openalex.org/S110447773,Cell,0092-8674,True,publishedVersion,publisher-specific-oa,True,hybrid,http://www.cell.com/article/S0092867415015044/pdf,https://doi.org/10.1016/j.cell.2015.11.024,https://pubmed.ncbi.nlm.nih.gov/26686651,2015,article,False,7,1611,163,1627,26686651,10.1016/J.CELL.2015.11.024,W2195190137,1
60,https://openalex.org/W2759086237,https://openalex.org/S64187185,Nature communications,2041-1723,True,publishedVersion,cc-by,True,gold,https://www.nature.com/articles/s41467-018-036...,https://doi.org/10.1038/s41467-018-03621-1,https://pubmed.ncbi.nlm.nih.gov/29739930,2018,article,False,1,,9,,29739930,10.1038/S41467-018-03621-1,W2759086237,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45991,https://openalex.org/W4286892492,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34704683,2021,article,False,,,,,34704683,,W4286892492,1
45992,https://openalex.org/W4286892578,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34704682,2021,article,False,,,,,34704682,,W4286892578,1
46007,https://openalex.org/W4286953485,https://openalex.org/S4306525036,PubMed,,False,,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34585859,2021,article,False,,,,,34585859,,W4286953485,1
46014,https://openalex.org/W4287020831,https://openalex.org/S4306525036,PubMed,,False,submittedVersion,,False,closed,,,https://pubmed.ncbi.nlm.nih.gov/34431641,2021,article,False,,,,,34431641,,W4287020831,1


In [81]:
# exports
openalex_not_merged.to_csv(myfolder_results + 'openalex_not_aou_not_pubmed.tsv', sep='\t', index=False)
openalex_not_merged.to_excel(myfolder_results + 'openalex_not_aou_not_pubmed.xlsx', index=False)

  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode

  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode

  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode(url))
  "65,530 URLS per worksheet." % force_unicode

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

