# Integrate Europe PMC and arXiv data

## 1. Introduction

The data for the 11,397 PubMed Central articles extracted using the Europe PMC API will be combined with the 118 exceptions and 301 arXiv articles extracted by GROBID, and cleaned to remove missing values and duplicates.



## 2. Import libraries

In [None]:
import pandas as pd
import pickle

## 3. Import data for 11,397 PMC articles

Load the data for the 11,397 articles with full text column added after using Beautiful Soup to remove tags and strip markup. This includes the rows for the 118 exceptions without full text.

In [None]:
with open('2023-01-06_search_results_full_text.pickle', 'rb') as f:
  search_results_full_text = pickle.load(f)

In [None]:
search_results_full_text

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...,"Sir James Black, a winner of the 1988 Nobel Pr..."
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...,Tight and selective interaction between ligand...
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...,The severe acute respiratory syndrome coronavi...
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...,The sudden outbreak of SARS-CoV-2 in 2019 took...
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...,"The 2019 novel coronavirus, now dubbed SARS-Co..."
...,...,...,...,...,...,...,...,...,...
11392,PMC6328940,2019-01-01,2020-03-09,β-RA reduces DMQ/CoQ ratio and rescues the enc...,EMBO molecular medicine,"Hidalgo-Gutiérrez A, Barriocanal-Casado E, Bak...",10.15252/emmm.201809466,https://europepmc.org/articles/PMC6328940?pdf=...,Mitochondria are the primary site of cellular ...
11393,PMC6598402,2019-06-21,2020-09-28,Alzheimer Disease Pathogenesis: Insights From ...,Frontiers in neuroscience,"Chen XQ, Mobley WC.",10.3389/fnins.2019.00659,https://europepmc.org/articles/PMC6598402?pdf=...,"AD is the most common cause of dementia, accou..."
11394,PMC6481739,2019-02-05,2020-09-28,Modeling cardiac complexity: Advancements in m...,APL bioengineering,"Callaghan NI, Hadipour-Lakmehsari S, Lee SH, G...",10.1063/1.5055873,https://europepmc.org/articles/PMC6481739?pdf=...,Compromised contractility of the heart is a ma...
11395,PMC6624471,2019-07-05,2020-09-28,Tissue Response to Neural Implants: The Use of...,Frontiers in neuroscience,"Gulino M, Kim D, Pané S, Santos SD, Pêgo AP.",10.3389/fnins.2019.00689,https://europepmc.org/articles/PMC6624471?pdf=...,Recent technological progress in the field of ...


Check for missing text by searching for empty strings.

In [None]:
search_results_full_text[search_results_full_text['text']=='']

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
1772,PMC9364697,2022-07-06,2022-10-06,3<sup>a</sup> edizione Giornate della ricerca ...,Journal of preventive medicine and hygiene,0,10.15167/2421-4248/jpmh2022.63.1s1,https://europepmc.org/articles/PMC9364697?pdf=...,


There is an empty string in the text column for 	PMC9364697 (an Italian edition) as GROBID currently only supports English.

We can also see the 118 missing values using the .isna() method.

In [None]:
search_results_full_text[search_results_full_text['text'].isna()]

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
64,PMC9538661,2022-10-13,2022-11-22,Recent Drug Development and Medicinal Chemistr...,ChemMedChem,"Ghosh AK, Mishevich JL, Mesecar A, Mitsuya H.",10.1002/cmdc.202200440,https://europepmc.org/articles/PMC9538661?pdf=...,
188,PMC9794394,2022-12-28,2023-01-02,Nonstructural protein 1 (nsp1) widespread RNA ...,iScience,"Bermudez Y, Miles J, Muller M.",10.1016/j.isci.2022.105887,https://europepmc.org/articles/PMC9794394?pdf=...,
212,PMC9538837,2022-10-10,2022-11-22,The Efficacy of Traditional Medicinal Plants i...,Chemistry & biodiversity,"Choe J, Har Yong P, Xiang Ng Z.",10.1002/cbdv.202200655,https://europepmc.org/articles/PMC9538837?pdf=...,
256,PMC9788990,2022-12-24,2023-01-02,Sleep and circadian rhythm disruption alters t...,iScience,"Taylor L, Von Lendenfeld F, Ashton A, Sanghani...",10.1016/j.isci.2022.105877,https://europepmc.org/articles/PMC9788990?pdf=...,
275,PMC9794516,2022-12-28,2023-01-02,MultiOMICs landscape of SARS-CoV-2-induced hos...,iScience,"Pinto SM, Subbannayya Y, Kim H, Hagen L, Górna...",10.1016/j.isci.2022.105895,https://europepmc.org/articles/PMC9794516?pdf=...,
...,...,...,...,...,...,...,...,...,...
9797,PMC7162151,2020-02-28,2020-04-21,News.,Chemistry & industry,0,10.1002/cind.842_3.x,https://europepmc.org/articles/PMC7162151?pdf=...,
9809,PMC9094125,2022-05-11,2022-07-16,Recent advances in metal-organic framework-bas...,Nano research,"Yang M, Zhang J, Wei Y, Zhang J, Tao C.",10.1007/s12274-022-4302-x,https://europepmc.org/articles/PMC9094125?pdf=...,
10399,PMC7492056,2020-09-15,2020-09-28,Full Issue PDF.,JACC. CardioOncology,0,10.1016/s2666-0873(20)30180-0,https://europepmc.org/articles/PMC7492056?pdf=...,
10767,PMC8459260,2021-02-25,2022-01-25,Getting in touch with your senses: Mechanisms ...,WIREs mechanisms of disease,"Gupta S, Butler SJ.",10.1002/wsbm.1520,https://europepmc.org/articles/PMC8459260?pdf=...,


And the .info() method.

In [None]:
search_results_full_text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11397 entries, 0 to 11396
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pmcid      11397 non-null  object
 1   published  11397 non-null  object
 2   revised    11397 non-null  object
 3   title      11397 non-null  object
 4   journal    11397 non-null  object
 5   authors    11397 non-null  object
 6   doi        11397 non-null  object
 7   pdf_url    11397 non-null  object
 8   text       11279 non-null  object
dtypes: object(9)
memory usage: 801.5+ KB


## 4. Import data for 118 PMC exceptions

Load the data for the 118 articles with full text column added after using GROBID to extract  the text, and Beautiful Soup to remove tags and strip markup.

In [None]:
with open('2023-01-06_pmc_search_results_full_text_grobid_v2.pickle', 'rb') as f:
  pmc_search_results_full_text_grobid = pickle.load(f)

In [None]:
pmc_search_results_full_text_grobid

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
64,PMC9538661,2022-10-13,2022-11-22,Recent Drug Development and Medicinal Chemistr...,ChemMedChem,"Ghosh AK, Mishevich JL, Mesecar A, Mitsuya H.",10.1002/cmdc.202200440,https://europepmc.org/articles/PMC9538661?pdf=...,\nSevere acute respiratory syndrome coronaviru...
188,PMC9794394,2022-12-28,2023-01-02,Nonstructural protein 1 (nsp1) widespread RNA ...,iScience,"Bermudez Y, Miles J, Muller M.",10.1016/j.isci.2022.105887,https://europepmc.org/articles/PMC9794394?pdf=...,\nThe past 20 years have seen the emergence of...
212,PMC9538837,2022-10-10,2022-11-22,The Efficacy of Traditional Medicinal Plants i...,Chemistry & biodiversity,"Choe J, Har Yong P, Xiang Ng Z.",10.1002/cbdv.202200655,https://europepmc.org/articles/PMC9538837?pdf=...,\nCoronavirus disease (Covid- 19) is a human r...
256,PMC9788990,2022-12-24,2023-01-02,Sleep and circadian rhythm disruption alters t...,iScience,"Taylor L, Von Lendenfeld F, Ashton A, Sanghani...",10.1016/j.isci.2022.105877,https://europepmc.org/articles/PMC9788990?pdf=...,\n\n\nJ o u r n a l P r e -p r o o f\nRespirat...
275,PMC9794516,2022-12-28,2023-01-02,MultiOMICs landscape of SARS-CoV-2-induced hos...,iScience,"Pinto SM, Subbannayya Y, Kim H, Hagen L, Górna...",10.1016/j.isci.2022.105895,https://europepmc.org/articles/PMC9794516?pdf=...,\nThe rapid emergence of the COVID-19 pandemic...
...,...,...,...,...,...,...,...,...,...
9797,PMC7162151,2020-02-28,2020-04-21,News.,Chemistry & industry,0,10.1002/cind.842_3.x,https://europepmc.org/articles/PMC7162151?pdf=...,\n\n\nRestricting the amount of the amino acid...
9809,PMC9094125,2022-05-11,2022-07-16,Recent advances in metal-organic framework-bas...,Nano research,"Yang M, Zhang J, Wei Y, Zhang J, Tao C.",10.1007/s12274-022-4302-x,https://europepmc.org/articles/PMC9094125?pdf=...,\n\n\n\n\n\n\n\n\n\n\n\n
10399,PMC7492056,2020-09-15,2020-09-28,Full Issue PDF.,JACC. CardioOncology,0,10.1016/s2666-0873(20)30180-0,https://europepmc.org/articles/PMC7492056?pdf=...,\nT he survival of children with cancer has co...
10767,PMC8459260,2021-02-25,2022-01-25,Getting in touch with your senses: Mechanisms ...,WIREs mechanisms of disease,"Gupta S, Butler SJ.",10.1002/wsbm.1520,https://europepmc.org/articles/PMC8459260?pdf=...,\nSomatosensation is essential for survival. I...


There are now no missing values for the text column.

In [None]:
pmc_search_results_full_text_grobid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 118 entries, 64 to 11329
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pmcid      118 non-null    object
 1   published  118 non-null    object
 2   revised    118 non-null    object
 3   title      118 non-null    object
 4   journal    118 non-null    object
 5   authors    118 non-null    object
 6   doi        118 non-null    object
 7   pdf_url    118 non-null    object
 8   text       118 non-null    object
dtypes: object(9)
memory usage: 9.2+ KB


View a few examples of rows from the 118 articles extracted by GROBID to check that they now have full text.

**PMC9538661**

In [None]:
pmc_search_results_full_text_grobid[pmc_search_results_full_text_grobid['pmcid']=='PMC9538661'].iloc[0]

pmcid                                               PMC9538661
published                                           2022-10-13
revised                                             2022-11-22
title        Recent Drug Development and Medicinal Chemistr...
journal                                            ChemMedChem
authors          Ghosh AK, Mishevich JL, Mesecar A, Mitsuya H.
doi                                     10.1002/cmdc.202200440
pdf_url      https://europepmc.org/articles/PMC9538661?pdf=...
text         \nSevere acute respiratory syndrome coronaviru...
Name: 64, dtype: object

**PMC7163523**

In [None]:
pmc_search_results_full_text_grobid[pmc_search_results_full_text_grobid['pmcid']=='PMC7163523'].iloc[0]

pmcid                                               PMC7163523
published                                           2019-08-01
revised                                             2021-02-16
title        Literature review of baseline information on n...
journal                           EFSA Supporting Publications
authors      Dávalos A, Henriques R, Latasa M, Laparra M, C...
doi                                                          0
pdf_url      https://europepmc.org/articles/PMC7163523?pdf=...
text         \nThis part provides baseline information on t...
Name: 11329, dtype: object

**PMC7640961**

In [None]:
pmc_search_results_full_text_grobid[pmc_search_results_full_text_grobid['pmcid']=='PMC7640961'].iloc[0]

pmcid                                               PMC7640961
published                                           2020-09-27
revised                                             2020-12-18
title        Repurposing of FDA-Approved Toremifene to Trea...
journal                           Journal of proteome research
authors                                    Martin WR, Cheng F.
doi                              10.1021/acs.jproteome.0c00397
pdf_url      https://europepmc.org/articles/PMC7640961?pdf=...
text         \nAs of August 4, 2020, there are over 18 mill...
Name: 2583, dtype: object

## 5. Concatenate PMC DataFrames

Concatenate the two PMC DataFrames and then drop the duplicates keeping the 118 rows with text.

In [None]:
pmc_search_results_full_text_merged = pd.concat([search_results_full_text, pmc_search_results_full_text_grobid]).drop_duplicates(subset=['pmcid'], keep='last')

In [None]:
pmc_search_results_full_text_merged

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...,"Sir James Black, a winner of the 1988 Nobel Pr..."
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...,Tight and selective interaction between ligand...
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...,The severe acute respiratory syndrome coronavi...
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...,The sudden outbreak of SARS-CoV-2 in 2019 took...
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...,"The 2019 novel coronavirus, now dubbed SARS-Co..."
...,...,...,...,...,...,...,...,...,...
9797,PMC7162151,2020-02-28,2020-04-21,News.,Chemistry & industry,0,10.1002/cind.842_3.x,https://europepmc.org/articles/PMC7162151?pdf=...,\n\n\nRestricting the amount of the amino acid...
9809,PMC9094125,2022-05-11,2022-07-16,Recent advances in metal-organic framework-bas...,Nano research,"Yang M, Zhang J, Wei Y, Zhang J, Tao C.",10.1007/s12274-022-4302-x,https://europepmc.org/articles/PMC9094125?pdf=...,\n\n\n\n\n\n\n\n\n\n\n\n
10399,PMC7492056,2020-09-15,2020-09-28,Full Issue PDF.,JACC. CardioOncology,0,10.1016/s2666-0873(20)30180-0,https://europepmc.org/articles/PMC7492056?pdf=...,\nT he survival of children with cancer has co...
10767,PMC8459260,2021-02-25,2022-01-25,Getting in touch with your senses: Mechanisms ...,WIREs mechanisms of disease,"Gupta S, Butler SJ.",10.1002/wsbm.1520,https://europepmc.org/articles/PMC8459260?pdf=...,\nSomatosensation is essential for survival. I...


In [None]:
with open('2023-01-15_pmc_search_results_full_text_merged.pickle', 'wb') as f:
  pickle.dump(pmc_search_results_full_text_merged, f)

## 6. Check combined PMC DataFrame for missing data

Check that there are no null values in the combined DataFrame.

In [None]:
pmc_search_results_full_text_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11397 entries, 0 to 11329
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pmcid      11397 non-null  object
 1   published  11397 non-null  object
 2   revised    11397 non-null  object
 3   title      11397 non-null  object
 4   journal    11397 non-null  object
 5   authors    11397 non-null  object
 6   doi        11397 non-null  object
 7   pdf_url    11397 non-null  object
 8   text       11397 non-null  object
dtypes: object(9)
memory usage: 890.4+ KB


Check using .isna() on the text column.

In [None]:
pmc_search_results_full_text_merged[pmc_search_results_full_text_merged['text'].isna()]

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text


View a few of the 118 exceptions to see that they now have text in the combined DataFrame.

In [None]:
pmc_search_results_full_text_merged[pmc_search_results_full_text_merged['pmcid']=='PMC9538661']

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
64,PMC9538661,2022-10-13,2022-11-22,Recent Drug Development and Medicinal Chemistr...,ChemMedChem,"Ghosh AK, Mishevich JL, Mesecar A, Mitsuya H.",10.1002/cmdc.202200440,https://europepmc.org/articles/PMC9538661?pdf=...,\nSevere acute respiratory syndrome coronaviru...


In [None]:
pmc_search_results_full_text_merged[pmc_search_results_full_text_merged['pmcid']=='PMC7163523']

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
11329,PMC7163523,2019-08-01,2021-02-16,Literature review of baseline information on n...,EFSA Supporting Publications,"Dávalos A, Henriques R, Latasa M, Laparra M, C...",0,https://europepmc.org/articles/PMC7163523?pdf=...,\nThis part provides baseline information on t...


In [None]:
pmc_search_results_full_text_merged[pmc_search_results_full_text_merged['pmcid']=='PMC7640961']

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
2583,PMC7640961,2020-09-27,2020-12-18,Repurposing of FDA-Approved Toremifene to Trea...,Journal of proteome research,"Martin WR, Cheng F.",10.1021/acs.jproteome.0c00397,https://europepmc.org/articles/PMC7640961?pdf=...,"\nAs of August 4, 2020, there are over 18 mill..."


Check that there are no instances of '0' in the text column.

In [None]:
pmc_search_results_full_text_merged[pmc_search_results_full_text_merged['text']=='0']

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text


Check that there are no instances of 'Abstract' in the title column.

In [None]:
pmc_search_results_full_text_merged[pmc_search_results_full_text_merged['title'].str.contains('Abstract')]

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text



When we first ran the parse_pdf function for the PMC data and created the futures_map dictionary, there were 13 proxy exceptions with an empty string.

In [None]:
with open('2023-01-06_europepmc_ft_xml.pickle', 'rb') as f:
    dl_results = pickle.load(f)

In [None]:
dl_results.count('0')

13

In [None]:
with open('2023-01-06_europepmc_df_json_ft_urls_13_exceptions.pickle', 'rb') as f:
    exceptions = pickle.load(f)

In [None]:
exceptions

{'PMC8018918': Exception(''),
 'PMC7640961': Exception(''),
 'PMC8018905': Exception(''),
 'PMC7098069': Exception(''),
 'PMC7382535': Exception(''),
 'PMC7936759': Exception(''),
 'PMC8014535': Exception(''),
 'PMC7383733': Exception(''),
 'PMC7558230': Exception(''),
 'PMC8115429': Exception(''),
 'PMC7497212': Exception(''),
 'PMC7321661': Exception(''),
 'PMC8459260': Exception('')}

Check for a couple of the 13 exceptions that they now appear with text in the new combined DataFrame.

In [None]:
test_exceptions = pmc_search_results_full_text_merged[pmc_search_results_full_text_merged.isin(['PMC8018918','PMC7558230']).any(axis=1)]
test_exceptions

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
2353,PMC8018918,2021-04-03,2022-11-21,Virtual high throughput screening: Potential i...,European journal of pharmacology,"Jade D, Ayyamperumal S, Tallapaneni V, Joghee ...",10.1016/j.ejphar.2021.174082,https://europepmc.org/articles/PMC8018918?pdf=...,"\nIn December 2019, rare pneumonia, now called..."
5714,PMC7558230,2020-10-15,2021-01-10,Evaluation of mechanisms of action of re-purpo...,Cellular immunology,"Rajaiah R, Abhilasha KV, Shekar MA, Vogel SN, ...",10.1016/j.cellimm.2020.104240,https://europepmc.org/articles/PMC7558230?pdf=...,"\nIn the past two decades, the world populatio..."


Check for any more rows with an empty string in the text column.

In [None]:
pmc_search_results_full_text_merged[pmc_search_results_full_text_merged['text']=='']

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
1772,PMC9364697,2022-07-06,2022-10-06,3<sup>a</sup> edizione Giornate della ricerca ...,Journal of preventive medicine and hygiene,0,10.15167/2421-4248/jpmh2022.63.1s1,https://europepmc.org/articles/PMC9364697?pdf=...,
1305,PMC8010379,2021-03-31,2022-11-08,Repurposing antiviral drugs on recently emerge...,Materials today. Proceedings,"Swathi K, Nikitha B, Chandrakala B, Lakshmanad...",10.1016/j.matpr.2021.03.143,https://europepmc.org/articles/PMC8010379?pdf=...,


The two rows that were identified earlier are there: PMC9364697 the Italian edition journal and PMC8010379 the withdrawn article.

In [None]:
pmc_search_results_full_text_merged[pmc_search_results_full_text_merged['pmcid']=='PMC9364697']

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
1772,PMC9364697,2022-07-06,2022-10-06,3<sup>a</sup> edizione Giornate della ricerca ...,Journal of preventive medicine and hygiene,0,10.15167/2421-4248/jpmh2022.63.1s1,https://europepmc.org/articles/PMC9364697?pdf=...,


In [None]:
pmc_search_results_full_text_merged[pmc_search_results_full_text_merged['pmcid']=='PMC8010379']

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
1305,PMC8010379,2021-03-31,2022-11-08,Repurposing antiviral drugs on recently emerge...,Materials today. Proceedings,"Swathi K, Nikitha B, Chandrakala B, Lakshmanad...",10.1016/j.matpr.2021.03.143,https://europepmc.org/articles/PMC8010379?pdf=...,


Create a copy of the combined DataFrame and drop the two rows.

In [None]:
pmc_search_results_full_text_merged_new = pmc_search_results_full_text_merged.copy()

In [None]:
pmc_search_results_full_text_merged_new.drop([1772,1305], axis=0, inplace=True)
pmc_search_results_full_text_merged_new

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...,"Sir James Black, a winner of the 1988 Nobel Pr..."
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...,Tight and selective interaction between ligand...
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...,The severe acute respiratory syndrome coronavi...
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...,The sudden outbreak of SARS-CoV-2 in 2019 took...
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...,"The 2019 novel coronavirus, now dubbed SARS-Co..."
...,...,...,...,...,...,...,...,...,...
9797,PMC7162151,2020-02-28,2020-04-21,News.,Chemistry & industry,0,10.1002/cind.842_3.x,https://europepmc.org/articles/PMC7162151?pdf=...,\n\n\nRestricting the amount of the amino acid...
9809,PMC9094125,2022-05-11,2022-07-16,Recent advances in metal-organic framework-bas...,Nano research,"Yang M, Zhang J, Wei Y, Zhang J, Tao C.",10.1007/s12274-022-4302-x,https://europepmc.org/articles/PMC9094125?pdf=...,\n\n\n\n\n\n\n\n\n\n\n\n
10399,PMC7492056,2020-09-15,2020-09-28,Full Issue PDF.,JACC. CardioOncology,0,10.1016/s2666-0873(20)30180-0,https://europepmc.org/articles/PMC7492056?pdf=...,\nT he survival of children with cancer has co...
10767,PMC8459260,2021-02-25,2022-01-25,Getting in touch with your senses: Mechanisms ...,WIREs mechanisms of disease,"Gupta S, Butler SJ.",10.1002/wsbm.1520,https://europepmc.org/articles/PMC8459260?pdf=...,\nSomatosensation is essential for survival. I...


In [None]:
len(pmc_search_results_full_text_merged_new)

11395

Reset index

In [None]:
pmc_search_results_full_text_merged_new.reset_index(drop=True, inplace=True)

Check for one of the articles (row 1772	PMC9364697) that it has been dropped and the DataFrame reindexed.

In [None]:
pmc_search_results_full_text_merged_new.loc[1771:1773]

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
1771,PMC8803929,2022-02-01,2022-11-07,"CoVac501, a self-adjuvanting peptide vaccine c...",Cell discovery,"Long Y, Sun J, Song TZ, Liu T, Tang F, Zhang X...",10.1038/s41421-021-00370-2,https://europepmc.org/articles/PMC8803929?pdf=...,The uncontrolled transmission and ongoing evol...
1772,PMC8889327,2022-02-01,2022-03-11,Organoid Studies in COVID-19 Research.,International journal of stem cells,"Kim J, Koo BK, Clevers H.",10.15283/ijsc21251,https://europepmc.org/articles/PMC8889327?pdf=...,"Global pandemics, such as bubonic plague and S..."
1773,PMC8929333,2022-02-01,2022-03-25,The interacting physiology of COVID-19 and the...,Pharmacology research & perspectives,"Lumbers ER, Head R, Smith GR, Delforce SJ, Jar...",10.1002/prp2.917,https://europepmc.org/articles/PMC8929333?pdf=...,ACE2angiotensin‐converting enzyme 2ACEIsangiot...


In [None]:
with open('2023-01-15_pmc_search_results_full_text_merged_new.pickle', 'wb') as f:
  pickle.dump(pmc_search_results_full_text_merged_new, f)

In [None]:
len(pmc_search_results_full_text_merged_new)

11395

Rename pmcid column heading in PMC DataFrame to article-id before combining with arXiv DataFrame.

In [None]:
pmc_search_results_full_text_merged_article_id = pmc_search_results_full_text_merged_new.copy()

In [None]:
pmc_search_results_full_text_merged_article_id.rename({'pmcid': 'article_id'}, axis=1, inplace=True)
pmc_search_results_full_text_merged_article_id

Unnamed: 0,article_id,published,revised,title,journal,authors,doi,pdf_url,text
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...,"Sir James Black, a winner of the 1988 Nobel Pr..."
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...,Tight and selective interaction between ligand...
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...,The severe acute respiratory syndrome coronavi...
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...,The sudden outbreak of SARS-CoV-2 in 2019 took...
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...,"The 2019 novel coronavirus, now dubbed SARS-Co..."
...,...,...,...,...,...,...,...,...,...
11390,PMC7162151,2020-02-28,2020-04-21,News.,Chemistry & industry,0,10.1002/cind.842_3.x,https://europepmc.org/articles/PMC7162151?pdf=...,\n\n\nRestricting the amount of the amino acid...
11391,PMC9094125,2022-05-11,2022-07-16,Recent advances in metal-organic framework-bas...,Nano research,"Yang M, Zhang J, Wei Y, Zhang J, Tao C.",10.1007/s12274-022-4302-x,https://europepmc.org/articles/PMC9094125?pdf=...,\n\n\n\n\n\n\n\n\n\n\n\n
11392,PMC7492056,2020-09-15,2020-09-28,Full Issue PDF.,JACC. CardioOncology,0,10.1016/s2666-0873(20)30180-0,https://europepmc.org/articles/PMC7492056?pdf=...,\nT he survival of children with cancer has co...
11393,PMC8459260,2021-02-25,2022-01-25,Getting in touch with your senses: Mechanisms ...,WIREs mechanisms of disease,"Gupta S, Butler SJ.",10.1002/wsbm.1520,https://europepmc.org/articles/PMC8459260?pdf=...,\nSomatosensation is essential for survival. I...


## 7. Import arXiv data

Import DataFrame for 301 articles with full text, and rows for three unavailable PDFs dropped.

In [None]:
with open('2023-01-06_article_results_arxiv_301_full_text.pickle', 'rb') as f:
    article_results_arxiv_301 = pickle.load(f)

In [None]:
article_results_arxiv_301

Unnamed: 0,arxiv-id,published,revised,title,journal,authors,doi,pdf_url,text
0,2109.06377v4,2021-09-14,2022-12-22,ASGARD: A Single-cell Guided pipeline to Aid R...,,"Bing He, Yao Xiao, Haodong Liang, Qianhui Huan...",,http://export.arxiv.org/pdf/2109.06377v4,"\nHeterogeneity, or more specifically, the div..."
1,2212.09867v1,2022-12-19,2022-12-19,Detecting Contradictory COVID-19 Drug Efficacy...,,"Daniel N. Sosa, Malavika Suresh, Christopher P...",,http://export.arxiv.org/pdf/2212.09867v1,\nThe COVID-19 pandemic caused by the novel SA...
2,2212.09610v1,2022-12-19,2022-12-19,Drying of Bio-colloidal Sessile Droplets: Adva...,,"Anusuya Pal, Amalesh Gope, Anupam Sengupta",,http://export.arxiv.org/pdf/2212.09610v1,\nVirus Emulating Particles (VEP) White Blood ...
3,2212.03911v1,2022-12-07,2022-12-07,Analysis of Drug repurposing Knowledge graphs ...,,Ajay Kumar Gogineni,,http://export.arxiv.org/pdf/2212.03911v1,\nIn the initial stages of a viral outbreak su...
4,2212.01575v1,2022-12-03,2022-12-03,Multi-view deep learning based molecule design...,,"Chao Pang, Yu Wang, Yi Jiang, Ruheng Wang, Ran...",,http://export.arxiv.org/pdf/2212.01575v1,\nDe novo drug design is a time-consuming and ...
...,...,...,...,...,...,...,...,...,...
296,2003.13665v1,2020-03-30,2020-03-30,Genomics-guided molecular maps of coronavirus ...,,Gennadi Glinsky,,http://export.arxiv.org/pdf/2003.13665v1,\nCoronavirus pandemic 2020 caused by the newl...
297,2003.14258v1,2020-03-30,2020-03-30,Nanomechanical sonification of the 2019-nCoV c...,,Markus J. Buehler,,http://export.arxiv.org/pdf/2003.14258v1,\nProteins are the building blocks of virtuall...
298,2003.12454v1,2020-03-26,2020-03-26,A Machine Learning alternative to placebo-cont...,,"Ezequiel Alvarez, Federico Lamagna, Manuel Szewc",,http://export.arxiv.org/pdf/2003.12454v1,\nCurrent and last decades research in drug di...
299,2003.04524v1,2020-03-10,2020-03-10,"Old Drugs for Newly Emerging Viral Disease, CO...",,Mohammad Reza Dayer,,http://export.arxiv.org/pdf/2003.04524v1,\nThe outbreak of coronavirus in 2019-2020 is...


There is one outstanding exception where GROBID was unable to parse the PDF with Exception('[GENERAL] An exception occurred while running Grobid.').

In [None]:
article_results_arxiv_301[article_results_arxiv_301['arxiv-id']=='2007.09186v3']

Unnamed: 0,arxiv-id,published,revised,title,journal,authors,doi,pdf_url,text
173,2007.09186v3,2020-07-17,2020-10-07,AWS CORD-19 Search: A Neural Search Engine for...,,"Parminder Bhatia, Lan Liu, Kristjan Arumae, Ni...",,http://export.arxiv.org/pdf/2007.09186v3,


Rename arxiv-id column heading in arXiv DataFrame to article_id before combining with PMC DataFrame.

In [None]:
article_results_arxiv_301_article_id = article_results_arxiv_301.copy()

In [None]:
article_results_arxiv_301_article_id.rename({'arxiv-id': 'article_id'}, axis=1, inplace=True)
article_results_arxiv_301_article_id

Unnamed: 0,article_id,published,revised,title,journal,authors,doi,pdf_url,text
0,2109.06377v4,2021-09-14,2022-12-22,ASGARD: A Single-cell Guided pipeline to Aid R...,,"Bing He, Yao Xiao, Haodong Liang, Qianhui Huan...",,http://export.arxiv.org/pdf/2109.06377v4,"\nHeterogeneity, or more specifically, the div..."
1,2212.09867v1,2022-12-19,2022-12-19,Detecting Contradictory COVID-19 Drug Efficacy...,,"Daniel N. Sosa, Malavika Suresh, Christopher P...",,http://export.arxiv.org/pdf/2212.09867v1,\nThe COVID-19 pandemic caused by the novel SA...
2,2212.09610v1,2022-12-19,2022-12-19,Drying of Bio-colloidal Sessile Droplets: Adva...,,"Anusuya Pal, Amalesh Gope, Anupam Sengupta",,http://export.arxiv.org/pdf/2212.09610v1,\nVirus Emulating Particles (VEP) White Blood ...
3,2212.03911v1,2022-12-07,2022-12-07,Analysis of Drug repurposing Knowledge graphs ...,,Ajay Kumar Gogineni,,http://export.arxiv.org/pdf/2212.03911v1,\nIn the initial stages of a viral outbreak su...
4,2212.01575v1,2022-12-03,2022-12-03,Multi-view deep learning based molecule design...,,"Chao Pang, Yu Wang, Yi Jiang, Ruheng Wang, Ran...",,http://export.arxiv.org/pdf/2212.01575v1,\nDe novo drug design is a time-consuming and ...
...,...,...,...,...,...,...,...,...,...
296,2003.13665v1,2020-03-30,2020-03-30,Genomics-guided molecular maps of coronavirus ...,,Gennadi Glinsky,,http://export.arxiv.org/pdf/2003.13665v1,\nCoronavirus pandemic 2020 caused by the newl...
297,2003.14258v1,2020-03-30,2020-03-30,Nanomechanical sonification of the 2019-nCoV c...,,Markus J. Buehler,,http://export.arxiv.org/pdf/2003.14258v1,\nProteins are the building blocks of virtuall...
298,2003.12454v1,2020-03-26,2020-03-26,A Machine Learning alternative to placebo-cont...,,"Ezequiel Alvarez, Federico Lamagna, Manuel Szewc",,http://export.arxiv.org/pdf/2003.12454v1,\nCurrent and last decades research in drug di...
299,2003.04524v1,2020-03-10,2020-03-10,"Old Drugs for Newly Emerging Viral Disease, CO...",,Mohammad Reza Dayer,,http://export.arxiv.org/pdf/2003.04524v1,\nThe outbreak of coronavirus in 2019-2020 is...


## 8. Combine PMC and arXiv data

Create a new merged dataset by concatenating the PMC and arXiv DataFrames with generic article-id column.

In [None]:
pmc_arxiv_full_text_merged = pd.concat([pmc_search_results_full_text_merged_article_id, article_results_arxiv_301_article_id], ignore_index=True)

In [None]:
pmc_arxiv_full_text_merged

Unnamed: 0,article_id,published,revised,title,journal,authors,doi,pdf_url,text
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...,"Sir James Black, a winner of the 1988 Nobel Pr..."
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...,Tight and selective interaction between ligand...
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...,The severe acute respiratory syndrome coronavi...
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...,The sudden outbreak of SARS-CoV-2 in 2019 took...
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...,"The 2019 novel coronavirus, now dubbed SARS-Co..."
...,...,...,...,...,...,...,...,...,...
11691,2003.13665v1,2020-03-30,2020-03-30,Genomics-guided molecular maps of coronavirus ...,,Gennadi Glinsky,,http://export.arxiv.org/pdf/2003.13665v1,\nCoronavirus pandemic 2020 caused by the newl...
11692,2003.14258v1,2020-03-30,2020-03-30,Nanomechanical sonification of the 2019-nCoV c...,,Markus J. Buehler,,http://export.arxiv.org/pdf/2003.14258v1,\nProteins are the building blocks of virtuall...
11693,2003.12454v1,2020-03-26,2020-03-26,A Machine Learning alternative to placebo-cont...,,"Ezequiel Alvarez, Federico Lamagna, Manuel Szewc",,http://export.arxiv.org/pdf/2003.12454v1,\nCurrent and last decades research in drug di...
11694,2003.04524v1,2020-03-10,2020-03-10,"Old Drugs for Newly Emerging Viral Disease, CO...",,Mohammad Reza Dayer,,http://export.arxiv.org/pdf/2003.04524v1,\nThe outbreak of coronavirus in 2019-2020 is...


## 9. Check for duplicate titles

Check for duplicate titles, for example any titles appearing in both PMC and arXiv databases.

In [None]:
duplicate_titles = pmc_arxiv_full_text_merged[pmc_arxiv_full_text_merged.duplicated('title', keep=False)].sort_values('title')
duplicate_titles

Unnamed: 0,article_id,published,revised,title,journal,authors,doi,pdf_url,text
3964,PMC9040466,2022-01-01,2022-12-01,CKD: The burden of disease invisible to resear...,Nefrologia,"AIRG-E, EKPF, ALCER, FRIAT, REDINREN, RICORS20...",10.1016/j.nefroe.2021.09.005,https://europepmc.org/articles/PMC9040466?pdf=...,The present manuscript summarizes key features...
4449,PMC8596203,2021-11-17,2022-11-30,CKD: The burden of disease invisible to resear...,Nefrologia : publicacion oficial de la Socieda...,"AIRG-E, EKPF, ALCER, FRIAT, REDINREN, RICORS20...",10.1016/j.nefro.2021.09.004,https://europepmc.org/articles/PMC8596203?pdf=...,The present manuscript summarizes key features...
11369,PMC7247787,2020-05-25,2020-09-28,Full Issue PDF.,JACC. Basic to translational science,0,10.1016/s2452-302x(20)30205-9,https://europepmc.org/articles/PMC7247787?pdf=...,\nprocedure and indicate that the institutiona...
11392,PMC7492056,2020-09-15,2020-09-28,Full Issue PDF.,JACC. CardioOncology,0,10.1016/s2666-0873(20)30180-0,https://europepmc.org/articles/PMC7492056?pdf=...,\nT he survival of children with cancer has co...
5528,PMC7232076,2020-05-18,2021-01-27,Is the anti-filarial drug diethylcarbamazine u...,Medical hypotheses,"Abeygunasekera A, Jayasinghe S.",10.1016/j.mehy.2020.109843,https://europepmc.org/articles/PMC7232076?pdf=...,SARS-CoV-2 virus has caused a pandemic with ap...
11677,2004.08491v1,2020-04-18,2020-04-18,Is the anti-filarial drug diethylcarbamazine u...,,"Anuruddha Abeygunasekera, Saroj Jayasinghe",,http://export.arxiv.org/pdf/2004.08491v1,\nSARS-CoV-2 virus has caused a pandemic with ...
11353,PMC7280571,2020-05-22,2020-06-15,News.,Chemistry & industry,0,10.1002/cind.3_845.x,https://europepmc.org/articles/PMC7280571?pdf=...,\n\nA smartphone test for Covid-19 is being de...
11367,PMC7280669,2020-05-22,2020-06-15,News.,Chemistry & industry,0,10.1002/cind.5_845.x,https://europepmc.org/articles/PMC7280669?pdf=...,\nA fundamental physics technique shows promis...
11390,PMC7162151,2020-02-28,2020-04-21,News.,Chemistry & industry,0,10.1002/cind.842_3.x,https://europepmc.org/articles/PMC7162151?pdf=...,\n\n\nRestricting the amount of the amino acid...
8729,PMC7489339,2020-09-14,2021-06-29,Targeting the Linear Ubiquitin Assembly Comple...,Archivos de bronconeumologia,"Brazee PL, Sznajder JI.",10.1016/j.arbr.2020.04.008,https://europepmc.org/articles/PMC7489339?pdf=...,Seasonal influenza A viral infection affects a...


In [None]:
len(duplicate_titles)

11

There are 11 duplicate titles requiring further investigation.

*   An English and Spanish edition appear in PMC - we will drop the latter (PMC8596203).
*   Two entitled 'Full Issue PDF' in PMC are different documents - both will be kept.
*   One title appears in PMC and arXiv - we will keep the PMC version with journal reference, DOI and more recent publish and revised dates and drop the latter (2004.08491v1).
*   Three from the same journal are entitled 'News' in PMC but are different articles - we will keep all.
*   Another English and Spanish edition appear in PMC - we will drop the latter (PMC7218391).




In [None]:
rows_to_drop = pmc_arxiv_full_text_merged[pmc_arxiv_full_text_merged.isin(['PMC8596203','2004.08491v1', 'PMC7218391']).any(axis=1)]
rows_to_drop

Unnamed: 0,article_id,published,revised,title,journal,authors,doi,pdf_url,text
4449,PMC8596203,2021-11-17,2022-11-30,CKD: The burden of disease invisible to resear...,Nefrologia : publicacion oficial de la Socieda...,"AIRG-E, EKPF, ALCER, FRIAT, REDINREN, RICORS20...",10.1016/j.nefro.2021.09.004,https://europepmc.org/articles/PMC8596203?pdf=...,The present manuscript summarizes key features...
9000,PMC7218391,2020-05-13,2021-06-24,Targeting the Linear Ubiquitin Assembly Comple...,Archivos de bronconeumologia,"Brazee PL, Sznajder JI.",10.1016/j.arbres.2020.04.019,https://europepmc.org/articles/PMC7218391?pdf=...,Seasonal influenza A viral infection affects a...
11677,2004.08491v1,2020-04-18,2020-04-18,Is the anti-filarial drug diethylcarbamazine u...,,"Anuruddha Abeygunasekera, Saroj Jayasinghe",,http://export.arxiv.org/pdf/2004.08491v1,\nSARS-CoV-2 virus has caused a pandemic with ...


Create a new combined DataFrame dropping the three rows.

In [None]:
pmc_arxiv_full_text_merged_new = pd.concat([pmc_arxiv_full_text_merged, rows_to_drop]).drop_duplicates(keep=False).reset_index(drop=True)
pmc_arxiv_full_text_merged_new

Unnamed: 0,article_id,published,revised,title,journal,authors,doi,pdf_url,text
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...,"Sir James Black, a winner of the 1988 Nobel Pr..."
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...,Tight and selective interaction between ligand...
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...,The severe acute respiratory syndrome coronavi...
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...,The sudden outbreak of SARS-CoV-2 in 2019 took...
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...,"The 2019 novel coronavirus, now dubbed SARS-Co..."
...,...,...,...,...,...,...,...,...,...
11688,2003.13665v1,2020-03-30,2020-03-30,Genomics-guided molecular maps of coronavirus ...,,Gennadi Glinsky,,http://export.arxiv.org/pdf/2003.13665v1,\nCoronavirus pandemic 2020 caused by the newl...
11689,2003.14258v1,2020-03-30,2020-03-30,Nanomechanical sonification of the 2019-nCoV c...,,Markus J. Buehler,,http://export.arxiv.org/pdf/2003.14258v1,\nProteins are the building blocks of virtuall...
11690,2003.12454v1,2020-03-26,2020-03-26,A Machine Learning alternative to placebo-cont...,,"Ezequiel Alvarez, Federico Lamagna, Manuel Szewc",,http://export.arxiv.org/pdf/2003.12454v1,\nCurrent and last decades research in drug di...
11691,2003.04524v1,2020-03-10,2020-03-10,"Old Drugs for Newly Emerging Viral Disease, CO...",,Mohammad Reza Dayer,,http://export.arxiv.org/pdf/2003.04524v1,\nThe outbreak of coronavirus in 2019-2020 is...


Check for one of the articles (row 11677 2004.08491v1) that it has been dropped and the DataFrame reindexed.

In [None]:
pmc_arxiv_full_text_merged_new.loc[11676:11678]

Unnamed: 0,article_id,published,revised,title,journal,authors,doi,pdf_url,text
11676,2004.07750v1,2020-04-16,2020-04-16,Extracting the effective contact rate of COVID...,,"Gaurav Goswami, Jayanti Prasad, Mansi Dhuria",,http://export.arxiv.org/pdf/2004.07750v1,"\nWithin a few months of its first outbreak, C..."
11677,2003.10642v2,2020-03-24,2020-04-15,In Silico Investigations on the Potential Inhi...,,"Ambrish Kumar Srivastava, Abhishek Kumar, Garg...",,http://export.arxiv.org/pdf/2003.10642v2,"\nAt the beginning of this year, the coronavir..."
11678,2004.07086v1,2020-04-15,2020-04-15,Prediction of potential inhibitors for RNA-dep...,International Journal of Biological Macromolec...,"Md. Sorwer Alam Parvez, Md. Adnan Karim, Mahmu...",10.1016/j.ijbiomac.2020.09.098,http://export.arxiv.org/pdf/2004.07086v1,\nThe pandemic Corona Virus Disease 19 (COVID-...


In [None]:
len(pmc_arxiv_full_text_merged_new)

11693

In [None]:
with open('2023-01-15_pmc_arxiv_full_text_merged_new.pickle', 'wb') as f:
  pickle.dump(pmc_arxiv_full_text_merged_new, f)

This just leaves the one arXiv exception GROBID did not extract text for which will be handled separately using more manual methods.

In [None]:
pmc_arxiv_full_text_merged_new.loc[pmc_arxiv_full_text_merged_new['text'].isna()]

Unnamed: 0,article_id,published,revised,title,journal,authors,doi,pdf_url,text
11566,2007.09186v3,2020-07-17,2020-10-07,AWS CORD-19 Search: A Neural Search Engine for...,,"Parminder Bhatia, Lan Liu, Kristjan Arumae, Ni...",,http://export.arxiv.org/pdf/2007.09186v3,
