### PubMed Data Extraction and DataFrame Creation for Post-acute sequelae of SARS-CoV-2 Infection Publications
This notebook is designed to retrieve information about scientific articles related to "post-COVID-19 syndrome" or "post-acute sequelae of SARS-CoV-2 infection". It achieves this by:

1. **Accessing PubMed:** It utilizes the PubMed database to search for relevant articles using specific keywords and filters (availability, language, publication type, etc.).
2. **Extracting Key Identifiers:** It retrieves PubMed IDs and PubMed Central IDs for the identified articles. (PMIDs, PMCIDs)
3. **Collecting Article Details:** For each article, it extracts the title and attempts to find a link to the full text PDF.
4. **Creating a Structured Dataset:** The extracted information (PMID, PMCID, title, PDF link, and article category) is organized into a pandas DataFrame.
4. **Data Cleaning:** Duplicate entries and entries with missing PMIDs are removed from the DataFrame.

The resulting DataFrame provides a structured overview of the collected documents, serving as a valuable resource before further processing or transformation into a vector database.

#### Imports and Installation
This section consist of imports of Python libraries, including requests, bs4, pandas, regex, time, shutil, and os, to facilitate web scraping, data processing, and file operations.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import regex as re
import math
import concurrent.futures
from urllib.parse import urljoin

In [2]:
import time
import shutil
import os

#### Source and Filter Initiation
This section defines the base URL for querying the PubMed database and sets up various filters to refine the search results. The filters include:

1. **Availability:** Limiting results to free full-text articles.
2. **Language:** Restricting results to English publications.
3. **Display Settings:** Specifying the number of results per page.
4. **Format Display:** Setting the output format to include PubMed IDs (PMIDs).
5. **MEDLINE Filter:** Including articles indexed in MEDLINE.
6. **Sort Order:** Ordering results by publication date.

These filters are concatenated with the base URL to create the final source URL used for retrieving article information. The constructed source URL is then printed.

In [3]:
url = """https://pubmed.ncbi.nlm.nih.gov/""" +\
      """?term="post-covid-19+syndrome"+""" +\
      """or+%28"post-acute+sequelae+of+SARS-CoV-2+infection"+and+"PACS"%29"""

In [4]:
availability_filter   = "&filter=simsearch2.ffrft"
language_filter       = "&filter=lang.english"
display_settings      = "&size=200"
format_display        = "&format=pmid"
medline_filter        = "&filter=other.medline"
sort_by_date          = "&sort=pubdate"

filter = availability_filter+language_filter+display_settings+format_display+medline_filter+sort_by_date

In [5]:
source = url+filter
print(source)

https://pubmed.ncbi.nlm.nih.gov/?term="post-covid-19+syndrome"+or+%28"post-acute+sequelae+of+SARS-CoV-2+infection"+and+"PACS"%29&filter=simsearch2.ffrft&filter=lang.english&size=200&format=pmid&filter=other.medline&sort=pubdate


#### pmids, pmcids, and articles title retrieval

This section focuses on extracting the core information about the relevant articles from PubMed. It performs the following steps:

1.  **Retrieve PMIDs by category:** The `get_pmids` function is called for each defined article type (e.g., "case reports", "clinical study"). This function constructs specific PubMed query URLs and extracts lists of PubMed IDs (PMIDs) for each category, handling pagination to retrieve results across multiple pages.
2.  **Fetch PMCIDs and Titles:** The `get_pmcid_and_titles` function takes the lists of PMIDs and queries PubMed again for each individual PMID. It parses the returned information to extract the PubMed Central ID (PMCID) and the article title. The function includes error handling and a delay to manage requests effectively.
3.  **Structure the data:** The retrieved PMIDs, PMCIDs, and titles are stored in separate dictionaries for each article type, providing a structured representation of the initial data.
4.  **Assign Categories:** The `get_category` function adds a 'category' key to each dictionary, clearly labeling the article type for later use.

In [6]:
def get_pmids(article_type):
  list_of_pmids = []

  # limiting the process to 30 based on compute power limitation
  for i in range(30):
    pagination    = f"&page={i+1}"
    new_source    = source+article_type+pagination
    get_source    = requests.get(new_source)
    soup          = BeautifulSoup(get_source.content, 'html.parser')
    category      = re.search(r"pubt\.(.*)", article_type).group(1) if re.search(r"pubt\.(.*)", article_type) else None

    try:
      pmids = soup.find("pre").find(string=True).split("\r\n")
      list_of_pmids.extend(pmids)

      print(f"processing publications of {category}")
      print(f"retriving page-{i+1}")
    except AttributeError:
      print("done!\n")
      break
  return list_of_pmids

In [7]:
article_types = ["&filter=pubt.casereports",
                 "&filter=pubt.clinicalstudy",
                 "&filter=pubt.clinicaltrial",
                 "&filter=pubt.clinicaltrialphaseiii",
                 "&filter=pubt.clinicaltrialphaseiv",
                 "&filter=pubt.guideline",
                 "&filter=pubt.meta-analysis",
                 "&filter=pubt.patienteducationhandout",
                 "&filter=pubt.review",
                 "&filter=pubt.systematicreview"]

In [8]:
list_of_pmids_per_category = []

for i in article_types:
  cache = get_pmids(i)
  list_of_pmids_per_category.append(cache)

processing publications of casereports
retriving page-1
done!

processing publications of clinicalstudy
retriving page-1
done!

processing publications of clinicaltrial
retriving page-1
done!

done!

done!

done!

processing publications of meta-analysis
retriving page-1
done!

done!

processing publications of review
retriving page-1
done!

processing publications of systematicreview
retriving page-1
done!



In [9]:
len(list_of_pmids_per_category)

10

In [10]:
print(f"successfully retrieving: {len(list_of_pmids_per_category[0])} pmids of case-report")
print(f"successfully retrieving: {len(list_of_pmids_per_category[1])} pmids of clinical-study")
print(f"successfully retrieving: {len(list_of_pmids_per_category[2])} pmids of clinical-trial")
print(f"successfully retrieving: {len(list_of_pmids_per_category[3])} pmids of meta-analysis")
print(f"successfully retrieving: {len(list_of_pmids_per_category[4])} pmids of guideline")
print(f"successfully retrieving: {len(list_of_pmids_per_category[5])} pmids of clinical-trial-phase-four")
print(f"successfully retrieving: {len(list_of_pmids_per_category[6])} pmids of clinical-trial-phase-three")
print(f"successfully retrieving: {len(list_of_pmids_per_category[7])} pmids of patient-education-hands-out")
print(f"successfully retrieving: {len(list_of_pmids_per_category[8])} pmids of review")
print(f"successfully retrieving: {len(list_of_pmids_per_category[9])} pmids of systematic-review")
print(f"\ntotal documents availabe: {sum([len(per_category) for per_category in list_of_pmids_per_category])}")

successfully retrieving: 13 pmids of case-report
successfully retrieving: 43 pmids of clinical-study
successfully retrieving: 11 pmids of clinical-trial
successfully retrieving: 0 pmids of meta-analysis
successfully retrieving: 0 pmids of guideline
successfully retrieving: 0 pmids of clinical-trial-phase-four
successfully retrieving: 7 pmids of clinical-trial-phase-three
successfully retrieving: 0 pmids of patient-education-hands-out
successfully retrieving: 65 pmids of review
successfully retrieving: 11 pmids of systematic-review

total documents availabe: 150


In [11]:
def get_pmcid_and_titles(pmids, batch_size=10, delay=3):
    results = []
    session = requests.Session()
    total_batches = math.ceil(len(pmids) / batch_size)

    # Pre-compile regex patterns for efficiency
    pmc_pattern = re.compile(r"PMC - (PMC\d+)")
    title_pattern = re.compile(r"TI  - (.*?\.)(?=\n|\r|$)", re.DOTALL)

    for batch_idx in range(total_batches):
        start_idx = batch_idx * batch_size
        batch = pmids[start_idx:start_idx + batch_size]
        print(f"Processing batch {batch_idx + 1}/{total_batches}")

        for pmid in batch:
            url = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/?format=pubmed"
            result = {'pmid': pmid, 'pmcid': None, 'title': None}

            try:
                response = session.get(url, timeout=10)
                response.raise_for_status()
                soup = BeautifulSoup(response.content, 'html.parser')
                text = soup.get_text()

                # Single pass for both patterns
                if pmc_match := pmc_pattern.search(text):
                    result['pmcid'] = pmc_match.group(1).strip()

                if title_match := title_pattern.search(text):
                    result['title'] = ' '.join(title_match.group(1).split())

            except requests.exceptions.RequestException as e:
                print(f"Request failed for PMID {pmid}: {str(e)[:100]}")
            except Exception as e:
                print(f"Error processing PMID {pmid}: {str(e)[:100]}")
            finally:
                results.append(result)

        # Add delay between batches (except after last batch)
        if batch_idx < total_batches - 1:
            print(f"Waiting {delay} seconds before next batch...")
            time.sleep(delay)

    return results

In [12]:
case_report_dict = get_pmcid_and_titles(list_of_pmids_per_category[0])

Processing batch 1/2
Waiting 3 seconds before next batch...
Processing batch 2/2


In [13]:
case_report_dict

[{'pmid': '39621236',
  'pmcid': 'PMC11971158',
  'title': 'Syncopes, paresis and loss of vision after COVID-19 mRNA-based vaccination and SARS-CoV-2 infection.'},
 {'pmid': '39988835',
  'pmcid': 'PMC11868965',
  'title': 'Severe COVID-19 Pneumonia, Opportunistic Candida krusei Infection, and Acute Respiratory Distress Syndrome with Pulmonary Arterial Hypertension Treated with Bosentan: A Case Report.'},
 {'pmid': '39093769',
  'pmcid': 'PMC11296404',
  'title': 'Integrative personalized medicine care for adjustment disorder of a post-COVID-19 patient: A CARE-compliant case report.'},
 {'pmid': '37606967',
  'pmcid': 'PMC10469423',
  'title': 'Severe pigeon paramyxovirus 1 infection in a human case with probable post-COVID-19 condition.'},
 {'pmid': '36617353',
  'pmcid': 'PMC9826535',
  'title': 'Post-COVID-19 syndrome increased the requirement for corticosteroids in a dialysis patient with preexisting adrenal insufficiency.'},
 {'pmid': '36581538',
  'pmcid': 'PMC9767890',
  'title'

In [14]:
clinicalstudy_dict = get_pmcid_and_titles(list_of_pmids_per_category[1])

Processing batch 1/5
Waiting 3 seconds before next batch...
Processing batch 2/5
Waiting 3 seconds before next batch...
Processing batch 3/5
Waiting 3 seconds before next batch...
Processing batch 4/5
Waiting 3 seconds before next batch...
Processing batch 5/5


In [15]:
clinicalstudy_dict

[{'pmid': '40065313',
  'pmcid': 'PMC11895281',
  'title': 'The legacy of the COVID-19 pandemic for the healthcare environment: the establishment of long COVID/ Post-COVID-19 condition follow-up outpatient clinics in Germany.'},
 {'pmid': '40003486',
  'pmcid': 'PMC11855376',
  'title': 'Comparative Analysis of Submaximal and Maximal Effort Capacities in Patients Post-COVID-19 and Individuals with Chronic Restrictive Lung Diseases.'},
 {'pmid': '39905419',
  'pmcid': 'PMC11792378',
  'title': 'Balneotherapy for the treatment of post-COVID syndrome: a randomized controlled trial.'},
 {'pmid': '39665835',
  'pmcid': 'PMC11922198',
  'title': 'Exercise rehabilitation in post COVID-19 patients: a randomized controlled trial of different training modalities.'},
 {'pmid': '39423759',
  'pmcid': None,
  'title': 'Analysis of fat oxidation capacity during cardiopulmonary exercise testing indicates long-lasting metabolic disturbance in patients with post-covid-19 syndrome.'},
 {'pmid': '3951934

In [16]:
clinicaltrial_dict = get_pmcid_and_titles(list_of_pmids_per_category[2])

Processing batch 1/2
Waiting 3 seconds before next batch...
Processing batch 2/2


In [17]:
clinicaltrial_dict

[{'pmid': '39905419',
  'pmcid': 'PMC11792378',
  'title': 'Balneotherapy for the treatment of post-COVID syndrome: a randomized controlled trial.'},
 {'pmid': '39665835',
  'pmcid': 'PMC11922198',
  'title': 'Exercise rehabilitation in post COVID-19 patients: a randomized controlled trial of different training modalities.'},
 {'pmid': '38937986',
  'pmcid': 'PMC11446691',
  'title': 'Telerehabilitation improves cardiorespiratory and muscular fitness and body composition in older people with post-COVID-19 syndrome.'},
 {'pmid': '38528512',
  'pmcid': 'PMC10964649',
  'title': 'Effect of home-based pulmonary rehabilitation on exercise capacity in post COVID-19 patients: a randomized controlled trail.'},
 {'pmid': '38213055',
  'pmcid': None,
  'title': 'Effectiveness of Internet-Based Group Supportive Psychotherapy on Psychic and Somatic Symptoms, Neutrophil-Lymphocyte Ratio, and Heart Rate Variability in Post COVID-19 Syndrome Patients.'},
 {'pmid': '37055254',
  'pmcid': 'PMC9981521',

In [18]:
meta_analysis_dict = get_pmcid_and_titles(list_of_pmids_per_category[3])

In [19]:
meta_analysis_dict #there are no data for meta analysis

[]

In [20]:
guideline_dict = get_pmcid_and_titles(list_of_pmids_per_category[4])

In [21]:
guideline_dict #there are no data for patient guidelines

[]

In [22]:
clinicalfour_dict = get_pmcid_and_titles(list_of_pmids_per_category[5])

In [23]:
clinicalfour_dict #there are no data for clinical phase four

[]

In [24]:
clinicalthree_dict = get_pmcid_and_titles(list_of_pmids_per_category[6])

Processing batch 1/1


In [25]:
clinicalthree_dict

[{'pmid': '40305533',
  'pmcid': None,
  'title': 'Predictors of post-COVID-19 syndrome: a meta-analysis.'},
 {'pmid': '38834107',
  'pmcid': None,
  'title': 'Early use of oral antiviral drugs and the risk of post COVID-19 syndrome: A systematic review and network meta-analysis.'},
 {'pmid': '38321404',
  'pmcid': 'PMC10848453',
  'title': 'The global prevalence of depression, anxiety, and sleep disorder among patients coping with Post COVID-19 syndrome (long COVID): a systematic review and meta-analysis.'},
 {'pmid': '36990297',
  'pmcid': 'PMC10067136',
  'title': 'COVID-19 vaccination for the prevention and treatment of long COVID: A systematic review and meta-analysis.'},
 {'pmid': '35339066',
  'pmcid': 'PMC8934180',
  'title': 'Call for correction: Mid and long-term neurological and neuropsychiatric manifestations of post-COVID-19 syndrome: A meta-analysis.'},
 {'pmid': '35121209',
  'pmcid': 'PMC8798975',
  'title': 'Mid and long-term neurological and neuropsychiatric manifesta

In [26]:
patienthandsout_dict = get_pmcid_and_titles(list_of_pmids_per_category[7])

In [27]:
patienthandsout_dict

[]

In [28]:
review_dict = get_pmcid_and_titles(list_of_pmids_per_category[8])

Processing batch 1/7
Request failed for PMID 37834270: 429 Client Error: Too Many Requests for url: https://pubmed.ncbi.nlm.nih.gov/error/429.shtml
Waiting 3 seconds before next batch...
Processing batch 2/7
Waiting 3 seconds before next batch...
Processing batch 3/7
Waiting 3 seconds before next batch...
Processing batch 4/7
Waiting 3 seconds before next batch...
Processing batch 5/7
Waiting 3 seconds before next batch...
Processing batch 6/7
Waiting 3 seconds before next batch...
Processing batch 7/7


In [29]:
review_dict #found three title that need to be neated PMID: {37106076, 36066294, 34140635}

[{'pmid': '40305533',
  'pmcid': None,
  'title': 'Predictors of post-COVID-19 syndrome: a meta-analysis.'},
 {'pmid': '39362575',
  'pmcid': None,
  'title': 'Immune Response and Cognitive Impairment in Post-COVID Syndrome: A Systematic Review.'},
 {'pmid': '39934846',
  'pmcid': 'PMC11818037',
  'title': 'Assessment of psychosocial aspects in adults in post-COVID-19 condition: the EURONET-SOMA recommendations on core outcome domains for clinical and research use.'},
 {'pmid': '39183058',
  'pmcid': 'PMC11436955',
  'title': 'Long COVID among healthcare workers: a narrative review of definitions, prevalence, symptoms, risk factors and impacts.'},
 {'pmid': '38834107',
  'pmcid': None,
  'title': 'Early use of oral antiviral drugs and the risk of post COVID-19 syndrome: A systematic review and network meta-analysis.'},
 {'pmid': '38392036',
  'pmcid': 'PMC10886368',
  'title': 'The Growing Understanding of the Pituitary Implication in the Pathogenesis of Long COVID-19 Syndrome: A Narra

In [30]:
system_review_dict = get_pmcid_and_titles(list_of_pmids_per_category[9])

Processing batch 1/2
Waiting 3 seconds before next batch...
Processing batch 2/2


In [31]:
system_review_dict

[{'pmid': '40305533',
  'pmcid': None,
  'title': 'Predictors of post-COVID-19 syndrome: a meta-analysis.'},
 {'pmid': '39362575',
  'pmcid': None,
  'title': 'Immune Response and Cognitive Impairment in Post-COVID Syndrome: A Systematic Review.'},
 {'pmid': '38834107',
  'pmcid': None,
  'title': 'Early use of oral antiviral drugs and the risk of post COVID-19 syndrome: A systematic review and network meta-analysis.'},
 {'pmid': '38321404',
  'pmcid': 'PMC10848453',
  'title': 'The global prevalence of depression, anxiety, and sleep disorder among patients coping with Post COVID-19 syndrome (long COVID): a systematic review and meta-analysis.'},
 {'pmid': '36990297',
  'pmcid': 'PMC10067136',
  'title': 'COVID-19 vaccination for the prevention and treatment of long COVID: A systematic review and meta-analysis.'},
 {'pmid': '36708608',
  'pmcid': 'PMC9840228',
  'title': 'Towards evidence-based and inclusive models of peer support for long covid: A hermeneutic systematic review.'},
 {'

In [32]:
def get_category(cat_dict, category_name):
  for item in cat_dict:
    item.update({"category": category_name})

In [33]:
get_category(clinicalthree_dict, "clinical trial phase three")
get_category(clinicalfour_dict, "clinical trial phase four")
get_category(case_report_dict, "case report")
get_category(guideline_dict, "guideline")
get_category(patienthandsout_dict, "patient education hands out")
get_category(clinicalstudy_dict, "clinical study")
get_category(clinicaltrial_dict, "clinical trial")
get_category(review_dict, "review")
get_category(system_review_dict, "systematic review")

In [34]:
database = pd.DataFrame(case_report_dict+guideline_dict+clinicalfour_dict+clinicalthree_dict+patienthandsout_dict+clinicalstudy_dict+clinicaltrial_dict+review_dict+system_review_dict)
database.sample(10)

Unnamed: 0,pmid,pmcid,title,category
115,35309344,PMC8924116,NETosis and Neutrophil Extracellular Traps in ...,review
2,39093769,PMC11296404,Integrative personalized medicine care for adj...,case report
37,38055548,PMC10695477,Unsupervised natural language processing in th...,clinical study
70,36476156,PMC9829459,"Effects of a concurrent training, respiratory ...",clinical trial
141,38834107,,Early use of oral antiviral drugs and the risk...,systematic review
93,37047556,PMC10094973,COVID-19 and Diarylamidines: The Parasitic Con...,review
18,35121209,PMC8798975,Mid and long-term neurological and neuropsychi...,clinical trial phase three
30,38436080,,Low-intensity rehabilitation in persistent pos...,clinical study
98,36534494,,Assessment of life quality and health percepti...,review
114,35121209,PMC8798975,Mid and long-term neurological and neuropsychi...,review


In [35]:
database.isna().sum()

Unnamed: 0,0
pmid,0
pmcid,28
title,1
category,0


#### Creating PMC Source Links

This section generates the full links to the articles hosted on PubMed Central (PMC) using the extracted PMC IDs.

1.  **Iterating and Generating Links:** The code iterates through the list of PMC IDs obtained in the previous step. For each valid PMC ID, it constructs the complete URL to the article on the PMC website.
2.  **Handling Missing Links:** If an article does not have a corresponding PMC ID, the link for that article is set to `None`, and a message is printed to indicate the missing link.
3.  **Storing the Links:** The generated PMC links (or `None` for missing entries) are collected in a list.
4.  **Adding Links to DataFrame:** This list of PMC links is then added as a new column (`pmc_link`) to the `database` DataFrame, associating each article with its potential PMC source URL.

In [36]:
pmcid_links = []
for i,j in enumerate(database.pmcid):
  if j!=None:
    links = "https://pmc.ncbi.nlm.nih.gov/articles/{}/".format(j)
  else:
    links = None
    print(f"No PMC links found for index-{i}")
  pmcid_links.append(links)

No PMC links found for index-13
No PMC links found for index-14
No PMC links found for index-24
No PMC links found for index-30
No PMC links found for index-33
No PMC links found for index-39
No PMC links found for index-40
No PMC links found for index-42
No PMC links found for index-59
No PMC links found for index-67
No PMC links found for index-74
No PMC links found for index-75
No PMC links found for index-78
No PMC links found for index-83
No PMC links found for index-84
No PMC links found for index-85
No PMC links found for index-86
No PMC links found for index-92
No PMC links found for index-98
No PMC links found for index-102
No PMC links found for index-106
No PMC links found for index-116
No PMC links found for index-122
No PMC links found for index-133
No PMC links found for index-134
No PMC links found for index-139
No PMC links found for index-140
No PMC links found for index-141


In [37]:
pmcid_links

['https://pmc.ncbi.nlm.nih.gov/articles/PMC11971158/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC11868965/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC11296404/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC10469423/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC9826535/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC9767890/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC10102822/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC9943558/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC11132361/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC8996041/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC8855328/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC8561376/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC8050227/',
 None,
 None,
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC10848453/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC10067136/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC8934180/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC8798975/',
 'https://pmc.ncbi.nlm.nih.gov/articles/PMC8715665/',
 'http

In [38]:
database = pd.concat([database, pd.DataFrame({"pmc_link":pmcid_links})], axis=1)

In [39]:
database.sample(10)

Unnamed: 0,pmid,pmcid,title,category,pmc_link
117,34973396,PMC8715665,Fatigue and cognitive impairment in Post-COVID...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC8715665/
144,36708608,PMC9840228,Towards evidence-based and inclusive models of...,systematic review,https://pmc.ncbi.nlm.nih.gov/articles/PMC9840228/
102,35428040,,An overview of post COVID sequelae.,review,
75,39362575,,Immune Response and Cognitive Impairment in Po...,review,
148,34619491,PMC8482840,Onset and frequency of depression in post-COVI...,systematic review,https://pmc.ncbi.nlm.nih.gov/articles/PMC8482840/
79,38392036,PMC10886368,The Growing Understanding of the Pituitary Imp...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC10886...
145,35177424,PMC9258567,Molecular Imaging Findings on Acute and Long-T...,systematic review,https://pmc.ncbi.nlm.nih.gov/articles/PMC9258567/
4,36617353,PMC9826535,Post-COVID-19 syndrome increased the requireme...,case report,https://pmc.ncbi.nlm.nih.gov/articles/PMC9826535/
19,34973396,PMC8715665,Fatigue and cognitive impairment in Post-COVID...,clinical trial phase three,https://pmc.ncbi.nlm.nih.gov/articles/PMC8715665/
37,38055548,PMC10695477,Unsupervised natural language processing in th...,clinical study,https://pmc.ncbi.nlm.nih.gov/articles/PMC10695...


In [40]:
database.describe()

Unnamed: 0,pmid,pmcid,title,category,pmc_link
count,150,122,149,150,122
unique,122,100,121,6,100
top,36990297,PMC8715665,COVID-19 vaccination for the prevention and tr...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC8715665/
freq,3,3,3,65,3


In [41]:
database.duplicated(subset=['pmid']).sum()

np.int64(28)

In [42]:
database[database.duplicated(subset=['pmid'], keep=False)]

Unnamed: 0,pmid,pmcid,title,category,pmc_link
13,40305533,,Predictors of post-COVID-19 syndrome: a meta-a...,clinical trial phase three,
14,38834107,,Early use of oral antiviral drugs and the risk...,clinical trial phase three,
15,38321404,PMC10848453,"The global prevalence of depression, anxiety, ...",clinical trial phase three,https://pmc.ncbi.nlm.nih.gov/articles/PMC10848...
16,36990297,PMC10067136,COVID-19 vaccination for the prevention and tr...,clinical trial phase three,https://pmc.ncbi.nlm.nih.gov/articles/PMC10067...
18,35121209,PMC8798975,Mid and long-term neurological and neuropsychi...,clinical trial phase three,https://pmc.ncbi.nlm.nih.gov/articles/PMC8798975/
19,34973396,PMC8715665,Fatigue and cognitive impairment in Post-COVID...,clinical trial phase three,https://pmc.ncbi.nlm.nih.gov/articles/PMC8715665/
22,39905419,PMC11792378,Balneotherapy for the treatment of post-COVID ...,clinical study,https://pmc.ncbi.nlm.nih.gov/articles/PMC11792...
23,39665835,PMC11922198,Exercise rehabilitation in post COVID-19 patie...,clinical study,https://pmc.ncbi.nlm.nih.gov/articles/PMC11922...
27,38937986,PMC11446691,Telerehabilitation improves cardiorespiratory ...,clinical study,https://pmc.ncbi.nlm.nih.gov/articles/PMC11446...
34,38528512,PMC10964649,Effect of home-based pulmonary rehabilitation ...,clinical study,https://pmc.ncbi.nlm.nih.gov/articles/PMC10964...


In [43]:
clean_database = database.drop_duplicates(subset='pmid', keep='first')

In [44]:
clean_database.describe()

Unnamed: 0,pmid,pmcid,title,category,pmc_link
count,122,100,121,122,100
unique,122,100,121,4,100
top,39621236,PMC11971158,"Syncopes, paresis and loss of vision after COV...",review,https://pmc.ncbi.nlm.nih.gov/articles/PMC11971...
freq,1,1,1,59,1


In [45]:
clean_database.category.unique()

array(['case report', 'clinical trial phase three', 'clinical study',
       'review'], dtype=object)

In [46]:
clean_database.sample(10)

Unnamed: 0,pmid,pmcid,title,category,pmc_link
104,35489015,PMC9055372,Evidence mapping and review of long-COVID and ...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC9055372/
129,34024217,PMC8146298,Long COVID or post-COVID-19 syndrome: putative...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC8146298/
103,35138001,PMC9111040,Neurological complications associated with Cov...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC9111040/
125,34319569,PMC8317481,"Long COVID, a comprehensive systematic scoping...",review,https://pmc.ncbi.nlm.nih.gov/articles/PMC8317481/
24,39423759,,Analysis of fat oxidation capacity during card...,clinical study,
109,35632823,PMC9147674,Dysregulated Immune Responses in SARS-CoV-2-In...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC9147674/
130,34384972,PMC8317446,"Post COVID-19 Syndrome (""Long COVID"") and Diab...",review,https://pmc.ncbi.nlm.nih.gov/articles/PMC8317446/
1,39988835,PMC11868965,"Severe COVID-19 Pneumonia, Opportunistic Candi...",case report,https://pmc.ncbi.nlm.nih.gov/articles/PMC11868...
81,38139400,PMC10743535,Resveratrol and Gut Microbiota Synergy: Preven...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC10743...
7,36822507,PMC9943558,Leveraging Serologic Testing to Identify Child...,case report,https://pmc.ncbi.nlm.nih.gov/articles/PMC9943558/


#### Scraping the PDF Download Links

This section focuses on extracting direct PDF download links for the articles identified in the previous steps. It accomplishes this by:

1.  **Setting up Selenium:** Configures Selenium with Chrome options for headless browsing, necessary for interacting with dynamic web content and potentially bypassing bot detection.
2.  **Defining `resolve_pdf_url` function:** This function takes a PMC article link, appends "/pdf/" to it, and attempts to resolve the final redirected URL. This is crucial because the direct "/pdf/" link on PMC often redirects to the actual PDF file location. It handles potential `requests.RequestException` errors and uses `allow_redirects=True` and `stream=True` for efficient handling of the response without downloading the entire PDF. A user agent is included to be polite during requests.
3.  **Defining `get_download_links` function (Implicitly in the ThreadPoolExecutor):** The code then utilizes a `concurrent.futures.ThreadPoolExecutor` to efficiently process the PMC links in parallel. For each valid PMC link in the `clean_database`, the `resolve_pdf_url` function is called within a thread to retrieve the corresponding PDF download link.
4.  **Executing the scraping:** The `ThreadPoolExecutor` manages the concurrent execution of the `resolve_pdf_url` function for all PMC links, and `tqdm` is used to display a progress bar during this process.
5.  **Adding download links to DataFrame:** The extracted PDF download links (or `None` if resolution failed) are collected in a list and then added as a new column (`download_links`) to a copy of the `clean_database` DataFrame.
6.  **Handling missing PMIDs:** Rows with missing PMIDs are explicitly dropped from the updated DataFrame (`clean_database_with_pdfs`).

The resulting DataFrame (`clean_database_with_pdfs`) now includes a column with direct links to the PDF versions of the articles, where available, obtained through a process that attempts to resolve the actual PDF location via redirection.

In [68]:
headers = {
    "content-type": "text/html; charset=utf-8",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-US,en;q=0.9",
    "cache-control": "no-cache",
    "pragma": "no-cache",
}

In [77]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import WebDriverException
import time

In [78]:
def make_driver():
    options = Options()
    options.add_argument("--headless=new")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.page_load_strategy = "eager"  # Faster than default "normal"
    return webdriver.Chrome(options=options)

In [79]:
def get_final_pdf_url(pmc_url: str) -> str | None:
    try:
        driver = make_driver()
        driver.set_page_load_timeout(15)  # Prevent infinite hangs

        pdf_url = pmc_url.rstrip("/") + "/pdf/"
        driver.get(pdf_url)

        final_url = driver.current_url
        return final_url
    except WebDriverException as e:
        print(f"Error on {pmc_url}: {e}")
        return None
    finally:
        try:
            driver.quit()  # Always clean up!
        except Exception:
            pass

In [101]:
pdfs = []
for link in clean_database.pmc_link.tolist():
  if link!=None:
    print(f"Processing: {link}")
    resolved = get_final_pdf_url(link)
    print(f"Resolved: {resolved}")
    pdfs.append(resolved)
    time.sleep(2)  # Be polite to the server
  else:
    pdfs.append(math.nan)

Processing: https://pmc.ncbi.nlm.nih.gov/articles/PMC11971158/
Resolved: https://pmc.ncbi.nlm.nih.gov/articles/PMC11971158/pdf/15010_2024_Article_2439.pdf
Processing: https://pmc.ncbi.nlm.nih.gov/articles/PMC11868965/
Resolved: https://pmc.ncbi.nlm.nih.gov/articles/PMC11868965/pdf/amjcaserep-26-e946400.pdf
Processing: https://pmc.ncbi.nlm.nih.gov/articles/PMC11296404/
Resolved: https://pmc.ncbi.nlm.nih.gov/articles/PMC11296404/pdf/medi-103-e39121.pdf
Processing: https://pmc.ncbi.nlm.nih.gov/articles/PMC10469423/
Resolved: https://pmc.ncbi.nlm.nih.gov/articles/PMC10469423/pdf/TEMI_12_2251600.pdf
Processing: https://pmc.ncbi.nlm.nih.gov/articles/PMC9826535/
Resolved: https://pmc.ncbi.nlm.nih.gov/articles/PMC9826535/pdf/13730_2023_Article_772.pdf
Processing: https://pmc.ncbi.nlm.nih.gov/articles/PMC9767890/
Resolved: https://pmc.ncbi.nlm.nih.gov/articles/PMC9767890/pdf/main.pdf
Processing: https://pmc.ncbi.nlm.nih.gov/articles/PMC10102822/
Resolved: https://pmc.ncbi.nlm.nih.gov/articles/P

In [102]:
len(clean_database.pmc_link.tolist())

122

In [127]:
clean_database = clean_database.reset_index(drop=True)

In [128]:
df_pdfs = pd.DataFrame({"download_link":pdfs})
df_pdfs

Unnamed: 0,download_link
0,https://pmc.ncbi.nlm.nih.gov/articles/PMC11971...
1,https://pmc.ncbi.nlm.nih.gov/articles/PMC11868...
2,https://pmc.ncbi.nlm.nih.gov/articles/PMC11296...
3,https://pmc.ncbi.nlm.nih.gov/articles/PMC10469...
4,https://pmc.ncbi.nlm.nih.gov/articles/PMC98265...
...,...
117,
118,https://pmc.ncbi.nlm.nih.gov/articles/PMC81561...
119,https://pmc.ncbi.nlm.nih.gov/articles/PMC81516...
120,https://pmc.ncbi.nlm.nih.gov/articles/PMC80905...


In [144]:
database_pdfs = pd.concat([clean_database, pd.DataFrame({"download_link":pdfs})],
                          axis=1).reset_index(drop=True)
database_pdfs

Unnamed: 0,pmid,pmcid,title,category,pmc_link,download_link
0,39621236,PMC11971158,"Syncopes, paresis and loss of vision after COV...",case report,https://pmc.ncbi.nlm.nih.gov/articles/PMC11971...,https://pmc.ncbi.nlm.nih.gov/articles/PMC11971...
1,39988835,PMC11868965,"Severe COVID-19 Pneumonia, Opportunistic Candi...",case report,https://pmc.ncbi.nlm.nih.gov/articles/PMC11868...,https://pmc.ncbi.nlm.nih.gov/articles/PMC11868...
2,39093769,PMC11296404,Integrative personalized medicine care for adj...,case report,https://pmc.ncbi.nlm.nih.gov/articles/PMC11296...,https://pmc.ncbi.nlm.nih.gov/articles/PMC11296...
3,37606967,PMC10469423,Severe pigeon paramyxovirus 1 infection in a h...,case report,https://pmc.ncbi.nlm.nih.gov/articles/PMC10469...,https://pmc.ncbi.nlm.nih.gov/articles/PMC10469...
4,36617353,PMC9826535,Post-COVID-19 syndrome increased the requireme...,case report,https://pmc.ncbi.nlm.nih.gov/articles/PMC9826535/,https://pmc.ncbi.nlm.nih.gov/articles/PMC98265...
...,...,...,...,...,...,...
117,34042167,,"Post-COVID-19 syndrome: epidemiology, diagnost...",review,,
118,34067776,PMC8156194,Post-COVID-19 Syndrome and the Potential Benef...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC8156194/,https://pmc.ncbi.nlm.nih.gov/articles/PMC81561...
119,34066174,PMC8151698,A Review of Prolonged Post-COVID-19 Symptoms a...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC8151698/,https://pmc.ncbi.nlm.nih.gov/articles/PMC81516...
120,33941272,PMC8090526,COVID-19 and Alzheimer's disease: how one cris...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC8090526/,https://pmc.ncbi.nlm.nih.gov/articles/PMC80905...


#### DataFrame Cleaning

In [145]:
clean_database.isna().sum()

Unnamed: 0,0
pmid,0
pmcid,22
title,1
category,0
pmc_link,22


In [173]:
manual_pdfs = database_pdfs[database_pdfs.isnull().any(axis=1)]
manual_pdfs

Unnamed: 0,pmid,pmcid,title,category,pmc_link,download_link
13,40305533,,Predictors of post-COVID-19 syndrome: a meta-a...,clinical trial phase three,,
14,38834107,,Early use of oral antiviral drugs and the risk...,clinical trial phase three,,
24,39423759,,Analysis of fat oxidation capacity during card...,clinical study,,
30,38436080,,Low-intensity rehabilitation in persistent pos...,clinical study,,
33,37201931,,Vision impairment is common in non-hospitalise...,clinical study,,
39,38213055,,Effectiveness of Internet-Based Group Supporti...,clinical study,,
40,37690207,,Factors associated with mental health outcomes...,clinical study,,
42,37184376,,Long-term consequences in Covid-19 and Non-Cov...,clinical study,,
59,34719599,,Post COVID-19 sequelae: A prospective observat...,clinical study,,
63,39362575,,Immune Response and Cognitive Impairment in Po...,review,,


##### Changing Error Title Value

In [163]:
# PMID 37106076
database_pdfs.loc[database_pdfs.pmid == '37106076'].title.values[0]

'Multisystem involvement in COVID-19: what have we learnt?'

In [164]:
database_pdfs.replace(to_replace=database_pdfs.loc[database_pdfs.pmid == '37106076'].title.values[0],
                      value= "Post-COVID-More than chronic fatigue?", inplace=True)

In [165]:
database_pdfs.loc[database_pdfs.pmid == '37106076'].title.values[0]

'Post-COVID-More than chronic fatigue?'

In [166]:
# PMID 36066294
database_pdfs.loc[database_pdfs.pmid == '36066294'].title.values[0]

"Multisystem involvement in COVID-19: what have we learnt? PG - 1-5 LID - 10.12968/hmed.2022.0290 [doi] AB - The COVID-19 illness trajectory involves persistent cardio-renal inflammation, activation of the haemostatic pathway and lung involvement. Results of a study carried out by the authors' team demonstrate a link between post-COVID-19 syndrome (people who have long COVID) and multisystem disease, which partly explains the lingering impairments in patient-reported health-related quality of life, physical function and psychological wellbeing after COVID-19. This article discusses what hospital physicians need to be aware of when considering the likelihood of myocarditis in patients with post-COVID-19 syndrome and the implications in the longer term."

In [167]:
database_pdfs.replace(to_replace=database_pdfs.loc[database_pdfs.pmid == '36066294'].title.values[0],
                      value= "Multisystem involvement in COVID-19: what have we learnt?", inplace=True)

In [168]:
database_pdfs.loc[database_pdfs.pmid == '37106076'].title.values[0]

'Post-COVID-More than chronic fatigue?'

In [169]:
# PMID 34140635
database_pdfs.loc[database_pdfs.pmid == '34140635'].title.values[0]

'Chronic post-COVID-19 syndrome and chronic fatigue syndrome: Is there a role for extracorporeal apheresis? PG - 34-37 LID - 10.1038/s41380-021-01148-4 [doi] AB - As millions of patients have been infected by SARS-CoV-2 virus a vast number of individuals complain about continuing breathlessness and fatigue even months after the onset of the disease. This overwhelming phenomenon has not been well defined and has been called "post-COVID syndrome" or "long-COVID" [1]. There are striking similarities to myalgic encephalomyelitis also called chronic fatigue syndrome linked to a viral and autoimmune pathogenesis. In both disorders neurotransmitter receptor antibodies against ß-adrenergic and muscarinic receptors may play a key role. We found similar elevation of these autoantibodies in both patient groups. Extracorporeal apheresis using a special filter seems to be effective in reducing these antibodies in a significant way clearly improving the debilitating symptoms of patients with chronic

In [170]:
database_pdfs.replace(to_replace=database_pdfs.loc[database_pdfs.pmid == '34140635'].title.values[0],
                      value= "Chronic post-COVID-19 syndrome and chronic fatigue syndrome: Is there a role for extracorporeal apheresis?", inplace=True)

In [172]:
database_pdfs.loc[database_pdfs.pmid == '34140635'].title.values[0]

'Chronic post-COVID-19 syndrome and chronic fatigue syndrome: Is there a role for extracorporeal apheresis?'

In [180]:
database_final = database_pdfs[database_pdfs.isnull().any(axis=1) == False].reset_index(drop=True)
database_final

Unnamed: 0,pmid,pmcid,title,category,pmc_link,download_link
0,39621236,PMC11971158,"Syncopes, paresis and loss of vision after COV...",case report,https://pmc.ncbi.nlm.nih.gov/articles/PMC11971...,https://pmc.ncbi.nlm.nih.gov/articles/PMC11971...
1,39988835,PMC11868965,"Severe COVID-19 Pneumonia, Opportunistic Candi...",case report,https://pmc.ncbi.nlm.nih.gov/articles/PMC11868...,https://pmc.ncbi.nlm.nih.gov/articles/PMC11868...
2,39093769,PMC11296404,Integrative personalized medicine care for adj...,case report,https://pmc.ncbi.nlm.nih.gov/articles/PMC11296...,https://pmc.ncbi.nlm.nih.gov/articles/PMC11296...
3,37606967,PMC10469423,Severe pigeon paramyxovirus 1 infection in a h...,case report,https://pmc.ncbi.nlm.nih.gov/articles/PMC10469...,https://pmc.ncbi.nlm.nih.gov/articles/PMC10469...
4,36617353,PMC9826535,Post-COVID-19 syndrome increased the requireme...,case report,https://pmc.ncbi.nlm.nih.gov/articles/PMC9826535/,https://pmc.ncbi.nlm.nih.gov/articles/PMC98265...
...,...,...,...,...,...,...
95,34175230,PMC8180841,Insights from myalgic encephalomyelitis/chroni...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC8180841/,https://pmc.ncbi.nlm.nih.gov/articles/PMC81808...
96,34067776,PMC8156194,Post-COVID-19 Syndrome and the Potential Benef...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC8156194/,https://pmc.ncbi.nlm.nih.gov/articles/PMC81561...
97,34066174,PMC8151698,A Review of Prolonged Post-COVID-19 Symptoms a...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC8151698/,https://pmc.ncbi.nlm.nih.gov/articles/PMC81516...
98,33941272,PMC8090526,COVID-19 and Alzheimer's disease: how one cris...,review,https://pmc.ncbi.nlm.nih.gov/articles/PMC8090526/,https://pmc.ncbi.nlm.nih.gov/articles/PMC80905...


In [181]:
database_final.describe()

Unnamed: 0,pmid,pmcid,title,category,pmc_link,download_link
count,100,100,100,100,100,100
unique,100,100,100,4,100,100
top,39621236,PMC11971158,"Syncopes, paresis and loss of vision after COV...",review,https://pmc.ncbi.nlm.nih.gov/articles/PMC11971...,https://pmc.ncbi.nlm.nih.gov/articles/PMC11971...
freq,1,1,1,46,1,1


##### exporting to csv

In [183]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [184]:
database_final.to_csv("pasc_pubmed.csv",index=False)