Import orchestrator, and load environment variables from ".env" file

In [1]:
from data_gatherer.orchestrator import Orchestrator
from data_gatherer.data_fetcher import DataFetcher, WebScraper, DatabaseFetcher, APIClient
from data_gatherer.selenium_setup import create_driver
from data_gatherer.parser import LLMParser
import requests
import pandas as pd
import time
from dotenv import load_dotenv

load_dotenv()

True

Instantiate the orchestrator as data gatherer

In [2]:
data_gatherer = Orchestrator()

orchestrator.py - line 40 - INFO - Data_Gatherer Orchestrator initialized. Extraction step Model: gemini-2.0-flash


## Step I: Fetch data

The Fetcher is responsible for fetching raw data about scientific publications on the internet. The parent class is DataFetcher, and the child classes are the following:



- **WebScraper**, given a URL, it scrapes the page extracting the HTML content.


- **DatabaseFetcher**, given a key, it fetches raw data (HTML or XML) from a local DataFrame.


- **APIClient**, given a URI, it fetches raw data (XML) from an API.

In [3]:
data_gatherer.setup_data_fetcher()

### Fetch from API

In [4]:
API_supported_input = "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466"
api_fetcher = APIClient(requests,'PMC_API',data_gatherer.config, data_gatherer.logger)
api_fetcher

<data_gatherer.data_fetcher.APIClient at 0x1456cf410>

In [5]:
raw_data_API = api_fetcher.fetch_data(API_supported_input)
raw_data_API

data_fetcher.py - line 549 - INFO - Fetching data from request: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC6141466&retmode=xml


<Element pmc-articleset at 0x14554b740>

### Fetch from Local Data

In [6]:
local_fetch_supported_input = "https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778"

In [7]:
df_fetcher = DatabaseFetcher(data_gatherer.config, data_gatherer.logger)
df_fetcher

<data_gatherer.data_fetcher.DatabaseFetcher at 0x145765010>

In [8]:
raw_data_local = df_fetcher.fetch_data(local_fetch_supported_input)
print(f"Length of raw content: {len(raw_data_local)}")

data_fetcher.py - line 466 - INFO - Fetching data for pmc11425778
data_fetcher.py - line 470 - INFO - Fetching data from ../input/fetched_data.parquet


Length of raw content: 205661


### Fetch Raw HTML from a web page

In [9]:
API_unsupported_input = "https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/"

In [10]:
driver = create_driver(data_gatherer.config['DRIVER_PATH'], data_gatherer.config['BROWSER'], data_gatherer.config['HEADLESS'], data_gatherer.logger)
web_fetcher = WebScraper(driver, data_gatherer.config, data_gatherer.logger)

selenium_setup.py - line 11 - INFO - Creating WebDriver for browser: Firefox
selenium_setup.py - line 41 - INFO - No driver path provided, using GeckoDriverManager to auto-install Firefox driver.
selenium_setup.py - line 43 - INFO - Using GeckoDriverManager to auto-install Firefox driver <selenium.webdriver.firefox.service.Service object at 0x144f94690>.


In [11]:
raw_html = web_fetcher.fetch_data(API_unsupported_input)
print(f"Length of raw content: {len(raw_html)}")

Length of raw content: 564424


## Step II: Parse data

The Parser is responsible for extracting the relevant information from the raw documents fetched by the Fetcher. It has two main discovery methods:


- **Retrieve-Then-Read**, the parser passes to the LLMs only the relevant sections of the raw document, and the LLMs extract the relevant information.


- **Full-Document-Read**, Large-context LLMs read the entire raw document and extract the relevant information.

### Retrieve-Then-Read

In [12]:
parser = LLMParser(data_gatherer.config['parser_config_path'], data_gatherer.logger, full_document_read=False)

parser.py - line 132 - INFO - Parser initialized.


In [13]:
input_url = [API_supported_input, local_fetch_supported_input, API_unsupported_input]
input_format = ["XML", "full_HTML", "full_HTML"]
input_cont = [raw_data_API, raw_data_local, raw_html]

dfs_dict_RTR = {}
for i in range(len(input_url)):
    append_data = parser.parse_data(input_cont[i], "PMC", input_url[i], raw_data_format=input_format[i])
    dfs_dict_RTR[input_url[i]] = append_data
    print(f"Parsed data from {input_url[i]}. Found {len(append_data)} datasets.")
    time.sleep(1)  # To print the output in the right order

parser.py - line 382 - INFO - Function call: parse_data(api_data(<class 'lxml.etree._Element'>), PMC, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466, additional_data, XML)
parser.py - line 396 - INFO - Extracted title:'Recurrent WNT pathway alterations are frequent in relapsed small cell lung cancer'
parser.py - line 806 - INFO - Function_call: extract_href_from_supplementary_material(api_xml, current_url_address)
parser.py - line 817 - INFO - Found 1 supplementary material sections .//sec[@sec-type='supplementary-material']. cont: [<Element sec at 0x1457a7280>]
parser.py - line 817 - INFO - Found 15 supplementary material sections .//supplementary-material. cont: [<Element supplementary-material at 0x1457a7080>, <Element supplementary-material at 0x1457a7500>, <Element supplementary-material at 0x1457a7400>, <Element supplementary-material at 0x1457a7580>, <Element supplementary-material at 0x1457a7300>, <Element supplementary-material at 0x1457a7100>, <Element supplementary-mat

Parsed data from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466. Found 32 datasets.


parser.py - line 382 - INFO - Function call: parse_data(api_data(<class 'str'>), PMC, https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778, additional_data, full_HTML)
parser.py - line 494 - INFO - Chunking the HTML content for the parsing step.
parser.py - line 730 - INFO - Function_call: extract_href_from_html_supplementary_material(tree, https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778)
parser.py - line 790 - INFO - Extracted 14 unique supplementary material links from HTML.
parser.py - line 541 - INFO - Function_call: normalize_full_DOM(api_data). Length of raw api data: 71801 tokens
parser.py - line 1453 - INFO - Extracting data availability elements from HTML
parser.py - line 1464 - INFO - Using selector: section.data-availability-statement
parser.py - line 1500 - INFO - Extracted data availability element: {'retrieval_pattern': 'section.data-availability-statement', 'text': 'Proteomics data can be found at the MassIVE database via MSV000092944. \n', 'html': '<p> Proteomics

Parsed data from https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778. Found 3 datasets.


parser.py - line 382 - INFO - Function call: parse_data(api_data(<class 'str'>), PMC, https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/, additional_data, full_HTML)
parser.py - line 494 - INFO - Chunking the HTML content for the parsing step.
parser.py - line 730 - INFO - Function_call: extract_href_from_html_supplementary_material(tree, https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/)
parser.py - line 790 - INFO - Extracted 61 unique supplementary material links from HTML.
parser.py - line 541 - INFO - Function_call: normalize_full_DOM(api_data). Length of raw api data: 197202 tokens
parser.py - line 1453 - INFO - Extracting data availability elements from HTML
parser.py - line 1464 - INFO - Using selector: section.data-availability-statement
parser.py - line 1500 - INFO - Extracted data availability element: {'retrieval_pattern': 'section.data-availability-statement', 'text': 'Processed data tables are available inTable S2. Data used for the manuscript are also available through a 

Parsed data from https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/. Found 10 datasets.


### Full-Document-Read

In [14]:
parser = LLMParser(data_gatherer.config['parser_config_path'], data_gatherer.logger, full_document_read=True)

parser.py - line 132 - INFO - Parser initialized.


In [15]:
input_url = [API_supported_input, local_fetch_supported_input, API_unsupported_input]
input_format = ["XML", "full_HTML", "full_HTML"]
input_cont = [raw_data_API, raw_data_local, raw_html]

dfs_dict_FDR = {}
for i in range(len(input_url)):
    append_data = parser.parse_data(input_cont[i], "PMC", input_url[i], raw_data_format=input_format[i])
    dfs_dict_FDR[input_url[i]] = append_data
    print(f"Parsed data from {input_url[i]}. Found {len(append_data)} datasets.")
    time.sleep(1)  # To print the output in the right order

parser.py - line 382 - INFO - Function call: parse_data(api_data(<class 'lxml.etree._Element'>), PMC, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466, additional_data, XML)
parser.py - line 396 - INFO - Extracted title:'Recurrent WNT pathway alterations are frequent in relapsed small cell lung cancer'
parser.py - line 806 - INFO - Function_call: extract_href_from_supplementary_material(api_xml, current_url_address)
parser.py - line 817 - INFO - Found 1 supplementary material sections .//sec[@sec-type='supplementary-material']. cont: [<Element sec at 0x146f1d4c0>]
parser.py - line 817 - INFO - Found 15 supplementary material sections .//supplementary-material. cont: [<Element supplementary-material at 0x146f1d5c0>, <Element supplementary-material at 0x146f1dc80>, <Element supplementary-material at 0x146f1d240>, <Element supplementary-material at 0x146f1d540>, <Element supplementary-material at 0x146f1d580>, <Element supplementary-material at 0x146f1d440>, <Element supplementary-mat

Parsed data from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466. Found 17 datasets.


parser.py - line 382 - INFO - Function call: parse_data(api_data(<class 'str'>), PMC, https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778, additional_data, full_HTML)
parser.py - line 474 - INFO - Extracting links from full HTML content.
parser.py - line 730 - INFO - Function_call: extract_href_from_html_supplementary_material(tree, https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778)
parser.py - line 790 - INFO - Extracted 14 unique supplementary material links from HTML.
parser.py - line 541 - INFO - Function_call: normalize_full_DOM(api_data). Length of raw api data: 71801 tokens
parser.py - line 1099 - INFO - Loading prompt: retrieve_datasets_simple_JSON_gemini
prompt_manager.py - line 29 - INFO - Loading prompt: retrieve_datasets_simple_JSON_gemini from user_prompt_dir: None, subdir: 
parser.py - line 1108 - INFO - static_prompt: [{'role': 'model', 'parts': [{'text': "You are a specialized assistant that extracts dataset references from the content of scientific papers. You mu

Parsed data from https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778. Found 3 datasets.


parser.py - line 382 - INFO - Function call: parse_data(api_data(<class 'str'>), PMC, https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/, additional_data, full_HTML)
parser.py - line 474 - INFO - Extracting links from full HTML content.
parser.py - line 730 - INFO - Function_call: extract_href_from_html_supplementary_material(tree, https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/)
parser.py - line 790 - INFO - Extracted 61 unique supplementary material links from HTML.
parser.py - line 541 - INFO - Function_call: normalize_full_DOM(api_data). Length of raw api data: 197202 tokens
parser.py - line 1099 - INFO - Loading prompt: retrieve_datasets_simple_JSON_gemini
prompt_manager.py - line 29 - INFO - Loading prompt: retrieve_datasets_simple_JSON_gemini from user_prompt_dir: None, subdir: 
parser.py - line 1108 - INFO - static_prompt: [{'role': 'model', 'parts': [{'text': "You are a specialized assistant that extracts dataset references from the content of scientific papers. You must outp

Parsed data from https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/. Found 10 datasets.


## Step III: Classify data

For now, we are only considering raw data files, i.e. those objects that can be accessed from a repository with an accession code. However, the classifier can be extended to include other types of objects, such as supplementary materials.

In [16]:
print(f"Fetched from {len(dfs_dict_FDR)} publications.")
raw_data_files = data_gatherer.classifier.get_raw_data_files(dfs_dict_FDR)
print(f"Fetched {len(raw_data_files)} raw data files:")

classifier.py - line 106 - INFO - Processing DataFrame for URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466
classifier.py - line 106 - INFO - Processing DataFrame for URL: https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778
classifier.py - line 106 - INFO - Processing DataFrame for URL: https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/


Fetched from 3 publications.
Fetched 4 raw data files:


In [17]:
raw_data_files

Unnamed: 0,publication_url,dataset_identifier,data_repository,dataset_webpage
0,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PHS001049,dbGAP,https://www.ncbi.nlm.nih.gov/projects/gap/cgi-...
1,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PRJNA306801,SRA,https://www.ncbi.nlm.nih.gov/bioproject/?term=...
17,https://www.ncbi.nlm.nih.gov/pmc/articles/pmc1...,MSV000092944,MassIVE database,https://massive.ucsd.edu/ProteoSAFe/dataset.js...
20,https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/,phs001287,dbGaP,https://www.ncbi.nlm.nih.gov/projects/gap/cgi-...
