Import orchestrator, and load environment variables from ".env" file

In [1]:
from data_gatherer.orchestrator import Orchestrator

Instantiate the orchestrator as data gatherer

In [2]:
data_gatherer = Orchestrator(llm_name="gemini-2.0-flash", log_level="INFO")

orchestrator.py - line 82 - INFO - Data_Gatherer Orchestrator initialized. Extraction Model: gemini-2.0-flash


## Step I: Fetch data

In [3]:
fetched_data = {}

The Fetcher is responsible for fetching raw data about scientific publications on the internet. The parent class is DataFetcher, and the child classes are the following:



- **WebScraper**, given a URL, it scrapes the page extracting the HTML content.


- **DatabaseFetcher**, given a key, it fetches raw data (HTML or XML) from a local DataFrame.


- **APIClient**, given a URI, it fetches raw data (XML) from an API.

In [4]:
data_gatherer.setup_data_fetcher()

orchestrator.py - line 219 - INFO - Setting up data fetcher...
orchestrator.py - line 249 - INFO - Data fetcher setup completed.


### Fetch from API

When the API is supported (for now only PubMedCentral), the fetcher will use the APIClient to fetch the data in XML format.

In [5]:
API_supported_input = "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466"

In [6]:
raw_data_API = data_gatherer.fetch_data(API_supported_input)
fetched_data.update(raw_data_API)

orchestrator.py - line 219 - INFO - Setting up data fetcher...
orchestrator.py - line 249 - INFO - Data fetcher setup completed.
orchestrator.py - line 132 - INFO - Setting target URL to: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466
data_fetcher.py - line 147 - INFO - Initializing EntrezFetcher(('requests', 'self.config'))
orchestrator.py - line 140 - INFO - Fetching data from URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466
data_fetcher.py - line 484 - INFO - Fetching data from request: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC6141466&retmode=xml


### Fetch from Local Data

In [7]:
local_fetch_supported_input = "https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778"

Show the structure of the local DataFrame

In [8]:
raw_data_local = data_gatherer.fetch_data(local_fetch_supported_input, local_fetch_file="../scripts/exp_input/Local_fetched_data.parquet")
fetched_data.update(raw_data_local)

orchestrator.py - line 219 - INFO - Setting up data fetcher...
orchestrator.py - line 249 - INFO - Data fetcher setup completed.
orchestrator.py - line 132 - INFO - Setting target URL to: https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778
data_fetcher.py - line 143 - INFO - URL https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778 found in DataFrame. Using DatabaseFetcher.
orchestrator.py - line 140 - INFO - Fetching data from URL: https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778
data_fetcher.py - line 428 - INFO - Fetching data for pmc11425778
data_fetcher.py - line 431 - INFO - Fetching data from ../scripts/exp_input/Local_fetched_data.parquet


### Fetch Raw HTML from a web page

In [9]:
API_unsupported_input = "https://www.nature.com/articles/s41467-024-51831-7"

In [10]:
raw_html = data_gatherer.fetch_data(API_unsupported_input, browser='Firefox', headless=True)
fetched_data.update(raw_html)

orchestrator.py - line 219 - INFO - Setting up data fetcher...
orchestrator.py - line 249 - INFO - Data fetcher setup completed.
orchestrator.py - line 132 - INFO - Setting target URL to: https://www.nature.com/articles/s41467-024-51831-7
data_fetcher.py - line 155 - INFO - WebScraper instance: True
data_fetcher.py - line 156 - INFO - EntrezFetcher instance: False
data_fetcher.py - line 157 - INFO - scraper_tool attribute: True
data_fetcher.py - line 159 - INFO - Initializing new selenium driver.
selenium_setup.py - line 11 - INFO - Creating WebDriver for browser: Firefox, with driver located at driver_path: None
selenium_setup.py - line 41 - INFO - No driver path provided, using GeckoDriverManager to auto-install Firefox driver.
selenium_setup.py - line 43 - INFO - Using GeckoDriverManager to auto-install Firefox driver <selenium.webdriver.firefox.service.Service object at 0x1a8b8a350>.
orchestrator.py - line 140 - INFO - Fetching data from URL: https://www.nature.com/articles/s41467-

## Step II: Parse data

The Parser is responsible for extracting the relevant information from the raw documents fetched by the Fetcher. It has two main discovery methods:


- **Retrieve-Then-Read**, the parser passes to the LLMs only the relevant sections of the raw document, and the LLMs extract the relevant information.


- **Full-Document-Read**, Large-context LLMs read the entire raw document and extract the relevant information.

### Retrieve-Then-Read 
version 0.1.0 supports this parse method only for PMC XML and HTML documents

In [11]:
results_RTR = data_gatherer.parse_data(API_supported_input, raw_data_API, publisher="PMC", use_portkey_for_gemini=True, prompt_name="retrieve_datasets_simple_JSON")
results_RTR

orchestrator.py - line 183 - INFO - Parsing data from URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466 with publisher: PMC
base_parser.py - line 181 - INFO - LLMParser initialized.
xml_parser.py - line 23 - INFO - Initializing xmlRetriever
xml_parser.py - line 141 - INFO - Function call: parse_data(api_data(<class 'lxml.etree._Element'>), PMC, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466, additional_data, XML)
xml_parser.py - line 440 - INFO - Function_call: extract_href_from_supplementary_material(api_xml, current_url_address)
base_parser.py - line 286 - INFO - Function_call: load_patterns_for_tgt_section(supplementary_material_sections)
base_parser.py - line 287 - INFO - Consider migrating this function to the BaseRetriever class.
xml_parser.py - line 451 - INFO - Found 15 supplementary material sections .//supplementary-material. cont: [<Element supplementary-material at 0x13e363dc0>, <Element supplementary-material at 0x13e229240>, <Element supplementary-material a

Unnamed: 0,dataset_identifier,data_repository,dataset_webpage,source_section,retrieval_pattern,access_mode,link,source_url,download_link,title,content_type,id,surrounding_text,description,file_extension,pub_title
0,PHS001049,dbgap,https://www.ncbi.nlm.nih.gov/projects/gap/cgi-...,data_availability,data availability,Application to access,,,,,,,,,,Recurrent WNT pathway alterations are frequent...
1,PRJNA306801,sra,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,data_availability,data availability,Complex download,,,,,,,,,,Recurrent WNT pathway alterations are frequent...
0,,,,supplementary material,.//supplementary-material,,41467_2018_6162_MOESM1_ESM.pdf,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM1,Supplementary Information,Supplementary Information,pdf,Recurrent WNT pathway alterations are frequent...
1,,,,supplementary material,.//supplementary-material,,41467_2018_6162_MOESM2_ESM.pdf,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM2,Peer Review File,Peer Review File,pdf,Recurrent WNT pathway alterations are frequent...
2,,,,supplementary material,.//supplementary-material,,41467_2018_6162_MOESM3_ESM.pdf,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM3,Description of Additional Supplementary Files,Description of Additional Supplementary Files,pdf,Recurrent WNT pathway alterations are frequent...
3,,,,supplementary material,.//supplementary-material,,41467_2018_6162_MOESM4_ESM.xlsx,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM4,Supplementary Data 1,Supplementary Data 1,xlsx,Recurrent WNT pathway alterations are frequent...
4,,,,supplementary material,.//supplementary-material,,41467_2018_6162_MOESM5_ESM.xlsx,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM5,Supplementary Data 2,Supplementary Data 2,xlsx,Recurrent WNT pathway alterations are frequent...
5,,,,supplementary material,.//supplementary-material,,41467_2018_6162_MOESM6_ESM.xlsx,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM6,Supplementary Data 3,Supplementary Data 3,xlsx,Recurrent WNT pathway alterations are frequent...
6,,,,supplementary material,.//supplementary-material,,41467_2018_6162_MOESM7_ESM.xlsx,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM7,Supplementary Data 4,Supplementary Data 4,xlsx,Recurrent WNT pathway alterations are frequent...
7,,,,supplementary material,.//supplementary-material,,41467_2018_6162_MOESM8_ESM.xlsx,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM8,Supplementary Data 5,Supplementary Data 5,xlsx,Recurrent WNT pathway alterations are frequent...


In [12]:
results_RTR = data_gatherer.parse_data(local_fetch_supported_input, raw_data_local, publisher="PMC", use_portkey_for_gemini=True, prompt_name="retrieve_datasets_simple_JSON", raw_data_format="HTML")
results_RTR

orchestrator.py - line 183 - INFO - Parsing data from URL: https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778 with publisher: PMC
base_parser.py - line 181 - INFO - LLMParser initialized.
html_parser.py - line 82 - INFO - Initializing htmlRetriever
html_parser.py - line 345 - INFO - Function call: parse_data(api_data(html_str, PMC, https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778, additional_data, HTML)
html_parser.py - line 426 - INFO - Function_call: extract_href_from_html_supplementary_material(tree, https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778)
html_parser.py - line 490 - INFO - Extracted 14 unique supplementary material links from HTML.
html_parser.py - line 372 - INFO - Chunking the HTML content for the parsing step.
html_retriever.py - line 171 - INFO - Extracting data availability elements from HTML
html_retriever.py - line 182 - INFO - Using selector: section.data-availability-statement
html_retriever.py - line 182 - INFO - Using selector: section.associated-d

Unnamed: 0,dataset_identifier,data_repository,dataset_webpage,access_mode,link,title,file_info,description,source_section,section_class,download_link,file_extension,a_attr_href,a_attr_class,a_attr_data-ga-action,a_attr_target,a_attr_rel,source_url,pub_title
0,MSV000092944,massive.ucsd.edu,https://massive.ucsd.edu/ProteoSAFe/dataset.js...,Complex download,,,,,,,,,,,,,,https://www.ncbi.nlm.nih.gov/pmc/articles/pmc1...,No title found
0,,,,,https://doi.org/10.1021/acs.jproteome.4c00338,10.1021/acs.jproteome.4c00338,,,,pmc-layout__citation font-secondary font-xs,,,https://doi.org/10.1021/acs.jproteome.4c00338,usa-link usa-link--external,click_feat_suppl,_blank,noopener noreferrer,https://www.ncbi.nlm.nih.gov/pmc/articles/pmc1...,No title found
8,,,,,/articles/instance/11425778/bin/NIHMS2020672-s...,NIHMS2020672-supplement-SI.pdf,"(1.4MB, pdf)",,d67e776,media p,https://www.ncbi.nlm.nih.gov/pmc/articles/inst...,pdf,/articles/instance/11425778/bin/NIHMS2020672-s...,usa-link,click_feat_suppl,,,https://www.ncbi.nlm.nih.gov/pmc/articles/pmc1...,No title found


In [13]:
results_RTR = data_gatherer.parse_data(API_unsupported_input, raw_html, publisher="Nature", use_portkey_for_gemini=True, prompt_name="retrieve_datasets_simple_JSON", raw_data_format ="HTML", semantic_retrieval=True)
results_RTR

orchestrator.py - line 183 - INFO - Parsing data from URL: https://www.nature.com/articles/s41467-024-51831-7 with publisher: Nature
base_parser.py - line 181 - INFO - LLMParser initialized.
html_parser.py - line 82 - INFO - Initializing htmlRetriever
html_parser.py - line 345 - INFO - Function call: parse_data(api_data(html_str, Nature, https://www.nature.com/articles/s41467-024-51831-7, additional_data, HTML)
html_parser.py - line 426 - INFO - Function_call: extract_href_from_html_supplementary_material(tree, https://www.nature.com/articles/s41467-024-51831-7)
html_parser.py - line 490 - INFO - Extracted 4 unique supplementary material links from HTML.
html_parser.py - line 372 - INFO - Chunking the HTML content for the parsing step.
html_retriever.py - line 171 - INFO - Extracting data availability elements from HTML
html_retriever.py - line 182 - INFO - Using selector: section.data-availability-statement
html_retriever.py - line 182 - INFO - Using selector: section.associated-data


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

embeddings_retriever.py - line 51 - INFO - Searching for top-5 passages similar to the query by embeddings.
embeddings_retriever.py - line 33 - INFO - Computing L2 distances using numpy.
base_parser.py - line 432 - INFO - Function_call: extract_datasets_info_from_content(...)
prompt_manager.py - line 32 - INFO - Loading prompt: retrieve_datasets_simple_JSON from user_prompt_dir: None, subdir: 
base_parser.py - line 1190 - INFO - Expected string but got list. Converting list to string.
base_parser.py - line 440 - INFO - Content length: 19082
base_parser.py - line 1190 - INFO - Expected string but got list. Converting list to string.
base_parser.py - line 451 - INFO - Prompt messages total length: 5113 tokens
base_parser.py - line 457 - INFO - Prompt ID: gemini-2.0-flash-0-ff14d72780bfa11968c20480e7528540c5731a2d1b12e3406dae2c8b79987faa
base_parser.py - line 1190 - INFO - Expected string but got list. Converting list to string.
base_parser.py - line 476 - INFO - Requesting datasets from 

Unnamed: 0,dataset_identifier,data_repository,dataset_webpage,access_mode,link,title,file_info,description,source_section,section_class,...,file_extension,a_attr_class,a_attr_data-track,a_attr_data-track-action,a_attr_data-test,a_attr_data-track-label,a_attr_href,a_attr_data-supp-info-image,source_url,pub_title
0,GSE269782,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
1,GSE31210,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
2,GSE106765,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
3,GSE60189,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
4,GSE59239,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
5,GSE122005,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
6,GSE38121,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
7,GSE71587,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
8,GSE37699,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
9,PXD051771,www.proteomexchange.org,https://www.proteomexchange.org/cgi/GetDataset...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found


### Full-Document-Read

In [14]:
results_FDR = data_gatherer.parse_data(API_unsupported_input, fetched_data[API_unsupported_input], publisher="Nature", prompt_name="GPT_from_full_input_Examples", full_document_read=True, raw_data_format="HTML")

orchestrator.py - line 183 - INFO - Parsing data from URL: https://www.nature.com/articles/s41467-024-51831-7 with publisher: Nature
base_parser.py - line 181 - INFO - LLMParser initialized.
html_parser.py - line 82 - INFO - Initializing htmlRetriever
html_parser.py - line 345 - INFO - Function call: parse_data(api_data(html_str, Nature, https://www.nature.com/articles/s41467-024-51831-7, additional_data, HTML)
html_parser.py - line 426 - INFO - Function_call: extract_href_from_html_supplementary_material(tree, https://www.nature.com/articles/s41467-024-51831-7)
html_parser.py - line 490 - INFO - Extracted 4 unique supplementary material links from HTML.
html_parser.py - line 355 - INFO - Extracting links from full HTML content.
base_parser.py - line 432 - INFO - Function_call: extract_datasets_info_from_content(...)
prompt_manager.py - line 32 - INFO - Loading prompt: GPT_from_full_input_Examples from user_prompt_dir: None, subdir: 
base_parser.py - line 1190 - INFO - Expected string 

In [15]:
results_FDR

Unnamed: 0,dataset_identifier,data_repository,dataset_webpage,access_mode,link,title,file_info,description,source_section,section_class,...,file_extension,a_attr_class,a_attr_data-track,a_attr_data-track-action,a_attr_data-test,a_attr_data-track-label,a_attr_href,a_attr_data-supp-info-image,source_url,pub_title
0,GSE31210,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
1,GSE269782,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
2,GSE60189,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
3,GSE59239,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
4,GSE122005,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
5,GSE38121,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
6,GSE71587,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
7,GSE37699,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
8,PXD051771,www.proteomexchange.org,https://www.proteomexchange.org/cgi/GetDataset...,Complex download,,,,,,,...,,,,,,,,,https://www.nature.com/articles/s41467-024-518...,No title found
0,,,,,https://static-content.springer.com/esm/art%3A...,Supplementary Information,,,MOESM1,c-article-supplementary__item,...,pdf,print-link,click,view supplementary info,supp-info-link,supplementary information,https://static-content.springer.com/esm/art%3A...,,https://www.nature.com/articles/s41467-024-518...,No title found


In [16]:
parsed_data = {"results_RTR" : results_RTR, "results_FDR": results_FDR}

## Step III: Classify data

For now, we are only considering raw data files, i.e. those objects that can be accessed from a repository with an accession code. However, the classifier can be extended to include other types of objects, such as supplementary materials.

In [17]:
print(f"Fetched from {len(parsed_data)} publications.")
raw_data_files = data_gatherer.classifier.get_raw_data_files(parsed_data)
print(f"Fetched {len(raw_data_files)} raw data files:")

classifier.py - line 106 - INFO - Processing DataFrame for URL: results_RTR
classifier.py - line 106 - INFO - Processing DataFrame for URL: results_FDR


Fetched from 2 publications.
Fetched 19 raw data files:


In [18]:
raw_data_files

Unnamed: 0,publication_url,dataset_identifier,data_repository,dataset_webpage
0,results_RTR,GSE269782,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
1,results_RTR,GSE31210,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
2,results_RTR,GSE106765,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
3,results_RTR,GSE60189,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
4,results_RTR,GSE59239,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
5,results_RTR,GSE122005,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
6,results_RTR,GSE38121,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
7,results_RTR,GSE71587,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
8,results_RTR,GSE37699,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
9,results_RTR,PXD051771,www.proteomexchange.org,https://www.proteomexchange.org/cgi/GetDataset...
