Import orchestrator, and load environment variables from ".env" file

In [1]:
from data_gatherer.orchestrator import Orchestrator
import time
from dotenv import load_dotenv

load_dotenv()

True

Instantiate the orchestrator as data gatherer

In [2]:
data_gatherer = Orchestrator(llm_name="gemini-2.0-flash", log_level="ERROR")

## Step I: Fetch data

In [3]:
fetched_data = {}

The Fetcher is responsible for fetching raw data about scientific publications on the internet. The parent class is DataFetcher, and the child classes are the following:



- **WebScraper**, given a URL, it scrapes the page extracting the HTML content.


- **DatabaseFetcher**, given a key, it fetches raw data (HTML or XML) from a local DataFrame.


- **APIClient**, given a URI, it fetches raw data (XML) from an API.

In [4]:
data_gatherer.setup_data_fetcher()

### Fetch from API

In [5]:
API_supported_input = "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466"

In [6]:
raw_data_API = data_gatherer.fetch_data(API_supported_input)
fetched_data.update(raw_data_API)

### Fetch from Local Data

In [7]:
local_fetch_supported_input = "https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778"

In [8]:
raw_data_local = data_gatherer.fetch_data(local_fetch_supported_input, local_fetch_file="../scripts/exp_input/Local_fetched_data.parquet")
fetched_data.update(raw_data_local)

### Fetch Raw HTML from a web page

In [9]:
API_unsupported_input = "https://www.nature.com/articles/s41467-024-51831-7"

In [10]:
raw_html = data_gatherer.fetch_data(API_unsupported_input, browser='Firefox', headless=True)
fetched_data.update(raw_html)

## Step II: Parse data

The Parser is responsible for extracting the relevant information from the raw documents fetched by the Fetcher. It has two main discovery methods:


- **Retrieve-Then-Read**, the parser passes to the LLMs only the relevant sections of the raw document, and the LLMs extract the relevant information.


- **Full-Document-Read**, Large-context LLMs read the entire raw document and extract the relevant information.

### Retrieve-Then-Read 
version 0.1.0 supports this parse method only for PMC XML and HTML documents

In [11]:
results_RTR = data_gatherer.parse_data(API_supported_input, fetched_data, publisher="PMC", use_portkey_for_gemini=False, prompt_name="retrieve_datasets_simple_JSON_gemini")

In [12]:
results_RTR

Unnamed: 0,dataset_identifier,data_repository,source_section,retrieval_pattern,dataset_webpage,access_mode,link,source_url,download_link,title,content_type,id,surrounding_text,description,file_extension
0,PHS001049,dbGAP,data_availability,data availability,https://www.ncbi.nlm.nih.gov/projects/gap/cgi-...,Application to access,,,,,,,,,
1,PRJNA306801,SRA,data_availability,data availability,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,Complex download,,,,,,,,,
0,,,supplementary material,.//sec[@sec-type='supplementary-material'],,,41467_2018_6162_MOESM1_ESM.pdf,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM1,Supplementary Information,Supplementary Information,pdf
1,,,supplementary material,.//sec[@sec-type='supplementary-material'],,,41467_2018_6162_MOESM2_ESM.pdf,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM2,Peer Review File,Peer Review File,pdf
2,,,supplementary material,.//sec[@sec-type='supplementary-material'],,,41467_2018_6162_MOESM3_ESM.pdf,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM3,Description of Additional Supplementary Files,Description of Additional Supplementary Files,pdf
3,,,supplementary material,.//sec[@sec-type='supplementary-material'],,,41467_2018_6162_MOESM4_ESM.xlsx,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM4,Supplementary Data 1,Supplementary Data 1,xlsx
4,,,supplementary material,.//sec[@sec-type='supplementary-material'],,,41467_2018_6162_MOESM5_ESM.xlsx,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM5,Supplementary Data 2,Supplementary Data 2,xlsx
5,,,supplementary material,.//sec[@sec-type='supplementary-material'],,,41467_2018_6162_MOESM6_ESM.xlsx,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM6,Supplementary Data 3,Supplementary Data 3,xlsx
6,,,supplementary material,.//sec[@sec-type='supplementary-material'],,,41467_2018_6162_MOESM7_ESM.xlsx,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM7,Supplementary Data 4,Supplementary Data 4,xlsx
7,,,supplementary material,.//sec[@sec-type='supplementary-material'],,,41467_2018_6162_MOESM8_ESM.xlsx,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://pmc.ncbi.nlm.nih.gov/articles/instance...,No Title,local-data,MOESM8,Supplementary Data 5,Supplementary Data 5,xlsx


### Full-Document-Read

In [13]:
results_FDR = data_gatherer.parse_data(API_unsupported_input, fetched_data[API_unsupported_input], publisher="Nature", use_portkey_for_gemini=False, prompt_name="retrieve_datasets_simple_JSON_gemini", full_document_read=True, raw_data_format="HTML")

In [14]:
results_FDR

Unnamed: 0,dataset_identifier,data_repository,dataset_webpage,access_mode,link,title,file_info,description,source_section,section_class,download_link,file_extension,a_attr_class,a_attr_data-track,a_attr_data-track-action,a_attr_data-test,a_attr_data-track-label,a_attr_href,a_attr_data-supp-info-image,source_url
0,GSE31210,GEO dataset,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,,,,,,,,,,https://www.nature.com/articles/s41467-024-518...
1,1ALU,Protein Data Bank,,,,,,,,,,,,,,,,,,https://www.nature.com/articles/s41467-024-518...
2,1P9M,Protein Data Bank,,,,,,,,,,,,,,,,,,https://www.nature.com/articles/s41467-024-518...
3,GSE106765,NCBI GEO,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,,,,,,,,,,https://www.nature.com/articles/s41467-024-518...
4,PXD051771,ProteomeXchange,https://www.proteomexchange.org/cgi/GetDataset...,Complex download,,,,,,,,,,,,,,,,https://www.nature.com/articles/s41467-024-518...
5,GSE269782,NCBI Gene Expression Omnibus (GEO),,,,,,,,,,,,,,,,,,https://www.nature.com/articles/s41467-024-518...
6,GSE60189,NCBI GEO,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,,,,,,,,,,https://www.nature.com/articles/s41467-024-518...
7,GSE59239,NCBI GEO,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,,,,,,,,,,https://www.nature.com/articles/s41467-024-518...
8,GSE122005,NCBI GEO,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,,,,,,,,,,https://www.nature.com/articles/s41467-024-518...
9,GSE38121,NCBI GEO,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,,,,,,,,,,,,,,,,https://www.nature.com/articles/s41467-024-518...


In [15]:
parsed_data = {"results_RTR" : results_RTR, "results_FDR": results_FDR}

## Step III: Classify data

For now, we are only considering raw data files, i.e. those objects that can be accessed from a repository with an accession code. However, the classifier can be extended to include other types of objects, such as supplementary materials.

In [16]:
print(f"Fetched from {len(parsed_data)} publications.")
raw_data_files = data_gatherer.classifier.get_raw_data_files(parsed_data)
print(f"Fetched {len(raw_data_files)} raw data files:")

Fetched from 2 publications.
Fetched 14 raw data files:


In [17]:
raw_data_files

Unnamed: 0,publication_url,dataset_identifier,data_repository,dataset_webpage
0,results_RTR,PHS001049,dbGAP,https://www.ncbi.nlm.nih.gov/projects/gap/cgi-...
1,results_RTR,PRJNA306801,SRA,https://www.ncbi.nlm.nih.gov/bioproject/?term=...
17,results_FDR,GSE31210,GEO dataset,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
18,results_FDR,1ALU,Protein Data Bank,
19,results_FDR,1P9M,Protein Data Bank,
20,results_FDR,GSE106765,NCBI GEO,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
21,results_FDR,PXD051771,ProteomeXchange,https://www.proteomexchange.org/cgi/GetDataset...
22,results_FDR,GSE269782,NCBI Gene Expression Omnibus (GEO),
23,results_FDR,GSE60189,NCBI GEO,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
24,results_FDR,GSE59239,NCBI GEO,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
