# Workflows Example

In [1]:
import pandas as pd

from data_gatherer.data_gatherer import DataGatherer

Instantiate the DataGatherer orchestrator

In [2]:
data_gatherer = DataGatherer(llm_name="gemini-2.0-flash", log_level="ERROR")

## Step I: Fetch data

The Fetcher is responsible for fetching raw data about scientific publications on the internet. The parent class is DataFetcher, and the child classes are the following:



- **WebScraper**, given a URL, it scrapes the page extracting the HTML content.


- **DatabaseFetcher**, given a key, it fetches raw data (HTML or XML) from a local DataFrame.


- **APIClient**, given a URI, it fetches raw data (XML) from an API.

### Fetch from API

When the API is supported (for now only PubMedCentral), the fetcher will use the APIClient to fetch the data in XML format.

In [4]:
API_supported_input = "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466"

In [5]:
raw_data_API = data_gatherer.fetch_data(API_supported_input)

### Fetch from Local Data

Show the structure of the local DataFrame


In [6]:
pd.read_parquet("../scripts/exp_input/Local_fetched_data.parquet").head(2)

Unnamed: 0,file_name,raw_cont,format,length,path,publication
0,miR-33b-3p Acts as a Tumor Suppressor by Targe...,"<html lang=""en"" class=""""><head>\n\n <me...",html,205313,../html_xml_samples/PMC/miR-33b-3p Acts as a T...,pmc8595470
1,Murine neuronatin deficiency is associated wit...,"<html lang=""en"" class=""""><head>\n\n <me...",html,238825,../html_xml_samples/PMC/Murine neuronatin defi...,pmc8413370


In [7]:
local_fetch_supported_input = "https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778"

In [8]:
raw_data_local = data_gatherer.fetch_data(local_fetch_supported_input, local_fetch_file="../scripts/exp_input/Local_fetched_data.parquet")

### Fetch Raw HTML from a web page

In [9]:
API_unsupported_input = "https://www.nature.com/articles/s41467-024-51831-7"

In [10]:
raw_html = data_gatherer.fetch_data(API_unsupported_input, browser='Firefox', headless=True)

## Step II: Parse data

The Parser is responsible for extracting the relevant information from the raw documents fetched by the Fetcher. It has two main discovery methods:


- **Retrieve-Then-Read**, the parser passes to the LLMs only the relevant sections of the raw document, and the LLMs extract the relevant information.


- **Full-Document-Read**, Large-context LLMs read the entire raw document and extract the relevant information.

### Retrieve-Then-Read 
version 0.1.0 supports this parse method only for PMC XML and HTML documents

In [11]:
results_RTR = data_gatherer.parse_data(raw_data_API, current_url_address=API_supported_input, publisher="PMC", use_portkey_for_gemini=True, prompt_name="retrieve_datasets_simple_JSON", section_filter="data_availability_statement")
results_RTR

Unnamed: 0,dataset_identifier,data_repository,dataset_webpage,source_section,retrieval_pattern,access_mode,pub_title
0,PHS001049,dbgap,https://www.ncbi.nlm.nih.gov/projects/gap/cgi-...,data_availability,data availability,Application to access,Recurrent WNT pathway alterations are frequent...
1,PRJNA306801,sra,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,data_availability,data availability,Complex download,Recurrent WNT pathway alterations are frequent...


In [12]:
results_RTR = data_gatherer.parse_data(raw_data_local, current_url_address=local_fetch_supported_input, publisher="PMC", use_portkey_for_gemini=True, prompt_name="retrieve_datasets_simple_JSON", raw_data_format="HTML", section_filter="data_availability_statement")
results_RTR

Unnamed: 0,dataset_identifier,data_repository,dataset_webpage,access_mode,source_url,pub_title
0,MSV000092944,massive.ucsd.edu,https://massive.ucsd.edu/ProteoSAFe/dataset.js...,Complex download,https://www.ncbi.nlm.nih.gov/pmc/articles/pmc1...,No title found


In [13]:
results_RTR = data_gatherer.parse_data(raw_html, current_url_address=API_unsupported_input, publisher="Nature", use_portkey_for_gemini=True, prompt_name="retrieve_datasets_simple_JSON", raw_data_format ="HTML", semantic_retrieval=True, section_filter="data_availability_statement")
results_RTR

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,dataset_identifier,data_repository,dataset_webpage,access_mode,source_url,pub_title
0,GSE269782,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
1,GSE31210,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
2,GSE106765,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
3,GSE60189,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
4,GSE59239,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
5,GSE122005,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
6,GSE38121,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
7,GSE71587,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
8,GSE37699,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
9,PXD051771,www.proteomexchange.org,https://www.proteomexchange.org/cgi/GetDataset...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found


### Full-Document-Read

In [14]:
results_FDR = data_gatherer.parse_data(raw_html, current_url_address=API_unsupported_input, publisher="Nature", prompt_name="GPT_from_full_input_Examples", full_document_read=True, raw_data_format="HTML", section_filter="data_availability_statement")
results_FDR

Unnamed: 0,dataset_identifier,data_repository,dataset_webpage,access_mode,source_url,pub_title
0,GSE31210,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
1,GSE269782,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
2,GSE60189,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
3,GSE59239,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
4,GSE122005,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
5,GSE38121,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
6,GSE71587,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
7,GSE37699,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
8,GSE106765,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
9,PXD051771,www.proteomexchange.org,https://www.proteomexchange.org/cgi/GetDataset...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found


In [15]:
parsed_data = {"results_RTR" : results_RTR, "results_FDR": results_FDR}

## Step III: Classify data

For now, we are only considering raw data files, i.e. those objects that can be accessed from a repository with an accession code. However, the classifier can be extended to include other types of objects, such as supplementary materials.

In [16]:
print(f"Fetched from {len(parsed_data)} publications.")
raw_data_files = data_gatherer.classifier.get_raw_data_files(parsed_data)
print(f"Fetched {len(raw_data_files)} raw data files:")

Fetched from 2 publications.
Fetched 20 raw data files:


In [17]:
raw_data_files

Unnamed: 0,publication_url,dataset_identifier,data_repository,dataset_webpage
0,results_RTR,GSE269782,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
1,results_RTR,GSE31210,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
2,results_RTR,GSE106765,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
3,results_RTR,GSE60189,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
4,results_RTR,GSE59239,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
5,results_RTR,GSE122005,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
6,results_RTR,GSE38121,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
7,results_RTR,GSE71587,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
8,results_RTR,GSE37699,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...
9,results_RTR,PXD051771,www.proteomexchange.org,https://www.proteomexchange.org/cgi/GetDataset...
