# Workflow Example

In [1]:
import pandas as pd

from data_gatherer.data_gatherer import DataGatherer

Instantiate the DataGatherer orchestrator and select the LLM you intend to use. Make sure you have already set up the api keys for the target LLM as envirornment variables.

In [None]:
data_gatherer = DataGatherer(llm_name="gemini-2.0-flash")

## Step I: Obtaining Papers

The Fetcher is responsible for retrieving scientific publications from the web when you don't have them locally on your system. They should be open access. It supports the following fetch methods:


- **HttpGetRequest**, given a URL, this class will fetch the static HTML from webpages with a simple get request.


- **WebScraper**, given a URL, this class will retrieve the raw HTML from dynamic webpages with a Selenium Webdriver.


- **DatabaseFetcher**, given a key, it fetches raw data (HTML or XML) from a local DataFrame.


- **EntrezFetcher**, given a PMCID, it fetches the raw data (XML) from the Entrez E-Utils API.


- **PdfFetcher**, given a URL pointing to a PDF file on the internet, this class will download the PDF to your local system. If you have already downloaded the pdf to your local system, you can skip the fetching step and go to the Parsing Step.

### Fetch Raw HTML with WebScraper

In this case the Selenium WebDriver will navigate to the page and fetch the raw data from the url provided. 

In [None]:
article_url = "https://www.nature.com/articles/s41467-024-51831-7"

In [None]:
raw_html = data_gatherer.fetch_data(article_url, browser='Firefox', headless=True)

### Fetch data from Local

Show the structure of the local DataFrame. You will have to provide the path to this file as a value for the `local_fetch_file` parameter.


In [6]:
pd.read_parquet("../scripts/exp_input/Local_fetched_data.parquet").head(2)

Unnamed: 0,file_name,raw_cont,format,length,path,publication
0,miR-33b-3p Acts as a Tumor Suppressor by Targe...,"<html lang=""en"" class=""""><head>\n\n <me...",html,205313,../html_xml_samples/PMC/miR-33b-3p Acts as a T...,pmc8595470
1,Murine neuronatin deficiency is associated wit...,"<html lang=""en"" class=""""><head>\n\n <me...",html,238825,../html_xml_samples/PMC/Murine neuronatin defi...,pmc8413370


In [None]:
publication_key = "pmc8595470"

In [None]:
raw_data_local = data_gatherer.fetch_data(publication_key, local_fetch_file="../scripts/exp_input/Local_fetched_data.parquet")

### Fetch from Entrez E-Utils API

You can provide as input a URL to a PubMed Central article, or the PMC ID.

In [None]:
article_uri = "PMC6141466"

In [None]:
raw_data_API = data_gatherer.fetch_data(article_uri)

## Step II: Parse data

The Parser is responsible for extracting the relevant information from the raw documents obtained by the Fetcher. It has two main discovery methods:


- **Retrieve-Then-Read**, the parser filters the paper content before passing it to the LLMs, resulting in a shorter input. Usually input size here is in the order of $10^3$ tokens. This allows you to use Small Language Models and reduce inference cost, keep high precision but with a possible decrese in recall.


- **Full-Document-Read**, Large-context LLMs read the entire raw document and extract the relevant information. Input size here usually ranges between $10^4$ and $10^5$ tokens. This allows you to recall more datasets information on average, with a lower precision.


### Retrieve-Then-Read 

Depending on the Publisher, the Retrieval rules will be less or more curated by default.
- PMC Retrieval patterns are well curated.
- For other publisher you can add the desired patterns to the config file: 'retrieval_patterns.json'

Alternatively, you can set the parameter `semantic_retrieval=True` and `top_k=3` to select the top 3 most similar sections in the article without hard-coding the patterns by choosing the name of the embedding model to use (default: `sentence-transformers/all-MiniLM-L6-v2`). Any huggingface sentence-tranformers model is supported here.

In [11]:
results_RTR = data_gatherer.parse_data(raw_data_API, current_url_address=API_supported_input, publisher="PMC", use_portkey_for_gemini=True, prompt_name="retrieve_datasets_simple_JSON", section_filter="data_availability_statement")
results_RTR

Unnamed: 0,dataset_identifier,data_repository,dataset_webpage,source_section,retrieval_pattern,access_mode,pub_title
0,PHS001049,dbgap,https://www.ncbi.nlm.nih.gov/projects/gap/cgi-...,data_availability,data availability,Application to access,Recurrent WNT pathway alterations are frequent...
1,PRJNA306801,sra,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,data_availability,data availability,Complex download,Recurrent WNT pathway alterations are frequent...


In [12]:
results_RTR = data_gatherer.parse_data(raw_data_local, current_url_address=local_fetch_supported_input, publisher="PMC", use_portkey_for_gemini=True, prompt_name="retrieve_datasets_simple_JSON", raw_data_format="HTML", section_filter="data_availability_statement")
results_RTR

Unnamed: 0,dataset_identifier,data_repository,dataset_webpage,access_mode,source_url,pub_title
0,MSV000092944,massive.ucsd.edu,https://massive.ucsd.edu/ProteoSAFe/dataset.js...,Complex download,https://www.ncbi.nlm.nih.gov/pmc/articles/pmc1...,No title found


In [13]:
results_RTR = data_gatherer.parse_data(raw_html, current_url_address=API_unsupported_input, publisher="Nature", use_portkey_for_gemini=True, prompt_name="retrieve_datasets_simple_JSON", raw_data_format ="HTML", semantic_retrieval=True, section_filter="data_availability_statement")
results_RTR

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,dataset_identifier,data_repository,dataset_webpage,access_mode,source_url,pub_title
0,GSE269782,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
1,GSE31210,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
2,GSE106765,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
3,GSE60189,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
4,GSE59239,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
5,GSE122005,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
6,GSE38121,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
7,GSE71587,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
8,GSE37699,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
9,PXD051771,www.proteomexchange.org,https://www.proteomexchange.org/cgi/GetDataset...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found


### Full-Document-Read

In [14]:
results_FDR = data_gatherer.parse_data(raw_html, current_url_address=API_unsupported_input, publisher="Nature", prompt_name="GPT_from_full_input_Examples", full_document_read=True, raw_data_format="HTML", section_filter="data_availability_statement")
results_FDR

Unnamed: 0,dataset_identifier,data_repository,dataset_webpage,access_mode,source_url,pub_title
0,GSE31210,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
1,GSE269782,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
2,GSE60189,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
3,GSE59239,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
4,GSE122005,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
5,GSE38121,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
6,GSE71587,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
7,GSE37699,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
8,GSE106765,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found
9,PXD051771,www.proteomexchange.org,https://www.proteomexchange.org/cgi/GetDataset...,Complex download,https://www.nature.com/articles/s41467-024-518...,No title found


In [15]:
parsed_data = {"results_RTR" : results_RTR, "results_FDR": results_FDR}