Import orchestrator, and load environment variables from ".env" file

In [1]:
from data_gatherer.orchestrator import Orchestrator
from data_gatherer.data_fetcher import DataFetcher, WebScraper, DatabaseFetcher, APIClient
from data_gatherer.selenium_setup import create_driver
from data_gatherer.parser import LLMParser
import requests
import pandas as pd
import time
from dotenv import load_dotenv

load_dotenv()

True

Instantiate the orchestrator as data gatherer

In [2]:
data_gatherer = Orchestrator()

## Step I: Fetch data

The Fetcher is responsible for fetching raw data about scientific publications on the internet. The parent class is DataFetcher, and the child classes are the following:



- **WebScraper**, given a URL, it scrapes the page extracting the HTML content.


- **DatabaseFetcher**, given a key, it fetches raw data (HTML or XML) from a local DataFrame.


- **APIClient**, given a URI, it fetches raw data (XML) from an API.

In [3]:
data_gatherer.setup_data_fetcher()

### Fetch from API

In [4]:
API_supported_input = "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466"
api_fetcher = APIClient(requests,'PMC_API',data_gatherer.config, data_gatherer.logger)
api_fetcher

<data_gatherer.data_fetcher.APIClient at 0x142f4db10>

In [5]:
raw_data_API = api_fetcher.fetch_data(API_supported_input)
raw_data_API

<Element pmc-articleset at 0x13e6ff8c0>

### Fetch from Local Data

In [6]:
local_fetch_supported_input = "https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778"

In [7]:
df_fetcher = DatabaseFetcher(data_gatherer.config, data_gatherer.logger)
df_fetcher

<data_gatherer.data_fetcher.DatabaseFetcher at 0x13e0220d0>

In [8]:
raw_data_local = df_fetcher.fetch_data(local_fetch_supported_input)
print(f"Length of raw content: {len(raw_data_local)}")

Length of raw content: 205661


### Fetch Raw HTML from a web page

In [9]:
API_unsupported_input = "https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/"

In [10]:
driver = create_driver(data_gatherer.config['DRIVER_PATH'], data_gatherer.config['BROWSER'], data_gatherer.config['HEADLESS'], data_gatherer.logger)
web_fetcher = WebScraper(driver, data_gatherer.config, data_gatherer.logger)

In [11]:
raw_html = web_fetcher.fetch_data(API_unsupported_input)
print(f"Length of raw content: {len(raw_html)}")

Length of raw content: 564434


## Step II: Parse data

The Parser is responsible for extracting the relevant information from the raw documents fetched by the Fetcher. It has two main discovery methods:


- **Retrieve-Then-Read**, the parser passes to the LLMs only the relevant sections of the raw document, and the LLMs extract the relevant information.


- **Full-Document-Read**, Large-context LLMs read the entire raw document and extract the relevant information.

### Retrieve-Then-Read

In [12]:
parser = LLMParser(data_gatherer.config['parser_config_path'], data_gatherer.logger, full_document_read=False)

In [13]:
input_url = [API_supported_input, local_fetch_supported_input, API_unsupported_input]
input_format = ["XML", "full_HTML", "full_HTML"]
input_cont = [raw_data_API, raw_data_local, raw_html]

dfs_dict_RTR = {}
for i in range(len(input_url)):
    append_data = parser.parse_data(input_cont[i], "PMC", input_url[i], raw_data_format=input_format[i])
    dfs_dict_RTR[input_url[i]] = append_data
    print(f"Parsed data from {input_url[i]}. Found {len(append_data)} candidate datasets.")
    time.sleep(1)  # To print the output in the right order



Parsed data from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466. Found 32 candidate datasets.
Parsed data from https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778. Found 3 candidate datasets.




Parsed data from https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/. Found 10 candidate datasets.


### Full-Document-Read

In [14]:
parser = LLMParser(data_gatherer.config['parser_config_path'], data_gatherer.logger, full_document_read=True)

In [15]:
input_url = [API_supported_input, local_fetch_supported_input, API_unsupported_input]
input_format = ["XML", "full_HTML", "full_HTML"]
input_cont = [raw_data_API, raw_data_local, raw_html]

dfs_dict_FDR = {}
for i in range(len(input_url)):
    append_data = parser.parse_data(input_cont[i], "PMC", input_url[i], raw_data_format=input_format[i])
    dfs_dict_FDR[input_url[i]] = append_data
    print(f"Parsed data from {input_url[i]}. Found {len(append_data)} candidate datasets.")
    time.sleep(1)  # To print the output in the right order



Parsed data from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6141466. Found 17 candidate datasets.
Parsed data from https://www.ncbi.nlm.nih.gov/pmc/articles/pmc11425778. Found 3 candidate datasets.




Parsed data from https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/. Found 12 candidate datasets.


## Step III: Classify data

For now, we are only considering raw data files, i.e. those objects that can be accessed from a repository with an accession code. However, the classifier can be extended to include other types of objects, such as supplementary materials.

In [16]:
print(f"Fetched from {len(dfs_dict_FDR)} publications.")
raw_data_files = data_gatherer.classifier.get_raw_data_files(dfs_dict_FDR)
print(f"Fetched {len(raw_data_files)} raw data files:")

Fetched from 3 publications.
Fetched 6 raw data files:


In [17]:
raw_data_files

Unnamed: 0,publication_url,dataset_identifier,data_repository,dataset_webpage
0,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PHS001049,dbGAP,https://www.ncbi.nlm.nih.gov/projects/gap/cgi-...
1,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PRJNA306801,SRA,https://www.ncbi.nlm.nih.gov/bioproject/?term=...
17,https://www.ncbi.nlm.nih.gov/pmc/articles/pmc1...,MSV000092944,MassIVE database,https://massive.ucsd.edu/ProteoSAFe/dataset.js...
20,https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/,phs001287,dbGaP,https://www.ncbi.nlm.nih.gov/projects/gap/cgi-...
21,https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/,https://cptac-data-portal.georgetown.edu/cptac...,CPTAC Data Portal,
22,https://pmc.ncbi.nlm.nih.gov/articles/PMC7233456/,http://www.linkedomics.org/,LinkedOmics,
