Rename `integration.prm.unified_nodes` (and edges) datasets to `integration.prm.unified_nodes@spark` by matentzn · Pull Request #1982 · everycure-org/matrix

matentzn · 2025-12-05T10:47:35Z

Description of the changes

This PR renames the integration.prm.unified_nodes (and edges) datasets to integration.prm.unified_nodes@spark to be inline with Kedro coding conventions on naming. This is the last request from the review of #1967.

Fixes / Resolves the following issues:

Being aligned with Kedro coding conventions, see @pascalwhoop request #1967 (comment)

@JacquesVergine this PR seems to have a lot of consequences everywhere. I don't know how to test this. When you review this, can you make sure I:

did not accidentally rename a wrong dataset (remember, pipelines/matrix/conf/sample/integration/catalog.yml has a dataset with the same name - I did not rename THAT as per previous conversations. A good candidate where I might have made a mistake: pipelines/matrix/src/matrix/pipelines/create_sample/pipeline.py - but note that the pipeline is called create_sample not sample (as in the catalog above), so now idea if they are the same thing.
was right in updating the blog post

Checklist:

Added label to PR (e.g. enhancement or bug)
Ensured the PR is named descriptively. FYI: This name is used as part of our changelog & release notes.
Looked at the diff on github to make sure no unwanted files have been committed.
Made corresponding changes to the documentation
Added tests that prove my fix is effective or that my feature works
Any dependent changes have been merged and published in downstream modules
If breaking changes occur or you need everyone to run a command locally after
pulling in latest main, uncomment the below "Merge Notification" section and
describe steps necessary for people
Ran on sample data using kedro run -e sample -p test_sample (see sample environment guide)

…@spark This is to be more aligned with the convention

Changed references from 'integration.prm.unified_nodes' to 'integration.prm.unified_nodes@spark' in documentation in line with the renaming of the dataset.

JacquesVergine

Looks good to me!

Introduces a new CI step to run 'kedro catalog resolve' with the 'sample' environment, ensuring catalog resolution is tested for this configuration.

Renamed 'integration.prm.unified_nodes' and 'integration.prm.unified_edges' in sample pipeline to include '@spark' suffix.

@pascalwhoop

* Register data_publication pipeline Added import and registration for the data_publication pipeline in pipeline_registry.py to enable its use within the project. * Create a new pipeline data_publication, for publishing HF datasets This pipeline is a minimal variant of @pascalwhoop draft supplied in #1932. The pipeline basically takes the released integrated graph (edges and nodes separately) and publishes them on huggingface hub. * Remove 'credentials' field from HuggingFace dataset configs Eliminated the 'credentials: hf' field from multiple HuggingFace dataset entries in catalog.yml to simplify configuration and rely on default authentication mechanisms. * Update pandas dataset references for integration PRM Removed pandas dataset definitions from data_publication catalog and added them to integration catalog with '@pandas' suffix to leverage Kedro transcoding. Updated pipeline to use new '@pandas' dataset names for publishing nodes and edges to Hugging Face. * Remove publish_dataset_to_hf function by lambda Eliminated the publish_dataset_to_hf function from nodes.py and replaced its usage in the pipeline with a passthrough lambda function. This simplifies the pipeline since dataset publishing is handled via catalog configuration. * Remove dataset publication verification nodes Eliminates verification nodes and related config for published datasets in the data publication pipeline. Adds upload verification logic to HFIterableDataset. This streamlines the pipeline by removing redundant read-back checks and consolidates verification within the dataset class. * Run ruff * Update Hugging Face dataset keys for data publication Renamed dataset keys in catalog.yml and pipeline.py from 'kg_edges_hf_published' and 'kg_nodes_hf_published' to 'data_publication.kg_edges_hf_published' and 'data_publication.kg_nodes_hf_published' to fullfill project requirements. * Run ruff formatting * Delete parameters.yml Empty parameter files not allowed in Matrix monorepo * Update data publication catalog and pipeline keys Renamed catalog and pipeline output keys to include 'prm.' prefix for consistency. * Rename `integration.prm.unified_nodes` (and edges) datasets to `integration.prm.unified_nodes@spark` (#1982) * Rename integration.prm.unified_nodes to integration.prm.unified_nodes@spark This is to be more aligned with the convention * Update dataset references to use Spark variant Changed references from 'integration.prm.unified_nodes' to 'integration.prm.unified_nodes@spark' in documentation in line with the renaming of the dataset. * Add kedro catalog resolve for sample environment Introduces a new CI step to run 'kedro catalog resolve' with the 'sample' environment, ensuring catalog resolution is tested for this configuration. * Update catalog source names for Spark integration in sample pipeline Renamed 'integration.prm.unified_nodes' and 'integration.prm.unified_edges' in sample pipeline to include '@spark' suffix. * Remove '@spark' suffix from prefiltered_nodes input Updated the input name for filter_unified_kg_edges node to use 'filtering.prm.prefiltered_nodes' instead of 'filtering.prm.prefiltered_nodes@spark'. (Typo) * Update Hugging Face repo IDs in catalog config Changed the repo_id values for kg_edges_hf_published and kg_nodes_hf_published from 'matentzn' test repositories to 'everycure' production repositories in the data publication catalog configuration. * Fix dataframe type handling in HFIterableDataset Replaces multiple 'if' statements with 'elif' and 'else' to ensure only one dataframe type branch is executed and to improve error handling for unsupported types. * Update uv.lock * Simplify dataframe conversion in HFIterableDataset Refactored the Hugging Face dataset loading logic to remove try/except blocks and fallback conversions for Spark and Polars dataframe types. Now assumes required libraries are installed and uses direct conversion methods, improving code clarity and reducing complexity. * Remove transcoding from unified nodes/edges Removes kedro transcoding logic by removing '@spark' and '@pandas' suffixes from 'integration.prm.unified_nodes' and 'integration.prm.unified_edges' across configuration files and pipeline code. Updates documentation and all pipeline references to use the new unified dataset names, simplifying catalog management and usage. * Switch dataframe_type from pandas to spark in catalog.yml Updated the dataframe_type for both kg_edges_hf_published and kg_nodes_hf_published datasets from 'pandas' to 'spark' to enable Spark-based processing. * Remove extra blank line in catalog.yml Deleted an unnecessary blank line between integration.prm.unified_edges and integration.prm.unified_edges_simplified for improved readability. * Add token support to HuggingFace Hub verification methods Updated internal methods to accept and use an optional token parameter for authenticated API requests to the HuggingFace Hub. This improves support for private datasets and ensures all verification steps can operate with proper authorization. * Add data publication pipeline readme

matentzn added 2 commits December 5, 2025 12:38

Rename integration.prm.unified_nodes to integration.prm.unified_nodes…

b4e4503

…@spark This is to be more aligned with the convention

Update dataset references to use Spark variant

93ee4fe

Changed references from 'integration.prm.unified_nodes' to 'integration.prm.unified_nodes@spark' in documentation in line with the renaming of the dataset.

matentzn requested a review from a team as a code owner December 5, 2025 10:47

matentzn requested review from matwasilewski and removed request for a team December 5, 2025 10:47

matentzn changed the title ~~Use spark name hf upload~~ Rename integration.prm.unified_nodes (and edges) datasets to integration.prm.unified_nodes@spark Dec 5, 2025

matentzn requested a review from JacquesVergine December 5, 2025 10:48

JacquesVergine approved these changes Dec 5, 2025

View reviewed changes

matentzn added 2 commits December 5, 2025 13:36

Add kedro catalog resolve for sample environment

8eced0b

Introduces a new CI step to run 'kedro catalog resolve' with the 'sample' environment, ensuring catalog resolution is tested for this configuration.

Update catalog source names for Spark integration in sample pipeline

674609c

Renamed 'integration.prm.unified_nodes' and 'integration.prm.unified_edges' in sample pipeline to include '@spark' suffix.

matentzn requested a review from JacquesVergine December 5, 2025 11:38

JacquesVergine approved these changes Dec 5, 2025

View reviewed changes

matentzn merged commit 0a21fe5 into hf-kg-upload Dec 5, 2025
1 check passed

matentzn deleted the use-spark-name-hf-upload branch December 5, 2025 12:08

matentzn mentioned this pull request Dec 5, 2025

Add new pipeline to manage hugginface hub uploads #1967

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename `integration.prm.unified_nodes` (and edges) datasets to `integration.prm.unified_nodes@spark`#1982

Rename `integration.prm.unified_nodes` (and edges) datasets to `integration.prm.unified_nodes@spark`#1982
matentzn merged 4 commits intohf-kg-uploadfrom
use-spark-name-hf-upload

matentzn commented Dec 5, 2025

Uh oh!

JacquesVergine left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

matentzn commented Dec 5, 2025

Description of the changes

Fixes / Resolves the following issues:

Checklist:

Uh oh!

JacquesVergine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants