Skip to content

Rename integration.prm.unified_nodes (and edges) datasets to integration.prm.unified_nodes@spark#1982

Merged
matentzn merged 4 commits intohf-kg-uploadfrom
use-spark-name-hf-upload
Dec 5, 2025
Merged

Rename integration.prm.unified_nodes (and edges) datasets to integration.prm.unified_nodes@spark#1982
matentzn merged 4 commits intohf-kg-uploadfrom
use-spark-name-hf-upload

Conversation

@matentzn
Copy link
Copy Markdown
Collaborator

@matentzn matentzn commented Dec 5, 2025

Description of the changes

This PR renames the integration.prm.unified_nodes (and edges) datasets to integration.prm.unified_nodes@spark to be inline with Kedro coding conventions on naming. This is the last request from the review of #1967.

Fixes / Resolves the following issues:

Being aligned with Kedro coding conventions, see @pascalwhoop request #1967 (comment)

@JacquesVergine this PR seems to have a lot of consequences everywhere. I don't know how to test this. When you review this, can you make sure I:

  1. did not accidentally rename a wrong dataset (remember, pipelines/matrix/conf/sample/integration/catalog.yml has a dataset with the same name - I did not rename THAT as per previous conversations. A good candidate where I might have made a mistake: pipelines/matrix/src/matrix/pipelines/create_sample/pipeline.py - but note that the pipeline is called create_sample not sample (as in the catalog above), so now idea if they are the same thing.
  2. was right in updating the blog post

Checklist:

  • Added label to PR (e.g. enhancement or bug)
  • Ensured the PR is named descriptively. FYI: This name is used as part of our changelog & release notes.
  • Looked at the diff on github to make sure no unwanted files have been committed.
  • Made corresponding changes to the documentation
  • Added tests that prove my fix is effective or that my feature works
  • Any dependent changes have been merged and published in downstream modules
  • If breaking changes occur or you need everyone to run a command locally after
    pulling in latest main, uncomment the below "Merge Notification" section and
    describe steps necessary for people
  • Ran on sample data using kedro run -e sample -p test_sample (see sample environment guide)

…@spark

This is to be more aligned with the convention
Changed references from 'integration.prm.unified_nodes' to 'integration.prm.unified_nodes@spark' in documentation in line with the renaming of the dataset.
@matentzn matentzn requested a review from a team as a code owner December 5, 2025 10:47
@matentzn matentzn requested review from matwasilewski and removed request for a team December 5, 2025 10:47
@matentzn matentzn changed the title Use spark name hf upload Rename integration.prm.unified_nodes (and edges) datasets to integration.prm.unified_nodes@spark Dec 5, 2025
Copy link
Copy Markdown
Collaborator

@JacquesVergine JacquesVergine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Introduces a new CI step to run 'kedro catalog resolve' with the 'sample' environment, ensuring catalog resolution is tested for this configuration.
Renamed 'integration.prm.unified_nodes' and 'integration.prm.unified_edges' in sample pipeline to include '@spark' suffix.
@matentzn matentzn merged commit 0a21fe5 into hf-kg-upload Dec 5, 2025
1 check passed
@matentzn matentzn deleted the use-spark-name-hf-upload branch December 5, 2025 12:08
matentzn added a commit that referenced this pull request Jan 8, 2026
* Register data_publication pipeline

Added import and registration for the data_publication pipeline in pipeline_registry.py to enable its use within the project.

* Create a new pipeline data_publication, for publishing HF datasets

This pipeline is a minimal variant of @pascalwhoop draft supplied in #1932.

The pipeline basically takes the released integrated graph (edges and nodes separately) and publishes them on huggingface hub.

* Remove 'credentials' field from HuggingFace dataset configs

Eliminated the 'credentials: hf' field from multiple HuggingFace dataset entries in catalog.yml to simplify configuration and rely on default authentication mechanisms.

* Update pandas dataset references for integration PRM

Removed pandas dataset definitions from data_publication catalog and added them to integration catalog with '@pandas' suffix to leverage Kedro transcoding. Updated pipeline to use new '@pandas' dataset names for publishing nodes and edges to Hugging Face.

* Remove publish_dataset_to_hf function by lambda

Eliminated the publish_dataset_to_hf function from nodes.py and replaced its usage in the pipeline with a passthrough lambda function. This simplifies the pipeline since dataset publishing is handled via catalog configuration.

* Remove dataset publication verification nodes

Eliminates verification nodes and related config for published datasets in the data publication pipeline. Adds upload verification logic to HFIterableDataset. This streamlines the pipeline by removing redundant read-back checks and consolidates verification within the dataset class.

* Run ruff

* Update Hugging Face dataset keys for data publication

Renamed dataset keys in catalog.yml and pipeline.py from 'kg_edges_hf_published' and 'kg_nodes_hf_published' to 'data_publication.kg_edges_hf_published' and 'data_publication.kg_nodes_hf_published' to fullfill project requirements.

* Run ruff formatting

* Delete parameters.yml

Empty parameter files not allowed in Matrix monorepo

* Update data publication catalog and pipeline keys

Renamed catalog and pipeline output keys to include 'prm.' prefix for consistency.

* Rename `integration.prm.unified_nodes` (and edges) datasets to `integration.prm.unified_nodes@spark` (#1982)

* Rename integration.prm.unified_nodes to integration.prm.unified_nodes@spark

This is to be more aligned with the convention

* Update dataset references to use Spark variant

Changed references from 'integration.prm.unified_nodes' to 'integration.prm.unified_nodes@spark' in documentation in line with the renaming of the dataset.

* Add kedro catalog resolve for sample environment

Introduces a new CI step to run 'kedro catalog resolve' with the 'sample' environment, ensuring catalog resolution is tested for this configuration.

* Update catalog source names for Spark integration in sample pipeline

Renamed 'integration.prm.unified_nodes' and 'integration.prm.unified_edges' in sample pipeline to include '@spark' suffix.

* Remove '@spark' suffix from prefiltered_nodes input

Updated the input name for filter_unified_kg_edges node to use 'filtering.prm.prefiltered_nodes' instead of 'filtering.prm.prefiltered_nodes@spark'. (Typo)

* Update Hugging Face repo IDs in catalog config

Changed the repo_id values for kg_edges_hf_published and kg_nodes_hf_published from 'matentzn' test repositories to 'everycure' production repositories in the data publication catalog configuration.

* Fix dataframe type handling in HFIterableDataset

Replaces multiple 'if' statements with 'elif' and 'else' to ensure only one dataframe type branch is executed and to improve error handling for unsupported types.

* Update uv.lock

* Simplify dataframe conversion in HFIterableDataset

Refactored the Hugging Face dataset loading logic to remove try/except blocks and fallback conversions for Spark and Polars dataframe types. Now assumes required libraries are installed and uses direct conversion methods, improving code clarity and reducing complexity.

* Remove transcoding from unified nodes/edges

Removes kedro transcoding logic by removing '@spark' and '@pandas' suffixes from 'integration.prm.unified_nodes' and 'integration.prm.unified_edges' across configuration files and pipeline code. Updates documentation and all pipeline references to use the new unified dataset names, simplifying catalog management and usage.

* Switch dataframe_type from pandas to spark in catalog.yml

Updated the dataframe_type for both kg_edges_hf_published and kg_nodes_hf_published datasets from 'pandas' to 'spark' to enable Spark-based processing.

* Remove extra blank line in catalog.yml

Deleted an unnecessary blank line between integration.prm.unified_edges and integration.prm.unified_edges_simplified for improved readability.

* Add token support to HuggingFace Hub verification methods

Updated internal methods to accept and use an optional token parameter for authenticated API requests to the HuggingFace Hub. This improves support for private datasets and ensures all verification steps can operate with proper authorization.

* Add data publication pipeline readme
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants