Rename integration.prm.unified_nodes (and edges) datasets to integration.prm.unified_nodes@spark#1982
Merged
matentzn merged 4 commits intohf-kg-uploadfrom Dec 5, 2025
Merged
Conversation
…@spark This is to be more aligned with the convention
Changed references from 'integration.prm.unified_nodes' to 'integration.prm.unified_nodes@spark' in documentation in line with the renaming of the dataset.
integration.prm.unified_nodes (and edges) datasets to integration.prm.unified_nodes@spark
JacquesVergine
approved these changes
Dec 5, 2025
Collaborator
JacquesVergine
left a comment
There was a problem hiding this comment.
Looks good to me!
Introduces a new CI step to run 'kedro catalog resolve' with the 'sample' environment, ensuring catalog resolution is tested for this configuration.
Renamed 'integration.prm.unified_nodes' and 'integration.prm.unified_edges' in sample pipeline to include '@spark' suffix.
JacquesVergine
approved these changes
Dec 5, 2025
10 tasks
matentzn
added a commit
that referenced
this pull request
Jan 8, 2026
* Register data_publication pipeline Added import and registration for the data_publication pipeline in pipeline_registry.py to enable its use within the project. * Create a new pipeline data_publication, for publishing HF datasets This pipeline is a minimal variant of @pascalwhoop draft supplied in #1932. The pipeline basically takes the released integrated graph (edges and nodes separately) and publishes them on huggingface hub. * Remove 'credentials' field from HuggingFace dataset configs Eliminated the 'credentials: hf' field from multiple HuggingFace dataset entries in catalog.yml to simplify configuration and rely on default authentication mechanisms. * Update pandas dataset references for integration PRM Removed pandas dataset definitions from data_publication catalog and added them to integration catalog with '@pandas' suffix to leverage Kedro transcoding. Updated pipeline to use new '@pandas' dataset names for publishing nodes and edges to Hugging Face. * Remove publish_dataset_to_hf function by lambda Eliminated the publish_dataset_to_hf function from nodes.py and replaced its usage in the pipeline with a passthrough lambda function. This simplifies the pipeline since dataset publishing is handled via catalog configuration. * Remove dataset publication verification nodes Eliminates verification nodes and related config for published datasets in the data publication pipeline. Adds upload verification logic to HFIterableDataset. This streamlines the pipeline by removing redundant read-back checks and consolidates verification within the dataset class. * Run ruff * Update Hugging Face dataset keys for data publication Renamed dataset keys in catalog.yml and pipeline.py from 'kg_edges_hf_published' and 'kg_nodes_hf_published' to 'data_publication.kg_edges_hf_published' and 'data_publication.kg_nodes_hf_published' to fullfill project requirements. * Run ruff formatting * Delete parameters.yml Empty parameter files not allowed in Matrix monorepo * Update data publication catalog and pipeline keys Renamed catalog and pipeline output keys to include 'prm.' prefix for consistency. * Rename `integration.prm.unified_nodes` (and edges) datasets to `integration.prm.unified_nodes@spark` (#1982) * Rename integration.prm.unified_nodes to integration.prm.unified_nodes@spark This is to be more aligned with the convention * Update dataset references to use Spark variant Changed references from 'integration.prm.unified_nodes' to 'integration.prm.unified_nodes@spark' in documentation in line with the renaming of the dataset. * Add kedro catalog resolve for sample environment Introduces a new CI step to run 'kedro catalog resolve' with the 'sample' environment, ensuring catalog resolution is tested for this configuration. * Update catalog source names for Spark integration in sample pipeline Renamed 'integration.prm.unified_nodes' and 'integration.prm.unified_edges' in sample pipeline to include '@spark' suffix. * Remove '@spark' suffix from prefiltered_nodes input Updated the input name for filter_unified_kg_edges node to use 'filtering.prm.prefiltered_nodes' instead of 'filtering.prm.prefiltered_nodes@spark'. (Typo) * Update Hugging Face repo IDs in catalog config Changed the repo_id values for kg_edges_hf_published and kg_nodes_hf_published from 'matentzn' test repositories to 'everycure' production repositories in the data publication catalog configuration. * Fix dataframe type handling in HFIterableDataset Replaces multiple 'if' statements with 'elif' and 'else' to ensure only one dataframe type branch is executed and to improve error handling for unsupported types. * Update uv.lock * Simplify dataframe conversion in HFIterableDataset Refactored the Hugging Face dataset loading logic to remove try/except blocks and fallback conversions for Spark and Polars dataframe types. Now assumes required libraries are installed and uses direct conversion methods, improving code clarity and reducing complexity. * Remove transcoding from unified nodes/edges Removes kedro transcoding logic by removing '@spark' and '@pandas' suffixes from 'integration.prm.unified_nodes' and 'integration.prm.unified_edges' across configuration files and pipeline code. Updates documentation and all pipeline references to use the new unified dataset names, simplifying catalog management and usage. * Switch dataframe_type from pandas to spark in catalog.yml Updated the dataframe_type for both kg_edges_hf_published and kg_nodes_hf_published datasets from 'pandas' to 'spark' to enable Spark-based processing. * Remove extra blank line in catalog.yml Deleted an unnecessary blank line between integration.prm.unified_edges and integration.prm.unified_edges_simplified for improved readability. * Add token support to HuggingFace Hub verification methods Updated internal methods to accept and use an optional token parameter for authenticated API requests to the HuggingFace Hub. This improves support for private datasets and ensures all verification steps can operate with proper authorization. * Add data publication pipeline readme
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of the changes
This PR renames the
integration.prm.unified_nodes(and edges) datasets tointegration.prm.unified_nodes@sparkto be inline with Kedro coding conventions on naming. This is the last request from the review of #1967.Fixes / Resolves the following issues:
Being aligned with Kedro coding conventions, see @pascalwhoop request #1967 (comment)
@JacquesVergine this PR seems to have a lot of consequences everywhere. I don't know how to test this. When you review this, can you make sure I:
pipelines/matrix/conf/sample/integration/catalog.ymlhas a dataset with the same name - I did not rename THAT as per previous conversations. A good candidate where I might have made a mistake: pipelines/matrix/src/matrix/pipelines/create_sample/pipeline.py - but note that the pipeline is calledcreate_samplenotsample(as in the catalog above), so now idea if they are the same thing.Checklist:
enhancementorbug)pulling in latest main, uncomment the below "Merge Notification" section and
describe steps necessary for people
kedro run -e sample -p test_sample(see sample environment guide)