-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Feature Request
Implement a column-mapping function in the VIPER data-cleaning stage to ensure all incoming data sources are standardized to the pipeline’s required column names.
Problem
Different health units may send spreadsheets with slightly different column names. This causes downstream failures because the pipeline expects canonical column names.
A reliable method is needed to map variable column names to standardized pipeline names without manually hardcoding every variation.
Proposed Solution
- Create a column-mapping utility that:
- Reads input column names.
- Normalizes them (lowercase, replace spaces/underscores, etc.).
- Uses semantic similarity to match input column names to the required pipeline columns.
- Renames dataframe columns based on the best match.
- Logs any unmapped or unexpected columns.
This can be implemented using rapidfuzz for similarity scoring, with optional synonym definitions stored in a YAML
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request