Skip to content

Add semantic column mapping function to standardize VIPER input data #107

@kassyray

Description

@kassyray

Feature Request

Implement a column-mapping function in the VIPER data-cleaning stage to ensure all incoming data sources are standardized to the pipeline’s required column names.

Problem

Different health units may send spreadsheets with slightly different column names. This causes downstream failures because the pipeline expects canonical column names.

A reliable method is needed to map variable column names to standardized pipeline names without manually hardcoding every variation.

Proposed Solution

  • Create a column-mapping utility that:
  • Reads input column names.
  • Normalizes them (lowercase, replace spaces/underscores, etc.).
  • Uses semantic similarity to match input column names to the required pipeline columns.
  • Renames dataframe columns based on the best match.
  • Logs any unmapped or unexpected columns.

This can be implemented using rapidfuzz for similarity scoring, with optional synonym definitions stored in a YAML

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions