Add semantic column mapping function to standardize VIPER input data

**Feature Request**

Implement a column-mapping function in the VIPER data-cleaning stage to ensure all incoming data sources are standardized to the pipeline’s required column names.

**Problem**

Different health units may send spreadsheets with slightly different column names. This causes downstream failures because the pipeline expects canonical column names.

A reliable method is needed to map variable column names to standardized pipeline names without manually hardcoding every variation.

**Proposed Solution**

- Create a column-mapping utility that:
- Reads input column names.
- Normalizes them (lowercase, replace spaces/underscores, etc.).
- Uses semantic similarity to match input column names to the required pipeline columns.
- Renames dataframe columns based on the best match.
- Logs any unmapped or unexpected columns.

This can be implemented using rapidfuzz for similarity scoring, with optional synonym definitions stored in a YAML

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add semantic column mapping function to standardize VIPER input data #107

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add semantic column mapping function to standardize VIPER input data #107

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions