feat: stringify dataset column contents during data loading #168

danielezhu · 2024-01-11T21:34:26Z

Issue #, if available:

Description of changes:
Currently, the JSON parsing code extracts raw data from the input dataset using user-provided JMESPath expressions. This raw data can be of any data type, such as booleans, floats, etc.

Certain columns (e.g. target output) should always be strings, even if the raw data from the dataset is not in string form. For example, the dataset could contain target outputs that are booleans, but due to the nature of the evaluation algorithms, the target output needs to be something like "True" or "False".

This PR adds a function cast_to_string, and updates the _parse_column method to call cast_to_string if appropriate. Additionally, this PR changes class ColumnNames(Enum) in constants.py to class DatasetColumns(Enum), where instead of enumerating just the names of the columns, it enumerates Column objects, where Column is a new class that I introduce.

Column contains a name attribute which represents the original strings that were being enumerated by class ColumnNames(Enum) and a should_cast attribute which represents whether the contents of this column should be casted to string during data loading.

The reason I added this class is so that we don't create a separate list for tracking which columns should/shouldn't be casted. Doing so would require updating this list manually whenever new column types are introduced, which is a source of human error that should be avoided. PR #171 was raised precisely because we were doing something similar in the past.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

franluca

Thanks @danielezhu ! Just got a question. Will this convert any field of the dataset into a string?
If so, I would strongly recommend against it.
We should only convert fields that are required to be strings by the eval algorithms (most notably input and targets, if present).

danielezhu · 2024-01-12T17:22:54Z

Luca raised a good point about not casting every column. It is clear why we shouldn't do this if we consider the log probability columns. I will update this PR after Luca sends me the list of columns that should be converted to strings, and those that should not.

lucfra · 2024-01-12T18:21:11Z

Looking at the data config in principle we should convert the following columns (if present):

    model_input_location: Optional[str] = None
    model_output_location: Optional[str] = None
    target_output_location: Optional[str] = None
    category_location: Optional[str] = None
    sent_more_input_location: Optional[str] = None
    sent_less_input_location: Optional[str] = None

lucfra · 2024-01-12T18:22:20Z

However I have another question. Can in principle be the fields also lists? like having a list of target output(s) ? Or do we prohibit this?

franluca

2 non-blocking comments. Thanks

src/fmeval/data_loaders/json_parser.py

Change requests are stale

danielezhu mentioned this pull request Jan 11, 2024

feat: conversion of targets to strings #167

Closed

danielezhu requested review from xiaoyi-cheng, keerthanvasist and franluca January 11, 2024 21:36

xiaoyi-cheng previously approved these changes Jan 12, 2024

View reviewed changes

keerthanvasist previously approved these changes Jan 12, 2024

View reviewed changes

This comment was marked as duplicate.

Sign in to view

franluca previously requested changes Jan 12, 2024

View reviewed changes

danielezhu dismissed stale reviews from keerthanvasist and xiaoyi-cheng via 0a8554b January 17, 2024 06:56

danielezhu requested a review from franluca January 17, 2024 07:00

feat: stringify dataset column contents during data loading

61816ec

danielezhu force-pushed the cast_string branch from 22e4103 to 61816ec Compare January 18, 2024 19:58

Update type hinting

9d05b3d

franluca reviewed Jan 19, 2024

View reviewed changes

src/fmeval/data_loaders/json_parser.py Show resolved Hide resolved

src/fmeval/data_loaders/json_parser.py Show resolved Hide resolved

xiaoyi-cheng previously approved these changes Jan 19, 2024

View reviewed changes

Update _cast_to_string error message to include failed column name

35d47ea

danielezhu dismissed xiaoyi-cheng’s stale review via 35d47ea January 19, 2024 19:15

Merge branch 'main' into cast_string

a527049

xiaoyi-cheng approved these changes Jan 19, 2024

View reviewed changes

Merge branch 'main' into cast_string

2a81934

oyangz approved these changes Jan 19, 2024

View reviewed changes

danielezhu merged commit e9bee8b into aws:main Jan 19, 2024
2 of 3 checks passed

danielezhu deleted the cast_string branch January 19, 2024 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: stringify dataset column contents during data loading #168

feat: stringify dataset column contents during data loading #168

danielezhu commented Jan 11, 2024 •

edited

Loading

This comment was marked as duplicate.

This comment was marked as duplicate.

franluca left a comment

danielezhu commented Jan 12, 2024

lucfra commented Jan 12, 2024

lucfra commented Jan 12, 2024

franluca left a comment

feat: stringify dataset column contents during data loading #168

feat: stringify dataset column contents during data loading #168

Conversation

danielezhu commented Jan 11, 2024 • edited Loading

This comment was marked as duplicate.

This comment was marked as duplicate.

franluca left a comment

Choose a reason for hiding this comment

danielezhu commented Jan 12, 2024

lucfra commented Jan 12, 2024

lucfra commented Jan 12, 2024

franluca left a comment

Choose a reason for hiding this comment

danielezhu commented Jan 11, 2024 •

edited

Loading