Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: stringify dataset column contents during data loading #168

Merged
merged 5 commits into from
Jan 19, 2024

Conversation

danielezhu
Copy link
Contributor

@danielezhu danielezhu commented Jan 11, 2024

Issue #, if available:

Description of changes:
Currently, the JSON parsing code extracts raw data from the input dataset using user-provided JMESPath expressions. This raw data can be of any data type, such as booleans, floats, etc.

Certain columns (e.g. target output) should always be strings, even if the raw data from the dataset is not in string form. For example, the dataset could contain target outputs that are booleans, but due to the nature of the evaluation algorithms, the target output needs to be something like "True" or "False".

This PR adds a function cast_to_string, and updates the _parse_column method to call cast_to_string if appropriate. Additionally, this PR changes class ColumnNames(Enum) in constants.py to class DatasetColumns(Enum), where instead of enumerating just the names of the columns, it enumerates Column objects, where Column is a new class that I introduce.

Column contains a name attribute which represents the original strings that were being enumerated by class ColumnNames(Enum) and a should_cast attribute which represents whether the contents of this column should be casted to string during data loading.

The reason I added this class is so that we don't create a separate list for tracking which columns should/shouldn't be casted. Doing so would require updating this list manually whenever new column types are introduced, which is a source of human error that should be avoided. PR #171 was raised precisely because we were doing something similar in the past.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

xiaoyi-cheng
xiaoyi-cheng previously approved these changes Jan 12, 2024
keerthanvasist
keerthanvasist previously approved these changes Jan 12, 2024
lucfra

This comment was marked as duplicate.

lucfra

This comment was marked as duplicate.

Copy link
Contributor

@franluca franluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @danielezhu ! Just got a question. Will this convert any field of the dataset into a string?
If so, I would strongly recommend against it.
We should only convert fields that are required to be strings by the eval algorithms (most notably input and targets, if present).

@danielezhu
Copy link
Contributor Author

Luca raised a good point about not casting every column. It is clear why we shouldn't do this if we consider the log probability columns. I will update this PR after Luca sends me the list of columns that should be converted to strings, and those that should not.

@lucfra
Copy link

lucfra commented Jan 12, 2024

Looking at the data config in principle we should convert the following columns (if present):

    model_input_location: Optional[str] = None
    model_output_location: Optional[str] = None
    target_output_location: Optional[str] = None
    category_location: Optional[str] = None
    sent_more_input_location: Optional[str] = None
    sent_less_input_location: Optional[str] = None

@lucfra
Copy link

lucfra commented Jan 12, 2024

However I have another question. Can in principle be the fields also lists? like having a list of target output(s) ? Or do we prohibit this?

Copy link
Contributor

@franluca franluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 non-blocking comments. Thanks

src/fmeval/data_loaders/json_parser.py Show resolved Hide resolved
src/fmeval/data_loaders/json_parser.py Show resolved Hide resolved
xiaoyi-cheng
xiaoyi-cheng previously approved these changes Jan 19, 2024
@danielezhu danielezhu dismissed franluca’s stale review January 19, 2024 18:36

Change requests are stale

@danielezhu danielezhu merged commit e9bee8b into aws:main Jan 19, 2024
2 of 3 checks passed
@danielezhu danielezhu deleted the cast_string branch January 19, 2024 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants