-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: stringify dataset column contents during data loading #168
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @danielezhu ! Just got a question. Will this convert any field of the dataset into a string?
If so, I would strongly recommend against it.
We should only convert fields that are required to be strings by the eval algorithms (most notably input and targets, if present).
Luca raised a good point about not casting every column. It is clear why we shouldn't do this if we consider the log probability columns. I will update this PR after Luca sends me the list of columns that should be converted to strings, and those that should not. |
Looking at the data config in principle we should convert the following columns (if present):
|
However I have another question. Can in principle be the fields also lists? like having a list of target output(s) ? Or do we prohibit this? |
0a8554b
22e4103
to
61816ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 non-blocking comments. Thanks
Issue #, if available:
Description of changes:
Currently, the JSON parsing code extracts raw data from the input dataset using user-provided JMESPath expressions. This raw data can be of any data type, such as booleans, floats, etc.
Certain columns (e.g. target output) should always be strings, even if the raw data from the dataset is not in string form. For example, the dataset could contain target outputs that are booleans, but due to the nature of the evaluation algorithms, the target output needs to be something like
"True"
or"False"
.This PR adds a function
cast_to_string
, and updates the_parse_column
method to callcast_to_string
if appropriate. Additionally, this PR changesclass ColumnNames(Enum)
inconstants.py
toclass DatasetColumns(Enum)
, where instead of enumerating just the names of the columns, it enumeratesColumn
objects, whereColumn
is a new class that I introduce.Column
contains aname
attribute which represents the original strings that were being enumerated byclass ColumnNames(Enum)
and ashould_cast
attribute which represents whether the contents of this column should be casted to string during data loading.The reason I added this class is so that we don't create a separate list for tracking which columns should/shouldn't be casted. Doing so would require updating this list manually whenever new column types are introduced, which is a source of human error that should be avoided. PR #171 was raised precisely because we were doing something similar in the past.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.