Drop augmented datapoints with variable labels #18

bdzyubak · 2024-04-01T20:53:51Z

In the issue #14, the dataset came with heavy resampling where "A series that was amazing" and "A series that was horrible" would each be resampled down to as little as one letter with labels being inherited. This causes "A", "A series" to have highly variable labels and would get in the way of training.

Enhancement: Implement code that runs on augmented data, finds datapoints with the same inputs and variable labels, and drops them. If we augmented the data ourselves and saved it to disk (vs augmenting on the fly), we need to keep track of augmented versus real cases. The implemented code could use the boolean parameter indicating whether the data was augmented to mask on only the generated cases. In the specific sentiment task with pregenerated data, we can detect augmented strings as being fully contained in other strings.

Experiment: See if removing augmented datasets with variable labels improves validation performance.

bdzyubak added enhancement New feature or request Experiment Run an evaluation and compare multiple approaches in the same data labels Apr 1, 2024

bdzyubak self-assigned this Apr 1, 2024

bdzyubak added this to the May Release milestone Apr 1, 2024

bdzyubak changed the title ~~Drop augmented datapoints with vairable labels~~ Drop augmented datapoints with variable labels Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop augmented datapoints with variable labels #18

Drop augmented datapoints with variable labels #18

bdzyubak commented Apr 1, 2024

Drop augmented datapoints with variable labels #18

Drop augmented datapoints with variable labels #18

Comments

bdzyubak commented Apr 1, 2024