Drop augmented datapoints with variable labels #18
Labels
enhancement
New feature or request
Experiment
Run an evaluation and compare multiple approaches in the same data
Milestone
In the issue #14, the dataset came with heavy resampling where "A series that was amazing" and "A series that was horrible" would each be resampled down to as little as one letter with labels being inherited. This causes "A", "A series" to have highly variable labels and would get in the way of training.
Enhancement: Implement code that runs on augmented data, finds datapoints with the same inputs and variable labels, and drops them. If we augmented the data ourselves and saved it to disk (vs augmenting on the fly), we need to keep track of augmented versus real cases. The implemented code could use the boolean parameter indicating whether the data was augmented to mask on only the generated cases. In the specific sentiment task with pregenerated data, we can detect augmented strings as being fully contained in other strings.
Experiment: See if removing augmented datasets with variable labels improves validation performance.
The text was updated successfully, but these errors were encountered: