Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop augmented datapoints with variable labels #18

Open
bdzyubak opened this issue Apr 1, 2024 · 0 comments
Open

Drop augmented datapoints with variable labels #18

bdzyubak opened this issue Apr 1, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request Experiment Run an evaluation and compare multiple approaches in the same data
Milestone

Comments

@bdzyubak
Copy link
Owner

bdzyubak commented Apr 1, 2024

In the issue #14, the dataset came with heavy resampling where "A series that was amazing" and "A series that was horrible" would each be resampled down to as little as one letter with labels being inherited. This causes "A", "A series" to have highly variable labels and would get in the way of training.

Enhancement: Implement code that runs on augmented data, finds datapoints with the same inputs and variable labels, and drops them. If we augmented the data ourselves and saved it to disk (vs augmenting on the fly), we need to keep track of augmented versus real cases. The implemented code could use the boolean parameter indicating whether the data was augmented to mask on only the generated cases. In the specific sentiment task with pregenerated data, we can detect augmented strings as being fully contained in other strings.

Experiment: See if removing augmented datasets with variable labels improves validation performance.

@bdzyubak bdzyubak added enhancement New feature or request Experiment Run an evaluation and compare multiple approaches in the same data labels Apr 1, 2024
@bdzyubak bdzyubak self-assigned this Apr 1, 2024
@bdzyubak bdzyubak added this to the May Release milestone Apr 1, 2024
@bdzyubak bdzyubak changed the title Drop augmented datapoints with vairable labels Drop augmented datapoints with variable labels Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Experiment Run an evaluation and compare multiple approaches in the same data
Projects
Status: No status
Status: No status
Development

No branches or pull requests

1 participant