Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Cleaning Process #208

Open
7 tasks
KennethEnevoldsen opened this issue Jan 4, 2024 · 2 comments
Open
7 tasks

Dataset Cleaning Process #208

KennethEnevoldsen opened this issue Jan 4, 2024 · 2 comments

Comments

@KennethEnevoldsen
Copy link
Contributor

Get an overview of filters:

  • Create an overview of cleaning taggers #207
    • Figure out the difference between the PII taggers (which ones are most relevant) - we might just start using the fastest one
    • Test NSFW filters (is what they filter out problematic in Danish) (minor)
    • Test hate-speech filters (is what they filter out problematic in Danish) (minor)

See the filters are valid

  • Apply all filters to DAGW (whole dataset) and see if any of the filters are problematic (non should filter out extreme amount of data as we consider the dataset fairly clean)

    • E.g. could be afraid that the current gopher filters uses "english" thresholds which might be problematic for us.

Starting applying filters to the dataset

  • Formatting datasets to jsonl.gz and applying filters to them.

Decide on reasonable threshold

  • Once we have run the analysis we would like to set a reasonable set of starting thresholds which we can vary based on (specified in the config).

@peterbjorgensen does this seems like a reasonable approach to you as well?

@KennethEnevoldsen KennethEnevoldsen changed the title Selection of cleaning steps to use Dataset Cleaning Process Jan 4, 2024
@peterbjorgensen
Copy link
Collaborator

One of the taggers in Dolma is using Microsoft Presidio for PII https://microsoft.github.io/presidio/
It is a bit unclear to me (also after reading their documentation) whether it works in Danish.
But I think it boils down to whether a PII classifier exists in Danish. The framework can use spaCy, stanza and transformers models.

@KennethEnevoldsen
Copy link
Contributor Author

^we can def. use a spacy pipeline (large embeddings model) I believe everything else it too slow. But there is no PII classifer for danish (only NER)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants