Dataset Cleaning Process #208

KennethEnevoldsen · 2024-01-04T15:07:22Z

Get an overview of filters:

Create an overview of cleaning taggers #207
- Figure out the difference between the PII taggers (which ones are most relevant) - we might just start using the fastest one
- Test NSFW filters (is what they filter out problematic in Danish) (minor)
- Test hate-speech filters (is what they filter out problematic in Danish) (minor)

See the filters are valid

Apply all filters to DAGW (whole dataset) and see if any of the filters are problematic (non should filter out extreme amount of data as we consider the dataset fairly clean)
- E.g. could be afraid that the current gopher filters uses "english" thresholds which might be problematic for us.

Starting applying filters to the dataset

Formatting datasets to jsonl.gz and applying filters to them.

Decide on reasonable threshold

Once we have run the analysis we would like to set a reasonable set of starting thresholds which we can vary based on (specified in the config).

@peterbjorgensen does this seems like a reasonable approach to you as well?

peterbjorgensen · 2024-01-23T09:12:08Z

One of the taggers in Dolma is using Microsoft Presidio for PII https://microsoft.github.io/presidio/
It is a bit unclear to me (also after reading their documentation) whether it works in Danish.
But I think it boils down to whether a PII classifier exists in Danish. The framework can use spaCy, stanza and transformers models.

KennethEnevoldsen · 2024-01-23T10:24:56Z

^we can def. use a spacy pipeline (large embeddings model) I believe everything else it too slow. But there is no PII classifer for danish (only NER)

KennethEnevoldsen assigned KennethEnevoldsen and TTTTao725 Jan 4, 2024

KennethEnevoldsen changed the title ~~Selection of cleaning steps to use~~ Dataset Cleaning Process Jan 4, 2024

KennethEnevoldsen added no-stale overview labels Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Cleaning Process #208

Dataset Cleaning Process #208

KennethEnevoldsen commented Jan 4, 2024

peterbjorgensen commented Jan 23, 2024

KennethEnevoldsen commented Jan 23, 2024

Dataset Cleaning Process #208

Dataset Cleaning Process #208

Comments

KennethEnevoldsen commented Jan 4, 2024

peterbjorgensen commented Jan 23, 2024

KennethEnevoldsen commented Jan 23, 2024