Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to corpus #27

Open
4 tasks
Glitchy-Tozier opened this issue Mar 28, 2022 · 1 comment
Open
4 tasks

Improvements to corpus #27

Glitchy-Tozier opened this issue Mar 28, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@Glitchy-Tozier
Copy link
Contributor

Glitchy-Tozier commented Mar 28, 2022

A random collection of ideas that would improve usage or optimization.

  • Add a README to /corpus, which explains where the sentences for those word-lists came from and what was done to them.
  • Add a README to every folder, explaining what files were used in what proportions (for example, with which percentages does "deu_mixed_wiki_web…" incorporate "wiki" and "web"
  • Include different countries for existing languages: Some collections (for example "web-public") separate the results into different countries.
    • English: I suggest taking files from the US, Great Britain, and Australia and weighting them 1:1:1.
    • German: As there's way less Austrians than Germans, we could tilt the weighting in favor of German corpora when adding "Austrian German". While I'd prefer going 1:1, I know many would not agree with this. Currently, the ratio of the two populations is roughly 9:1, so this might be an acceptable starting-point.
  • In http://www.adnw.de/index.php?n=Main.Bewertungsverfahren, multiple sources of n-gram frequencies are mentioned. It would be interesting to incorporate some of them, to further solidify the validity of our corpus.
@Glitchy-Tozier Glitchy-Tozier added the enhancement New feature or request label Mar 28, 2022
@iandoug
Copy link

iandoug commented Oct 19, 2022

Can't help with German but I did a corpus exercise last year for English. Get the latest versions of the PDF and .zip here:
https://zenodo.org/record/5501838

Also for Spanish, not as cleaned up as the English but better than straight from Leipzig:
https://zenodo.org/record/5501931

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants