Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing countries in german language #140

Closed
pokecheater opened this issue Nov 18, 2022 · 4 comments · Fixed by #165
Closed

Missing countries in german language #140

pokecheater opened this issue Nov 18, 2022 · 4 comments · Fixed by #165

Comments

@pokecheater
Copy link

Correction of Türkei (turkey) becomes to Türme (towers) for example. I think proper country names are missing.

@pokecheater
Copy link
Author

I had a look into the german dictionary and indeed country names are missing.
But funny part is that türkisch (turkish) for example exist. I tested also Polen (poland) and Indien (India). It is the same: country names are not inside the dictionary.

image

@pokecheater
Copy link
Author

@pokecheater
Copy link
Author

Another error i noticed is that for example the word "Tieren" means animals becomes to tiefen (deep). And the word fuer should become für in my opinion (fuer is not really wrong since this is the workaround writing case when the ü letter is missing).

sag das mal den Milliarden Tieren die fuer Fleisch getötet werden
sag das mal den Milliarden tiefen die fuer Fleisch getötet werden

@barrust
Copy link
Owner

barrust commented Nov 18, 2022

Thank you for this information. I am not a German speaker, so this is very helpful. The data used to build the word frequencies is from the OpenSubtitles project so these issues generate from there.

Any help maintaining and updating the languages would be much appreciated. The easiest method is to look in the scripts folder where you can see how the dictionaries are generated. There are a few files that can be used to ensure that certain words are removed, and others to ensure that missing words are added.

  • scripts/data/{lang}_exclude.txt
  • scripts/data/{lang}_include.txt
  • scripts/data/{lang}_full.json.gz

A Pull Request updating the excluded and included txt files would make the next build of the dictionaries reflect these changes. There may also be some code that could be written in the scripts/build_dictionary.py (likely in the clean_german() function) that could make this more automatic. For example, if ü -> ue often, perhaps finding all instances where that is the only difference in the word would make it easier to exclude those that have the ue spelling. Not sure if that is true or not, but something like that to find common errors could be useful.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants