Parse data from cldr #19

arnavkapoor · 2020-06-30T19:39:07Z

Contains the script for parsing CLDR data and added the merged 'py' files. The output 'py' files follow the same structure as dateparser.
Information about some of the issues in the individual language files which were mentioned here #17 :-

The raw_data file for Lithuanian number_parser_data/raw_cldr_translation_data/lt.json had keyword "ERROR" for some numbers those have been explicitly skipped in the script now.
The files with arabic script ar, fa, fa-AF were not being created properly. However I believe the issue is with how github/other editors view this file. Opening the same file using cat command or vim / sublime seems to render it properly. (However vscode doesn't render it properly). I am not sure what's the root cause behind this.

noviluni

Hi @arnavkapoor !

I know that I added a lot of comments, but don't be afraid, most of them are really simple to fix 😄

You did a really good job looking at the dateparser script. Even if that script has a lot of things that should be fixed/improved, when looking at code from others it's easy to learn new things and (hopefully) take advantage of some code snippets.

It was a good idea to use the if __name__ == '__main__':, as it avoids this script to run after importing (for example) and using an underscore to indicate private functions was a good idea too.

Apart from the comments, there are two things I would like to comment:

Tests
I know that testing this script it's really hard because it includes files, etc. and as it's not as important as other parts we could probably skip it for now. However, if you want to add a simple test checking that there isn't any '%' sign in the generated files go ahead (if not, don't worry). In dateparser we recently added a test to check that the current generated .py files include the same generated content by this script to avoid merging wrong PRs. It's not necessary by now. Maybe we can add an issue with this.
Docstrings
I would like to see simple docstrings describing the functions purposes. This makes easier to understand the current code and to refactor it. You did a great job in the other PR in the parser.py file.
I added some ideas to change the data points names (MTENS to TENS, etc). Apart from that, I think that the current VALID_TOKENS doesn't need to contain a key called tokens and it could be implemented directly as a list. I.e:

"VALID_TOKENS": ["and", "-"]

However, both changes (in point 3) require to change the "supplementary data" files, so I would ask you to keep it as-is for now and to submit a separate PR with these two changes after merging this.

Of course, if you don't agree with anything I wrote don't hesitate to let me know and to write your opinion.

Good job again! 🚀 😄

noviluni · 2020-06-30T20:58:30Z

number_parser/data/ak.py

+        "asuon": 7,
+        "awɔtwe": 8,
+        "akron": 9,
+        "": 0,


I think that this zero wouldn't work.

noviluni · 2020-06-30T21:24:25Z

number_parser/data/ca.py

+# -*- coding: utf-8 -*-
+info = {
+    "UNIT_NUMBERS": {
+        "s": 0,


I can speak/write Catalan and this "s" for zero is wrong... 🤔

Do you know why could this happen? What does spellout-numbering-cents mean?

The other numbers for this language look good 👍

noviluni · 2020-06-30T21:39:27Z

number_parser/data/ar.py

+# -*- coding: utf-8 -*-
+info = {
+    "UNIT_NUMBERS": {
+        "صفر": 0,


I can confirm that the files with Arabic and Hebrew signs are correctly generated but they are not shown properly in GitHub. I have already sent this issue to the GitHub support team: https://support.github.com/contact

noviluni · 2020-06-30T21:41:26Z

number_parser/data/fi.py

+        "kymmeniin": 10,
+        "kymmenillä": 10,
+        "kymmeniltä": 10,
+        "kymmenille": 10


Wow, how crazy is this!

noviluni · 2020-06-30T21:43:03Z

number_parser/data/fr-BE.py

+        "septante": 70,
+        "quatre-vingt>%%cents-m>": 80,
+        "nonante": 90,
+        "quatre-vingt>%%cents-f>": 80


There is an error here for quatre-vingt (as it's duplicated and contains incorrect % symbols).

noviluni · 2020-07-01T06:06:57Z

scripts/parse_data_from_cldr.py

+        language_data["BASE_NUMBERS"][word] = number
+
+
+def _find_zeroes(number):


what do you think about changing this to _count_zeroes instead?

noviluni · 2020-07-01T06:09:12Z

scripts/parse_data_from_cldr.py

+
+def _find_zeroes(number):
+    zero_count = 0
+    while(number > 0):


You will save some iterations if you change number > 0 by number > 9, as we don't need to check those numbers with one digit.

noviluni · 2020-07-01T06:18:14Z

scripts/parse_data_from_cldr.py

+
+
+def write_complete_data():
+    for files in os.listdir(SOURCE_PATH):


this files variable should be renamed to file (or filename or file_name) as it only represents one file (the json file).

noviluni · 2020-07-01T06:22:52Z

scripts/parse_data_from_cldr.py

@@ -0,0 +1,143 @@
+"""


This script is not parsing data from CLDR only, but also adding the "supplementary data", so I think that the name of the file should be changed. You can use the same than in dateparser ("write_complete_data") or find another (better) name.

noviluni · 2020-07-01T06:30:20Z

scripts/parse_data_from_cldr.py

+                language_data_populated[keys].update(data[keys])
+
+        encoding_comment = "# -*- coding: utf-8 -*-\n"
+        translation_data = json.dumps(language_data_populated, indent=4, ensure_ascii=False)


I think that it would be a really good idea to "order" each section by value.

So, instead of having

"UNIT_NUMBERS": { "cero": 0, "uno": 1, "dos": 2, "tres": 3, ... "un": 1, "una": 1 },

We will have:

"UNIT_NUMBERS": { "cero": 0, "uno": 1, "un": 1, "una": 1 "dos": 2, "tres": 3, ... },

or similar.

This would make easier to read the files. I think that the easiest place to do it is here, after having all the language_data, but it's up to you.

arnavkapoor · 2020-07-06T07:44:52Z

Hi @noviluni I have updated the code with the relevant changes , it took a bit longer than expected as I was still understanding the language file of CLDR in more detail. I planned to fix the normalization issue #13 with this PR, I implemened the solution here which didn't use any external library. However, it caused a lot of changes in loads of file, and was able to see that for Hindi it wrongly changed the words to a different form which were invalid. I am assuming it did the same for some of the other languages too. The best way to proceed would probably be to apply this solution only on languages using latin script and then incrementally see what is needed for other languages.

noviluni

Hi @arnavkapoor! Good job again!

Now the code is really solid! I added some suggestions (mostly aesthetics).

I think we can merge this and open a PR just changing the naming of the variables of the file. Don't worry about the normalization, I think it can be performed when comparing the entered string with the data from the language files instead of saving the strings already normalized. Let's this to be handled in a future PR.

noviluni · 2020-07-06T08:46:31Z