-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse data from cldr #19
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @arnavkapoor !
I know that I added a lot of comments, but don't be afraid, most of them are really simple to fix 😄
You did a really good job looking at the dateparser
script. Even if that script has a lot of things that should be fixed/improved, when looking at code from others it's easy to learn new things and (hopefully) take advantage of some code snippets.
It was a good idea to use the if __name__ == '__main__':
, as it avoids this script to run after importing (for example) and using an underscore to indicate private functions was a good idea too.
Apart from the comments, there are two things I would like to comment:
-
Tests
I know that testing this script it's really hard because it includes files, etc. and as it's not as important as other parts we could probably skip it for now. However, if you want to add a simple test checking that there isn't any '%' sign in the generated files go ahead (if not, don't worry). Indateparser
we recently added a test to check that the current generated.py
files include the same generated content by this script to avoid merging wrong PRs. It's not necessary by now. Maybe we can add an issue with this. -
Docstrings
I would like to see simple docstrings describing the functions purposes. This makes easier to understand the current code and to refactor it. You did a great job in the other PR in theparser.py
file. -
I added some ideas to change the data points names (
MTENS
toTENS
, etc). Apart from that, I think that the currentVALID_TOKENS
doesn't need to contain a key calledtokens
and it could be implemented directly as a list. I.e:
"VALID_TOKENS": ["and", "-"]
However, both changes (in point 3) require to change the "supplementary data" files, so I would ask you to keep it as-is for now and to submit a separate PR with these two changes after merging this.
Of course, if you don't agree with anything I wrote don't hesitate to let me know and to write your opinion.
Good job again! 🚀 😄
number_parser/data/ak.py
Outdated
"asuon": 7, | ||
"awɔtwe": 8, | ||
"akron": 9, | ||
"": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this zero wouldn't work.
number_parser/data/ca.py
Outdated
# -*- coding: utf-8 -*- | ||
info = { | ||
"UNIT_NUMBERS": { | ||
"s": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can speak/write Catalan and this "s" for zero is wrong... 🤔
Do you know why could this happen? What does spellout-numbering-cents
mean?
The other numbers for this language look good 👍
# -*- coding: utf-8 -*- | ||
info = { | ||
"UNIT_NUMBERS": { | ||
"صفر": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that the files with Arabic and Hebrew signs are correctly generated but they are not shown properly in GitHub. I have already sent this issue to the GitHub support team: https://support.github.com/contact
number_parser/data/fi.py
Outdated
"kymmeniin": 10, | ||
"kymmenillä": 10, | ||
"kymmeniltä": 10, | ||
"kymmenille": 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, how crazy is this!
number_parser/data/fr-BE.py
Outdated
"septante": 70, | ||
"quatre-vingt>%%cents-m>": 80, | ||
"nonante": 90, | ||
"quatre-vingt>%%cents-f>": 80 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an error here for quatre-vingt
(as it's duplicated and contains incorrect % symbols).
scripts/parse_data_from_cldr.py
Outdated
language_data["BASE_NUMBERS"][word] = number | ||
|
||
|
||
def _find_zeroes(number): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about changing this to _count_zeroes
instead?
scripts/parse_data_from_cldr.py
Outdated
|
||
def _find_zeroes(number): | ||
zero_count = 0 | ||
while(number > 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will save some iterations if you change number > 0
by number > 9
, as we don't need to check those numbers with one digit.
scripts/parse_data_from_cldr.py
Outdated
|
||
|
||
def write_complete_data(): | ||
for files in os.listdir(SOURCE_PATH): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this files
variable should be renamed to file
(or filename
or file_name
) as it only represents one file (the json file).
scripts/parse_data_from_cldr.py
Outdated
@@ -0,0 +1,143 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is not parsing data from CLDR only, but also adding the "supplementary data", so I think that the name of the file should be changed. You can use the same than in dateparser
("write_complete_data") or find another (better) name.
scripts/parse_data_from_cldr.py
Outdated
language_data_populated[keys].update(data[keys]) | ||
|
||
encoding_comment = "# -*- coding: utf-8 -*-\n" | ||
translation_data = json.dumps(language_data_populated, indent=4, ensure_ascii=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it would be a really good idea to "order" each section by value.
So, instead of having
"UNIT_NUMBERS": {
"cero": 0,
"uno": 1,
"dos": 2,
"tres": 3,
...
"un": 1,
"una": 1
},
We will have:
"UNIT_NUMBERS": {
"cero": 0,
"uno": 1,
"un": 1,
"una": 1
"dos": 2,
"tres": 3,
...
},
or similar.
This would make easier to read the files. I think that the easiest place to do it is here, after having all the language_data
, but it's up to you.
Hi @noviluni I have updated the code with the relevant changes , it took a bit longer than expected as I was still understanding the language file of CLDR in more detail. I planned to fix the normalization issue #13 with this PR, I implemened the solution here which didn't use any external library. However, it caused a lot of changes in loads of file, and was able to see that for Hindi it wrongly changed the words to a different form which were invalid. I am assuming it did the same for some of the other languages too. The best way to proceed would probably be to apply this solution only on languages using latin script and then incrementally see what is needed for other languages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @arnavkapoor! Good job again!
Now the code is really solid! I added some suggestions (mostly aesthetics).
I think we can merge this and open a PR just changing the naming of the variables of the file. Don't worry about the normalization, I think it can be performed when comparing the entered string with the data from the language files instead of saving the strings already normalized. Let's this to be handled in a future PR.
if power_of_10 >= 2: | ||
language_data["MULTIPLIERS"][valid_word] = power_of_10_num | ||
return | ||
valid_word = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this line (valid_word = ""
). As you are using an if/else
structure below, there will be always a value for the valid_words
variable, so you don't need to initialize it here.
else: | ||
root_word = word.split(">")[0].strip() | ||
|
||
if(root_word == ""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(root_word == ""): | |
if root_word == "": |
language_data[keys].update(data[keys]) | ||
sorted_tuples = sorted(language_data[keys].items(), key=lambda x: x[1]) | ||
for items in sorted_tuples: | ||
word, number = items[0],items[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
word, number = items[0],items[1] | |
word, number = items[0], items[1] |
def _is_valid(key): | ||
"""Identifying whether the given key of the source language file needs to be extracted.""" | ||
needed = False | ||
for valid_key in VALID_KEYS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename needed
to is_valid
for consistency 👍
for keys in REQUIRED_DATA_POINTS: | ||
language_data[keys] = {} | ||
ordered_language_data[keys] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for keys in REQUIRED_DATA_POINTS: | |
language_data[keys] = {} | |
ordered_language_data[keys] = {} | |
language_data = dict.fromkeys(REQUIRED_DATA_POINTS, {}) | |
ordered_language_data = OrderedDict.fromkeys(REQUIRED_DATA_POINTS, {}) |
Or:
for keys in REQUIRED_DATA_POINTS: | |
language_data[keys] = {} | |
ordered_language_data[keys] = {} | |
language_data = {key: {} for key in REQUIRED_DATA_POINTS} | |
ordered_language_data = OrderedDict((key, {}) for key in REQUIRED_DATA_POINTS) |
data = json.load(supplementary_data) | ||
for keys in REQUIRED_DATA_POINTS: | ||
language_data[keys].update(data[keys]) | ||
sorted_tuples = sorted(language_data[keys].items(), key=lambda x: x[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about sorting by value and then by word?
sorted_tuples = sorted(language_data[keys].items(), key=lambda x: x[1]) | |
sorted_tuples = sorted(language_data[keys].items(), key=lambda x: (x[1], x[0])) |
language_data[keys].update(data[keys]) | ||
sorted_tuples = sorted(language_data[keys].items(), key=lambda x: x[1]) | ||
for items in sorted_tuples: | ||
word, number = items[0],items[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💄
word, number = items[0],items[1] | |
word, number = items[0], items[1] |
Updated the scripts with the relevant changes , ordering it by value then word does bring an added consistency. Thanks @noviluni @Gallaecio for the inputs. There was an option to directly commit the suggestions which I hadn't seen earlier. Is there any harm to doing it directly (increased number of commits perhaps ? ). Also I will go ahead and merge with master then ? |
None. In fact, one of these changes required regenerating all files again, so you did well not just committing the suggestion from GitHub. Committing suggestions can be easy, but there’s nothing wrong with applying the suggestions manually. As for the commit history, I usually squash pull requests instead of merging, so that there is a single commit for the whole pull request. If you do that, commits created during the review process are not a problem. |
Just to add something to the @Gallaecio comment, even if people try to do the best when suggesting changes, it sometimes happens that there is a typo, error, or side effect. I remember once that I accepted a suggestion and after that commit, the pipelines broke because there was a missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, @Gallaecio @arnavkapoor, good job implementing this new order, it's really consistent right now :)
Contains the script for parsing CLDR data and added the merged 'py' files. The output 'py' files follow the same structure as dateparser.
Information about some of the issues in the individual language files which were mentioned here #17 :-
number_parser_data/raw_cldr_translation_data/lt.json
had keyword "ERROR" for some numbers those have been explicitly skipped in the script now.ar, fa, fa-AF
were not being created properly. However I believe the issue is with how github/other editors view this file. Opening the same file usingcat
command orvim / sublime
seems to render it properly. (However vscode doesn't render it properly). I am not sure what's the root cause behind this.