Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse data from cldr #19

Merged
merged 10 commits into from
Jul 6, 2020
Merged

Parse data from cldr #19

merged 10 commits into from
Jul 6, 2020

Conversation

arnavkapoor
Copy link
Collaborator

Contains the script for parsing CLDR data and added the merged 'py' files. The output 'py' files follow the same structure as dateparser.
Information about some of the issues in the individual language files which were mentioned here #17 :-

  • The raw_data file for Lithuanian number_parser_data/raw_cldr_translation_data/lt.json had keyword "ERROR" for some numbers those have been explicitly skipped in the script now.
  • The files with arabic script ar, fa, fa-AF were not being created properly. However I believe the issue is with how github/other editors view this file. Opening the same file using cat command or vim / sublime seems to render it properly. (However vscode doesn't render it properly). I am not sure what's the root cause behind this.

Copy link
Contributor

@noviluni noviluni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arnavkapoor !

I know that I added a lot of comments, but don't be afraid, most of them are really simple to fix 😄

You did a really good job looking at the dateparser script. Even if that script has a lot of things that should be fixed/improved, when looking at code from others it's easy to learn new things and (hopefully) take advantage of some code snippets.

It was a good idea to use the if __name__ == '__main__':, as it avoids this script to run after importing (for example) and using an underscore to indicate private functions was a good idea too.

Apart from the comments, there are two things I would like to comment:

  1. Tests
    I know that testing this script it's really hard because it includes files, etc. and as it's not as important as other parts we could probably skip it for now. However, if you want to add a simple test checking that there isn't any '%' sign in the generated files go ahead (if not, don't worry). In dateparser we recently added a test to check that the current generated .py files include the same generated content by this script to avoid merging wrong PRs. It's not necessary by now. Maybe we can add an issue with this.

  2. Docstrings
    I would like to see simple docstrings describing the functions purposes. This makes easier to understand the current code and to refactor it. You did a great job in the other PR in the parser.py file.

  3. I added some ideas to change the data points names (MTENS to TENS, etc). Apart from that, I think that the current VALID_TOKENS doesn't need to contain a key called tokens and it could be implemented directly as a list. I.e:

"VALID_TOKENS": ["and", "-"]

However, both changes (in point 3) require to change the "supplementary data" files, so I would ask you to keep it as-is for now and to submit a separate PR with these two changes after merging this.

Of course, if you don't agree with anything I wrote don't hesitate to let me know and to write your opinion.

Good job again! 🚀 😄

"asuon": 7,
"awɔtwe": 8,
"akron": 9,
"": 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this zero wouldn't work.

# -*- coding: utf-8 -*-
info = {
"UNIT_NUMBERS": {
"s": 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can speak/write Catalan and this "s" for zero is wrong... 🤔

Do you know why could this happen? What does spellout-numbering-cents mean?

The other numbers for this language look good 👍

# -*- coding: utf-8 -*-
info = {
"UNIT_NUMBERS": {
"صفر": 0,
Copy link
Contributor

@noviluni noviluni Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that the files with Arabic and Hebrew signs are correctly generated but they are not shown properly in GitHub. I have already sent this issue to the GitHub support team: https://support.github.com/contact

"kymmeniin": 10,
"kymmenillä": 10,
"kymmeniltä": 10,
"kymmenille": 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, how crazy is this!

"septante": 70,
"quatre-vingt>%%cents-m>": 80,
"nonante": 90,
"quatre-vingt>%%cents-f>": 80
Copy link
Contributor

@noviluni noviluni Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an error here for quatre-vingt (as it's duplicated and contains incorrect % symbols).

language_data["BASE_NUMBERS"][word] = number


def _find_zeroes(number):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about changing this to _count_zeroes instead?


def _find_zeroes(number):
zero_count = 0
while(number > 0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will save some iterations if you change number > 0 by number > 9, as we don't need to check those numbers with one digit.



def write_complete_data():
for files in os.listdir(SOURCE_PATH):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this files variable should be renamed to file (or filename or file_name) as it only represents one file (the json file).

@@ -0,0 +1,143 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is not parsing data from CLDR only, but also adding the "supplementary data", so I think that the name of the file should be changed. You can use the same than in dateparser ("write_complete_data") or find another (better) name.

language_data_populated[keys].update(data[keys])

encoding_comment = "# -*- coding: utf-8 -*-\n"
translation_data = json.dumps(language_data_populated, indent=4, ensure_ascii=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be a really good idea to "order" each section by value.

So, instead of having

   "UNIT_NUMBERS": {
        "cero": 0,
        "uno": 1,
        "dos": 2,
        "tres": 3,
      ...
        "un": 1,
        "una": 1
    },

We will have:

   "UNIT_NUMBERS": {
        "cero": 0,
        "uno": 1,
        "un": 1,
        "una": 1
        "dos": 2,
        "tres": 3,
      ...
    },

or similar.

This would make easier to read the files. I think that the easiest place to do it is here, after having all the language_data, but it's up to you.

@arnavkapoor arnavkapoor linked an issue Jul 5, 2020 that may be closed by this pull request
@arnavkapoor
Copy link
Collaborator Author

Hi @noviluni I have updated the code with the relevant changes , it took a bit longer than expected as I was still understanding the language file of CLDR in more detail. I planned to fix the normalization issue #13 with this PR, I implemened the solution here which didn't use any external library. However, it caused a lot of changes in loads of file, and was able to see that for Hindi it wrongly changed the words to a different form which were invalid. I am assuming it did the same for some of the other languages too. The best way to proceed would probably be to apply this solution only on languages using latin script and then incrementally see what is needed for other languages.

Copy link
Contributor

@noviluni noviluni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arnavkapoor! Good job again!

Now the code is really solid! I added some suggestions (mostly aesthetics).

I think we can merge this and open a PR just changing the naming of the variables of the file. Don't worry about the normalization, I think it can be performed when comparing the entered string with the data from the language files instead of saving the strings already normalized. Let's this to be handled in a future PR.

if power_of_10 >= 2:
language_data["MULTIPLIERS"][valid_word] = power_of_10_num
return
valid_word = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this line (valid_word = ""). As you are using an if/else structure below, there will be always a value for the valid_words variable, so you don't need to initialize it here.

else:
root_word = word.split(">")[0].strip()

if(root_word == ""):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(root_word == ""):
if root_word == "":

language_data[keys].update(data[keys])
sorted_tuples = sorted(language_data[keys].items(), key=lambda x: x[1])
for items in sorted_tuples:
word, number = items[0],items[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
word, number = items[0],items[1]
word, number = items[0], items[1]

def _is_valid(key):
"""Identifying whether the given key of the source language file needs to be extracted."""
needed = False
for valid_key in VALID_KEYS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename needed to is_valid for consistency 👍

Comment on lines 136 to 138
for keys in REQUIRED_DATA_POINTS:
language_data[keys] = {}
ordered_language_data[keys] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for keys in REQUIRED_DATA_POINTS:
language_data[keys] = {}
ordered_language_data[keys] = {}
language_data = dict.fromkeys(REQUIRED_DATA_POINTS, {})
ordered_language_data = OrderedDict.fromkeys(REQUIRED_DATA_POINTS, {})

Or:

Suggested change
for keys in REQUIRED_DATA_POINTS:
language_data[keys] = {}
ordered_language_data[keys] = {}
language_data = {key: {} for key in REQUIRED_DATA_POINTS}
ordered_language_data = OrderedDict((key, {}) for key in REQUIRED_DATA_POINTS)

data = json.load(supplementary_data)
for keys in REQUIRED_DATA_POINTS:
language_data[keys].update(data[keys])
sorted_tuples = sorted(language_data[keys].items(), key=lambda x: x[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about sorting by value and then by word?

Suggested change
sorted_tuples = sorted(language_data[keys].items(), key=lambda x: x[1])
sorted_tuples = sorted(language_data[keys].items(), key=lambda x: (x[1], x[0]))

language_data[keys].update(data[keys])
sorted_tuples = sorted(language_data[keys].items(), key=lambda x: x[1])
for items in sorted_tuples:
word, number = items[0],items[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💄

Suggested change
word, number = items[0],items[1]
word, number = items[0], items[1]

@arnavkapoor
Copy link
Collaborator Author

Updated the scripts with the relevant changes , ordering it by value then word does bring an added consistency. Thanks @noviluni @Gallaecio for the inputs. There was an option to directly commit the suggestions which I hadn't seen earlier. Is there any harm to doing it directly (increased number of commits perhaps ? ). Also I will go ahead and merge with master then ?

@Gallaecio
Copy link
Member

Is there any harm to doing it directly (increased number of commits perhaps ? )

None. In fact, one of these changes required regenerating all files again, so you did well not just committing the suggestion from GitHub.

Committing suggestions can be easy, but there’s nothing wrong with applying the suggestions manually.

As for the commit history, I usually squash pull requests instead of merging, so that there is a single commit for the whole pull request. If you do that, commits created during the review process are not a problem.

@noviluni
Copy link
Contributor

noviluni commented Jul 6, 2020

Just to add something to the @Gallaecio comment, even if people try to do the best when suggesting changes, it sometimes happens that there is a typo, error, or side effect. I remember once that I accepted a suggestion and after that commit, the pipelines broke because there was a missing ', so I also prefer to implement the changes manually except when there is just one suggestion and it's too obvious.
@arnavkapoor

Copy link
Contributor

@noviluni noviluni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, @Gallaecio @arnavkapoor, good job implementing this new order, it's really consistent right now :)

@arnavkapoor arnavkapoor merged commit e2c102a into master Jul 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants