Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse number feature #9

Merged
merged 7 commits into from
Jun 22, 2020
Merged

Parse number feature #9

merged 7 commits into from
Jun 22, 2020

Conversation

arnavkapoor
Copy link
Collaborator

Added the feature 'parse_number' #7 that parses a single number.

Copy link
Contributor

@noviluni noviluni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job @arnavkapoor! That was quick!

I added some comments to improve the code, let me know if you need to clarify something.

Don't worry too much if you see a lot of comments. These early stages are really important to set a good codebase, as it will have a really big impact on the future of this library.

On the other side, I would ask you to add a simple one-line docstring in each function explaining what it does. It's not necessary to add the arguments, etc. just the description. In that way, it will be easier to understand the code and we would find better names for some functions (like check_validity, which I think that it's not a good name). If you see that any of these functions do a lot of things you can separate part of the logic to another function (even if it's not reused) to improve readability (single responsibility).

And again, good job! I think we really have a good PoC! Keep in that way! 🚀

if previous_token in MTENS:
return False
if previous_token in MTENS:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some unnecessarily nested ifs. It would be better to rewrite them with an and to avoid too many levels:

Example:

if current_token in HUNDRED and previous_token in MTENS:
    return False

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -75,8 +72,9 @@ def number_builder(token_list):
value_list = []

for each_token in token_list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename each_token to simply token.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

# Basically implying beginning of a new number hence resetting values.
if each_token.isspace():
continue
valid = check_validity(each_token, previous_token)
if not valid:
total_value += current_grp_value
value_list.append(str(total_value))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This comment refers to the code below, but I can't attach a comment there)

The token can't be (or shouldn't be) in more than one type (BASE_NUMBERS, MTENS, HUNDRED, MULTIPLIERS) at the same time, so I would change the if to elif to avoid doing additional unnecessary checkings.

Example:

if each_token in BASE_NUMBERS:
   ...
elif each_token in MTENS:
   ...
elif each_token in HUNDRED:
   ...
elif each_token in MULTIPLIERS:
   ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

def parse(input_string):
# Fails when two numbers have no SENTENCE_SEPERATORS or words between them
# eg) 'one two three' doesn't work but 'one,two,three' and 'one apple , two mango , three' works.
def tokeniser(input_string):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, it's not a class/object but a function (action) so it should be renamed to tokenise. On the other hand, dateparser and other libraries as nltk use the "tokenize/tokenizer" spelling, so I would change this function name to tokenize().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, make sense Will surely think a bit now before naming variables, I didn't use to give too much thought to it. Updated


def parse_number(input_string):
if input_string.isnumeric():
return int(input_string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this approach is not good enough, as it doesn't work with float numbers.

Please, add these examples to the tests and fix it:

>>> parse_number("1,000,000")                                                                                                                                                          
1000000

>>> parse_number("4.3")                                                                                                                                                                
4.3

>>> parse_number("4")                                                                                                                                                                  
4

>>> parse_number("1,000,000.25")                                                                                                                                                       
1000000.25

Something like this could work:

input_string = input_string.replace(',', '')
if input_string.replace('.', '').isnumeric():
    try:
        return int(input_string)
    except ValueError:
        return float(input_string)

But it will fail if we add multiple points (example: parse_number("10.00.25")). Maybe you find a better approach. We should probably add a setting or option to change the , and . behavior, as in some locales the commas are used as a decimal separator.

>>> parse_number("1.000,25", decimal_separator=',')                                                                                                                                                       
1000000.25

But we don't need to spend time on this right now. You can open a new issue to handle this in the future.

Copy link
Collaborator Author

@arnavkapoor arnavkapoor Jun 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did realize that it won't work with float numbers. But was planning to fix it in a later iteration once the support for decimals is added. The main parse function would need to handle the 'point' keyword and the number_builder logic would also change. Even more, modifications would be needed for negative numbers as ' - ' (minus sign) would need to be treated separately. I guess it's better to raise a separate issue for decimal support as you suggested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, open a new ticket and it will be fixed in a future iteration. If you can, please, link to this comments (URL: https://github.com//pull/9#discussion_r442400344).


all_tokens = tokeniser(input_string)
for index, each_token in enumerate(all_tokens):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename all_tokens to tokens and each_token to token.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

@noviluni noviluni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we will rewrite most of these functions when implementing the data from CLDR, but for now, I think it works and looks pretty well.

Good job @arnavkapoor! 💪

@arnavkapoor arnavkapoor merged commit a26a452 into master Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants