-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse number feature #9
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job @arnavkapoor! That was quick!
I added some comments to improve the code, let me know if you need to clarify something.
Don't worry too much if you see a lot of comments. These early stages are really important to set a good codebase, as it will have a really big impact on the future of this library.
On the other side, I would ask you to add a simple one-line docstring in each function explaining what it does. It's not necessary to add the arguments, etc. just the description. In that way, it will be easier to understand the code and we would find better names for some functions (like check_validity
, which I think that it's not a good name). If you see that any of these functions do a lot of things you can separate part of the logic to another function (even if it's not reused) to improve readability (single responsibility).
And again, good job! I think we really have a good PoC! Keep in that way! 🚀
number_parser/parser.py
Outdated
if previous_token in MTENS: | ||
return False | ||
if previous_token in MTENS: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some unnecessarily nested ifs
. It would be better to rewrite them with an and
to avoid too many levels:
Example:
if current_token in HUNDRED and previous_token in MTENS:
return False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
number_parser/parser.py
Outdated
@@ -75,8 +72,9 @@ def number_builder(token_list): | |||
value_list = [] | |||
|
|||
for each_token in token_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename each_token
to simply token
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
# Basically implying beginning of a new number hence resetting values. | ||
if each_token.isspace(): | ||
continue | ||
valid = check_validity(each_token, previous_token) | ||
if not valid: | ||
total_value += current_grp_value | ||
value_list.append(str(total_value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This comment refers to the code below, but I can't attach a comment there)
The token can't be (or shouldn't be) in more than one type (BASE_NUMBERS
, MTENS
, HUNDRED
, MULTIPLIERS
) at the same time, so I would change the if
to elif
to avoid doing additional unnecessary checkings.
Example:
if each_token in BASE_NUMBERS:
...
elif each_token in MTENS:
...
elif each_token in HUNDRED:
...
elif each_token in MULTIPLIERS:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
number_parser/parser.py
Outdated
def parse(input_string): | ||
# Fails when two numbers have no SENTENCE_SEPERATORS or words between them | ||
# eg) 'one two three' doesn't work but 'one,two,three' and 'one apple , two mango , three' works. | ||
def tokeniser(input_string): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, it's not a class/object but a function (action) so it should be renamed to tokenise
. On the other hand, dateparser
and other libraries as nltk
use the "tokenize/tokenizer" spelling, so I would change this function name to tokenize()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, make sense Will surely think a bit now before naming variables, I didn't use to give too much thought to it. Updated
|
||
def parse_number(input_string): | ||
if input_string.isnumeric(): | ||
return int(input_string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this approach is not good enough, as it doesn't work with float numbers.
Please, add these examples to the tests and fix it:
>>> parse_number("1,000,000")
1000000
>>> parse_number("4.3")
4.3
>>> parse_number("4")
4
>>> parse_number("1,000,000.25")
1000000.25
Something like this could work:
input_string = input_string.replace(',', '')
if input_string.replace('.', '').isnumeric():
try:
return int(input_string)
except ValueError:
return float(input_string)
But it will fail if we add multiple points (example: parse_number("10.00.25")
). Maybe you find a better approach. We should probably add a setting or option to change the ,
and .
behavior, as in some locales the commas are used as a decimal separator.
>>> parse_number("1.000,25", decimal_separator=',')
1000000.25
But we don't need to spend time on this right now. You can open a new issue to handle this in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did realize that it won't work with float numbers. But was planning to fix it in a later iteration once the support for decimals is added. The main parse
function would need to handle the 'point' keyword and the number_builder
logic would also change. Even more, modifications would be needed for negative numbers as ' - ' (minus sign) would need to be treated separately. I guess it's better to raise a separate issue for decimal support as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, open a new ticket and it will be fixed in a future iteration. If you can, please, link to this comments (URL: https://github.com//pull/9#discussion_r442400344).
number_parser/parser.py
Outdated
|
||
all_tokens = tokeniser(input_string) | ||
for index, each_token in enumerate(all_tokens): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename all_tokens
to tokens
and each_token
to token
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we will rewrite most of these functions when implementing the data from CLDR, but for now, I think it works and looks pretty well.
Good job @arnavkapoor! 💪
Added the feature 'parse_number' #7 that parses a single number.