Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle commas better #74

Closed
cannawen opened this issue Oct 7, 2017 · 16 comments
Closed

Handle commas better #74

cannawen opened this issue Oct 7, 2017 · 16 comments

Comments

@cannawen
Copy link
Owner

cannawen commented Oct 7, 2017

Enhancement request summary:

Use Numeral.js to localize number recognition (i.e. 4,32 vs. 4.32)

Initial discussion:

Now that we have a preprocess step, we can remove commas from the input string but it might be a bit tricky because we don't want to blanket get rid of all commas (for example "40 miles,foo" would turn into "40 milesfoo" which would not be processed as miles )

Another problem:
Numbers formatted like 4,32 might be an alternate way to say 4.32 but 4,321 is more commonly the number "4321". Currently, we do not match numbers that are formatted like "4,32"

Perhaps there is a library out there for parsing numbers from different locales? If not perhaps we can spin off a new project to create such a library?

Issue is open for discussion

@cannawen
Copy link
Owner Author

cannawen commented Oct 7, 2017

Numeral.js looks like it has some locale features, but I'm not sure if it is as flexible as we need

@nicolasferraro
Copy link
Collaborator

Since we are looking to add more code to preprocessing, should we factor out the preprocessing code into a separate module and then export preprocessComment to avoid polluting conversion_helper?

@cannawen
Copy link
Owner Author

cannawen commented Oct 7, 2017

I like that idea! Perhaps "conversion.js" can use some "preprocess" module before passing the result to "conversion_helper"? Do you want to handle this as part of PR #73, or should we create a new issue?

@nicolasferraro
Copy link
Collaborator

Sure!, I'll handle it in #73.

@nalinbhardwaj
Copy link
Collaborator

I really like the idea of using numeral.js for parsing/handling preprocessing. But this would probably require a major refactor for preprocessing. We can assume locale based on many things(subreddit, general number of active users, etc.) and parse numbers accordingly, remove commas and push it forward. I can look into handling this once #73 is merged. If we plan to do this however, I suggest moving preprocess even outside conversion.js, for the sake of minimising complexity, since it would probably have lots of corner cases and stuff in itself.

@cannawen
Copy link
Owner Author

cannawen commented Oct 7, 2017

Ohhh yes, maybe it's a good idea move it out of anything related to conversions. Assigning issue to @nalinbhardwaj and marking issue as blocked on #73

@nalinbhardwaj
Copy link
Collaborator

Some thoughts after further analysis.

numeral.js is actually very lacking in it's locale handling. For example, for the Arabic/Indian system, the 2nd most popular after american system, it doesn't contain a locale. And locales are mainly just better(language translated) handlers for thousand, million, billion.

Also, some manual stats gathering on r/india shows that a lot more people use American/none(not comma seperate) their comments than those that do use the Indian system. False positives in such cases might lead to very bad results(it's better not to answer than answer wrongly IMO).

We probably need to find better heuristics, and/or a library that's much better at parsing automatically.

@cannawen
Copy link
Owner Author

cannawen commented Oct 9, 2017

Instead of looking at the subreddit to determine locale, maybe we can just handle some basic types of numbers for all subreddits.

So the following is a list of what a Reddit user writes, and what the bot parses in "standard" notation:

  1. x.xx -> x.xx
  2. x.xxx.xxx -> x,xxx,xxx
  3. x.xxx,xxx -> x,xxx.xxx
  4. x'xxx -> x,xxx

So these are just some rough ideas/guidelines. We can break down the different types of formats into different issues if you think that makes sense? Or just tackle it all in one go. Whatever we do, we should keep this section of code very self-contained. I think we can create our own "locale" files in Numeral, if we feel like they are lacking something we need, but I think it is also possible to do this without using numeral and just using regular expressions. Lots and lots of regular expressions.

@nalinbhardwaj
Copy link
Collaborator

nalinbhardwaj commented Oct 9, 2017

Yes @cannawen, actually, Numeral.js doesn't allow for a different number of distance in commas in a number(at least not trivially), which is the case with Arabic numeral system, where-in numbers are 105 = 1 lakh, 107 = 1 crore and so on, so basically, we use(1,00,000 instead of 100,000). Doing this in-house would be much easier.

We can break down the different types of formats into different issues if you think that makes sense?

This is optimal since that'll allow us expertise from individual user groups.

Whatever we do, we should keep this section of code very self-contained.

Agreed. Best to have a separate folder for keeping the parsers, and parse in a main preprocess file.

Also, I wanted to know, what are your thoughts on what we should do for multiple acceptable formats? As in user types 100,1000 this can represent a valid Arabic system 105 or 100 and 1000 in standard.

@cannawen
Copy link
Owner Author

OK. Let's first create an issue for "refactor number parsing" and first we will just parse standard numbers, like:

000.00
0,000.00
0.00
.00

And then we can talk about adding support for different number systems on a case-by-case basis after the refactor is in.

@nalinbhardwaj Do you still want to work on this?

@nalinbhardwaj
Copy link
Collaborator

nalinbhardwaj commented Oct 12, 2017

@cannawen yes, that can be done, but again more or less first we need to figure out how to deal with this as I mentioned:

Also, I wanted to know, what are your thoughts on what we should do for multiple acceptable formats? As in user types 100,1000 this can represent a valid Arabic system 105 or 100 and 1000 in standard.

I can make a PR soon that would add all suffix multipliers(thousand, million etc.) from numeral.js . I think it's a good start. We can work on the number parsing regex side by side. I can also make a PR adding the new formats, but that can be kept open if someone else wants a stab at it.

@nalinbhardwaj
Copy link
Collaborator

I thought about this a bit, and I now feel like the downside to making a much plainer regex, something like this would be extremely low. I mean, we know it is a number, and followed by a unit/following the format, isn't it almost guaranteed to be a number of units? @cannawen thoughts?

@cannawen
Copy link
Owner Author

cannawen commented Oct 12, 2017

We can add a parsing for "100,1000" (or 50e-5) but I think it makes sense to first focus on creating a good architecture for "standard" numbers so we can easily add new ones in the future. I think we can leave the "1k -> 1000" parsing where it is for now, and perhaps refactor later on if we feel like it makes sense in the number parsing part of the code.

I created issue #100 for how I am envisioning the initial refactor, does that issue look clear enough/make sense to you? Anything you think should be changed?

Also, I think it makes sense to keep this issue the discussion area, so the refactor thread itself is cleaner

@nalinbhardwaj
Copy link
Collaborator

nalinbhardwaj commented Oct 12, 2017

@cannawen did you see that regex I posted? My basic idea was that if we just allow for any numbers with any amount of separation using , or ., I don’t think we'll run into many false positives, since we are already checking if it’s surrounded by a unit etc. Can you think of a case where a valid sentence is written, assuming proper punctuation(commas in sentence followed by space) where this parser would give sub optimal results?
If not, this would easily give extremely good results in parsing(in any and every format)

@cannawen
Copy link
Owner Author

If we "capture" every number with periods and commas, we still need to "figure out" what it means (ex. the string "1.234,56" means 1234.56) so I don't see the benefit in capturing the string "1.234,56" alone, without parsing it to a javascript float as well. But perhaps I am missing something?

I also think it would be best if the number-parsing was completely separate form the unit-parsing (so perhaps we could release it as a separate library for others to use), but this is open for discussion.

@cannawen
Copy link
Owner Author

Closing for inactivity - Perhaps a future enhancement to look into

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants