Handle commas better #74

cannawen · 2017-10-07T06:56:57Z

Enhancement request summary:

Use Numeral.js to localize number recognition (i.e. 4,32 vs. 4.32)

Initial discussion:

Now that we have a preprocess step, we can remove commas from the input string but it might be a bit tricky because we don't want to blanket get rid of all commas (for example "40 miles,foo" would turn into "40 milesfoo" which would not be processed as miles )

Another problem:
Numbers formatted like 4,32 might be an alternate way to say 4.32 but 4,321 is more commonly the number "4321". Currently, we do not match numbers that are formatted like "4,32"

Perhaps there is a library out there for parsing numbers from different locales? If not perhaps we can spin off a new project to create such a library?

Issue is open for discussion

cannawen · 2017-10-07T07:08:07Z

Numeral.js looks like it has some locale features, but I'm not sure if it is as flexible as we need

nicolasferraro · 2017-10-07T07:34:56Z

Since we are looking to add more code to preprocessing, should we factor out the preprocessing code into a separate module and then export preprocessComment to avoid polluting conversion_helper?

cannawen · 2017-10-07T07:40:15Z

I like that idea! Perhaps "conversion.js" can use some "preprocess" module before passing the result to "conversion_helper"? Do you want to handle this as part of PR #73, or should we create a new issue?

nicolasferraro · 2017-10-07T07:43:34Z

Sure!, I'll handle it in #73.

nalinbhardwaj · 2017-10-07T13:27:37Z

I really like the idea of using numeral.js for parsing/handling preprocessing. But this would probably require a major refactor for preprocessing. We can assume locale based on many things(subreddit, general number of active users, etc.) and parse numbers accordingly, remove commas and push it forward. I can look into handling this once #73 is merged. If we plan to do this however, I suggest moving preprocess even outside conversion.js, for the sake of minimising complexity, since it would probably have lots of corner cases and stuff in itself.

cannawen · 2017-10-07T13:44:30Z

Ohhh yes, maybe it's a good idea move it out of anything related to conversions. Assigning issue to @nalinbhardwaj and marking issue as blocked on #73

nalinbhardwaj · 2017-10-09T04:11:17Z

Some thoughts after further analysis.

numeral.js is actually very lacking in it's locale handling. For example, for the Arabic/Indian system, the 2nd most popular after american system, it doesn't contain a locale. And locales are mainly just better(language translated) handlers for thousand, million, billion.

Also, some manual stats gathering on r/india shows that a lot more people use American/none(not comma seperate) their comments than those that do use the Indian system. False positives in such cases might lead to very bad results(it's better not to answer than answer wrongly IMO).

We probably need to find better heuristics, and/or a library that's much better at parsing automatically.

cannawen · 2017-10-09T05:50:46Z

Instead of looking at the subreddit to determine locale, maybe we can just handle some basic types of numbers for all subreddits.

So the following is a list of what a Reddit user writes, and what the bot parses in "standard" notation:

x.xx -> x.xx
x.xxx.xxx -> x,xxx,xxx
x.xxx,xxx -> x,xxx.xxx
x'xxx -> x,xxx

So these are just some rough ideas/guidelines. We can break down the different types of formats into different issues if you think that makes sense? Or just tackle it all in one go. Whatever we do, we should keep this section of code very self-contained. I think we can create our own "locale" files in Numeral, if we feel like they are lacking something we need, but I think it is also possible to do this without using numeral and just using regular expressions. Lots and lots of regular expressions.

nalinbhardwaj · 2017-10-09T05:58:10Z

Yes @cannawen, actually, Numeral.js doesn't allow for a different number of distance in commas in a number(at least not trivially), which is the case with Arabic numeral system, where-in numbers are 10⁵ = 1 lakh, 10⁷ = 1 crore and so on, so basically, we use(1,00,000 instead of 100,000). Doing this in-house would be much easier.

We can break down the different types of formats into different issues if you think that makes sense?

This is optimal since that'll allow us expertise from individual user groups.

Whatever we do, we should keep this section of code very self-contained.

Agreed. Best to have a separate folder for keeping the parsers, and parse in a main preprocess file.

Also, I wanted to know, what are your thoughts on what we should do for multiple acceptable formats? As in user types 100,1000 this can represent a valid Arabic system 10⁵ or 100 and 1000 in standard.

cannawen · 2017-10-12T00:45:48Z

OK. Let's first create an issue for "refactor number parsing" and first we will just parse standard numbers, like:

000.00
0,000.00
0.00
.00

And then we can talk about adding support for different number systems on a case-by-case basis after the refactor is in.

@nalinbhardwaj Do you still want to work on this?

nalinbhardwaj · 2017-10-12T00:49:07Z

@cannawen yes, that can be done, but again more or less first we need to figure out how to deal with this as I mentioned:

Also, I wanted to know, what are your thoughts on what we should do for multiple acceptable formats? As in user types 100,1000 this can represent a valid Arabic system 10⁵ or 100 and 1000 in standard.

I can make a PR soon that would add all suffix multipliers(thousand, million etc.) from numeral.js . I think it's a good start. We can work on the number parsing regex side by side. I can also make a PR adding the new formats, but that can be kept open if someone else wants a stab at it.

nalinbhardwaj · 2017-10-12T08:07:56Z

I thought about this a bit, and I now feel like the downside to making a much plainer regex, something like this would be extremely low. I mean, we know it is a number, and followed by a unit/following the format, isn't it almost guaranteed to be a number of units? @cannawen thoughts?

cannawen · 2017-10-12T12:48:40Z

We can add a parsing for "100,1000" (or 50e-5) but I think it makes sense to first focus on creating a good architecture for "standard" numbers so we can easily add new ones in the future. I think we can leave the "1k -> 1000" parsing where it is for now, and perhaps refactor later on if we feel like it makes sense in the number parsing part of the code.

I created issue #100 for how I am envisioning the initial refactor, does that issue look clear enough/make sense to you? Anything you think should be changed?

Also, I think it makes sense to keep this issue the discussion area, so the refactor thread itself is cleaner

nalinbhardwaj · 2017-10-12T13:54:38Z

@cannawen did you see that regex I posted? My basic idea was that if we just allow for any numbers with any amount of separation using , or ., I don’t think we'll run into many false positives, since we are already checking if it’s surrounded by a unit etc. Can you think of a case where a valid sentence is written, assuming proper punctuation(commas in sentence followed by space) where this parser would give sub optimal results?
If not, this would easily give extremely good results in parsing(in any and every format)

cannawen · 2017-10-12T14:15:41Z

If we "capture" every number with periods and commas, we still need to "figure out" what it means (ex. the string "1.234,56" means 1234.56) so I don't see the benefit in capturing the string "1.234,56" alone, without parsing it to a javascript float as well. But perhaps I am missing something?

I also think it would be best if the number-parsing was completely separate form the unit-parsing (so perhaps we could release it as a separate library for others to use), but this is open for discussion.

cannawen · 2017-10-30T14:58:09Z

Closing for inactivity - Perhaps a future enhancement to look into

cannawen added discussion maintainer approved labels Oct 7, 2017

cannawen mentioned this issue Oct 7, 2017

Parse number abbreviations [#46] #73

Merged

cannawen assigned nalinbhardwaj Oct 7, 2017

cannawen added blocked feature request and removed discussion labels Oct 7, 2017

nalinbhardwaj removed the blocked label Oct 12, 2017

cannawen added discussion and removed feature request labels Oct 12, 2017

cannawen closed this as completed Oct 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle commas better #74

Handle commas better #74

cannawen commented Oct 7, 2017 •

edited

Loading

cannawen commented Oct 7, 2017 •

edited

Loading

nicolasferraro commented Oct 7, 2017

cannawen commented Oct 7, 2017

nicolasferraro commented Oct 7, 2017

nalinbhardwaj commented Oct 7, 2017

cannawen commented Oct 7, 2017 •

edited

Loading

nalinbhardwaj commented Oct 9, 2017

cannawen commented Oct 9, 2017

nalinbhardwaj commented Oct 9, 2017 •

edited

Loading

cannawen commented Oct 12, 2017

nalinbhardwaj commented Oct 12, 2017 •

edited

Loading

nalinbhardwaj commented Oct 12, 2017

cannawen commented Oct 12, 2017 •

edited

Loading

nalinbhardwaj commented Oct 12, 2017 •

edited

Loading

cannawen commented Oct 12, 2017

cannawen commented Oct 30, 2017

Handle commas better #74

Handle commas better #74

Comments

cannawen commented Oct 7, 2017 • edited Loading

Enhancement request summary:

Initial discussion:

cannawen commented Oct 7, 2017 • edited Loading

nicolasferraro commented Oct 7, 2017

cannawen commented Oct 7, 2017

nicolasferraro commented Oct 7, 2017

nalinbhardwaj commented Oct 7, 2017

cannawen commented Oct 7, 2017 • edited Loading

nalinbhardwaj commented Oct 9, 2017

cannawen commented Oct 9, 2017

nalinbhardwaj commented Oct 9, 2017 • edited Loading

cannawen commented Oct 12, 2017

nalinbhardwaj commented Oct 12, 2017 • edited Loading

nalinbhardwaj commented Oct 12, 2017

cannawen commented Oct 12, 2017 • edited Loading

nalinbhardwaj commented Oct 12, 2017 • edited Loading

cannawen commented Oct 12, 2017

cannawen commented Oct 30, 2017

cannawen commented Oct 7, 2017 •

edited

Loading

cannawen commented Oct 7, 2017 •

edited

Loading

cannawen commented Oct 7, 2017 •

edited

Loading

nalinbhardwaj commented Oct 9, 2017 •

edited

Loading

nalinbhardwaj commented Oct 12, 2017 •

edited

Loading

cannawen commented Oct 12, 2017 •

edited

Loading

nalinbhardwaj commented Oct 12, 2017 •

edited

Loading