-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle commas better #74
Comments
Numeral.js looks like it has some locale features, but I'm not sure if it is as flexible as we need |
Since we are looking to add more code to preprocessing, should we factor out the preprocessing code into a separate module and then export preprocessComment to avoid polluting conversion_helper? |
I like that idea! Perhaps "conversion.js" can use some "preprocess" module before passing the result to "conversion_helper"? Do you want to handle this as part of PR #73, or should we create a new issue? |
Sure!, I'll handle it in #73. |
I really like the idea of using numeral.js for parsing/handling preprocessing. But this would probably require a major refactor for preprocessing. We can assume locale based on many things(subreddit, general number of active users, etc.) and parse numbers accordingly, remove commas and push it forward. I can look into handling this once #73 is merged. If we plan to do this however, I suggest moving preprocess even outside conversion.js, for the sake of minimising complexity, since it would probably have lots of corner cases and stuff in itself. |
Ohhh yes, maybe it's a good idea move it out of anything related to conversions. Assigning issue to @nalinbhardwaj and marking issue as blocked on #73 |
Some thoughts after further analysis. numeral.js is actually very lacking in it's locale handling. For example, for the Arabic/Indian system, the 2nd most popular after american system, it doesn't contain a locale. And locales are mainly just better(language translated) handlers for Also, some manual stats gathering on r/india shows that a lot more people use American/none(not comma seperate) their comments than those that do use the Indian system. False positives in such cases might lead to very bad results(it's better not to answer than answer wrongly IMO). We probably need to find better heuristics, and/or a library that's much better at parsing automatically. |
Instead of looking at the subreddit to determine locale, maybe we can just handle some basic types of numbers for all subreddits. So the following is a list of what a Reddit user writes, and what the bot parses in "standard" notation:
So these are just some rough ideas/guidelines. We can break down the different types of formats into different issues if you think that makes sense? Or just tackle it all in one go. Whatever we do, we should keep this section of code very self-contained. I think we can create our own "locale" files in Numeral, if we feel like they are lacking something we need, but I think it is also possible to do this without using numeral and just using regular expressions. Lots and lots of regular expressions. |
Yes @cannawen, actually, Numeral.js doesn't allow for a different number of distance in commas in a number(at least not trivially), which is the case with Arabic numeral system, where-in numbers are 105 = 1 lakh, 107 = 1 crore and so on, so basically, we use(1,00,000 instead of 100,000). Doing this in-house would be much easier.
This is optimal since that'll allow us expertise from individual user groups.
Agreed. Best to have a separate folder for keeping the parsers, and parse in a main preprocess file. Also, I wanted to know, what are your thoughts on what we should do for multiple acceptable formats? As in user types |
OK. Let's first create an issue for "refactor number parsing" and first we will just parse standard numbers, like: 000.00 And then we can talk about adding support for different number systems on a case-by-case basis after the refactor is in. @nalinbhardwaj Do you still want to work on this? |
@cannawen yes, that can be done, but again more or less first we need to figure out how to deal with this as I mentioned:
I can make a PR soon that would add all suffix multipliers(thousand, million etc.) from numeral.js . I think it's a good start. We can work on the number parsing regex side by side. I can also make a PR adding the new formats, but that can be kept open if someone else wants a stab at it. |
We can add a parsing for "100,1000" (or 50e-5) but I think it makes sense to first focus on creating a good architecture for "standard" numbers so we can easily add new ones in the future. I think we can leave the "1k -> 1000" parsing where it is for now, and perhaps refactor later on if we feel like it makes sense in the number parsing part of the code. I created issue #100 for how I am envisioning the initial refactor, does that issue look clear enough/make sense to you? Anything you think should be changed? Also, I think it makes sense to keep this issue the discussion area, so the refactor thread itself is cleaner |
@cannawen did you see that regex I posted? My basic idea was that if we just allow for any numbers with any amount of separation using , or ., I don’t think we'll run into many false positives, since we are already checking if it’s surrounded by a unit etc. Can you think of a case where a valid sentence is written, assuming proper punctuation(commas in sentence followed by space) where this parser would give sub optimal results? |
If we "capture" every number with periods and commas, we still need to "figure out" what it means (ex. the string "1.234,56" means 1234.56) so I don't see the benefit in capturing the string "1.234,56" alone, without parsing it to a javascript float as well. But perhaps I am missing something? I also think it would be best if the number-parsing was completely separate form the unit-parsing (so perhaps we could release it as a separate library for others to use), but this is open for discussion. |
Closing for inactivity - Perhaps a future enhancement to look into |
Enhancement request summary:
Use Numeral.js to localize number recognition (i.e. 4,32 vs. 4.32)
Initial discussion:
Now that we have a preprocess step, we can remove commas from the input string but it might be a bit tricky because we don't want to blanket get rid of all commas (for example "40 miles,foo" would turn into "40 milesfoo" which would not be processed as miles )
Another problem:
Numbers formatted like 4,32 might be an alternate way to say 4.32 but 4,321 is more commonly the number "4321". Currently, we do not match numbers that are formatted like "4,32"
Perhaps there is a library out there for parsing numbers from different locales? If not perhaps we can spin off a new project to create such a library?
Issue is open for discussion
The text was updated successfully, but these errors were encountered: