Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/improve parsing #81

Merged
merged 9 commits into from
Jun 26, 2015

Conversation

maherkassim
Copy link
Contributor

Added the ability to parse multiple quantities (eg. 1ft 1in)

@angularsen
Copy link
Owner

As already stated, I think this is a fair approach, but I have a hunch we might run into some edge cases in the parsing, especially for unit abbreviations with numbers in them. We don't have any today, since we normally go with m² and m³, but it's fair to assume we will have abbreviations like m^2 and m^3 at some point. I think we need more tests to be sure though, and I guess adding abbreviations like that is needed to fully test it.

I just added a couple of new tests in b215ac5 that might serve as a reference on how to temporarily add new unit abbreviations in a test, to test the outcome. Since UnitSystem is a singleton-per-culture, it will also affect parsing done in unit classes.

This way we can test the parse behavior for both realistic and contrived cases, such as "5 m^2 2m²" equals 7 m². While writing that I just realized an edge case. If we allow removing the space you get a pretty ambiguous expression "5m^22m²". This is why I don't want to allow space between them, but for some cases like 5"2' we do want to allow that. So I'm starting to question whether we should allow 2 units for all types of units, or maybe we should only allow them for the known cases, like feet/inches and stone/pounds?

@maherkassim
Copy link
Contributor Author

I'm leaning more towards handling imperial units like feet/inches and stone/pounds separately.

The updated regex will be able to parse 1' 1" (not 1'1"), but I believe that to be sufficient for now. I've reverted the unit matching back, so it will allow cases like 1m^2 again.

@maherkassim
Copy link
Contributor Author

I haven't yet been able to find a way to modify the regex to parse both 1'1" and 1m^2. Right now it can parse 1' 1" and 1m^2, but it needs that space between the two quantities in the Imperial units.

One approach we can take is to attempt both regex's. Specifically, if it fails to parse the string using the initial regex then it can try an alternate regex for Imperial units where it stops matching the unit once it reaches a number. This should match all imperial quantities of the form 1'1".
The only case I believe won't work with this approach is the one you mentioned with a string of the form "5m^22m²", but I believe that is simply poorly formatted input and should fail (it's not particularly human-readable).

A potential concern with this approach may be performance. However, if we only run the second regex when the first fails, the only times where both are run would be for invalid input or for Imperial units with no space in between. So I'm not sure how big of a concern it is, but we may want to take it into consideration.

What do you think?

@angularsen
Copy link
Owner

I'm still not agreeing with myself on whether we want to allow 2 units for all our unit classes, or only implement for the few known cases such as 5" 3' and possibly pounds/stones (not too familiar with that one).

I do think it is a nice that we can parse strings like "5m 10cm". Possibly even "5m and 10cm", which should be trivial to support with regex. However, it feels strange that this would also parse "5Pa 3 bars" or "5m 2yd". But then again, why not? As long as it parses the correct quantity, why limit the option. That way we don't have to keep implementing all the special cases of 2 units in the future when we discover them.

So in the end, I'm leaning towards keeping the current implementation, but we need to figure out how to support parsing 5"2' , because I believe that will be commonly used. Maybe as you said, a second pass of regex? I don't know what the performance hit is here, but if it's starting to become significant, then I do want to optimize for the 80% of usages, and that is single quantity/unit.

How about this:

  • 1st pass: single quantity/unit regex (most common)
  • 2nd pass: feet/inches special case regex (quite common)
  • 3rd pass: dual quantity/unit regex (rare)

@maherkassim
Copy link
Contributor Author

I believe that having a second pass should be sufficient. I don't think we'll see a significant performance improvement from splitting the current regex into two separate passes.

I think another option might be to just separate the functionality completely into a separate "ParseImperial" method which would be dedicated to handling the special parsing rules. This might also help clarify how the input is expected to be formatted for each method (ie. only allowing multiple quantities like "1ft2in" and "1ft 2in" in ParseImperial).

It might also make sense to limit ParseImperial to only be implemented for specific classes like:

  • Length (eg. inches, feet, yards)
  • Mass (eg. pounds, ounces)
    I don't think it makes much sense for most of the other classes (eg. Acceleration, Temperature).

@angularsen
Copy link
Owner

I can go with that. I too don't think performance will be a big problem for the short strings we are parsing. We can compile and cache the regexes too to further improve performance if need be.

I think ParseImperial may be a bit generic though. The way I see it, the only special case we need to handle is to allow 1"2' with no space between them. All other single/dual units can be handled by the current regex implementation, maybe extended with support for "and" between dual quantities. Am I right?

As per my last post, I am currently leaning towards allowing dual quantities for all units, simply because it is less maintenance if we later find we need dual quantities for units X and Y as well, and I don't see the harm in parsing two units that can logically be added together although they may not be commonly used together, such as "1L 2dm³" or "1L and 2dm³" parses to 3 liters. As long as we require a space between the two quantities, we should hopefully avoid many edge cases.

@maherkassim
Copy link
Contributor Author

Ok, so I will just have add the second regex that will attempt to match strings like 1'1" if the current regex fails. Also, using the current regex the "and" will just be ignored so it should work without any additional modification.

@angularsen
Copy link
Owner

Perfect, sounds good to me.

@maherkassim
Copy link
Contributor Author

If the input is "1ft and 1in" then that will work with the parsing, but so will nonsense like "1ft chicken cow sheep 1in" as the extra words simply won't be matched by the regex. Should we be concerned with this or ignore it and leave it up to the user? We should document the behavior we choose (ie. let users know that all invalid input mixed in with valid input will be ignored).

On one hand, we can ignore the invalid input and not limit the user. On the other hand, it is technically invalid, so maybe we should treat it as such? For example, how should we treat 1'1? It will currently ignore the second 1 and parse as 1 foot because it isn't matched by the regex. I came across this case when I was writing tests for the parsing and wasn't sure how it should be handled.

@angularsen
Copy link
Owner

After some thought, I think we should allow multiple quantities, but not allow invalid strings. Generally, I favor parsing the quantities as long as they can logically be parsed, and leave it up to user code to ensure what is parsed actually makes sense.

My thoughts are as follows

// Length.Parse() examples
// Multiple quantities are added together
// No invalid text is allowed
// Allowed delimiters between quantities: 1 comma, 1 word "and", N whitespace, or any combination 

// Valid strings

1m => 1 meter                       // single quantity
1m 1" => (1 meter + 1 inch)         // valid, but unconventional
1'1" => (1 foot + 1 inch)           // special case for feet/inches, allow no space
1dm³ 1L => 2 liters                 // 2 quantities, separated by space
1m 1m 1m => 3m                      // 3 quantities, separated by space
1m 1m ... 1m, N times => N m        // N quantities, separated by space
1m and 1cm => 1.011m                // 2 quantities, separated by "and"-word and whitespace
1m,1cm,1mm => 1.011m                // 3 quantities, separated by comma
1m, 1cm, 1mm => 1.011m              // 3 quantities, separated by comma
1m, 1cm and 1mm => 1.011m           // 3 quantities, separated by a mix of delimiters
1m, 1cm, and 1mm => 1.011m          // 3 quantities, separated by a mix of delimiters
1m     and      1cm => 1.011m       // 3 quantities, separated by a mix of delimiters
1m  ,   and,      1cm => 1.011m     // 3 quantities, separated by a mix of delimiters

// Invalid strings, throws exception

1m1cm           // 2 quantities, no space
1m monkey 1cm   // invalid word
1''             // invalid unit
1mmm            // invalid unit
1'1"2"          // only 2 quantities of exactly foot and inch is supported without space
1"1'            // only 2 quantities of exactly foot and inch is supported without space

It's possible we might run into trouble allowing comma as delimiter, since different cultures use comma differently in the number formatting, but to my knowledge none of them result in a number starting or ending with a comma, but I could be wrong. For instance US English often use ".5" to denote 0.5, so I suppose some cultures allow the same for comma. However, we could simply specify that number formats like ",5" and ".5" are not supported.

Topic of interesting complexity, this is. --Yoda

Update: Fixed " and ' for feet/inches.

@maherkassim
Copy link
Contributor Author

Minor clarification: I believe that 1' refers to 1 foot and 1" refers to 1 inch, although this may vary for other languages. This was how I entered it in the abbreviations for feet and inches in the unit definition. http://en.wikipedia.org/wiki/Prime_(symbol)

@maherkassim
Copy link
Contributor Author

Regarding the note about parsing ".5", I think it would be useful to observe the behavior of double.Parse() as that is ultimately what decides whether or not it can be parsed (unless we specifically pad it with a zero).

@angularsen
Copy link
Owner

You are right of course, I'm a metric person myself so got them mixed up for some reason :-)

@maherkassim
Copy link
Contributor Author

Ahh ok :) Sorry for nitpicking, I just wanted to make sure there wasn't an actual difference that we'd have to support. Although that's not the case here, I wonder if there are actually any units where the abbreviations differ by language?

@angularsen
Copy link
Owner

No idea, but I'm tempted to simply assume no for now until someone points out otherwise.

@angularsen
Copy link
Owner

I don't mean to nag, but any progress on this?

@maherkassim
Copy link
Contributor Author

Wow! Sorry about this. Was really preoccupied for a while and it completely slipped my mind to come back to it. I'll take a look to refresh my memory and then submit the remaining changes.

@angularsen
Copy link
Owner

Just a heads up, I am about to delete develop and stable branches, and returning to using a master branch. It seems pull requests can't be edited to change target branch, so I will leave develop up until this is merged in.

@maherkassim
Copy link
Contributor Author

Ok, so I've updated to simplify the parsing (splitting to have two regex calls seemed like a bit much). I added a capturing group to the regex for invalid strings that should capture nonsense like "monkey". I think the regex should now behave as expected, and any further limitations we want to place on the parsing might be done after the matching. For example, if we want to enforce that 1"1' is invalid (should be 1'1"). Please note that I have not added checks like that with these changes, but we may wish to add it in the future (perhaps we should add a ticket in the issue tracker).

@angularsen
Copy link
Owner

So mostly nitpicking on my end. Do you think we should just merge it in as-is or do you want to act on any of the feedback first?

@maherkassim
Copy link
Contributor Author

Ok, I think it can be merged unless there is a different alternative to the groups["invalid"].Value == "" check that I missed (couldn't find groups["invalid"].IsMatch and groups["invalid"].Success didn't work as expected).

angularsen added a commit that referenced this pull request Jun 26, 2015
Improve parsing of unit quantities.

Not all the cases below are implemented. For instance it happily parses unit quantity pairs with no space or delimiters between them. This is left for a future improvement.

---

// Length.Parse() examples
// Multiple quantities are added together
// No invalid text is allowed
// Allowed delimiters between quantities: 1 comma, 1 word "and", N whitespace, or any combination 

// Valid strings

1m => 1 meter                       // single quantity
1m 1" => (1 meter + 1 inch)         // valid, but unconventional
1'1" => (1 foot + 1 inch)           // special case for feet/inches, allow no space
1dm³ 1L => 2 liters                 // 2 quantities, separated by space
1m 1m 1m => 3m                      // 3 quantities, separated by space
1m 1m ... 1m, N times => N m        // N quantities, separated by space
1m and 1cm => 1.011m                // 2 quantities, separated by "and"-word and whitespace
1m,1cm,1mm => 1.011m                // 3 quantities, separated by comma
1m, 1cm, 1mm => 1.011m              // 3 quantities, separated by comma
1m, 1cm and 1mm => 1.011m           // 3 quantities, separated by a mix of delimiters
1m, 1cm, and 1mm => 1.011m          // 3 quantities, separated by a mix of delimiters
1m     and      1cm => 1.011m       // 3 quantities, separated by a mix of delimiters
1m  ,   and,      1cm => 1.011m     // 3 quantities, separated by a mix of delimiters

// Invalid strings, throws exception

1m1cm           // 2 quantities, no space
1m monkey 1cm   // invalid word
1''             // invalid unit
1mmm            // invalid unit
1'1"2"          // only 2 quantities of exactly foot and inch is supported without space
1"1'            // only 2 quantities of exactly foot and inch is supported without space
@angularsen angularsen merged commit 7353c4b into angularsen:develop Jun 26, 2015
@angularsen
Copy link
Owner

Ok, I haven't played too much with them myself. If it works, it's fine by me.

Merging in!

Thanks for the 1+ month effort :-) 👍

@maherkassim
Copy link
Contributor Author

Haha yeah, sorry about the extended delay. There's still room for improvement, but I think we've made some progress.

angularsen added a commit that referenced this pull request Jun 26, 2015
PR was based on develop, so have to manually merge it into master after
changing the branching style (again).
@angularsen
Copy link
Owner

Nuget is now out 3.14.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants