Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new SuffixParser to handle UTF-8 chracters with more than one byte #82

Merged
merged 5 commits into from
Mar 31, 2023

Conversation

baldawar
Copy link
Collaborator

Issue #, if available: #73

Description of changes:

Add a new suffix parser that reverses utf-8 characters correctly.

Before we would convert "雨 to [34, -23, -101, -88] while addSuffixMatch(...) in ByteMatchine expected the the value to be [34, -88, -101, -23].

In this change, we're changing how we build InputCharacter[] within the default parser by introducing SuffixParser for the current "reverse and then match backwards" quirky behaviour within suffix and anything-bit-suffix matcher.

I decided against switching to a wildcard based pattern like *雨 within suffix because

  1. Patterns.suffixMatch() is a public method, so any changes can be breaking.
  2. Wildcard hasn't been significantly tested in the wild yet and introduces a new layer of complexity.
  3. There's significant amount of refactoring (mostly deletion) which will lead to rebasing issues with Exponential memory improvement by re-using NameState across multiple patterns #75 that I don't want at introduce at the moment.

No issues detected within benchmarks.

Benchmark / Performance (for source code changes):

Before

EXACT events/sec: 201387.5
WILDCARD events/sec: 160201.5
PREFIX events/sec: 214138.7
SUFFIX events/sec: 197651.2
EQUALS_IGNORE_CASE events/sec: 175798.7
NUMERIC events/sec: 129840.3
ANYTHING-BUT events/sec: 127356.8
ANYTHING-BUT-IGNORE-CASE events/sec: 129524.6
ANYTHING-BUT-PREFIX events/sec: 135971.9
ANYTHING-BUT-SUFFIX events/sec: 136319.9
COMPLEX_ARRAYS events/sec: 4285.7
PARTIAL_COMBO events/sec: 56712.3
COMBO events/sec: 2421.7

After

EXACT events/sec: 205664.1
WILDCARD events/sec: 154062.2
PREFIX events/sec: 213281.3
SUFFIX events/sec: 209506.4
EQUALS_IGNORE_CASE events/sec: 183048.1
NUMERIC events/sec: 125852.3
ANYTHING-BUT events/sec: 127738.6
ANYTHING-BUT-IGNORE-CASE events/sec: 129367.3
ANYTHING-BUT-PREFIX events/sec: 129998.8
ANYTHING-BUT-SUFFIX events/sec: 131280.3
COMPLEX_ARRAYS events/sec: 4411.9
PARTIAL_COMBO events/sec: 44781.0
COMBO events/sec: 2360.7

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Machine m = new Machine();
String rule = "{\n" +
" \"status\": {\n" +
" \"weatherText\": [{\"suffix\": \"雨\"}]\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth having a suffix with more than one multi-byte character?

" }\n" +
"}";
m.addRule("r1", rule);
List<String> matchRules = m.rulesForJSONEvent(eventStr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I have been doing recently is added rulesForJSONEvent tests to ACMachineTest and adding rulesForEvent tests (otherwise identical) to MachineTest. I think that is the intended difference between the two test classes? So change this and add a version to ACMachineTest? Of course, we should probably think through a path forward to stop duplicating all tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't thought about it. Let me follow it as well. I filed #83 to keep track of duplicate effort.

@baldawar baldawar requested a review from jonessha March 31, 2023 20:36
@baldawar baldawar merged commit 7d588b1 into main Mar 31, 2023
@baldawar baldawar deleted the suffix-cn-bug branch March 31, 2023 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants