Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: named capture groups and reference (alternative approach) #78

Closed
wants to merge 9 commits into from

Conversation

mdjastrzebski
Copy link
Member

Summary

This is an alternative syntax for backreferences proposed by @Killavus.

// Define a capture as a variable.
const someCapture = capture([...], { name: 'some' });

const regex = buildRegExp([
  // Create a named capture using name from `someRef`.
  someCapture,
  // ... some other elements ...
  // Match the same text as captured in a `capture` using `someRef`.
  someCapture.ref(),
])

It requires assigning a capture group to a JS variable which then can generate a backreference by calling .ref() method on it.

Comparison to #66

Pros:

  • more natural flow: from capture to reference
  • more compact in the main expression, as capture needs to be assigned to a separate variable

Cons:

  • requires assigning capture to a separate variable, not possible to have capture in the main regex expression
  • requires assigning explicit name for capture, as we want to avoid creating named capturing group with automatically generated name if not needed.
  • deviates from Swift Regex Builder pattern: https://developer.apple.com/documentation/regexbuilder/reference

Real life example (HTML open-closing tags matching):

test('example: html tag matching', () => {
  const tagName = capture(
    oneOrMore(/[a-z0-9]/),
    { name: 'tag' },
  );
  const tagContent = capture(
    zeroOrMore(any, { greedy: false }),
    { name: 'content' },
  );

  const tagMatcher = buildRegExp(['<', tagName, '>', tagContent, '</', tagName.ref(), '>'], {
    ignoreCase: true,
    global: true,
  });

  expect(tagMatcher).toMatchAllNamedGroups('<a>abc</a>', [{ tag: 'a', content: 'abc' }]);
});

@PaulJPhilp pls take a look at this approach vs the one in #66 .

Test plan

Added automated tests & real-life example.

wip

chore: more tests

feat: improved refs

refactor: merge name and ref

refactor: self code review

refactor: tweaks

refactor: rename reference to ref

feat: example with html tags

chore: self code review
@PaulJPhilp
Copy link
Contributor

One advantage I see with the ts-regex-builder approach is that it is can be external DSL, separate from JS/TS. An external DSL can facilitate:

Storing explicit refs in vars breaks this paradigm (in the Swift approach too).

Is it possible to get the benefits of the explicit refs by using implicit refs? E.g:

notes:
I have no preference between 'name' or 'as' ( or 'group', .....) so I use name/as below.
I am using "useGroup" as a placeholder only. The actual naming convention would need some
thought.

In Typescript:

 const htmlTag = capture(
    zeroOrMore(alphabetical, digit),
   {name/as: "htmlTag"}
 )

 const tagContent = capture(
    zeroOrMore(any, { greedy: false }),
   { name/as: 'taggedContent' },
 );

const tagMatcher = buildRegExp(['<', tagName, '>', tagContent, '</', useGroup("htmlTag"), '>'], {
ignoreCase: true,
global: true,
});

Or, in my imaginary DSL:
Regex {
'<'
Capture(name/as: "htmlTag") {
oneOrMore(alphabetical, digit)
}
'>'
Capture(name/as: "taggedContent") {
zeroOrMore(any, greedy:false)
}
'</'
useGroup("htmlTag")
}

As a first step, these group names could be scoped globally, meaning each name
can only be used once. It's a limitation, but perhaps better than automatic name generation.

@mdjastrzebski
Copy link
Member Author

@PaulJPhilp Hmmm, I'm not sure I get your point. Could you explain what would be your goal(s) in terms of API design, and how the above options realize/don't realize it?

For brief summary here's the recap of the two approaches:

  1. In the original PR feat: named capture groups and reference #66 standalone ref("abc") was just a holder for name ({ type: "reference", name: "abc" }). When assigning it to a capture group: capture([...], { name: ref }) the capture group did not actually "hold" the ref in any meaningful way, it just used its name when encoding the capture regex. The goal here is to not require user to type the name twice, once in the capture group and second in the backreference.

So the following examples actually worked exactly the same from technical point of view:

// Suggested syntax
const myRef = ref('my'):
const myRegex = buildRegExp([catpure([...], { name: myRef }, myRef);

// Other option that actually worked
const myRegex = buildRegExp([catpure([...], { name: "my" }, ref("my"));
  1. In case of this PR, this is somewhat similar, ref() is not a real reference, just ensures that the capture/backref names are the same. However, this PR removes ref("my") standalone construct so
const myRegex = buildRegExp([catpure([...], { name: "my" }, ref("my"));

Would no longer be possible.

@PaulJPhilp
Copy link
Contributor

PaulJPhilp commented Apr 1, 2024 via email

@mdjastrzebski
Copy link
Member Author

@PaulJPhilp The URL you have posted seems to be obfuscated. Perhaps this is somesecurity feature from GitHub.

@PaulJPhilp
Copy link
Contributor

PaulJPhilp commented Apr 2, 2024 via email

@mdjastrzebski
Copy link
Member Author

Closing in favor of more restrictive syntax in #66.

@mdjastrzebski
Copy link
Member Author

@PaulJPhilp I've followed you on Twitter, you can send the draft by DMs.

@PaulJPhilp
Copy link
Contributor

PaulJPhilp commented Apr 5, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants