Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use custom char check? #3

Closed
oovm opened this issue Sep 22, 2022 · 7 comments
Closed

How to use custom char check? #3

oovm opened this issue Sep 22, 2022 · 7 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@oovm
Copy link

oovm commented Sep 22, 2022

My identifiers are defined as follows:

@string
@no_skip_ws
Ident = (XID_START | '_') (XID_CONTINUE)*

where XID_START represents an external function UnicodeXID::is_xid_start.

How should I capture my Ident token?

@badicsalex badicsalex added the question Further information is requested label Sep 22, 2022
@badicsalex
Copy link
Owner

badicsalex commented Sep 22, 2022

Unfortunately this can only be done with quite a bit of hacking currently. Maybe I should implement proper "external function" support.

But if you really want to do it right now, there is a way.

You have to manually create a parse_XID_START function in the same namespace as the compiled grammar (should be easy if you use the macro, a bit harder if you use a buildscript).

Something like the following:

peginate!("
@export
Idents = {idents:Ident};

@string
@no_skip_ws
Ident = (XID_START | '_') {XID_CONTINUE};
");

pub fn parse_XID_START<'a, _CT>(
    state: ParseState<'a>,
    _tracer: impl ParseTracer,
    _cache: &_CT,
) -> ParseResult<'a, char> {
    // Boilerplate
    let result = state.s().chars().next().ok_or_else(|| {
        state
            .clone()
            .report_error(ParseErrorSpecifics::Other)
    })?;

    // Actual business logic
    if !result.is_xid_start() {
        return Err(state.report_error(ParseErrorSpecifics::Other));
    }

    // More boilerplate
    // We are skipping a full character, so we should be OK.
    let state = unsafe { state.advance(result.len_utf8()) };
    Ok(ParseOk {
        result,
        state,
        farthest_error: None,
    })
}

pub fn parse_XID_CONTINUE<'a, _CT>(
    state: ParseState<'a>,
    _tracer: impl ParseTracer,
    _cache: &_CT,
) -> ParseResult<'a, char> {
    // Boilerplate
    let result = state.s().chars().next().ok_or_else(|| {
        state
            .clone()
            .report_error(ParseErrorSpecifics::Other)
    })?;

    // Actual business logic
    if !result.is_xid_start() {
        return Err(state.report_error(ParseErrorSpecifics::Other));
    }

    // More boilerplate
    // We are skipping a full character, so we should be OK.
    let state = unsafe { state.advance(result.len_utf8()) };
    Ok(ParseOk {
        result,
        state,
        farthest_error: None,
    })
}

#[test]
fn test_macro() {
    let s = Idents::parse("xyz áé8").unwrap();
    assert_eq!(s.idents, vec!["xyz", "áé8"]);
}

@badicsalex
Copy link
Owner

badicsalex commented Sep 22, 2022

I understand the above is not convenient. What if I implemented a syntax like this:

@custom_char(crate::some_module::check_xid)
XID_START;

And then in some_module.rs, you could have a function like this:

fn check_xid(char) -> bool {
    char.is_xid_start
}

Maybe even use the unicode_xid directly:

@custom_char(unicode_xid::UnicodeXID::is_xid_continue)
XID_CONTINUE;

Would it fit your use-case?

@badicsalex badicsalex added the enhancement New feature or request label Sep 22, 2022
@oovm
Copy link
Author

oovm commented Sep 23, 2022

This hacking meets my needs.

If it were to stabilize as a feature I would like to be

@custom_char(char_xid_start) // advance 1 char
XID_START = 'ANY';  // annotative description, do not use
@custom_string(keyword_where, 5) // advance 5 chars 
WHERE = 'case insensitive where'; // annotative description, do not use

@check_string(keyword_checker)
KEYWORD = Ident; // Requires successful capture of Ident and keyword_checker to return true

with function signature

fn char_xid_start(char) -> bool;
fn keyword_where(&str) -> bool;
fn keyword_checker(&str) -> bool;

@badicsalex
Copy link
Owner

badicsalex commented Sep 23, 2022

The syntax I'm currently thinking about is:

@char
@check(unicode_xid::UnicodeXID::is_xid_continue)
XID_START = char; # In this case "char" is actually used

@extern(crate::keyword_where -> String)
WHERE; # no body, prefer comments

@check(crate::keyword_check)
KEYWORD = Ident;

There would be two new additions:

@check directive
The function gets whatever the rule spits out (char in case of @char rules, strings or structs in case of string or struct rules), and should return a bool.
So fn char_xid_start(char) -> bool and fn keyword_checker(&str) -> bool fits here, but you could also do checks on more complex structures with multiple fields in the middle of parsing.

@extern directive

It is a completely external parse function with the following signature:

fn custom_fn(&str) -> Result<(T, usize), &'static str>

If the string can be parsed OK, you return a tuple with the result, and the amount of bytes (!) the parser consumed from the input, wrapped in OK. If it cannot be parsed according to the rule, you return a static error message string wrapped in Err.

In case of the keyword where, it would probably look something like this:

fn keyword_where(&str) -> Result<(String, usize), &'static str> {
    if str.to_uppercase() == "WHERE" {
        let result = str.chars().take(5).collect();
        Ok((result, result.len()))
    } else {
        Err("Expected string 'where' (case insensitive)
    }
}

Or you could also return () or a named empty struct for efficiency.

It could also be used to parse numbers in place with something like fn number(&str) -> Result<(i64, usize), &'static str>

You could also do the requested r#"-string feature. In that case you would return the parsed string literal, but skip the starting and ending ##-s. (I really don't want to implement the stack, I think it's not a good addition to PEGs)

Any comments?

@badicsalex
Copy link
Owner

By the way, is the case insensitive match common?

Because I think adding a case insensitive string literal and char literal shouldn't be a big problem (the biggest problem is coming up with a good syntax for it).

badicsalex added a commit that referenced this issue Sep 23, 2022
badicsalex added a commit that referenced this issue Sep 23, 2022
@badicsalex
Copy link
Owner

Please see if the newly added features satisfy your needs. If so, I'll close the issue.

@oovm
Copy link
Author

oovm commented Sep 24, 2022

Good, this approach is very scalable.

@oovm oovm closed this as completed Sep 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants