Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ragel v7 scanner does not set/reset ts/te after the loop ending in a full token #76

Open
gh-andre opened this issue Oct 27, 2021 · 6 comments
Assignees

Comments

@gh-andre
Copy link

This is a regression from Ragel 6.10, which set ts and te pointers as described in the manual, which says this:

This variable may be used in action code for retrieving the text of
the current match. Ragel ensures that in between tokens and outside of the longest-match machines
that this pointer is set to null. In between calls to the execute code the user must check if ts is
set and if so, ensure that the data it points to is preserved ahead of the next buffer block.

Here's the program to show this behavior:

#include <cstdio>
#include <cstdlib>
#include <string>

%%{
machine part_token;

main := |*
     ' '+;
     ',';
     'abcd' => { printf("token: %s\n", std::string(ts, te).c_str()); };
  *|;
}%%

%% write data;

int main(int argc, char *argv[])
{
  if(argc != 2) {
    printf("Usage: scan token\n");
    return EXIT_FAILURE;
  }

  std::string in(argv[1]);

  int cs = 0, act = 0;
  const char *p = in.c_str();
  const char *pe = in.c_str() + in.length();
  const char *ts = nullptr, *te = nullptr;
  const char *eof = pe;

  %% write init;
  %% write exec;

  printf("done\n");

  if(cs == part_token_error)
     printf("Error near %zd\n", p-in.data());
  else if(ts)
    printf("offsets: ts %zd te: %zd pe: %zd\n", ts-in.data(), te-in.data(), pe-in.data());
  else
    printf("ts is null\n");

  return EXIT_SUCCESS;
}

Compiled on CentOS 8 Stream with Ragel 7.0.0.12, which is available as a package, and with Ragel 6.10, built on the same machine from the source at http://www.colm.net/files/ragel/ragel-6.10.tar.gz.

ragel -o scan.cpp scan.rl
g++ -std=c++17 -o scan scan.cpp

Both scanners ran this way:

./scan abcd
./scan abcd,ab

Ragel 6.10 produces this output:

./scan abcd
token: abcd
done
ts is null

./scan abcd,ab
token: abcd
done
offsets: ts 5 te: 5 pe: 7

Ragel 7 produces this output:

./scan abcd
token: abcd
done
offsets: ts 4 te: 4 pe: 4

./scan abcd,ab
token: abcd
done
offsets: ts 5 te: 5 pe: 7

ts is no longer reset to a null pointer after the execution loop if the input ends in a full token.

@adrian-thurston
Copy link
Owner

Hi @gh-andre thanks for reporting this.

This is a consequence of

  1. Ragel using from-state actions to set ts = p on the start of a token (always the case).
  2. In ragel 7 from-state actions getting executed on EOF. This was done to better align with leaving-state actions getting executed on EOF. It helps with zero-length matches.

I'm not sure what to do about this at the moment, but thinking about it.

@adrian-thurston adrian-thurston self-assigned this Dec 27, 2021
adrian-thurston added a commit that referenced this issue Dec 28, 2021
The ts var is currently getting set on EOF due to ragel 7 executing from-state
actions on EOF. This is a regression from ragel 6. refs #76.
adrian-thurston added a commit that referenced this issue Dec 28, 2021
In ragel 7, from-state actions are executed on EOF, which means that after a
machine sees EOF ts will be set, which is unexpected. Instead, set TS on an
entering character-transition action.

refs #76
@adrian-thurston
Copy link
Owner

Note that this is only the case after EOF. It means you can check that EOF has not been reached yet. It's still not ideal though. It should stay at zero.

@adrian-thurston
Copy link
Owner

Proposed solution in df0a63a.

@gh-andre
Copy link
Author

@adrian-thurston Thank you for looking into this. I worked it around in my code by checking whether ts and pe are the same, in addition to checking whether ts is a null pointer or not. I'm not familiar with Ragel source and cannot comment on the proposed fix, but will be able to test it once I'm back in the office next week, if you don't mind the delay.

@gh-andre
Copy link
Author

gh-andre commented Jan 4, 2022

@adrian-thurston Sorry for the delay - colm was conflicting with the one installed via dnf and I needed how to build both from the commit.

I tested various scenarios and the new Ragel behaves the same as v6.10. I also tested the scenario I forgot to mention, when the first token is incomplete (i.e. ./scan abc) and verified that ts is set at the start of input, while te is not set and in this example remains nullptr, while pe in each case is advanced past the end of the last character (i.e. points to the null character).

Seems to be working well for the scenarios I can think of. Thanks again for looking into this.

@adrian-thurston
Copy link
Owner

@gh-andre great, thanks for trying it out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants