Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error-tolerant tokenization #29

Closed
zachallaun opened this issue Apr 28, 2024 · 5 comments
Closed

Error-tolerant tokenization #29

zachallaun opened this issue Apr 28, 2024 · 5 comments

Comments

@zachallaun
Copy link

I've been thinking more about error tolerance in Elixir parsing and wanted to start a discussion about errors that come from the tokenizer.

Currently, the behavior of the tokenizer is to stop tokenizing when an error is encountered, emitting the error and the token accumulator. This means that Spitfire can only be error-tolerant to the degree that the tokenizer is. Consider this bit of example code:

defmodule Foo do
  def bar(x) do
<<<<<<< HEAD
    x + 1
=======
    x + 2
>>>>>>> main
  end

  def baz(x), do: x + 3
end

When the tokenizer encounters the merge conflict, it bails out, so the best that Spitfire can do is to give us context about the module and function head, but nothing about the body and nothing about the definition for baz below.

If the tokenizer were to accumulate errors instead, Spitfire might be able to give us the AST for the following, as well as errors that VCS merge conflict markers were present on lines 3/5/7:

defmodule Foo do
  def bar(x) do

    x + 1

    x + 2

  end

  def baz(x), do: x + 3
end

While I haven't reviewed every tokenizer error case, there are a significant number where an error could be accumulated and tokenization could continue. Alternatively, the tokenizer could emit "error tokens" of some kind, e.g. {{:error, :version_contol_marker}, {3, 1, nil}, ~c"<<<<<<< HEAD"} and then handled in the parser.

@mhanberg
Copy link
Contributor

The ways handwritten parsers generalize tokenize is a token at a time as you parse, but elixirs tokenized is designed to work with a parser generator, so it tokenizes the entire document first.

I used the internal tokenizer just as a way to skip a bunch of work and make progress, but it's been on my radar to alter the tokenizer to work how I described above.

That way as you parse, you can decide how to handle unknown tokens.

We can use this as the tracking issue.

Basically the API of the tokenizer should be "next_token" rather than "tokenize"

Are you interesting in taking this on?

@zachallaun
Copy link
Author

The ways handwritten parsers generalize tokenize is a token at a time as you parse, but elixirs tokenized is designed to work with a parser generator, so it tokenizes the entire document first.

Makes sense!

Are you interesting in taking this on?

Potentially, yes, but it's not something I can necessarily commit to at the moment. I'll share any progress here!

@mhanberg
Copy link
Contributor

Potentially, yes, but it's not something I can necessarily commit to at the moment. I'll share any progress here!

No worries! Just wanted to clarify, as this is something that I was going to tackle soon if the priorities aligned correctly.

I'll go ahead and write up a hopefully detailed ticket that describes to needed changes and the desired outcome, i'll close this issue once i get that created.

And another issue with the currently "total lexing" approach is that it bails if any non-recognized operators are lexed, which doesn't allow the parser to collect an error for that and continue on. Here is a pending test I have that describes that issue as well

    # TODO: this needs a change to the tokenizer i believe, or a way to splice out the unknown token
    @tag :skip
    test "unknown prefix operator" do
      code = "foo $bar, baz"

      assert Spitfire.parse(code) ==
               {:error, {:foo, [line: 1, column: 1], [{:__block__, [error: true, line: 1, column: 5], []}]},
                [{[line: 1, column: 5], "unknown token: %"}]}
    end

@zachallaun
Copy link
Author

Reading the tokenizer, it seems like the fastest path would be modifying the existing tokenizer to return {token, next_tokenizer} (or nil) where the next tokenizer is a closure that will produce the next token.

This could be somewhat easily tested against Elixir's tokenizer for valid inputs by consuming all tokens and comparing the lists.

Then, as a second step, the tokenizer could start emitting error tokens.

@mhanberg
Copy link
Contributor

mhanberg commented May 2, 2024

Closed in favor of #31

@mhanberg mhanberg closed this as not planned Won't fix, can't repro, duplicate, stale May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants