New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexer is O(n^2) #10

Closed
cderwin opened this Issue Dec 15, 2016 · 3 comments

Comments

Projects
None yet
2 participants
@cderwin

cderwin commented Dec 15, 2016

Hi there!

I was looking through your source code, specifically at src/lexer.rs and noticed that Lexer.position only tracks the index of the character in Lexer.input, and that when you call read_char() you find the next char via self.input.chars.nth(self.read_pos), which is an O(n) operation in rust because utf8 chars have variable length. This means that read_char() is O(n) and that because you call this every time you want to advance the lexer, lexing as a whole is O((n(n-1))/2) = O(n^2).

There are two easy fixes here:

The first is to convert the input field from an &str to a Vec<char> once and then index that vec. However, this is hogs a bunch of memory because the entire buffer has to be reallocated onto the heap, and each char has a size of four bytes (to accommodate non-ascii values), not just the one byte occupied by ascii characters originally.

The alternative (which is used by the rust lexer and is a bit less intuitive) is to instead have self.position keep track of the starting byte position of the current char in the buffer, and then set self.read_position = self.position + self.ch.map_or(0, |c| c.len_utf8()). Then you can fetch the next char with self.input[self.read_pos..].chars().next(), which is O(1). The whole `read_char method would look something like this:

fn read_char(&mut self) {
    if self.read_position >= self.input.len() {
        self.ch = None;
    } else {
        self.ch = self.input[self.read_position..].chars().next();
        self.position = self.read_position;
        self.read_position += self.ch.unwrap().len_utf8();
    }
}

That said, this method is much more verbose, so I could definitely understand opting for the first option instead. However, I think implementing an O(n^2) lexer in a tutorial is setting a poor example for the speed required of real-world interpreters.

@chr4

This comment has been minimized.

Owner

chr4 commented Dec 16, 2016

Thanks for pointing that out!
I agree that performance should be important even in a lexer example.

In this pull-request I've migrated the Lexer struct to use Rust's standart Peekable<Chars>. Does this still apply?

@cderwin

This comment has been minimized.

cderwin commented Dec 16, 2016

Not anymore! Because you hold a single iterator in the struct and just call the next() method once instead of creating a new iterator and calling nth(self.read_position) in each call to read_char() this is effectively the same as the first option I listed above, with the trade-off that you don't use any heap allocation but that you can't recover characters before self.ch and you can only peek ahead one character (which is probably all you need).

@cderwin cderwin closed this Dec 16, 2016

@chr4

This comment has been minimized.

Owner

chr4 commented Dec 17, 2016

Thanks for the insights and the explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment