Skip to content

Eliminating whitespace from the parser logic #2076

@LucaCappelletti94

Description

@LucaCappelletti94

Hi,

At this time, whitespace tokens are stored in the parser, and are then filtered out in several distinct points in the parser logic, such as:

  • pub fn peek_tokens_with_location<const N: usize>(&self) -> [TokenWithSpan; N] {
    let mut index = self.index;
    core::array::from_fn(|_| loop {
    let token = self.tokens.get(index);
    index += 1;
    if let Some(TokenWithSpan {
    token: Token::Whitespace(_),
    span: _,
    }) = token
    {
    continue;
    }
    break token.cloned().unwrap_or(TokenWithSpan {
    token: Token::EOF,
    span: Span::empty(),
    });
    })
    }
  • pub fn peek_tokens_ref<const N: usize>(&self) -> [&TokenWithSpan; N] {
    let mut index = self.index;
    core::array::from_fn(|_| loop {
    let token = self.tokens.get(index);
    index += 1;
    if let Some(TokenWithSpan {
    token: Token::Whitespace(_),
    span: _,
    }) = token
    {
    continue;
    }
    break token.unwrap_or(&EOF_TOKEN);
    })
    }
  • pub fn peek_nth_token_ref(&self, mut n: usize) -> &TokenWithSpan {
    let mut index = self.index;
    loop {
    index += 1;
    match self.tokens.get(index - 1) {
    Some(TokenWithSpan {
    token: Token::Whitespace(_),
    span: _,
    }) => continue,
    non_whitespace => {
    if n == 0 {
    return non_whitespace.unwrap_or(&EOF_TOKEN);
    }
    n -= 1;
    }
    }
    }
    }
  • pub fn advance_token(&mut self) {
    loop {
    self.index += 1;
    match self.tokens.get(self.index - 1) {
    Some(TokenWithSpan {
    token: Token::Whitespace(_),
    span: _,
    }) => continue,
    _ => break,
    }
    }
    }
  • /// Seek back the last one non-whitespace token.
    ///
    /// Must be called after `next_token()`, otherwise might panic. OK to call
    /// after `next_token()` indicates an EOF.
    ///
    // TODO rename to backup_token and deprecate prev_token?
    pub fn prev_token(&mut self) {
    loop {
    assert!(self.index > 0);
    self.index -= 1;
    if let Some(TokenWithSpan {
    token: Token::Whitespace(_),
    span: _,
    }) = self.tokens.get(self.index)
    {
    continue;
    }
    return;
    }
    }

and many more.

SQL, as far as I know, is not a language that cares about spaces like Python - it should be safe to remove all concepts of whitespaces after the tokenization process is complete, and this should:

  • Reduce memory requirements, as whitespace tokens would not be stored anymore
  • Significantly simplify parser logic by removing all of the whitespace-related logic from the parser
  • Move the parser closer to a streaming logic, but that will require many more PRs

Since such a PR would require quite a bit of effort on my part, I would appreciate some feedback on it before moving forward with it.

@iffyio do you happen to have any opinion regarding such a refactoring?

Ciao,
Luca

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions