Eliminating whitespace from the parser logic

Hi,

At this time, whitespace tokens are stored in the parser, and are then filtered out in several distinct points in the parser logic, such as:

* https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/parser/mod.rs#L4032-L4049
* https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/parser/mod.rs#L4055-L4069
* https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/parser/mod.rs#L4077-L4094
* https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/parser/mod.rs#L4149-L4160
* https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/parser/mod.rs#L4183-L4202

and many more.

SQL, as far as I know, is not a language that cares about spaces like Python - it should be safe to remove all concepts of whitespaces after [the tokenization process](https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/tokenizer.rs#L937-L942) is complete, and this should:

* Reduce memory requirements, as whitespace tokens would not be stored anymore
* Significantly simplify parser logic by removing all of the whitespace-related logic from the parser
* Move the parser closer to a streaming logic, but that will require many more PRs

Since such a PR would require quite a bit of effort on my part, I would appreciate some feedback on it before moving forward with it.

@iffyio do you happen to have any opinion regarding such a refactoring?

Ciao,
Luca

	pub fn peek_tokens_with_location<const N: usize>(&self) -> [TokenWithSpan; N] {
	let mut index = self.index;
	core::array::from_fn(\|_\| loop {
	let token = self.tokens.get(index);
	index += 1;
	if let Some(TokenWithSpan {
	token: Token::Whitespace(_),
	span: _,
	}) = token
	{
	continue;
	}
	break token.cloned().unwrap_or(TokenWithSpan {
	token: Token::EOF,
	span: Span::empty(),
	});
	})
	}

	pub fn peek_tokens_ref<const N: usize>(&self) -> [&TokenWithSpan; N] {
	let mut index = self.index;
	core::array::from_fn(\|_\| loop {
	let token = self.tokens.get(index);
	index += 1;
	if let Some(TokenWithSpan {
	token: Token::Whitespace(_),
	span: _,
	}) = token
	{
	continue;
	}
	break token.unwrap_or(&EOF_TOKEN);
	})
	}

	pub fn peek_nth_token_ref(&self, mut n: usize) -> &TokenWithSpan {
	let mut index = self.index;
	loop {
	index += 1;
	match self.tokens.get(index - 1) {
	Some(TokenWithSpan {
	token: Token::Whitespace(_),
	span: _,
	}) => continue,
	non_whitespace => {
	if n == 0 {
	return non_whitespace.unwrap_or(&EOF_TOKEN);
	}
	n -= 1;
	}
	}
	}
	}

	pub fn advance_token(&mut self) {
	loop {
	self.index += 1;
	match self.tokens.get(self.index - 1) {
	Some(TokenWithSpan {
	token: Token::Whitespace(_),
	span: _,
	}) => continue,
	_ => break,
	}
	}
	}

	/// Seek back the last one non-whitespace token.
	///
	/// Must be called after `next_token()`, otherwise might panic. OK to call
	/// after `next_token()` indicates an EOF.
	///
	// TODO rename to backup_token and deprecate prev_token?
	pub fn prev_token(&mut self) {
	loop {
	assert!(self.index > 0);
	self.index -= 1;
	if let Some(TokenWithSpan {
	token: Token::Whitespace(_),
	span: _,
	}) = self.tokens.get(self.index)
	{
	continue;
	}
	return;
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Eliminating whitespace from the parser logic #2076

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Eliminating whitespace from the parser logic #2076

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions