Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove grammar duplication #62

Closed

Conversation

mingodad
Copy link
Contributor

@mingodad mingodad commented Jul 1, 2021

It seems that there is an unnecessary grammar duplication that this pull request removes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 1, 2021
@mingodad
Copy link
Contributor Author

Any feedback on this pull request ?

@ricomariani
Copy link
Contributor

Hi sorry I didn't see this sooner! Something must have failed in the notification system.

IIRC the reason that part of the grammar is duplicated is so that there are no shift reduce conflicts in the BETWEEN section. It's possible that this is no longer necessary.

@facebook-github-bot
Copy link
Contributor

@ricomariani has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ricomariani
Copy link
Contributor

It looks like it should work but an internal test is failing. I'll look into that. Your way is cleaner than my way and still avoids the conflict.

@ricomariani
Copy link
Contributor

Minimal repro...

select CAST(1 AS REAL) + 1;

No longer parses

investigating...

@ricomariani
Copy link
Contributor

Yeah this doesn't work... sigh.

The point of math_expr is that it creates a smaller universe of math operators that you can use inside of BETWEEN x AND y

To do this math_expr is "closed", general it can't reach back out to normal expr except via parentheses.

Compare:

| expr[lhs] '+' expr[rhs] { $result = new_ast_add($lhs, $rhs); }

This allows the LHS of + to be any expr

| math_expr[lhs] '+' math_expr[rhs] { $result = new_ast_add($lhs, $rhs); }

This limits + operations to math_expr.

math_expr is its own little world, created so that there would be no conflicts in BETWEEN math_expr AND math_expr

In particular, logicals are not allowed in math_expr so AND for sure ends the math_expr of the between hence no shift/reduce conflict.

@ricomariani
Copy link
Contributor

I'll add a test case to force the issue.

@ricomariani
Copy link
Contributor

The new test cases should be landing shortly. Any objection if I close the PR at this point?

If you can think of a way to remove that redundancy I'd be most grateful. I don't like it :D

@@ -848,17 +848,7 @@ math_expr[result]:
;

expr[result]:
basic_expr { $result = $basic_expr; }
| expr[lhs] '&' expr[rhs] { $result = new_ast_bin_and($lhs, $rhs); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the problem here is that if you use math_expr for all of these then the LHS and RHS are limited to math_expr which means you cast parse something like CAST(1 as REAL) + 1

I'm adding new test cases to give a clean error on that.

@ricomariani
Copy link
Contributor

1e67156

The above are some new simple test cases that cover this case.

I'm sorry those should have been there in the first place -- then this would have been reasonably obvious.

Thanks for helping me find that gap.

@mingodad
Copy link
Contributor Author

Thank you for all feedback !
I didn't realized the full implications and now looking through PostgreSQL 13.3 grammar I can see that they also have duplication (a lot more) for expressions a_expr (https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/parser/gram.y;h=3adc087e3ffe59a5c4bf5443d7e49545f5f0ba51;hb=refs/heads/REL_13_STABLE#l13200) and b_expr (https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/parser/gram.y;h=3adc087e3ffe59a5c4bf5443d7e49545f5f0ba51;hb=refs/heads/REL_13_STABLE#l13616).

@mingodad
Copy link
Contributor Author

@mingodad
Copy link
Contributor Author

This is a good place for a macro preprocessor to have been incorporated in bison/yacc.

@mingodad
Copy link
Contributor Author

For my own I normally use a simple in place custom preprocessor for mechanical duplication/generation (it was inspired by one in python but I don't remember it's origin right now):

-------------------------------- sources/cql.y --------------------------------
index d8d8d31d..fc79a512 100644
@@ -852,34 +852,63 @@ basic_expr:
   | '(' select_stmt IF NOTHING THROW')'  { $basic_expr = new_ast_select_if_nothing_throw_expr($select_stmt); }
   | EXISTS '(' select_stmt ')'  { $basic_expr = new_ast_exists_expr($select_stmt); }
   ;
+  
+/*SquiLu
+
+function commonExpr(pfx) {
+	auto base_txt = [==[
+  basic_expr  { $result = $basic_expr; }
+  | xpfx_expr[lhs] '&' xpfx_expr[rhs]  { $result = new_ast_bin_and($lhs, $rhs); }
+  | xpfx_expr[lhs] '|' xpfx_expr[rhs]  { $result = new_ast_bin_or($lhs, $rhs); }
+  | xpfx_expr[lhs] LS xpfx_expr[rhs]   { $result = new_ast_lshift($lhs, $rhs); }
+  | xpfx_expr[lhs] RS  xpfx_expr[rhs]  { $result = new_ast_rshift($lhs, $rhs); }
+  | xpfx_expr[lhs] '+' xpfx_expr[rhs]  { $result = new_ast_add($lhs, $rhs); }
+  | xpfx_expr[lhs] '-' xpfx_expr[rhs]  { $result = new_ast_sub($lhs, $rhs); }
+  | xpfx_expr[lhs] '*' xpfx_expr[rhs]  { $result = new_ast_mul($lhs, $rhs); }
+  | xpfx_expr[lhs] '/' xpfx_expr[rhs]  { $result = new_ast_div($lhs, $rhs); }
+  | xpfx_expr[lhs] '%' xpfx_expr[rhs]  { $result = new_ast_mod($lhs, $rhs); }
+  | '-' xpfx_expr[rhs] %prec UMINUS    { $result = new_ast_uminus($rhs); }
+  | xpfx_expr[lhs] CONCAT xpfx_expr[rhs]  { $result = new_ast_concat($lhs, $rhs); }	
+]==];
+  puts(base_txt.replace("xpfx_", pfx));
+}
+
+SquiLu*/
 
 math_expr[result]:
+//@commonExpr("math_")
+// generated-code:begin
   basic_expr  { $result = $basic_expr; }
   | math_expr[lhs] '&' math_expr[rhs]  { $result = new_ast_bin_and($lhs, $rhs); }
   | math_expr[lhs] '|' math_expr[rhs]  { $result = new_ast_bin_or($lhs, $rhs); }
-  | math_expr[lhs] LS math_expr[rhs]  { $result = new_ast_lshift($lhs, $rhs); }
+  | math_expr[lhs] LS math_expr[rhs]   { $result = new_ast_lshift($lhs, $rhs); }
   | math_expr[lhs] RS  math_expr[rhs]  { $result = new_ast_rshift($lhs, $rhs); }
   | math_expr[lhs] '+' math_expr[rhs]  { $result = new_ast_add($lhs, $rhs); }
   | math_expr[lhs] '-' math_expr[rhs]  { $result = new_ast_sub($lhs, $rhs); }
   | math_expr[lhs] '*' math_expr[rhs]  { $result = new_ast_mul($lhs, $rhs); }
   | math_expr[lhs] '/' math_expr[rhs]  { $result = new_ast_div($lhs, $rhs); }
   | math_expr[lhs] '%' math_expr[rhs]  { $result = new_ast_mod($lhs, $rhs); }
-  | '-' math_expr[rhs] %prec UMINUS  { $result = new_ast_uminus($rhs); }
-  | math_expr[lhs] CONCAT math_expr[rhs]  { $result = new_ast_concat($lhs, $rhs); }
+  | '-' math_expr[rhs] %prec UMINUS    { $result = new_ast_uminus($rhs); }
+  | math_expr[lhs] CONCAT math_expr[rhs]  { $result = new_ast_concat($lhs, $rhs); }	
+// generated-code:end
   ;
 
 expr[result]:
+//@commonExpr("")
+// generated-code:begin
   basic_expr  { $result = $basic_expr; }
   | expr[lhs] '&' expr[rhs]  { $result = new_ast_bin_and($lhs, $rhs); }
   | expr[lhs] '|' expr[rhs]  { $result = new_ast_bin_or($lhs, $rhs); }
-  | expr[lhs] LS expr[rhs]  { $result = new_ast_lshift($lhs, $rhs); }
-  | expr[lhs] RS expr[rhs]  { $result = new_ast_rshift($lhs, $rhs); }
+  | expr[lhs] LS expr[rhs]   { $result = new_ast_lshift($lhs, $rhs); }
+  | expr[lhs] RS  expr[rhs]  { $result = new_ast_rshift($lhs, $rhs); }
   | expr[lhs] '+' expr[rhs]  { $result = new_ast_add($lhs, $rhs); }
   | expr[lhs] '-' expr[rhs]  { $result = new_ast_sub($lhs, $rhs); }
   | expr[lhs] '*' expr[rhs]  { $result = new_ast_mul($lhs, $rhs); }
   | expr[lhs] '/' expr[rhs]  { $result = new_ast_div($lhs, $rhs); }
   | expr[lhs] '%' expr[rhs]  { $result = new_ast_mod($lhs, $rhs); }
-  | '-' expr[rhs] %prec UMINUS  { $result = new_ast_uminus($rhs); }
+  | '-' expr[rhs] %prec UMINUS    { $result = new_ast_uminus($rhs); }
+  | expr[lhs] CONCAT expr[rhs]  { $result = new_ast_concat($lhs, $rhs); }	
+// generated-code:end
   | NOT expr[rhs]  { $result = new_ast_not($rhs); }
   | '~' expr[rhs]  { $result = new_ast_tilde($rhs); }
   | expr[lhs] COLLATE name  { $result = new_ast_collate($lhs, $name); }
@@ -906,7 +935,6 @@ expr[result]:
   | expr[lhs] BETWEEN math_expr[me1] AND math_expr[me2]  { $result = new_ast_between($lhs, new_ast_range($me1,$me2)); }
   | expr[lhs] IS_NOT expr[rhs]  { $result = new_ast_is_not($lhs, $rhs); }
   | expr[lhs] IS expr[rhs]  { $result = new_ast_is($lhs, $rhs); }
-  | expr[lhs] CONCAT expr[rhs]  { $result = new_ast_concat($lhs, $rhs); }
   | CASE expr[cond] case_list END  { $result = new_ast_case_expr($cond, new_ast_connector($case_list, NULL)); }
   | CASE expr[cond1] case_list ELSE expr[cond2] END  { $result = new_ast_case_expr($cond1, new_ast_connector($case_list, $cond2));}
   | CASE case_list END  { $result = new_ast_case_expr(NULL, new_ast_connector($case_list, NULL));}

Here is the one I use most:

#!/home/mingo/bin/squilu
__max_print_stack_str_size <- 100;

function puts(s) {
	fd.write(s);
}

function putsnl(s) {
	fd.write(s);
	fd.write("\n");
}

function preprocess(file_name){
	local fd = file(file_name, "r");
	local code = fd.read(fd.len());
	fd.close();
	
	local function escape_re(str){
		local new_str = str.gsub("[-.$%%[%]^]", "%%%1")
		return new_str
	}
	local code_generation_begin = "// generated-code:begin";
	local code_generation_end = "// generated-code:end";

	local code_generation_begin_escaped = escape_re(code_generation_begin);
	local code_generation_end_escaped = escape_re(code_generation_end);

	//print(code_generation_begin, code_generation_begin_escaped);

	local new_code = code.gsub(code_generation_begin_escaped + ".-" + code_generation_end_escaped + "\n", "");

	new_code = new_code.gsub("(//@(.-)\n)", function(m, m2) {
			return format("%s%s\n}====})\n%s;\nputs({===={%s\n", m, code_generation_begin, m2, code_generation_end)
		});


	new_code = new_code.gsub("(/%*SquiLu(.-)SquiLu%*/)", function(m, m2) {
			return format("%s\n}====})\n%s\nputs({===={", m, m2)
		});

	local buffer = blob();
	buffer.write("puts({===={");
	buffer.write(new_code);
	buffer.write("}====})");
	local sqcode = buffer.tostring();
	
	//print(sqcode);
	
	local code_func = compilestring(sqcode, "sqcode-preprocessed");

	local bak_filename = file_name + ".pp.bak";
	os.rename(file_name, bak_filename);

	::fd <-  file(file_name, "w");
	code_func();
	::fd.close();
}

if(vargv.len() > 1){
	preprocess(vargv[1]);
}

The same in Lua (a bit more verbose):

#!/usr/bin/env lua

_fc_ = '' -- content of file being processed

function puts(s)
	print(s)
end	
function putsnl(s)
	print(s)
end	

function preprocess_str(txt)
	local luaStart = '/%*luapp'
	local luaEnd = 'luapp%*/'
	local luaHash = '//luapp'
	local luacode
	local lastPos = 1
	local luaGeneratedCodeStart = '// generated-code:begin'
	local luaGeneratedCodeEnd = '// generated-code:end'
	local luaGeneratedCodePattern = string.gsub(luaGeneratedCodeStart, '%-', '%%-') ..  '.-' .. string.gsub(luaGeneratedCodeEnd, '%-', '%%-') .. '\n'
	local luaCodePosStart = 1
	local luaCodePosEnd = 1
	local luaHashPosStart = 1
	local luaHashPosEnd = 1
	local tbl = {}
	
	local cleanTxt =  string.gsub(txt, luaGeneratedCodePattern, '')

	function doLuaCode()
		luaCodePosStart, luaCodePosEnd, luaCode = string.find(cleanTxt, luaStart .. '(.-)' .. luaEnd, luaCodePosStart)
		table.insert(tbl, 'puts(' .. string.format('%q', string.sub(cleanTxt, lastPos, luaCodePosEnd)) .. ')')
		table.insert(tbl,  luaCode )
		lastPos = luaCodePosEnd +1
	end

	function doLuaHash()
		luaHashPosStart, luaHashPosEnd, luaCode = string.find(cleanTxt, luaHash .. '(.-)\n' , luaHashPosStart)
		table.insert(tbl, 'putsnl(' .. string.format('%q', string.sub(cleanTxt, lastPos, luaHashPosEnd-1)) .. ')')
		table.insert(tbl,  'putsnl("' .. luaGeneratedCodeStart .. '")' )
		table.insert(tbl,  luaCode )
		table.insert(tbl,  'putsnl("' .. luaGeneratedCodeEnd .. '")' )
		lastPos = luaHashPosEnd+1
	end
		
	while true do
		if luaCodePosStart then
			luaCodePosStart, luaCodePosEnd = string.find(cleanTxt, luaStart, luaCodePosEnd)
		end
		
		if luaHashPosStart then
			luaHashPosStart, luaHashPosEnd = string.find(cleanTxt, luaHash, luaHashPosEnd)
		end
		
		if (luaCodePosStart == nil) and (luaHashPosStart == nil) then
			-- we have finished the work
			table.insert(tbl, 'puts(' .. string.format('%q', string.sub(cleanTxt, lastPos)) .. ')')
			break
		end
		
		if luaCodePosStart and luaHashPosStart then
			-- we have both lua code and lua hash
			if luaCodePosStart < luaHashPosStart then
				--lua code comes first
				doLuaCode()
				luaHashPosEnd = lastPos
			else
				--lua hash comes first
				doLuaHash()				
				luaCodePosEnd = lastPos
			end
		elseif luaCodePosStart then 
			doLuaCode()
		elseif luaHashPosStart then
			doLuaHash()
		end
	end
	--print("#tbl", #tbl, table.concat(tbl, '\n'))
	return table.concat(tbl, '\n')
end

function preprocess_file(fn)
	local fn_bkp = fn .. '.lpb'
	local fn_tmp = fn .. '.lpp'
	--print(fn, fn_bkp, fn_tmp)
	-- local fn_prj = fn:split('.')
	-- if #fn_prj > 1 then table.remove(fn_prj) end
	-- fn_prj = table.concat(fn_prj) -- .. '.lua'

	-- if not PROJECT_SCRIPT_INCLUDED then
		-- require(fn_prj)
	-- end
	
	local fh, message = io.open(fn, 'r')
	if not fh then 
		print(message)
		return -1
	end
	_fc_ = fh:read('*a')
	fh:close()
	
	--print(_fc_)

	local fh, message = io.open(fn_tmp, 'w')
	if not fh then 
		print(message)
		return -1
	end
	function puts(s)
		fh:write(s)
	end	
	function putsnl(s)
		fh:write(s)
		fh:write('\n')
	end	
	--print(_fc_)
	local lc = preprocess_str(_fc_)
	--print(lc)
	loadstring(lc, 'pp')()
	fh:close()	
	
	os.remove(fn_bkp)
	os.rename(fn, fn_bkp)
	os.rename(fn_tmp, fn)
end

if arg[1] then
	if arg[2] then dofile(arg[2]) end
	preprocess_file(arg[1])
end

@ricomariani
Copy link
Contributor

Actually we can totally fix this! We just didn't go far enough.

We need to move almost everything into math_expr so that only the things that ware weird with BETWEEN are out.

We're left with this and no duplication. I do have to add a NOT BETWEEN token but that's not so bad.

expr[result]:
  math_expr  { $result = $math_expr; }
  | expr[lhs] COLLATE name  { $result = new_ast_collate($lhs, $name); }
  | expr[lhs] AND expr[rhs]  { $result = new_ast_and($lhs, $rhs); }
  | expr[lhs] OR expr[rhs]  { $result = new_ast_or($lhs, $rhs); } 
  | expr[lhs] NOT_BETWEEN math_expr[me1] AND math_expr[me2]  { $result = new_ast_not_between($lhs, new_ast_range($me1,$me2)); }
  | expr[lhs] BETWEEN math_expr[me1] AND math_expr[me2]  { $result = new_ast_between($lhs, new_ast_range($me1,$me2)); }
  ;

This also reveals a couple of lingering errors...

@ricomariani
Copy link
Contributor

Well poop.

This almost works. If you pull everything into math_expr that isn't AND/OR/BETWEEN you're golden. The only problem is that if you put NOT into math_expr then it forces the order of operations of NOT to be wrong... NOT has to be weaker than BETWEEN. And btw this forced me to look hard at order of operations so I had to add more tests because there were bugs...

On the other hand if you put NOT in expr and not in math expr then you have the problem that you can't parse

1 + not 1;

If NOT was tighter than between this would work nicely... but it isn't.

Meaning

NOT 1 BETWEEN -2 and 2  -> IS FALSE

-- because
NOT (1 BETWEEN -2 and 2)   --> IS FALSE

-- whereas
(NOT 1) BETWEEN -2 and 2 --> IS TRUE

Sigh...

@ricomariani
Copy link
Contributor

I was able to make it work! I had to also move the between node into math_expr. So the main expr node contains only AND/OR/COLLATE that break works and requires no duplication. All we needed from math expr was that it not have AND in it.

I had to do a little tweak to make BETWEEN left associative as expected but that was easy.

@ricomariani
Copy link
Contributor

This PR turned out to be very good at finding latent bugs... thank you!

@facebook-github-bot
Copy link
Contributor

@ricomariani merged this pull request in 3610bed.

@ricomariani
Copy link
Contributor

I generalized what you did:

  • moved CAST+CASE into basic_expr (safe to do so)
  • kept only AND/OR/COLLATE in "expr"
  • everything else went into math_expr

This works well and has no duplication.

@ricomariani
Copy link
Contributor

FWIW it seems like PostgreSQL had settled on a solution very much like the one I had, my basic_expr, math_expr, and expr correspond to their a_ b_ and c_ exprs (not in that order).

Interestingly, the order of operations split working out the way that it does let's you cleave out AND/OR and create a chain. I think they could do the same in their grammar if they were so inclined.

@ricomariani
Copy link
Contributor

If you look at 9185fc8 you will see all the bugs this forced out. I did those fixes in a separate diff because they were unrelated to your refactor, but it was your PR that led me to find those bugs.

@mingodad
Copy link
Contributor Author

Two people looking at the same thing not always see the same thing !
It seems that I only saw the tip of the iceberg, and after pointing at it others dived in and saw a lot more.
I'm glad it helped improve !

@mingodad
Copy link
Contributor Author

Looking at the sqlite3 railroad diagram (https://sqlite.org/forum/forumpost/c7a0c2a23231a27f9b746f99e390e1a89d83a4678eda306b4ae0415c471aa819) I can see that sqlite grammar doesn't have two kinds of expressions and I'm wondering if the bugs you've found in cql.y are also present in sqlite3.

@mingodad
Copy link
Contributor Author

Testing this query in both sqlite3 and PostgreSQL gives:

select NOT 1 BETWEEN -2 and 2, NOT (1 BETWEEN -2 and 2), (NOT 1) BETWEEN -2 and 2
sqlite3 < "test-between.sql"
0|0|1

PostgreSQL:

argument of NOT must be type boolean, not type integer

@mingodad
Copy link
Contributor Author

This query runs on both sqlite3 and PostgreSQL and give different results, it seems that we have a bug on PostgreSQL:

select /*NOT*/ 1 BETWEEN -2 and 2, NOT (1 BETWEEN -2 and 2), (/*NOT*/ 1) BETWEEN -2 and 2
sqlite3 < "test-between.sql"
1|0|1

Postgresql at (https://extendsclass.com/postgresql-online.html):

?column? | ?column? | ?column?
-- | -- | --
true | true | true

@mingodad
Copy link
Contributor Author

Here is the bug report on PostgreSQL just in case you want to follow it:

The following bug has been logged on the website:

Bug reference:      17109
Logged by:          Domingo Alvarez Duarte
Email address:      mingodad@gmail.com
PostgreSQL version: 11.10
Operating system:   Online at https://extendsclass.com/postgresql-onli
Description:        

When proposing a change to
https://github.com/facebookincubator/CG-SQL/pull/62 people there found
several problems on their project and one of the related to how
parse/evaluate expressions around "BETWEEN" keyword and they created a
simple test case to check it (adapted by me):

====
select /*NOT*/ 1 BETWEEN -2 and 2, NOT (1 BETWEEN -2 and 2), (/*NOT*/ 1)
BETWEEN -2 and 2
====

Here is the output of PostgreSQL where the second column is not negated (if
column 1 expression is true then "NOT" that expression should return false
):
====
?column? | ?column? | ?column?
-- | -- | --
true | true | true
====

Here is the output of sqlite3:
====
sqlite3 < "test-between.sql"
1|0|1
====

@mingodad
Copy link
Contributor Author

It seems that I fired the trigger to early, although the output of that website does seems wrong when executing the query on other places with different PostgreSQL version the output is the expected one:

psql  template1
psql (13.3)
Type "help" for help.

template1=# select /*NOT*/ 1 BETWEEN -2 and 2, NOT (1 BETWEEN -2 and 2), (/*NOT*/ 1) BETWEEN -2 and 2
template1-# ;
 ?column? | ?column? | ?column? 
----------+----------+----------
 t        | f        | t
(1 row)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants