Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Hive QL Parsing #235

Merged
merged 101 commits into from
Feb 4, 2021
Merged

Implement Hive QL Parsing #235

merged 101 commits into from
Feb 4, 2021

Conversation

hntd187
Copy link
Contributor

@hntd187 hntd187 commented Jul 23, 2020

This is a highly experimental start to implementing Hive QL in the parser. Luckily, the parser's design lends itself well to adding in some of these things, but there are still a few oddities and things to implement to make something for Hive complete. I'd like to get people reviewing this early because in my opinion I do some janky things here I should clean up and I'm hoping early review will help me nip that in the bud.

Also, my tests are a bit on the light side right now, but I have verified a lot of this functionality with some internal HQL that I cannot share publicly. I am confident that this covers the majority of edge cases on how not to write HQL.

@hntd187 hntd187 mentioned this pull request Jul 23, 2020
@coveralls
Copy link

coveralls commented Jul 24, 2020

Pull Request Test Coverage Report for Build 537958138

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 569 of 754 (75.46%) changed or added relevant lines in 13 files are covered.
  • 10 unchanged lines in 5 files lost coverage.
  • Overall coverage decreased (-2.0%) to 90.085%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/tokenizer.rs 27 31 87.1%
src/ast/ddl.rs 16 21 76.19%
src/dialect/hive.rs 10 15 66.67%
src/ast/query.rs 25 33 75.76%
src/ast/mod.rs 101 176 57.39%
src/parser.rs 279 367 76.02%
Files with Coverage Reduction New Missed Lines %
src/ast/data_type.rs 1 77.78%
src/ast/ddl.rs 1 81.98%
src/ast/query.rs 1 87.56%
src/parser.rs 2 85.93%
src/ast/mod.rs 5 72.41%
Totals Coverage Status
Change from base Build 469532238: -2.0%
Covered Lines: 5324
Relevant Lines: 5910

💛 - Coveralls

src/ast/mod.rs Outdated
@@ -627,6 +753,25 @@ impl fmt::Display for Statement {
}
Ok(())
}
Statement::CreateDatabase {
db_name,
ine,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if_not_exists , better to write it out without abbreviations

@@ -61,6 +61,8 @@ pub enum DataType {
Regclass,
/// Text
Text,
/// String (Hive)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also some other dialects besides hive use string

} => {
write!(f, "TRUNCATE TABLE {}", table_name)?;
if let Some(ref parts) = partitions {
if !parts.is_empty() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is PARTITIONS () also possible? If so, then this check is not needed. Otherwise, we could check the partitions type to a vec instead of optional vec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe PARTITIONS () is valid, you have to specify at the very least a column.

src/ast/mod.rs Outdated
Msck {
table_name: ObjectName,
repair: bool,
add_partitions: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add/drop/sync could be an enum I think? Also simplifies the parser code a bit and allows for easier patter matching.

@Dandandan
Copy link
Collaborator

Looks great, would love to have this! I added a few comments

#[derive(Debug, Clone, PartialEq, Eq, Hash)]
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
pub enum PartitionAction {
ADD,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change naming to Add, Drop, Sync ?

@hntd187
Copy link
Contributor Author

hntd187 commented Jul 27, 2020

So @Dandandan I ran into a discussion point here. Hive allows you to have column aliases beginning with numbers such as SELECT a as 3b from table with no quotes or anything of the sort around it. This basically breaks most ident parsing, which led me to disable the hive dialect in the ansi tests for now. While I don't believe hive has ever conformed to anything close to ansi sql there might be a better way to do this or support numbers as beginning identifiers.

@hntd187 hntd187 changed the title Hive AST Updates Implement Hive QL Parsing Dec 7, 2020
@andygrove
Copy link
Collaborator

@hntd187 @Dandandan Seems like we should go ahead and merge this now? I will skim through and merge if there are no objections.

@@ -22,15 +22,17 @@ use std::fmt;
pub enum Value {
/// Numeric literal
#[cfg(not(feature = "bigdecimal"))]
Number(String),
Number(String, bool),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if a new type should have been added rather than change this type?

Copy link
Collaborator

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -35,7 +35,7 @@ pub enum Token {
/// A keyword (like SELECT) or an optionally quoted SQL identifier
Word(Word),
/// An unsigned numeric literal
Number(String),
Number(String, bool),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be better extracted to a named field, e.g. long: bool?.
This helps readability

@@ -22,15 +22,17 @@ use std::fmt;
pub enum Value {
/// Numeric literal
#[cfg(not(feature = "bigdecimal"))]
Number(String),
Number(String, bool),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those values as well

@Dandandan
Copy link
Collaborator

@hntd187 @andygrove
To me it looks great as well, added some final comments, could also be done in a follow-up.
Thanks @hntd187, for the great addition and for your patience 😎

@andygrove andygrove merged commit 8a214f9 into sqlparser-rs:main Feb 4, 2021
@alamb alamb mentioned this pull request May 9, 2022
@bitemyapp
Copy link
Contributor

Hey y'all, I'm writing a utility that indexes SQL queries using this library as a parser.

I just bumped into this: 9d9d681#diff-4a04259da480a6b794a2e947e4cc03eff4d1aa9330836f5b91cac68c5398193fR2182-L2173

The support for HiveQL's use of a FROM clause in a WITH ... AS ... common table expression. The keywords involved are very overloaded and I am having a very difficult time figuring out what the FROM clause means in that instance.

Here's an example query:

WITH

brudda AS (
    SELECT
        id
    FROM test_table
) FROM a

SELECT
    t.id
FROM test_table AS t
INNER JOIN brudda as b
ON b.test_table_id = t.id

Could you please link documentation that explains what a is or what the FROM clause does in HiveQL if it exists? If not, could you please briefly explain it? There's nothing in the source code/docs for this library explaining it that I can tell either.

The feature doesn't seem to be explained in the HiveQL docs:

It demonstrates its use in:

Use a CTE to insert data.
CREATE TABLE s1 LIKE src; WITH q1 AS (SELECT key, value FROM src WHERE key = '5') FROM q1 INSERT OVERWRITE TABLE s1 SELECT *;

But doesn't explain what the clause is actually doing. I don't have a Hive instance to test right now but I'd expect this to work as well:

WITH q1 AS (SELECT key, value FROM src WHERE key = '5') FROM q1 INSERT OVERWRITE TABLE s1 SELECT *

Similarly from https://cwiki.apache.org/confluence/display/Hive/Common+Table+Expression

-- insert example
create table s1 like src;
with q1 as ( select key, value from src where key = '5')
from q1
insert overwrite table s1
select *;

Could someone please tell me what's going on here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants