Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

>,<,>=,<= checks fail on column names with special characters #89

Open
malcolmgreaves opened this issue Feb 8, 2019 · 2 comments
Open

Comments

@malcolmgreaves
Copy link
Contributor

The .isGreaterThan, .isLessThan, .isGreaterThanOrEqualTo, and .isLessThanOrEqualTo methods on the Check type will fail with a Spark SQL SyntaxError at runtime when applied to columns whose names contain special characters or keywords.

@malcolmgreaves
Copy link
Contributor Author

// ensure that `spark` exists as an instance of `SparkSession`

import com.amazon.deequ.analyzers.Analysis
import com.amazon.deequ.analyzers.runners.AnalysisRunner
import com.amazon.deequ.checks.{ Check, CheckLevel, CheckResult } 

case class Wpair(`[this] is a value!!!11!#$`: Long, `[ also a column name:`: Long)
val rawValues = Seq(Long.MaxValue, Long.MaxValue, 25252L, 25252L, 66231L)
val weirdNums = spark.createDataFrame(
    rawValues.zip(rawValues.map{_-10L}).map { (Wpair.apply _).tupled }
)

val escapedCheck = Check(CheckLevel.Error, """escaped isGreaterThanOrEqualTo""").isGreaterThanOrEqualTo("`[this] is a value!!!11!#$`", "`[ also a column name:`")
val nonEscapedCheck = Check(CheckLevel.Error, """non-escaped isGreaterThanOrEqualTo""").isGreaterThanOrEqualTo("[this] is a value!!!11!#$", "[ also a column name:")

def analyze(df: DataFrame, check: Check): CheckResult = 
  check.evaluate(
    AnalysisRunner
      .onData(df)
      .addAnalyzers(Analysis().analyzers ++ check.requiredAnalyzers())
      .run()
    )

println(s"Check on weird column name, escaped:     ${analyze(weirdNums, escapedCheck).status}")
println(s"Check on weird column name, non-escaped: ${analyze(weirdNums, nonEscapedCheck).status}")

Observed output is:

Check on weird column name, escaped:     Success
Check on weird column name, non-escaped: Error

Investigating the nonEscapedCheck's application, we have:

print(analyze(weirdNums, nonEscapedCheck).constraintResults.head.message.get)

Showing:

org.apache.spark.sql.catalyst.parser.ParseException: 
extraneous input '[' expecting {'(', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IGNORE', 'IF', '+', '-', '*', 'DIV', '~', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', STRING, BIGINT_LITERAL, SMALLINT_LITERAL, TINYINT_LITERAL, INTEGER_VALUE, DECIMAL_VALUE, DOUBLE_LITERAL, BIGDECIMAL_LITERAL, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)

== SQL ==
[this] is a value!!!11!#$ >= [ also a column name:
^^^

@malcolmgreaves
Copy link
Contributor Author

Fixed by PR #91

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant