DRILL-4980: Upgrading of the approach of parquet date correctness status detection #644

vdiravka · 2016-11-03T10:34:30Z

Parquet writer version is added;
Updated the detection method of parquet date correctness.

vdiravka · 2016-11-03T10:51:50Z

@parthchandra @paul-rogers @jaltekruse coud you please review?

paul-rogers

Nice improvement! Comments include questions and suggested changes.

paul-rogers · 2016-11-03T16:20:04Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java

-    public ParquetTableMetadata_v2(boolean isDateCorrect) {
      this.drillVersion = DrillVersionInfo.getVersion();
-      this.isDateCorrect = isDateCorrect;
+      this.writerVersion = ParquetWriter.WRITER_VERSION;


We set the writer version to the current version when we create the metadata. Is this same metadata used for both read and write? If so, we have the potential for a nasty bug. A (new) reader fails to set the writerVersion value from actual file metadata. The value will default to the latest, even if the file itself happens to be older.

I wonder if it makes sense to pass the version into the constructor. The Writer passes in the current writer version. The reader must pass in the value found in the file.

Or, is this metadata used only for writing, but not reading? If that is true perhaps we can document that in the code somewhere. (I looked but did not anything.)

Or, is this metadata cached from scanning actual files? If so, isn't defaulting the writer version simply asking for trouble if someone forgets to set this field based on actual file version?

paul-rogers · 2016-11-03T16:24:51Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java

    @JsonProperty List<String> directories;
    @JsonProperty String drillVersion;
-    @JsonProperty boolean isDateCorrect;
+    @JsonProperty int writerVersion;


This property is Jackson-serialized. How does Jackson handle older files written without this version? According to this post (http://stackoverflow.com/questions/8320993/jackson-what-happens-if-a-property-is-missing), "Setter methods are only invoked for properties with explicit values." This means that older files without the writerVersion set won't call the function to deserialize the writer version, and the version will default to the newest version. I suspect that either A) I'm misunderstanding what this code does, or B) we have a backward-compatibility issue. Please explain which.

ParquetTableMetadata_v2 is used mostly when the parquet meta cache file created.
For reading the metadata cache file or metadta footer this class is not used except the case, when we read one parquet file or parquet folder, an empty instance of this class is created, but used only columnTypeInfo field. And the drillVersion or writerVersion weren't used. That's why everything worked but it was not right. I added an empty constructor (as I made earlier, actually master version).
And added initializing this fields across constructors parameters. I think it is more right and looks like this is what you were talking about.

Not sure I completely follow... Here's what I understood:

The class is used when ready a Parquet file footer. (In such a case, the writer version would presumably be the actual version from the footer, or null if the file is old, was not written by Drill, etc.)

The class is used when reading metadata, but then some fields are not used.

In the case where the metadata is for one file (only) is the writer version stored in the metadata? So, object created when reading metadata will have the same information as the original file?

If the metadata is for multiple files, then the writer version is never set? (Or, should the rule be that the writer version is set if all files have the same version?)

Since this is a bit hard to understand, it would be very helpful to describe all this in a Javadoc comment for the class.

I tried to figure out this and found that this class isn't used for reading parquet footer. For that purposes ParquetMetadata is used from parquet-hadoop-1.8.1-drill-r0.jar.
This class used for entire parquet directory with metadata cache file. When it is absent the an empty instance of this class (with null fields) is created.
I added some javadoc. Hope it will help.

paul-rogers · 2016-11-03T16:27:24Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java

      this.directories = directories;
      this.columnTypeInfo = columnTypeInfo;
+      this.drillVersion = DrillVersionInfo.getVersion();
+      this.writerVersion = ParquetWriter.WRITER_VERSION;


Given a set of files and directories, we're assuming the writer version? This is supposed to be the version of the Drill Parquet writer that wrote the actual Parquet files, right? Not the version of the metadata files that hold information about the Parquet files? Or, are we using the same writer version for both purposes? If we need two versions (one for Parquet, another for metadata), then we should introduce a separate Parquet metadata version. (Or, I'm misunderstanding the code...)

paul-rogers · 2016-11-03T16:33:56Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java

    }
  }

+  public static boolean isDrillVersionHasCorrectDates(String drillVersion) throws VersionParser.VersionParseException {


Might flow better as "drillVersionHasCorrectDates"

But, see comment above about whether we need this check.

paul-rogers · 2016-11-03T16:37:48Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java

+    try {
+      if (drillVersion != null) {
+        return (writerVersion != null && Integer.parseInt(writerVersion) >= 2) || Boolean.valueOf(isDateCorrect)
+            || isDrillVersionHasCorrectDates(drillVersion)


Isn't the Drill version check redundant? We know that all Drill versions from 1.9.0 onwards will have a Drill Parquet writer version in the file. So, we only need check the writer version, if we have it. If we don't have it, then the only info we have is the Drill version.

We might want to explain this logic in a comment.

The purpose of adding the writer version is that all future format decisions can be made on the writer version independent of Drill version. For example, suppose we change something in Drill 1.10. Drill 1.10.0-SNAPSHOT will start with writer version 2. Later we'll make a change and 1.10.0-SNAPSHOT will writer files with writer version 3. The Drill version is ambiguous, the writer version is spot on.

paul-rogers · 2016-11-03T16:39:29Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java

+  }
+
+  public static VersionParser.ParsedVersion parseDrillVersion(String drillVersion) throws VersionParser.VersionParseException {
+    return VersionParser.parse("drill version " + drillVersion + " (build 1234)");


The " (build 1234)" deserves some explanation. Does the parse function demand it so that we have to make up something here just to get the version to parse?

paul-rogers · 2016-11-03T16:40:44Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java


  public static final String DRILL_VERSION_PROPERTY = "drill.version";
-  public static final String IS_DATE_CORRECT_PROPERTY = "is.date.correct";
+  public static final String WRITER_VERSION_PROPERTY = "parquet-writer.version";


We know we are Drill, but others might be confused. Perhaps "drill-writer.version" or "drill.writer-version" (to be consistent with "drill.version".)

Done. Also parquet files for unit tests were regenerated as well.

paul-rogers · 2016-11-03T16:43:47Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetWriter.java

 public class ParquetWriter extends AbstractWriter {
  static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ParquetWriter.class);

+  public static final int WRITER_VERSION = 2;


This deserves an explanation. Something like:

Version of Drill's Parquet writer. Increment this version (by 1) any time we make any format change to the file. Format changes include 1) supporting new data types, 2) changes to the format of data fields, 3) adding new metadata to the file footer, etc.

Newer readers must be able to read old files. The Writer version tells the Parquet reader how to interpret fields or metadata when that data changes format from one writer version to another.

I copied this comment into the project.

vdiravka

Github swallowed some comments but I answered on "jackson-serialization" in one comment. Regarding drill version checking - agree with you, I deleted it.

vdiravka · 2016-11-04T01:53:38Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java

    @JsonProperty List<String> directories;
    @JsonProperty String drillVersion;
-    @JsonProperty boolean isDateCorrect;
+    @JsonProperty int writerVersion;


ParquetTableMetadata_v2 is used mostly when the parquet meta cache file created.
For reading the metadata cache file or metadta footer this class is not used except the case, when we read one parquet file or parquet folder, an empty instance of this class is created, but used only columnTypeInfo field. And the drillVersion or writerVersion weren't used. That's why everything worked but it was not right. I added an empty constructor (as I made earlier, actually master version).
And added initializing this fields across constructors parameters. I think it is more right and looks like this is what you were talking about.

vdiravka · 2016-11-04T01:56:45Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java


  public static final String DRILL_VERSION_PROPERTY = "drill.version";
-  public static final String IS_DATE_CORRECT_PROPERTY = "is.date.correct";
+  public static final String WRITER_VERSION_PROPERTY = "parquet-writer.version";


Done. Also parquet files for unit tests were regenerated as well.

vdiravka · 2016-11-04T02:01:32Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetWriter.java

 public class ParquetWriter extends AbstractWriter {
  static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ParquetWriter.class);

+  public static final int WRITER_VERSION = 2;


I copied this comment into the project.

vdiravka · 2016-11-08T15:15:41Z

@paul-rogers Could you please review? I answered on the last comment and rebased the branch with resolving conflicts (ParquetTableMetadata_v3 was added)

paul-rogers

Just a few minor suggestions. Otherwise, LGTM.

paul-rogers · 2016-11-09T16:50:01Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java

    }
-    ParquetTableMetadata_v3 parquetTableMetadata = new ParquetTableMetadata_v3(true);
+    ParquetTableMetadata_v3 parquetTableMetadata = new ParquetTableMetadata_v3(DrillVersionInfo.getVersion(),
+        ParquetWriter.WRITER_VERSION);


I'm a bit confused. The writer version applies to the Parquet files which Drill writes. (Or, at least, that was the intention.)

Here, we're talking about metadata. There may well be a metadata writer, but that should be a different writer, with a different version.

Not sure we want to initialize the metadata object with the current writer version: there seems to be no correlation between the metadata object and the writer version.

On the other hand, the metadata can certainly hold the writer version, but it must be the value read from the Parquet file itself; not a value set by the code. Else, we have the difficult problem of making sure that the code-set version number agrees with the actual version number in the file.

is.date.correct or parquet-writer.version were needed in metadata cache file for quick detection of date values correctness. Otherwise need to check files.rowGroups.columns.mxValue values from this cache file.
But thought a little, I've understood that due to new added ParquetTableMetadata_v3 we can check:
If version of parquet metadata cache file is 3, the date values are definitely correct. Otherwise (when parquet metadata cache file was generated earlier) need to check date values from this file.
So writerVersion is redundant in the ParquetTableMetadataBase now. I deleted it. Does it make sense?

paul-rogers · 2016-11-09T16:52:59Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java

      (int) (UTC.getDateTimeMillis(5000, 1, 1, 0) / DateTimeConstants.MILLIS_PER_DAY);
-
+  /**
+   * The version of drill parquet writer with date values corruption fix


Maybe explain this a bit better. Something like:

Version 2 (and later) of the Drill Parquet writer uses the date format described (in the Parquet spec? URL?). Prior versions had dates formatted (copy description from above.)

paul-rogers · 2016-11-09T16:54:05Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java

+  /**
+   * The version of drill parquet writer with date values corruption fix
+   */
+  public static final int DRILL_WRITER_VERSION_WITHOUT_CORRUPTION = 2;


Maybe call this DRILL_WRITER_VERSION_STD_DATE_FORMAT

The old format was not "corrupted", it just used a date format that was non-standard.

I don't mind. Suitable name. Done

paul-rogers · 2016-11-09T16:56:19Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java

-    String isDateCorrect = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.IS_DATE_CORRECT_PROPERTY);
+    String writerVersion = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.WRITER_VERSION_PROPERTY);
+    // This flag can be present in parquet files which were generated with 1.9.0-SNAPSHOT drill version
+    final String isDateCorrectFlag = "is.date.correct";


Maybe here you want to special case the "is.date.correct" flag.

If the writer version is present, use it.

If "is.date.correct" is present, set the writer version to "2".

If neither are present, set the writer version to 1.

That way we don't have to have (much) extra logic for the "is.date.correct" handling.

paul-rogers · 2016-11-09T16:57:43Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetWriter.java

+ * 1) supporting new data types,
+ * 2) changes to the format of data fields,
+ * 3) adding new metadata to the file footer, etc.
+ * Newer readers must be able to read old files. The Writer version tells the Parquet reader how to interpret fields


This is Javadoc, so needs HTML formatting. For the list:

<ul> <li>Supporting new data types,</li> ... </ul>

vdiravka · 2016-11-10T15:35:39Z

@parthchandra Could you please review this PR?

…tus detection - Parquet writer version is added; - Updated the detection method of parquet date correctness.

parthchandra · 2016-11-22T02:06:48Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java

    if (drillVersion != null) {
-      return Boolean.valueOf(isDateCorrect) ? DateCorruptionStatus.META_SHOWS_NO_CORRUPTION
-          : DateCorruptionStatus.META_SHOWS_CORRUPTION;
+      int writerVersion = (stringWriterVersion != null) ? Integer.parseInt(stringWriterVersion)


Use if-then-else to make this a little easier to read?

Agree. It makes sense.
I even found one redundant check in the return statement.
Changes in a new commit.

…tus detection - Parquet writer version is added; - Updated the detection method of parquet date correctness. This closes apache#644

parthchandra · 2016-11-23T23:31:59Z

+1

…tus detection - Parquet writer version is added; - Updated the detection method of parquet date correctness. This closes apache#644

* Updated plugin version to 1.7.0-SNAPSHOT * MD-789: Query with condition involving addition of DATE and INTERVAL types returns no results [MapR-DB JSON Tables] + Added `enablePushdown` option to enable/disable all filter pushdown, enabled by default. * MD-813: Improve count(*) queries against MapR-DB Json tables. + Fail query on schema change. + Added a configuration option 'ignoreSchemaChange', which when enabled, drops the rows from the result * Update MapR v5.1.0 artifacts version * DRILL-4199: Add Support for HBase 1.X * Updated plugin version to 1.9.0-SNAPSHOT * Explicitly specify `hbase-server` dependency in Hive storage plugin. Hive's HBaseStorageHandler uses HBase's TableInputFormat which is in hbase-server module. * Exclude 'drill-memory-base' and 'tpch-sample-data' jars from jars/3rdparty folder. * DRILL-4886: Modifying projects POMs to align with Drill's build and distribution. The default build/test/packaging behavior for mapr-format-plugin module are 1. BUILD of mapr-format-plugin is ENABLED. 2. Unit tests of mapr-format-plugin module are DISABLED (use `-Pmapr` to enable). 3. Packaging of mapr-format-plugin is DISABLED (use `-Pmapr` to enable). Please see LEGAL-251 for discussion/conclusion regarding inclusion of source code with non-open-source dependency. * DRILL-4894: Fix unit test failure in 'storage-hive/core' module Exclude 'hadoop-mapreduce-client-core' and 'hadoop-auth' as transitive dependencies from 'hbase-server' * DRILL-3898 : Sort spill was modified to catch all errors, ignore repeated errors while closing the new group and issue a more detailed error message. close apache/drill#591 * DRILL-4771: Drill should avoid doing the same join twice if count(distinct) exists close apache/drill#588 * DRILL-4906: CASE Expression with constant generates class exception Tests for different data types close apache/drill#598 * DRILL-4911: Avoid plan serialization in SimpleParallelizer when debug logging is not enabled. * DRILL-4618: Correct the usage of random flag in Hive function registry + Function visitor should not use previous function holder if this function is non-deterministic closes #509 * DRILL-4862: Fix binary_string to use an injected buffer as out buffer + Previously, binary_string used the input buffer as output buffer. So after calling binary_string, the original content was destroyed. Other expressions/ functions that need to access the original input buffer get wrong results. + This fix also sets readerIndex and writerIndex correctly for the output buffer, otherwise the consumer of the output buffer will hit issues. closes #604 * DRILL-4927: Add support for Null Equality Joins These changes are a subset of the original pull request from DRILL-4539 (PR-462). - Added changes to support Null Equality Joins; - Created tests for it. close apache/drill#603 * DRILL-4452: Uses Apache Calcite Avatica driver vs Optiq driver for Drill JDBC Drill JDBC driver uses Optiq Avatica as its basis, but this dependency has been moved over to Calcite, for quite some time without Drill code being updated for it. This patch updates Avatica version to the version from Calcite (1.4.0-drill-r19). It also refactors Drill JDBC driver to comply with the packages and API changes in Avatica. Finally it fixes the the SQL types for lists and structs, since Drill doesn't support java.sql.Array and java.sql.Struct interfaces. this closes #395 Change-Id: Ia608adf900e8708d9e6f6f58ed41e104321a9914 * DRILL-4880: Support JDBC driver registration using ServiceLoader Support loading Drill driver using ServiceLoader. From the user perspective, it means being able to use the driver without registering it first, like by using Class.forName("org.apache.drill.jdbc.Driver") for example. this closes #596 Change-Id: Id26922ee42bef5fbce46ac2bcbb83f1859e9bb48 * DRILL-4930: Fix Metadata results ordering Change MetadataProvider to return metadata results ordered (following convention used by ODBC and JDBC specs). this closes #614 Change-Id: Iff59b7fada7040602f1735bccc13bc6bf5c9a252 * DRILL-4925: Add tableType filter to GetTables metadata query - Adding tableType filter to GetTablesReq query (needed for JDBC and ODBC drivers). - Fix table type returned by sys and INFORMATION_SCHEMA tables - Also fixes some protobuf typos to related classes. this closes #612 Change-Id: If95246a312f6c6d64a88872936f516308874c2d2 * DRILL-4870 drill-config.sh sets JAVA_HOME incorrectly for the Mac This closes #605 * DRILL-4203: Fix DrillVersionInfo to make it provide a valid version number even during the unit tests. This is now a build-time generated class, rather than one that looks on the classpath for META-INF files. This pattern for file generation with parameters passed from the POM files was borrowed from parquet-mr. * DRILL-4203: Fix date values written in parquet files created by Drill Drill was writing non-standard dates into parquet files for all releases before 1.9.0. The values have been read by Drill correctly by Drill, but external tools like Spark reading the files will see corrupted values for all dates that have been written by Drill. This change corrects the behavior of the Drill parquet writer to correctly store dates in the format given in the parquet specification. To maintain compatibility with old files, the parquet reader code has been updated to check for the old format and automatically shift the corrupted values into corrected ones automatically. The test cases included here should ensure that all files produced by historical versions of Drill will continue to return the same values they had in previous releases. For compatibility with external tools, any old files with corrupted dates can be re-written using the CREATE TABLE AS command (as the writer will now only produce the specification-compliant values, even if after reading out of older corrupt files). While the old behavior was a consistent shift into an unlikely range to be used in a modern database (over 10,000 years in the future), these are still valid date values. In the case where these may have been written into files intentionally, and we cannot be certain from the metadata if Drill produced the files, an option is included to turn off the auto-correction. Use of this option is assumed to be extremely unlikely, but it is included for completeness. This patch was originally written against version 1.5.0, when rebasing the corruption threshold was updated to 1.9.0. Added regenerated binary files, updated metadata cache files accordingly. One small fix in the ParquetGroupScan to accommodate changes in master that changed when metadata is read. Tests for bugs revealed by the regression suite. Fix drill version number in metadata file generation * DRILL-4203: Parquet File. Date is stored wrongly - Added new extra field in the parquet meta info "is.date.correct = true"; - Removed unnecessary double conversion of value with Julian day; - Added ability to correct corrupted dates for parquet files with the second version old metadata cache file as well. This closes #595 * DRILL-4653: Malformed JSON should not stop the entire query from progressing This closes #518 * DRILL-4726: Dynamic UDF Support 1) Configuration / parsing / options / protos 2) Zookeeper integration 3) Registration / unregistration / lazy-init 4) Unit tests This closes #574 * DRILL-3178: csv reader should allow newlines inside quotes This closes #593 * DRILL-4369: Exchange name and version infos during handshake There's no name and version exchanged between client and server over the User RPC channel. On client side, having access to the server name and version is useful to expose it to the user (through JDBC or ODBC api like DatabaseMetadata#getDatabaseProductVersion()), or to implement fallback strategy when some recent API are not available (like metadata API). On the server side, having access to the client version might be useful for audit purposes and eventually to implement fallback strategy if it doesn't require a RPC version change. this closes #622 * DRILL-4950: Remove incorrect false condition; consume all empty batches closes #621 * DRILL-4921: Added -P to "cd" in scripts to allow for symbolic links close apache/drill#623 * DRILL-4954: Fix printing Strings when `allTextMode` is ON for MapR-DB JSON tables. close apache/drill#624 * DRILL-4905: Push down the LIMIT to the parquet reader scan to limit the numbers of records read close apache/drill#597 * DRILL-4560: When new bits register, invoke listeners in ZKClusterCoordinator close apache/drill#626 Original JIRA title: “ZKClusterCoordinator does not call DrillbitStatusListener.drillbitRegistered for new bits” * DRILL-4968: Add column size to ColumnMetadata Add a column size to ColumnMetadata so that JDBC and ODBC clients share the implementation and don't have to recompute it client side. this closes #631 * DRILL-4826: Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases This closes #592 * DRILL-4964: Make Drill reconnect to hive metastore after hive metastore is restarted. Drill fails to connect to hive metastore after hive metastore is restarted unless drillbits are restarted. Changes: For methods DrillHiveMetaStoreClient.getAllDatabases() and DrillHiveMetaStoreClient.getAllTables(), the HiveMetaStoreClient wraps MetaException and TException both into MetaException. In case of connection failure which is thrown as TException it is difficult to categorize at DrillClient level. The fix is to close older connection and reconnect in case of these 2 api's. In all other cases proper set of exceptions are thrown where we can handle each one individually. close apache/drill#628 * DRILL-4972: Remove setDaemon(true) call in WorkManager.StatusThread closes #633 * DRILL-4974: Add missing null check in FindPartitionConditions.analyzeCall() close apache/drill#634 * DRILL-4967: Adding template_name to source code generated using freemarker template. close apache/drill#629 * DRILL-4884: Fix IOB exception in limit n query when n is beyond 65535. close apache/drill#584 * DRILL-3423: Initial HTTPD log plugin. Needs tests. Would be good to improve the timestamp and cookies behaviors since we can make those more type specific. * DRILL-3423: Adding HTTPd Log Parsing functionality including full pushdown, type remapping and wildcard support. Pushed through the requested columns for push down to the parser. Added more tests to cover a few more use cases. Ensured that user query fields are now completely consistent with returned values. * DRILL-3243: Added CSG mods. Fixed field names. Removed old test files Added Parse_url and parse_query() functions Fix unit test This closes #607 * DRILL-4373: Drill and Hive have incompatible timestamp representations in parquet - added sys/sess option "store.parquet.int96_as_timestamp"; - added int96 to timestamp converter for both readers; - added unit tests; This closes #600 * DRILL-4420: C++ API for metadata access and prepared statements Add support to the C++ client for metadata querying and prepared statement requests. Part of the metadata API, add methods to query for server capabilities. As of now, this interface is not backed up by any RPC exchange so the information is pretty much static, and match Drill 1.8.0 current capabilities. * DRILL-4853: Update C++ protobuf source files Add support for prepared statements and metadata querying * DRILL-1268: Add unit test to C++ native client Add CppUnit unit test to the C++ native client * DRILL-1996: Add cancel method to Drill C++ connector This closes #602 * DRILL-4792: Include session options used for a query as part of the profile closes #551 * DRILL-4951: Do Guava.patch earlier so we can run single Hbase's unit test through command line or IDE closes #636 * DRILL-4927 (part 2): Add support for Null Equality Joins (mixed comparators) This changes are a subset of the original pull request from DRILL-4539 (PR-462). - Added changes to support mixed comparators; - Added tests for it. closes #635 * DRILL-4986: Allow users to customize the Drill log file name Adds a new user-settable environment variable to choose the log file base name. closes #641 * DRILL-4989: Fix TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange closes #643 * DRILL-4800: Use a buffering input stream in the Parquet reader * DRILL-4800: Add AsyncPageReader to pipeline PageRead Use non tracking input stream for Parquet scans. Make choice between async and sync reader configurable. Make various options user configurable - choose between sync and async page reader, enable/disable fadvise Add Parquet Scan metrics to track time spent in various operations * DRILL-4800: Parallelize column reading. Read/Decode fixed width fields in parallel Decoding var length columns in parallel Use simplified decompress method for Gzip and Snappy decompression. Avoids concurrency issue with Parquet decompression. (It's also faster). Stress test Parquet read write Parallel column reader is disabled by default (may perform less well under higher concurrency) * DRILL-4800: Various fixes. Fix buffer underflow exception in BufferedDirectBufInputStream. Fix writer index for in64 dictionary encoded types. Added logging to help debug. Fix memory leaks. Work around issues with of InputStream.available() ( Do not use hasRemainder; Remove check for EOF in BufferedDirectBufInputStream.read() ). Finalize defaults. Remove commented code. Addressed review comments This closes #611 * DRILL-5400: Fix NPE in corrupt date detection This closes #646 * DRILL-1950: Initial prototype patch for parquet filter pushdown. Use three new classes from Adam's patch. * DRILL-1950: Update parquet metadata cache format to include both min/max and additional column type information. Parquet meta cache format change: 1. include both min/max in ColumnMetaData if column statistics is available, 2. include precision/scale/repetitionLevel/definitionLevel in ColumnTypeMetaData (precision/scale/definitionLevel is for future use). * DRILL-1950: Parquet rowgroup level filter pushdown in query planning time. Implement Parquet rowgroup level filter pushdown. The filter pushdown is performed in in Drill physical planning phase. Only a local filter, which refers to columns in a single table, is qualified for filter pushdown. A filter may be qualified if it is a simple comparison filter, or a compound "and/or" filter consists of simple comparison filter. Data types allowed in comparison filter are int, bigint, float, double, date, timestamp, time. Comparison operators are =, !=, <, <=, >, >=. Operands have to be a column of the above data types, or an explicit cast or implicit cast function, or a constant expressions. This closes #637 * DRILL-4945: Report INTERVAL exact type as column type name closes #618 * DRILL-4969: Basic implementation of display size for column metadata Add a simple implementation of display size, based on the ODBC table available at: https://msdn.microsoft.com/en-us/library/ms713974(v=vs.85).aspx closes #632 * DRILL-4823: Fix OOM while trying to prune partitions with reasonable data size closes #560 * DRILL-4674: Allow casting to boolean the same literals as in Postgres closes #610 * Add Sudheesh's PGP Key * DRILL-5007: Dynamic UDF lazy-init does not work correctly in multi-node cluster closes #650 * DRILL-5009: Skip reading of empty row groups while reading Parquet metadata + We will no longer attempt to scan such row groups. closes #651 * DRILL-5047: When session option is string, query profile is displayed incorrectly on Web UI closes #655 * DRILL-4935: Allow drillbits to advertise a configurable host address to Zookeeper This closes #647 * DRILL-4979: Make port of the DataConnection configurable This closes #649 * DRILL-4831: Running refresh table metadata concurrently randomly fails with JsonParseException This closes #653 * DRILL-4980: Upgrading of the approach of parquet date correctness status detection - Parquet writer version is added; - Updated the detection method of parquet date correctness. This closes #644 * DRILL-4995: Allow lazy init when dynamic UDF support is disabled This closes #645 * Update version to 1.10.0-SNAPSHOT * DRILL-4604: Generate warning on Web UI if drillbits version mismatch is detected close apache/drill#482 * DRILL-5094: Comparator should guarantee transitive attribute. close apache/drill#675 * DRILL-5086: Fix conversion of min and max values to appropriate data type. close apache/drill#674 * DRILL-5050: C++ client library has symbol resolution issues when loaded by a process that already uses boost::asio Build with Boost static libs and drill_boost namespace on mac. Added readme with instructions DRILL-5050: Addressed review comments DRILL-5050: address more review comments close apache/drill#659 * DRILL-4982: Separate Hive reader classes for different data formats to improve performance. 1, Separating Hive reader classes allows optimization to apply on different classes in optimized ways. This separation effectively avoid the performance degradation of scan. 2, Do not apply Skip footer/header mechanism on most Hive formats. This skip mechanism introduces extra checks on each incoming records. close apache/drill#638 * DRILL-5015: Randomly select the drillbit from the list provided by user in connection string Note: Improved the connection string validation and error handling during parsing. Added unit test for the new parsing mechanism. close apache/drill#648 * DRILL-5091: Handle new Java 8 JDBC functions and add missing test config parameter * JDBC unit test fails to set up test storage plugin config on Java 8 * Cleans up compiler warnings seen in Eclipse. * Added TODO's based on code reviews closes #676 * DRILL-5108: Reduce output from Maven git-commit-id-plugin The git-commit-id-plugin grabs information from Git to display during a build. It prints many e-mail addresses and other generic project information. As part of the effort to trim down unit test output, we propose to turn off the verbose output from this plugin by default. closes #680 * DRILL-5113: Upgrade Maven RAT plugin to avoid annoying XML errors Upgrade to eliminate XML compiler warnings that appear for each sub-project in the Drill build. Also fixes a Maven warning about a duplicated version error for the dependency plugin (version is inherited from root pom.xml). closes #682 * DRILL-5048: Fix type mismatch error in case statement with null timestamp closes #657 * DRILL-5044: Fix retry logic to handle VersionMismatchException by not deleting jars in remote UDFs area closes #669 * DRILL-5085: Update description for dynamic UDFs directories in drill-env.sh and drill-module.conf closes #672 * DRILL-5112: Fix config in PopUnitTestBase Tests rely on command-line settings in the pom.xml file. Those settings are not available when tests are run in Eclipse. Replicated required settings into the base test class (as in BaseTestQuery). closes #681 * DRILL-4812: Fix FileSelection#handleWildCard to use normalized path separator closes #627 * DRILL-5056: Fix UserException to write full message to log A case occurred in which the External Sort failed during spilling. All that was written to the log was: 2016-11-18 ... INFO o.a.d.e.p.i.xsort.ExternalSortBatch - User Error Occurred Modified the logging code to provide more information to aid tracking down problems that occur in the field. closes #665 * DRILL-5119: Update MapR version to 5.2.0.40963-mapr closes #688 * DRILL-4987: Use ImpersonationUtil to get process user’s groups in RemoteFunctionRegistry closes #642 * DRILL-5081: Lower logging level for corrupt dates message * introduced in DRILL-4203 closes #691 * DRILL-4938: Report UserException when constant expression reduction fails closes #689 * DRILL-5065: Optimize count(*) queries on MapR-DB JSON Tables In MapR-DB v5.2.0, we enabled '_id' only projection for JSON tables. Hence, we can now optimize the following queries: a. count(*) by projecting only the '_id' column. b. '_id' only projections, including count(_id) Change the format plugin config parameter name. Fix setter of config parameter `disableCountOptimization` for drill-maprdb plugin closes #678 * DRILL-5123: Write query profile after sending final response to client to improve latency In testing a particular query, I used a test setup that does not write to the "persistent store", causing query profiles to not be saved. I then changed the config to save them (to local disk). This produced about a 200ms difference in query run time as perceived by the client. I then moved writing the query profile after sending the client the final message. This resulted in an approximately 100ms savings, as perceived by the client, in query run time on short (~3 sec.) queries. close apache/drill#692 * DRILL-5117: Compile error when query a json file with 1000+columns close apache/drill#686 * DRILL-5098: Improving fault tolerance for connection between client and foreman node. Adding tries config option in connection string. Improving fault tolerance in Drill client when trying to make first connection with foreman. The client will try to connect to min(tries, num_drillbits) unique drillbits unless a successfull connection is established. HYGIENE: Refactoring BasicClient::close to call RemoteConnection::close close apache/drill#679 * DRILL-5051: Fix incorrect computation of 'fetch' in LimitRecordBatch when 'offset' is specified close apache/drill#662 * DRILL-5032: Drill query on hive parquet table failed with OutOfMemoryError: Java heap space close apache/drill#654 * DRILL-5052: Option to debug generated Java code using an IDE Provides a second compilation path for generated code: “plan old Java” in which generated code inherit from their templates. Such code can be compiled directly, allowing easy debugging of generated code. Also show to generate two classes in the External Sort Batch as “plain old Java” to enable IDE debugging of that generated code. Required minor clean-up of the templates. Fixes some broken toString( ) methods in code generation classes Fixes a variety of small compilation warnings Adds Java doc to a few classes Includes clean-up from code review comments. close apache/drill#660 * DRILL-5159: Drill's ProjectMergeRule should operate on RelNodes with same convention trait. close apache/drill#705 * DRILL-5127: Revert the fix for DRILL-4831 close apache/drill#718 * DRILL-5121 Fix for memory leak. Changes fieldVectorMap in ScanBatch to a CaseInsensitiveMap close apache/drill#690 * DRILL-5039: NPE - CTAS PARTITION BY (<char-type-column>) close apache/drill#706 * DRILL-5116: Enable generated code debugging in each Drill operator DRILL-5052 added the ability to debug generated code. The reviewer suggested permitting the technique to be used for all Drill operators. This PR provides the required fixes. Most were small changes, others dealt with the rather clever way that the existing byte-code merge converted static nested classes to non-static inner classes, with the way that constructors were inserted at the byte-code level and so on. See the JIRA for the details. This code passed the unit tests twice: once with the traditional byte-code manipulations, a second time using "plain-old Java" code compilation. Plain-old Java is turned off by default, but can be turned on for all operators with a single config change: see the JIRA for info. Consider the plain-old Java option to be experimental: very handy for debugging, perhaps not quite tested enough for production use. close apache/drill#716 * DRILL-4996: Parquet Date auto-correction is not working in auto-partitioned parquet files generated by drill-1.6 - Changed detection approach of corrupted date values for the case, when parquet files are generated by drill: the corruption status is determined by looking at the min/max values in the metadata; - Appropriate refactoring of TestCorruptParquetDateCorrection. This closes #687 * DRILL-5105: comment out unecessary recursive buffer size check This closes #715 * DRILL-5152: Enhance the mock data source: better data, SQL access Provides an enhanced version of the mock data source. See the JIRA entry for motivation, package-info.java for details of operation. Revisions suggested by code review Also includes additional comments and a few more compiler warning cleanups. This closes #708 * DRILL-4919: Fix select count(1) / count(*) on csv with header This closes #714 * DRILL-4558: BSonReader should prepare buffer size as actual need This closes #696 * DRILL-4868: fix how hive function set DrillBuf. This closes #695 * DRILL-5172: Display elapsed time for queries in the UI Displays the elapsed time for running queries and the total duration of completed/failed/cancelled queries in the list of query profiles displayed, and within a query's profile page as well. The query runtime is displayed in '[hr] [min] sec'. e.g. A duration of 25,254,321ms is displayed 7 hr 00 min 54.321 sec This closes #721 * DRILL-4956: Temporary tables support close apache/drill#666 * DRILL-5097: Using store.parquet.reader.int96_as_timestamp gives IOOB whereas convert_from works close apache/drill#697 * DRILL-5104: Foreman should not set external sort memory for a physical plan Physical plans include a plan for memory allocations. However, the code path in Foreman replans external sort memory, even for a physical plan. This makes it impossible to use a physical plan to test memory configuration. This change avoids changing memory settings in a physical plan; while preserving the adjustments for logical plans or SQL queries. Revised to put a property in the plan itself. Old plans, and those generated from SQL, will have memory allocations applied. Plans marked as already "resource management" planned will be used as-is. Includes a unit test that demonstrates the new behavior. close apache/drill#703 * DRILL-5164: Equi-join query results in CompileException when inputs have large number of columns. close apache/drill#711 * DRILL-5218: Support optionally disabling heartbeats from C++ client closes #726 * DRILL-4764: Parquet file with INT_16, etc. logical types not supported by simple SELECT closes #673 * DRILL-5126: Provide simplified, unified "cluster fixture" for test Drill provides a robust selection of test frameworks that have evolved to satisfy the needs of a variety of test cases. However, some do some of what a given test needs, while others to other parts. Also, the various frameworks make assumptions (in the form of boot-time configuration) that differs from what some test may need, forcing the test to start, then stop, then restart a Drillbit - an expensive operation. Also, many ways exist to run queries, but they all do part of the job. Several ways exist to channge runtime options. This checkin shamelessly grabs the best parts from existing frameworks, adds a fluent builder facade and provides a complete, versitie test framework for new tests. Old tests are unaffected by this new code. An adjustment was made to allow use of the existing TestBuilder mechanism. TestBuilder used to depend on static members of BaseTestQuery. A "shim" allows the same code to work in the old way for old tests, but with the new ClusterFixture for new tests. Details are in the org.apache.drill.test.package-info.java file. This commit modifies a single test case, TestSimpleExternalSort, to use the new framework. More cases will follow once this framework itself is committed. Also, the framework will eventually allow use of the extended mock data source from SQL. However, that change must await checkin of the mock data source changes. Includes a LogFixture that allows setting logger options per test to simplify debugging via tests. Also includes a “summary listener” to run a query and return a summary of the run. Handy to simply verify that a query runs and to time it. Added an async query runner for tests that want to run multiple concurrent queries. closes #710 * DRILL-3562: Query fails when using flatten on JSON data where some documents have an empty array closes #713 * DRILL-5043: Function that returns a unique id per session/connection similar to MySQL's CONNECTION_ID() #685 * DRILL-5207: Improve Parquet Scan pipelining. Add a configurable AsyncPageReader Queue. Enforce total size of parquet row group. Do not initialize BufferedDirectBufInputStream buffer in init. Wait for first read. Change default size of BufferedDirectBufInputStream. Do not invoke getOptions too many times in Parquet reader. Add metrics for processing time, and decoding time for varlen and fixedlen columns. This closes #723 * DRILL-5215: CTTAS: disallow temp tables in view expansion logic This closes #725 * DRILL-5220: Provide API to set application/client names in C++ connector Add method to DrillClientConfig to set the client and the application names in the C++ connector. Allow the ODBC driver (or any user of the C++ connector) to provide more specific informations like the application using the client. This closes #728 * DRILL-5224: CTTAS: fix errors connected with system path delimiters (Windows) This closes #731 * DRILL-5237: FlattenRecordBatch loses nested fields from the schema when returns empty batches for the first time This closes #735 * DRILL-5238: CTTAS: unable to resolve temporary table if workspace is indicated without schema This closes #736 * DRILL-5240: Parquet - fix unnecessary object creation while checking for null values in nullable var length columns This closes #740 * DRILL-5223: Drill should ensure balanced workload assignment at node level in order to get better query performance This closes #730 * DRILL-5241: JDBC proxy driver: Do not put null value in map This closes #724 * DRILL-5243: Fix TestContextFunctions.sessionIdUDFWithinSameSession unit test This closes #743 * DRILL-5219: Relax user properties validation in C++ client Unlike Java client, C++ client only allows user properties present in a whitelist. Relax this restriction so that user can add extra properties. This closes #727 * DRILL-5040: Parquet writer unable to delete table folder on abort close apache/drill#744 * DRILL-4864: Add ANSI format for date/time functions DRILL-4864: Add ANSI format for date/time functions(review changes) close apache/drill#581 * DRILL-5080: Memory-managed version of external sort Please see JIRA entry for reasons for revision, design spec and list of changes. This PR covers the changes to the external sort itself. Tests for this operator require the test framework in DRILL-5126 and the mock data source in DRILL-5152. Tests for this operator will be issued as a separate PR once those two dependencies are committed. Until then, the new operator is disabled by default. It can be enabled using drill.sort.external.disable_managed: false. The operator now spills before receiving a new batch. Revised memory calcs and merge calcs to make them a bit clearer and provide more margin of error for the power-of-two allocations used when allocating vectors. We have two external sort implementations, but only one operator code for both. They can use only one Metrics enum between them. When adding new metrics to the new version, didn’t add matching metrics to the old one. This fixes that issue. (The issue will go away once the old one is retired.) Revised memory calculations to reflect limit of 16 MB per vector. Current revision limits to 16 MB per output batch to be safe. Next revision will enforce per-vector limits to allow the overall batch to be larger when possible. Also simplified the merge-time calculations. Original code provided only crude methods to learn the size of a record batch. Adds a "RecordBatchSizer" to provide detailed analysis so the sort can know the amount of memory used to buffer a batch, the number of rows, and the expected row width once the rows are copied to a spill file or the output. Moved generic spill classes to a separate package. Created parameters for spill batch size and merge batch size. Separated these values in code. Deprecated the min, max spill parameters as they no longer add much value. Minor code rearranging. Bug fix Fixes a corner case of merging spilled files in a low-memory condition. Fixes from code review close apache/drill#717 * DRILL-5263: Prevent left NLJoin with non scalar subqueries * DRILL-5230: Translation of millisecond duration into hours is incorrect Fixed invalid representation of readable elapsed time using `TimeUnit` class in JDK. e.g. 4545 sec is now correctly translated as `1h15m` instead of `17h15m` * DRILL-5242: The UI breaks when trying to render profiles having unknown metrics Skip any metrics whose metric ID is unknown, This prevents any ArrayIndexOutOfBoundsException from being thrown and breaking the UI rendering. * DRILL-5157: Multiple Snappy versions on class path Multiple Snappy versions on class path; causes unit test failures. This fix updates the Snappy library and adds dependency management to exclude older versions brought in by Avro and Parquet. * DRILL-5273: CompliantTextReader exhausts 4 GB memory when reading 5000 small files Please see JIRA for details of problem and fix. closes #750 * DRILL-5275: Sort spill is slow due to repeated allocations Rather than create a heap buffer per vector when writing and reading, the revised code creates a single, shared buffer used for all I/O within a particular container. This improves performance by reducing GC and CPU costs during I/Os. Move I/O buffer, and methods to allocator Allows the buffer to be shared. Especially in the sort, this is important, as the sort may have many serializations open at once. closes #754 * DRILL-5274: Exception thrown in Drillbit shutdown in UDF cleanup code closes #760 * DRILL-5260: Extend "Cluster Fixture" test framework - Config option to suppress printing of CSV and other output. (Allows printing for single tests, not printing when running from Maven.) - Parsing of query profiles to extract plan and run time information. - Fix bug in log fixture when enabling logging for a package. - Improved ZK support. - Set up the new CTTAS default temporary workspace for tests. - Clean up persistent storage files on disk to avoid CTTAS startup failures. - Provides a set of examples for how to use the cluster fixture. closes #753 * DRILL-5259: Allow listing a user-defined number of profiles Allow changing default number of finished queries in web UI, when starting up Drillbits. Option provided in drill-override.conf (default=100 ; defined in drill-module.conf) Alternatively, the page can be loaded dynamically for the same. e.g. https://<hostname>:8047/profiles?max=100 closes #751 * DRILL-5257: Run-time control of query profiles Adds a run-time option to save (default) or not save query profiles. Adds a run-time option to save query profiles in "debug" mode: that is, after returning the last client response. (Normal mode is to return the response before writing the profile.) Tests for normal case are normal unit tests. Tests for debug mode case are unit tests using the new framework that parse profiles. The test framework is extended to save query profiles using this new option. Modifies the test framework to use the new options when a test asks to save query profiles. closes #747 * DRILL-5088: Set client's codec for toJson closes #702 * DRILL-5196: Init MongoDB cluster when run a single test case directly through command line or IDE Other fixes include: + Sync mongo-java-driver versions to newer 3.2.0 + update flapdoodle package to latest accordingly closes #741 * DRILL-5190: Display planning and queued time for a query's profile page Modified UserSharedBit protobuf for marking planning and wait-in-queue end times. This will allow for accurately reporting the planning, queued and actual execution times of a query. Planning Time: In the absence of the planning time's end, for older profiles, the root fragment's (i.e. SCREEN operator) start time is taken as the estimated end of planning time, and as the estimated start time of the execution phase. QueueWait Time: We do not estimate the queue time if the planning end time is not available. Execution Time: We calculate the execution time based on the availability of these 2 planning time. The computation is done the following way, and reflects a decreasing level of accuracy 1. Execution time = [end(QueueWait) - endTime(Query)] 2. Execution time = [end(Planning) - endTime(Query)] 3. Execution time = [start(rootFragment) - endTime(Query)] - {Estimated} closes #738 * DRILL-5255: Remove default temporary workspace check at drillbit start up closes #759 * DRILL-4280: HYGIENE + Ignore files generated by IntelliJ in RAT plugin * DRILL-4280: CORE (Java protocol) + Define SaslStatus and SaslMessage messages in protocol + Add "authenticationMechanisms" field to all handshakes + Add "saslSupport” field to UserToBitHandshake * DRILL-4280: REFACTOR + Extract RemoteConnection interface, and add AbstractRemoteConnection + Add ServerConnection and ClientConnection interfaces + Add RequestHandler interface to decouple connections from how requests are handled + Add NonTransientRpcException + Remove unused classes and methods + Code style changes * DRILL-5195: Publish Operator and MajorFragment Stats in Profile page Improved UI 1. Introduction of Tooltips 2. Share of each operator as a percentages of the major fragment and of the query - This would help identify the most CPU intensive operators within a fragment and across the query 3. Rows emitted by each operator 4. For a running query, changes to 'last update' and 'last progress' now shows the elapsed time since. closes #756 * DRILL-4280: CORE (service login) + Support Drillbit login to KDC using Hadoop's UserGroupInformation library + Set hostname in BootstrapContext + Use process user's short name in ImpersonationUtil + Add KerberosUtil class * DRILL-4280: HYGIENE + Do not recreate DrillConfig object in PamUserAuthenticator + Add new factory method to CaseInsensitiveMap + Clean documentation * DRILL-4280: CORE (security package) + Add AuthenticatorFactory interface + Kerberos implementation + includes SaslServer and SaslClient wrappers + Plain implementation + PlainServer implements SaslServer (unavailable in Java) for username/password based authentication + retrofit user authenticator + add logic for backward compatibility + Add AuthenticatorProvider interface to provide authenticator factories, and add two implementations: + DrillConfig and ScanResult based AuthenticatorProviderImpl + Default and system property based ClientAuthenticatorProvider + FastSaslServerFactory caches SaslServer factories + FastSaslClientFactory caches SaslClient factories + ServerAuthenticationHandler handles authentication on server-side + FailingRequestHandler to fail any message received + AuthenticationOutcomeListener handles authentication on client-side security * DRILL-4280: CORE (user to bit authentication, Java) + Add logic for authentication in UserClient and UserServer with backward compatibility in both directions + Add abstract extension to ServerConnection and ClientConnection + Add concrete extensions to abstract connections: BitToUserConnection and UserToBitConnection + Add ConnectionConfig interface with abstract and concrete implementations to encapsulate configuration for server-side connections + Encapsulate all requests handled by UserServer in UserServerRequestHandler + Clear UserSession when connection is closed either by user or bit + Add DrillProperties to encapsulate all connection properties used during connection time * DRILL-4280: CORE (revert DRILL-3242) + DRILL-3242 aims to provide offloading request handling to a secondary thread, but this feature is disabled by default due to concurrency issues + One of the implications of the feature was to ignore exceptions that were not of UserRpcException type. But exceptions must not be ignored, they should be handled properly, specially in the context of security * DRILL-4280: CORE (bit to bit authentication, control) + Support authentication in ControlServer and ControlClient + Add AuthenticationCommand as an initial command after handshake and before the command that initiates a connection + Add ControlConnectionConfig to encapsulate configuration + ControlMessageHandler now implements RequestHandler control * DRILL-4280: CORE (bit to bit authentication, data) + Support authentication in DataServer and DataClient + Add AuthenticationCommand as an initial command after handshake and before the command that initiates a connection + Add DataConnectionConfig to encapsulate configuration + Add DataServerRequestHandler to encapsulate all handling of requests to DataServer data * DRILL-4280: CORE (web client) + Disabled web server when authentication is enabled but PLAIN mechanism is not configured; log a warning * DRILL-4280: CORE (C++ protocol) * DRILL-4280: CORE (user to bit authentication, C++) closes #578 * DRILL-4280: CORE (unit tests) + Modify existing tests to use new authentication configuration + Add TestUserBitKerberos and TestBitBitKerberos using Apache Kerby library * Bump maxsize of jdbc-all jar to accommodate the increased size of jar file due to new code. * DRILL-4994: Refactor DrillCursor Refactor DrillCursor to be more self-contained. * DRILL-4994: Add back JDBC prepared statement for older servers When the JDBC client is connected to an older Drill server, it always attempted to use server-side prepared statement with no fallback. With this change, client will check server version and will fallback to the previous client-side prepared statement (which is still limited to only execute queries and does not provide metadata). close #613 * DRILL-4730: Update JDBC DatabaseMetaData implementation to use new Metadata APIs Update JDBC driver to use Metadata APIs instead of executing SQL queries close #613 * DRILL-5301: Server metadata API Add a Server metadata API to the User protocol, to query server support of various SQL features. Add support to the client (DrillClient) to query this information. Add support to the JDBC driver to query this information, if the server supports the new API, or fallback to the previous behaviour (rely on Avatica defaults) otherwise. close #764 * DRILL-5301: Add C++ client support for Server metadata API Add support to the Server metadata API to the C++ client if available. If the API is not supported to the server, fallback to the previous hard-coded values. Update the querySubmitter example program to query the information. close #764 * DRILL-5167: Send escape character for metadata queries Escape character was not sent when doing metadata queries, which caused the server to return incorrect results as the pattern is interpreted differently form what the user asked for. close #712 * DRILL-5221: Send cancel message as soon as possible in C++ connector In C++ connector, try to send cancel request to the server as soon as possible, which means when receiving the queryId or when requested by the user if queryId has already been received. close #733 * DRILL-5258: Access mock data definition from SQL Extends the mock data source to allow using the full power of the mock data source from an SQL query by referencing the JSON definition file. See JIRA and package-info for details. Adds a boolean data generator and a varying-length string generator. Adds “mock” table stats for use in the planner. Revisions based on code review comments close #752 * DRILL-5284: Roll-up of final fixes for managed sort See subtasks for details. * Provide detailed, accurate estimate of size consumed by a record batch * Managed external sort spills too often with Parquet data * Managed External Sort fails with OOM * External sort refers to the deprecated HDFS fs.default.name param * Config param drill.exec.sort.external.batch.size is not used * NPE in managed external sort while spilling to disk * External Sort BatchGroup leaks memory if an OOM occurs during read * DRILL-5294: Under certain low-memory conditions, need to force the sort to merge two batches to make progress, even though this is a bit more than comfortably fits into memory. close #761 * DRILL-5304: Queries fail intermittently when there is skew in data distribution close #766 * DRILL-4963: Fix issues with dynamically loaded overloaded functions close #701 * DRILL-5252: Fix a condition that always returns true close #745 * DRILL-5266: Parquet returns low-density batches Fixes one glaring problem related to bit/byte confusion. Includes a few clean-up items found along the way. Additional fixes from code review comments More code clean up from code review close #749 * DRILL-5034: Select timestamp from hive generated parquet always return in UTC - TIMESTAMP_IMPALA function is reverted to retaine local timezone - TIMESTAMP_IMPALA_LOCALTIMEZONE is deleted - Retain local timezone for the INT96 timestamp values in the parquet files while PARQUET_READER_INT96_AS_TIMESTAMP option is on Minor changes according to the review Fix for the test, which relies on particular timezone close #656 * DRILL-5290: Provide an option to build operator table once for built-in static functions and reuse it across queries. close #757 * DRILL-5287: Provide option to skip updates of ephemeral state changes in Zookeeper close #758 * DRILL-5208: Finding path to java executable should be deterministic See DRILL-5208 for background. Instead of using “find” to locate the java command, we use the any information available, resorting to find only if the “usual suspects” fails. The result is that we use the JDK java when available, instead of randomly choosing JDK or JRE java. close #763 * DRILL-5313: Fix compilation issue in C++ connector DRILL-5301 and DRILL-5167 have conflicting changes, which causes the C++ connector to not compile: the static symbol for the search escape string has been removed as the server might use a different one. Fix the issue by using the current search escape string (injected from the meta to the internal drill client when querying metadata). close #769 * DRILL-5293: Change seed for distribution hash function to differ from that of the hash table close #765 * DRILL-5326: Unit tests failures related to the SERVER_METADTA - adding of the sql type name for the "GENERIC_OBJECT"; - changing "NullCollation" in the "ServerMetaProvider" to the correct default value; - changing RpcType to GET_SERVER_META in the appropriate ServerMethod close #775 * Update version to 1.11.0-SNAPSHOT * DRILL-5226: Managed external sort fixes * Memory leak in managed sort if OOM during sv2 allocation * "Record batch sizer" does not include overhead for variable-sized vectors * Paranoid double-checking of merge batch sizes to prevent OOM when the sizes differ from expectations * Revised logging Addresses review comments close apache/drill#767 * DRILL-5311: Check handshake result in C++ connector In C++ client connector, DrillClientImpl::recvHandshake always return success, even in case of connection error (like a tcp timeout issue). Only on WIN32 platform would the error code be checked. Remove the restriction to only check on WIN32, plus add some logging. close apache/drill#770 * DRILL-5316: Check drillbits size before we attempt to access the vector element close apache/drill#772 * DRILL-5330: NPE in FunctionImplementationRegistry Fixes: * DRILL-5330: NPE in FunctionImplementationRegistry.functionReplacement() * DRILL-5331: NPE in FunctionImplementationRegistry.findDrillFunction() if dynamic UDFs disabled When running in a unit test, the dynamic UDF (DUDF) mechanism is not available. When running in production, the DUDF mechanism is available, but may be disabled. One confusing aspect of this code is that the function registry is given the option manager, but the option manager is not yet valid (not yet initialized) in the function registry constructor. So, we cannot access the option manager in the function registry constructor. In any event, the existing system options cannot be used to disable DUDF support. For obscure reasons, DUDF support is always enabled, even when disabled by the user. Instead, for DRILL-5331, we added a config option to "really" disable DUDFS. The property is set only for tests, disables DUDF support. Note that, in the future, this option could be generalized to "off, read-only, on" to capture the full set of DUDF modes. But, for now, just turning this off is sufficient. For DRILL-5330, we use an existing option validator rather than accessing the raw option directly. Also includes a bit of code cleanup in the class in question. The result is that the code now works when used in a sub-operator unit test. close apache/drill#777 * DRILL-5349: Fix TestParquetWriter unit tests when synchronous parquet reader is used. close apache/drill#780 * DRILL-5352: Profile parser printing for multi fragments Enhances the recently added ProfileParser to display run times for queries that contain multiple fragments. (The original version handled just a single fragment.) Prints the query in “classic” mode if it is linear, or in the new semi-indented mode if the query forms a tree. Also cleans up formatting - removing spaces between parens. Fixes from review close apache/drill#782 * Fixed process time percent. * Added support for getting operator profiles in a multi-fragment query. * DRILL-5359: Fix ClassCastException when Drill pushes down filter on the output of flatten operator. - Move findItemOrFlatten as a static method in DrillRelOptUtil. - Exclude filter conditions if they contain item/flatten operator. close apache/drill#786 * DRILL-4678: Tune metadata by generating a dispatcher at runtime main code changes are in Calcite library. update drill's calcite version to 1.4.0-drill-r20. close #793 * DRILL-5394: Optimize query planning for MapR-DB tables by caching row counts close #802 * DRILL-5378: Put more information for schema change exception in hash join, hash agg, streaming agg and sort operator. close #801 * DRILL-5369: Add initializer for ServerMetaContext ServerMetaContext had no default constructor. The lack of it might cause m_done to be set to true, same for other variables. Add a default constructor to explicitly initialize its members. close #791 * DRILL-5368: Fix memory leak issue in DrillClientImpl::processServerMetaResult Fix a small memory leak by doing local allocation instead since the object doesn't escape the function. close #790 * DRILL-5351: Minimize bounds checking in var len vectors for Parquet reader close #781 * DRILL-5297: when the generated plan mismatches, PlanTest print the generated plan along with expected pattern close #798 * DRILL-4971: Query encounters system error, when there aren't eval subexpressions of any function in boolean and/or expressions - New evaluated blocks for boolean operators should be with braces always, since they use labels. close #792 * DRILL-3510: Add ANSI_QUOTES option so that Drill's SQL Parser will recognize ANSI_SQL identifiers - added supporing of quoting identifiers with DOUBLE_QUOTES or BRACKETS via setting new sys/sess EnumString option QUOTING_IDENTIFIERS; - added possibility of setting QUOTING_IDENTIFIERS by the jdbc connection URL string; - added relevant unit tests; close #520 * DRILL-5373: Drill JDBC error in the process of connection via SQuirrel - java.lang.NoClassDefFoundError: javax/validation/constraints/NotNull * DRILL-5413: DrillConnectionImpl.isReadOnly() throws NullPointerException change is in CALCITE-843. update drill's calcite version to 1.4.0-drill-r21 close #806 * DRILL-5375: Nested loop join: return correct result for left join closes #794 * DRILL-5315: Address small typo in the comment in drillClient.hpp closes #771 * DRILL-5355: Misc. code cleanup closes #784 * DRILL-5234: External sort's spilling functionality does not work when the spilled columns contains a map type column closes #799 * DRILL-5395: Query on MapR-DB table fails with NPE due to an issue with assignment logic closes #803 * DRILL-5387: Ignore TestBitBitKerberos and TestUserBitKerberos closes #810 * DRILL-5319: Refactor "contexts" for unit testing closes #787 This PR is purely a refactoring: no functionality is added or changed. The refactoring splits various context and related classes into a set of new interfaces with needed for operator-level unit tests. The other, Drillbit-related methods are left in the original interfaces. Most code need not change. The changes here allow operator-level unit tests to mock up the exec-time methods so we can use them without firing up a Drillbit (or using mocking libraries). A later PR will provide the sub-operator test framework that uses this refactoring. Changes include: * The OptionManager is split, with read-only methods moving to a new OptionSet interface. * The FragmentContext is split, with an exec-only FragmentExecContext proving low-level methods. * OperatorStats is split, with a new OperatorStatReceiver class providing write-only support to operators. * Several places that accepted an OperatorContext or FragmentContext, but needed only an allocator, are changed to accept the allocator directly. Includes fixes for code review comments Adds more comments. Postpones the suggested rename until all affected code is in master, else it will be difficult to synchronize the rename across multiple branches. * DRILL-5213: Prepared statement for actual query is missing the query text close apache/drill#812 * DRILL-5409 - update MapR version to 5.2.1 close apache/drill#813 * DRILL-5415: Improve Fixture Builder to configure client properties and keep collection type properties for server Updated with review feedback close apache/drill#807 * DRILL-5424: Fix IOBE for reverse function close apache/drill#815 * Test-specific column accessor implementation. Provides a simplified, unified set of access methods for value vectors specifically for wrting simple, compact unit test code. * Interfaces for column readers and writers * Interfaces for tuple (row and map) readers and writers * Generated implementations * Base implementation used by the generated code * Factory class to create the proper reader or writer given a major type (type and cardinality) * Utilities for generic access, type conversions, etc. Many vector types can be mapped to an int for get and set. One key exception are the decimal types: decimals, by definition, require a different representation. In Java, that is `BigDecimal`. Added get, set and setSafe accessors as required for each decimal type that uses `BigDecimal` to hold data. The generated code builds on the `valueVectorTypes.tdd` file, adding additional properties needed to generate the accessors. The PR also includes a number of code cleanups done while reviewing existing code. In particular `DecimalUtility` was very roughly formatted and thus hard to follow. Supports Drill’s interval types (INTERVAL, INTERVALDAY, INTERVALYEAR) in the form of the Joda interval class. Adds support for Map vectors. Maps are treated as nested tuples and are expanded out to create a flattened row in the schema. The accessors then access rows using the flattened column index or the combined name (“a.b”). Supports arrays via a writer interface that appends values as written, and an indexed, random-access reader interface. Removed HTTP log parser from JDBC jar to keep the JDBC jar from getting too big. close apache/drill#783 * DRILL-5323: Test tools for row sets Provide test tools to create, populate and compare row sets To simplify tests, we need a TestRowSet concept that wraps a VectorContainer and provides easy ways to: - Define a schema for the row set. - Create a set of vectors that implement the schema. - Populate the row set with test data via code. - Add an SV2 to the row set. - Pass the row set to operator components (such as generated code blocks.) - Examine the contents of a row set - Compare the results of the operation with an expected result set. - Dispose of the underling direct memory when work is done. This code builds on that in DRILL-5324 to provide a complete row set API. See DRILL-5318 for the spec. Note: this code can be reviewed as-is, but cannot be committed until after DRILL-5324 is committed: this code has compile-time dependencies on that code. This PR will be rebased once DRILL-5324 is pulled into master. Handles maps and intervals The row set schema is refined to provide two forms of schema. A physical schema shows the nested structure of the data with maps expanding into their contents. Updates the row set schema builder to easily build a schema with maps. An access schema shows the row “flattened” to include just scalar (non-map) columns, with all columns at a single level, with dotted names identifying nested fields. This form makes for very simple access. Then, provides tools for reading and writing batches with maps by presenting the flattened view to the row reader and writer. HyperVectors have a very complex structure for maps. The hyper row set implementation takes a first crack at mapping that structure into the standardized row set format. Also provides a handy way to set an INTERVAL column from an int. There is no good mapping from an int to an interval, so an arbitrary convention is used. This convention is not generally useful, but is very handy for quickly generating test data. As before, this is a partial PR. The code here still depends on DRILL-5324 to provide the column accessors needed by the row reader and writer. All this code is getting rather complex, so this commit includes a unit test of the schema and row set code. Revisions to support arrays Arrays require a somewhat different API. Refactored to allow arrays to appear as a field type. While refactoring, moved interfaces to more logical locations. Added more comments. Rejiggered the row set schema to provide both a physical and flattened (access) schema, both driven from the original batch schema. Pushed some accessor and writer classes into the accessor layer. Added tests for arrays. Also added more comments where needed. Moved tests to DRILL-5318 The test classes previously here depend on the new “operator fixture”. To provide a non-cyclic checkin order, moved the tests to the PR with the fixtures so that this PR is clear of dependencies. The tests were reviewed in the context of DRILL-5318. Also pulls in batch sizer support for map fields which are required by the tests. closes #785 * DRILL-5318: Sub-operator test fixture This commit depends on: * DRILL-5323 This PR cannot be accepted (or built) until the above are pulled and this PR is rebased on top of them. The PR is issued now so that reviews can be done in parallel. Provides the following: * A new OperatorFixture to set up all the objects needed to test at the sub-operator level. This relies on the refactoring to create the required interfaces. * Pulls the config builder code out of the cluster fixture builder so that configs can be build for sub-operator tests. * Modifies the QueryBuilder test tool to run a query and get back one of the new row set objects to allow direct inspection of data returned from a query. * Modifies the cluster fixture to create a JDBC connection to the test cluster. (Use requires putting the Drill JDBC project on the test class path since exec does not depend on JDBC.) Created a common subclass for the cluster and operator fixtures to abstract out the allocator and config. Also provides temp directory support to the operator fixture. Merged with DRILL-5415 (Improve Fixture Builder to configure client properties) Moved row set tests here from DRILL-5323 so that DRILL-5323 is self contained. (The tests depend on the fixtures defined here.) Added comments where needed. Puts code back as it was prior to a code review comment. The code is redundant, but necessarily so due to code which is specific to several primitive types. closes #788 * DRILL-5391: CTAS: make folder and file permission configurable close #820 * DRILL-5428: submit_plan fails after Drill 1.8 script revisions When the other scripts were updated, submit_plan was not corrected. After Drill 1.8, drill-config.sh consumes all command line arguments, finds the —config and —site options, removes them, and places the rest in the new args array. This PR updates submit_plan to use the new args array. The fix was tested on a test cluster: we verified that a physical plan was submitted and ran. closes #816 * DRILL-5385: Vector serializer fails to read saved SV2 Unit testing revealed that the VectorAccessorSerializable class claims to serialize SV2s, but, in fact, does not. Actually, it writes them, but does not read them, resulting in corrupted data on read. Fortunately, no code appears to serialize sv2s at present. Still, it is a bug and needs to be fixed. First task is to add serialization code for the sv2. That revealed that the recently-added code to save DrillBufs using a shared buffer had a bug: it relied on the writer index to know how much data is in the buffer. Turns out sv2 buffers don’t set this index. So, new versions of the write function takes a write length. Then, closer inspection of the read code revealed duplicated code. So, DrillBuf allocation moved into a version of the read function that now does reading and DrillBuf allocation. Turns out that value vectors, but not SV2s, can be built from a Drillbuf. Added a matching constructor to the SV2 class. Finally, cleaned up the code a bit to make it easier to follow. Also allowed test code to access the handy timer already present in the code. closes #800 * DRILL-5344: External sort priority queue copier fails with an empty batch Unit tests showed that the “priority queue copier” does not handle an empty batch. This has not been an issue because code elsewhere in the sort specifically works around this issue. This fix resolves the issue at the source to avoid the need for future work-arounds. closes #778 * DRILL-5423: Refactor ScanBatch to allow unit testing record readers Refactors ScanBatch to allow unit testing of record reader implementations, especially the “writer” classes. See JIRA for details. closes #811 * DRILL-4039: Query fails when non-ascii characters are used in string literals closes #825 * DRILL-5419: Calculate return string length for literals & some string functions 1. Revisited calculation logic for string literals and some string functions (cast, upper, lower, initcap, reverse, concat, concat operator, rpad, lpad, case statement, coalesce, first_value, last_value, lag, lead). Synchronized return type length calculation logic between limit 0 and regular queries. 2. Deprecated width and changed it to precision for string types in MajorType. 3. Revisited FunctionScope and splitted it into FunctionScope and ReturnType. FunctionScope will indicate only function usage in term of number of in / out rows, (n -> 1, 1 -> 1, 1->n). New annotation in UDFs ReturnType will indicate which return type strategy should be used. 4. Changed MAX_VARCHAR_LENGTH from 65536 to 65535…

paul-rogers requested changes Nov 3, 2016

View reviewed changes

vdiravka force-pushed the DRILL-4980 branch 3 times, most recently from b3f5cea to 6c2d5ec Compare November 4, 2016 02:14

vdiravka commented Nov 4, 2016

View reviewed changes

vdiravka force-pushed the DRILL-4980 branch 7 times, most recently from ab69f7b to 1fda940 Compare November 7, 2016 10:41

paul-rogers requested changes Nov 9, 2016

View reviewed changes

vdiravka force-pushed the DRILL-4980 branch from 1fda940 to 7e111cc Compare November 10, 2016 13:02

vdiravka force-pushed the DRILL-4980 branch from 7e111cc to c11d8f5 Compare November 10, 2016 15:44

DRILL-4980: Upgrading of the approach of parquet date correctness sta…

c11d8f5

…tus detection - Parquet writer version is added; - Updated the detection method of parquet date correctness.

parthchandra reviewed Nov 22, 2016

View reviewed changes

Changes according to the review

e5ae647

asfgit closed this in cf29b91 Nov 23, 2016

DRILL-4980: Upgrading of the approach of parquet date correctness status detection #644

DRILL-4980: Upgrading of the approach of parquet date correctness status detection #644

Uh oh!

Conversation

vdiravka commented Nov 3, 2016

Uh oh!

vdiravka commented Nov 3, 2016

Uh oh!

paul-rogers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdiravka Nov 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdiravka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdiravka commented Nov 8, 2016

Uh oh!

paul-rogers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdiravka Nov 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdiravka commented Nov 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

vdiravka Nov 4, 2016 •

edited

Loading

vdiravka Nov 10, 2016 •

edited

Loading