-
Notifications
You must be signed in to change notification settings - Fork 986
DRILL-4980: Upgrading of the approach of parquet date correctness status detection #644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
vdiravka
commented
Nov 3, 2016
- Parquet writer version is added;
- Updated the detection method of parquet date correctness.
|
@parthchandra @paul-rogers @jaltekruse coud you please review? |
paul-rogers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement! Comments include questions and suggested changes.
| public ParquetTableMetadata_v2(boolean isDateCorrect) { | ||
| this.drillVersion = DrillVersionInfo.getVersion(); | ||
| this.isDateCorrect = isDateCorrect; | ||
| this.writerVersion = ParquetWriter.WRITER_VERSION; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set the writer version to the current version when we create the metadata. Is this same metadata used for both read and write? If so, we have the potential for a nasty bug. A (new) reader fails to set the writerVersion value from actual file metadata. The value will default to the latest, even if the file itself happens to be older.
I wonder if it makes sense to pass the version into the constructor. The Writer passes in the current writer version. The reader must pass in the value found in the file.
Or, is this metadata used only for writing, but not reading? If that is true perhaps we can document that in the code somewhere. (I looked but did not anything.)
Or, is this metadata cached from scanning actual files? If so, isn't defaulting the writer version simply asking for trouble if someone forgets to set this field based on actual file version?
| @JsonProperty List<String> directories; | ||
| @JsonProperty String drillVersion; | ||
| @JsonProperty boolean isDateCorrect; | ||
| @JsonProperty int writerVersion; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This property is Jackson-serialized. How does Jackson handle older files written without this version? According to this post (http://stackoverflow.com/questions/8320993/jackson-what-happens-if-a-property-is-missing), "Setter methods are only invoked for properties with explicit values." This means that older files without the writerVersion set won't call the function to deserialize the writer version, and the version will default to the newest version. I suspect that either A) I'm misunderstanding what this code does, or B) we have a backward-compatibility issue. Please explain which.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParquetTableMetadata_v2 is used mostly when the parquet meta cache file created.
For reading the metadata cache file or metadta footer this class is not used except the case, when we read one parquet file or parquet folder, an empty instance of this class is created, but used only columnTypeInfo field. And the drillVersion or writerVersion weren't used. That's why everything worked but it was not right. I added an empty constructor (as I made earlier, actually master version).
And added initializing this fields across constructors parameters. I think it is more right and looks like this is what you were talking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I completely follow... Here's what I understood:
- The class is used when ready a Parquet file footer. (In such a case, the writer version would presumably be the actual version from the footer, or null if the file is old, was not written by Drill, etc.)
- The class is used when reading metadata, but then some fields are not used.
- In the case where the metadata is for one file (only) is the writer version stored in the metadata? So, object created when reading metadata will have the same information as the original file?
- If the metadata is for multiple files, then the writer version is never set? (Or, should the rule be that the writer version is set if all files have the same version?)
Since this is a bit hard to understand, it would be very helpful to describe all this in a Javadoc comment for the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to figure out this and found that this class isn't used for reading parquet footer. For that purposes ParquetMetadata is used from parquet-hadoop-1.8.1-drill-r0.jar.
This class used for entire parquet directory with metadata cache file. When it is absent the an empty instance of this class (with null fields) is created.
I added some javadoc. Hope it will help.
| this.directories = directories; | ||
| this.columnTypeInfo = columnTypeInfo; | ||
| this.drillVersion = DrillVersionInfo.getVersion(); | ||
| this.writerVersion = ParquetWriter.WRITER_VERSION; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given a set of files and directories, we're assuming the writer version? This is supposed to be the version of the Drill Parquet writer that wrote the actual Parquet files, right? Not the version of the metadata files that hold information about the Parquet files? Or, are we using the same writer version for both purposes? If we need two versions (one for Parquet, another for metadata), then we should introduce a separate Parquet metadata version. (Or, I'm misunderstanding the code...)
| } | ||
| } | ||
|
|
||
| public static boolean isDrillVersionHasCorrectDates(String drillVersion) throws VersionParser.VersionParseException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might flow better as "drillVersionHasCorrectDates"
But, see comment above about whether we need this check.
| try { | ||
| if (drillVersion != null) { | ||
| return (writerVersion != null && Integer.parseInt(writerVersion) >= 2) || Boolean.valueOf(isDateCorrect) | ||
| || isDrillVersionHasCorrectDates(drillVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the Drill version check redundant? We know that all Drill versions from 1.9.0 onwards will have a Drill Parquet writer version in the file. So, we only need check the writer version, if we have it. If we don't have it, then the only info we have is the Drill version.
We might want to explain this logic in a comment.
The purpose of adding the writer version is that all future format decisions can be made on the writer version independent of Drill version. For example, suppose we change something in Drill 1.10. Drill 1.10.0-SNAPSHOT will start with writer version 2. Later we'll make a change and 1.10.0-SNAPSHOT will writer files with writer version 3. The Drill version is ambiguous, the writer version is spot on.
| } | ||
|
|
||
| public static VersionParser.ParsedVersion parseDrillVersion(String drillVersion) throws VersionParser.VersionParseException { | ||
| return VersionParser.parse("drill version " + drillVersion + " (build 1234)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The " (build 1234)" deserves some explanation. Does the parse function demand it so that we have to make up something here just to get the version to parse?
|
|
||
| public static final String DRILL_VERSION_PROPERTY = "drill.version"; | ||
| public static final String IS_DATE_CORRECT_PROPERTY = "is.date.correct"; | ||
| public static final String WRITER_VERSION_PROPERTY = "parquet-writer.version"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know we are Drill, but others might be confused. Perhaps "drill-writer.version" or "drill.writer-version" (to be consistent with "drill.version".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also parquet files for unit tests were regenerated as well.
| public class ParquetWriter extends AbstractWriter { | ||
| static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ParquetWriter.class); | ||
|
|
||
| public static final int WRITER_VERSION = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deserves an explanation. Something like:
Version of Drill's Parquet writer. Increment this version (by 1) any time we make any format change to the file. Format changes include 1) supporting new data types, 2) changes to the format of data fields, 3) adding new metadata to the file footer, etc.
Newer readers must be able to read old files. The Writer version tells the Parquet reader how to interpret fields or metadata when that data changes format from one writer version to another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this comment into the project.
b3f5cea to
6c2d5ec
Compare
vdiravka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github swallowed some comments but I answered on "jackson-serialization" in one comment. Regarding drill version checking - agree with you, I deleted it.
| @JsonProperty List<String> directories; | ||
| @JsonProperty String drillVersion; | ||
| @JsonProperty boolean isDateCorrect; | ||
| @JsonProperty int writerVersion; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParquetTableMetadata_v2 is used mostly when the parquet meta cache file created.
For reading the metadata cache file or metadta footer this class is not used except the case, when we read one parquet file or parquet folder, an empty instance of this class is created, but used only columnTypeInfo field. And the drillVersion or writerVersion weren't used. That's why everything worked but it was not right. I added an empty constructor (as I made earlier, actually master version).
And added initializing this fields across constructors parameters. I think it is more right and looks like this is what you were talking about.
|
|
||
| public static final String DRILL_VERSION_PROPERTY = "drill.version"; | ||
| public static final String IS_DATE_CORRECT_PROPERTY = "is.date.correct"; | ||
| public static final String WRITER_VERSION_PROPERTY = "parquet-writer.version"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also parquet files for unit tests were regenerated as well.
| public class ParquetWriter extends AbstractWriter { | ||
| static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ParquetWriter.class); | ||
|
|
||
| public static final int WRITER_VERSION = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this comment into the project.
ab69f7b to
1fda940
Compare
|
@paul-rogers Could you please review? I answered on the last comment and rebased the branch with resolving conflicts (ParquetTableMetadata_v3 was added) |
paul-rogers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor suggestions. Otherwise, LGTM.
| } | ||
| ParquetTableMetadata_v3 parquetTableMetadata = new ParquetTableMetadata_v3(true); | ||
| ParquetTableMetadata_v3 parquetTableMetadata = new ParquetTableMetadata_v3(DrillVersionInfo.getVersion(), | ||
| ParquetWriter.WRITER_VERSION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused. The writer version applies to the Parquet files which Drill writes. (Or, at least, that was the intention.)
Here, we're talking about metadata. There may well be a metadata writer, but that should be a different writer, with a different version.
Not sure we want to initialize the metadata object with the current writer version: there seems to be no correlation between the metadata object and the writer version.
On the other hand, the metadata can certainly hold the writer version, but it must be the value read from the Parquet file itself; not a value set by the code. Else, we have the difficult problem of making sure that the code-set version number agrees with the actual version number in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is.date.correct or parquet-writer.version were needed in metadata cache file for quick detection of date values correctness. Otherwise need to check files.rowGroups.columns.mxValue values from this cache file.
But thought a little, I've understood that due to new added ParquetTableMetadata_v3 we can check:
If version of parquet metadata cache file is 3, the date values are definitely correct. Otherwise (when parquet metadata cache file was generated earlier) need to check date values from this file.
So writerVersion is redundant in the ParquetTableMetadataBase now. I deleted it. Does it make sense?
| (int) (UTC.getDateTimeMillis(5000, 1, 1, 0) / DateTimeConstants.MILLIS_PER_DAY); | ||
|
|
||
| /** | ||
| * The version of drill parquet writer with date values corruption fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explain this a bit better. Something like:
Version 2 (and later) of the Drill Parquet writer uses the date format described (in the Parquet spec? URL?). Prior versions had dates formatted (copy description from above.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| /** | ||
| * The version of drill parquet writer with date values corruption fix | ||
| */ | ||
| public static final int DRILL_WRITER_VERSION_WITHOUT_CORRUPTION = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this DRILL_WRITER_VERSION_STD_DATE_FORMAT
The old format was not "corrupted", it just used a date format that was non-standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind. Suitable name. Done
| String isDateCorrect = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.IS_DATE_CORRECT_PROPERTY); | ||
| String writerVersion = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.WRITER_VERSION_PROPERTY); | ||
| // This flag can be present in parquet files which were generated with 1.9.0-SNAPSHOT drill version | ||
| final String isDateCorrectFlag = "is.date.correct"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe here you want to special case the "is.date.correct" flag.
- If the writer version is present, use it.
- If "is.date.correct" is present, set the writer version to "2".
- If neither are present, set the writer version to 1.
That way we don't have to have (much) extra logic for the "is.date.correct" handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * 1) supporting new data types, | ||
| * 2) changes to the format of data fields, | ||
| * 3) adding new metadata to the file footer, etc. | ||
| * Newer readers must be able to read old files. The Writer version tells the Parquet reader how to interpret fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Javadoc, so needs HTML formatting. For the list:
<ul>
<li>Supporting new data types,</li>
...
</ul>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
@parthchandra Could you please review this PR? |
…tus detection - Parquet writer version is added; - Updated the detection method of parquet date correctness.
| if (drillVersion != null) { | ||
| return Boolean.valueOf(isDateCorrect) ? DateCorruptionStatus.META_SHOWS_NO_CORRUPTION | ||
| : DateCorruptionStatus.META_SHOWS_CORRUPTION; | ||
| int writerVersion = (stringWriterVersion != null) ? Integer.parseInt(stringWriterVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use if-then-else to make this a little easier to read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. It makes sense.
I even found one redundant check in the return statement.
Changes in a new commit.
…tus detection - Parquet writer version is added; - Updated the detection method of parquet date correctness. This closes apache#644
|
+1 |
…tus detection - Parquet writer version is added; - Updated the detection method of parquet date correctness. This closes apache#644
* Updated plugin version to 1.7.0-SNAPSHOT
* MD-789: Query with condition involving addition of DATE and INTERVAL types returns no results [MapR-DB JSON Tables]
+ Added `enablePushdown` option to enable/disable all filter pushdown, enabled by default.
* MD-813: Improve count(*) queries against MapR-DB Json tables.
+ Fail query on schema change.
+ Added a configuration option 'ignoreSchemaChange', which when enabled, drops the rows from the result
* Update MapR v5.1.0 artifacts version
* DRILL-4199: Add Support for HBase 1.X
* Updated plugin version to 1.9.0-SNAPSHOT
* Explicitly specify `hbase-server` dependency in Hive storage plugin.
Hive's HBaseStorageHandler uses HBase's TableInputFormat which is in hbase-server module.
* Exclude 'drill-memory-base' and 'tpch-sample-data' jars from jars/3rdparty folder.
* DRILL-4886: Modifying projects POMs to align with Drill's build and distribution.
The default build/test/packaging behavior for mapr-format-plugin module are
1. BUILD of mapr-format-plugin is ENABLED.
2. Unit tests of mapr-format-plugin module are DISABLED (use `-Pmapr` to enable).
3. Packaging of mapr-format-plugin is DISABLED (use `-Pmapr` to enable).
Please see LEGAL-251 for discussion/conclusion regarding inclusion of source code with non-open-source dependency.
* DRILL-4894: Fix unit test failure in 'storage-hive/core' module
Exclude 'hadoop-mapreduce-client-core' and 'hadoop-auth' as transitive dependencies from 'hbase-server'
* DRILL-3898 : Sort spill was modified to catch all errors, ignore repeated errors while closing the new group and issue a more detailed error message.
close apache/drill#591
* DRILL-4771: Drill should avoid doing the same join twice if count(distinct) exists
close apache/drill#588
* DRILL-4906: CASE Expression with constant generates class exception
Tests for different data types
close apache/drill#598
* DRILL-4911: Avoid plan serialization in SimpleParallelizer when debug logging is not enabled.
* DRILL-4618: Correct the usage of random flag in Hive function registry
+ Function visitor should not use previous function holder if this function is non-deterministic
closes #509
* DRILL-4862: Fix binary_string to use an injected buffer as out buffer
+ Previously, binary_string used the input buffer as output buffer. So after calling binary_string, the original content was destroyed. Other expressions/ functions that need to access the original input buffer get wrong results.
+ This fix also sets readerIndex and writerIndex correctly for the output buffer, otherwise the consumer of the output buffer will hit issues.
closes #604
* DRILL-4927: Add support for Null Equality Joins
These changes are a subset of the original pull request from DRILL-4539 (PR-462).
- Added changes to support Null Equality Joins;
- Created tests for it.
close apache/drill#603
* DRILL-4452: Uses Apache Calcite Avatica driver vs Optiq driver for Drill JDBC
Drill JDBC driver uses Optiq Avatica as its basis, but this dependency has
been moved over to Calcite, for quite some time without Drill code being
updated for it.
This patch updates Avatica version to the version from Calcite
(1.4.0-drill-r19). It also refactors Drill JDBC driver to comply with the
packages and API changes in Avatica. Finally it fixes the the SQL types for
lists and structs, since Drill doesn't support java.sql.Array and
java.sql.Struct interfaces.
this closes #395
Change-Id: Ia608adf900e8708d9e6f6f58ed41e104321a9914
* DRILL-4880: Support JDBC driver registration using ServiceLoader
Support loading Drill driver using ServiceLoader. From the user perspective,
it means being able to use the driver without registering it first, like by using
Class.forName("org.apache.drill.jdbc.Driver") for example.
this closes #596
Change-Id: Id26922ee42bef5fbce46ac2bcbb83f1859e9bb48
* DRILL-4930: Fix Metadata results ordering
Change MetadataProvider to return metadata results ordered (following
convention used by ODBC and JDBC specs).
this closes #614
Change-Id: Iff59b7fada7040602f1735bccc13bc6bf5c9a252
* DRILL-4925: Add tableType filter to GetTables metadata query
- Adding tableType filter to GetTablesReq query (needed for JDBC and ODBC
drivers).
- Fix table type returned by sys and INFORMATION_SCHEMA tables
- Also fixes some protobuf typos to related classes.
this closes #612
Change-Id: If95246a312f6c6d64a88872936f516308874c2d2
* DRILL-4870 drill-config.sh sets JAVA_HOME incorrectly for the Mac
This closes #605
* DRILL-4203: Fix DrillVersionInfo to make it provide a valid version number even during the unit tests.
This is now a build-time generated class, rather than one that looks on the
classpath for META-INF files.
This pattern for file generation with parameters passed from the POM files
was borrowed from parquet-mr.
* DRILL-4203: Fix date values written in parquet files created by Drill
Drill was writing non-standard dates into parquet files for all releases
before 1.9.0. The values have been read by Drill correctly by Drill, but
external tools like Spark reading the files will see corrupted values for
all dates that have been written by Drill.
This change corrects the behavior of the Drill parquet writer to correctly
store dates in the format given in the parquet specification.
To maintain compatibility with old files, the parquet reader code has
been updated to check for the old format and automatically shift the
corrupted values into corrected ones automatically.
The test cases included here should ensure that all files produced by
historical versions of Drill will continue to return the same values they
had in previous releases. For compatibility with external tools, any old
files with corrupted dates can be re-written using the CREATE TABLE AS
command (as the writer will now only produce the specification-compliant
values, even if after reading out of older corrupt files).
While the old behavior was a consistent shift into an unlikely range
to be used in a modern database (over 10,000 years in the future), these are still
valid date values. In the case where these may have been written into
files intentionally, and we cannot be certain from the metadata if Drill
produced the files, an option is included to turn off the auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included
for completeness.
This patch was originally written against version 1.5.0, when rebasing
the corruption threshold was updated to 1.9.0.
Added regenerated binary files, updated metadata cache files accordingly.
One small fix in the ParquetGroupScan to accommodate changes in master that changed
when metadata is read.
Tests for bugs revealed by the regression suite.
Fix drill version number in metadata file generation
* DRILL-4203: Parquet File. Date is stored wrongly - Added new extra field in the parquet meta info "is.date.correct = true"; - Removed unnecessary double conversion of value with Julian day; - Added ability to correct corrupted dates for parquet files with the second version old metadata cache file as well.
This closes #595
* DRILL-4653: Malformed JSON should not stop the entire query from progressing
This closes #518
* DRILL-4726: Dynamic UDF Support
1) Configuration / parsing / options / protos
2) Zookeeper integration
3) Registration / unregistration / lazy-init
4) Unit tests
This closes #574
* DRILL-3178: csv reader should allow newlines inside quotes
This closes #593
* DRILL-4369: Exchange name and version infos during handshake
There's no name and version exchanged between client and server over the User RPC
channel.
On client side, having access to the server name and version is useful to expose it
to the user (through JDBC or ODBC api like DatabaseMetadata#getDatabaseProductVersion()),
or to implement fallback strategy when some recent API are not available (like
metadata API).
On the server side, having access to the client version might be useful for audit
purposes and eventually to implement fallback strategy if it doesn't require a RPC
version change.
this closes #622
* DRILL-4950: Remove incorrect false condition; consume all empty batches
closes #621
* DRILL-4921: Added -P to "cd" in scripts to allow for symbolic links
close apache/drill#623
* DRILL-4954: Fix printing Strings when `allTextMode` is ON for MapR-DB JSON tables.
close apache/drill#624
* DRILL-4905: Push down the LIMIT to the parquet reader scan to limit the numbers of records read
close apache/drill#597
* DRILL-4560: When new bits register, invoke listeners in ZKClusterCoordinator
close apache/drill#626
Original JIRA title: “ZKClusterCoordinator does not call
DrillbitStatusListener.drillbitRegistered for new bits”
* DRILL-4968: Add column size to ColumnMetadata
Add a column size to ColumnMetadata so that JDBC and ODBC clients share
the implementation and don't have to recompute it client side.
this closes #631
* DRILL-4826: Query against INFORMATION_SCHEMA.TABLES degrades as the number of views
increases
This closes #592
* DRILL-4964: Make Drill reconnect to hive metastore after hive metastore is restarted.
Drill fails to connect to hive metastore after hive metastore is restarted unless drillbits are restarted.
Changes: For methods DrillHiveMetaStoreClient.getAllDatabases() and DrillHiveMetaStoreClient.getAllTables(),
the HiveMetaStoreClient wraps MetaException and TException both into MetaException. In case of connection
failure which is thrown as TException it is difficult to categorize at DrillClient level. The fix is to
close older connection and reconnect in case of these 2 api's. In all other cases proper set of exceptions
are thrown where we can handle each one individually.
close apache/drill#628
* DRILL-4972: Remove setDaemon(true) call in WorkManager.StatusThread
closes #633
* DRILL-4974: Add missing null check in FindPartitionConditions.analyzeCall()
close apache/drill#634
* DRILL-4967: Adding template_name to source code generated using freemarker template.
close apache/drill#629
* DRILL-4884: Fix IOB exception in limit n query when n is beyond 65535.
close apache/drill#584
* DRILL-3423: Initial HTTPD log plugin. Needs tests. Would be good to improve the timestamp and cookies behaviors since we can make those more type specific.
* DRILL-3423: Adding HTTPd Log Parsing functionality including full pushdown, type remapping and wildcard support.
Pushed through the requested columns for push down to the parser. Added more tests to cover a few more use cases. Ensured that user query fields are now completely consistent with returned values.
* DRILL-3243: Added CSG mods. Fixed field names.
Removed old test files
Added Parse_url and parse_query() functions
Fix unit test
This closes #607
* DRILL-4373: Drill and Hive have incompatible timestamp representations in parquet - added sys/sess option "store.parquet.int96_as_timestamp"; - added int96 to timestamp converter for both readers; - added unit tests;
This closes #600
* DRILL-4420: C++ API for metadata access and prepared statements
Add support to the C++ client for metadata querying and prepared
statement requests.
Part of the metadata API, add methods to query for server capabilities.
As of now, this interface is not backed up by any RPC exchange so
the information is pretty much static, and match Drill 1.8.0
current capabilities.
* DRILL-4853: Update C++ protobuf source files
Add support for prepared statements and metadata querying
* DRILL-1268: Add unit test to C++ native client
Add CppUnit unit test to the C++ native client
* DRILL-1996: Add cancel method to Drill C++ connector
This closes #602
* DRILL-4792: Include session options used for a query as part of the profile
closes #551
* DRILL-4951: Do Guava.patch earlier so we can run single Hbase's unit test through command line or IDE
closes #636
* DRILL-4927 (part 2): Add support for Null Equality Joins (mixed comparators)
This changes are a subset of the original pull request from DRILL-4539 (PR-462).
- Added changes to support mixed comparators;
- Added tests for it.
closes #635
* DRILL-4986: Allow users to customize the Drill log file name
Adds a new user-settable environment variable to choose the log file
base name.
closes #641
* DRILL-4989: Fix TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange
closes #643
* DRILL-4800: Use a buffering input stream in the Parquet reader
* DRILL-4800: Add AsyncPageReader to pipeline PageRead Use non tracking input stream for Parquet scans. Make choice between async and sync reader configurable. Make various options user configurable - choose between sync and async page reader, enable/disable fadvise Add Parquet Scan metrics to track time spent in various operations
* DRILL-4800: Parallelize column reading. Read/Decode fixed width fields in parallel Decoding var length columns in parallel Use simplified decompress method for Gzip and Snappy decompression. Avoids concurrency issue with Parquet decompression. (It's also faster). Stress test Parquet read write Parallel column reader is disabled by default (may perform less well under higher concurrency)
* DRILL-4800: Various fixes. Fix buffer underflow exception in BufferedDirectBufInputStream. Fix writer index for in64 dictionary encoded types. Added logging to help debug. Fix memory leaks. Work around issues with of InputStream.available() ( Do not use hasRemainder; Remove check for EOF in BufferedDirectBufInputStream.read() ). Finalize defaults. Remove commented code.
Addressed review comments
This closes #611
* DRILL-5400: Fix NPE in corrupt date detection
This closes #646
* DRILL-1950: Initial prototype patch for parquet filter pushdown.
Use three new classes from Adam's patch.
* DRILL-1950: Update parquet metadata cache format to include both min/max and additional column type information.
Parquet meta cache format change:
1. include both min/max in ColumnMetaData if column statistics is available,
2. include precision/scale/repetitionLevel/definitionLevel in ColumnTypeMetaData (precision/scale/definitionLevel is for future use).
* DRILL-1950: Parquet rowgroup level filter pushdown in query planning time.
Implement Parquet rowgroup level filter pushdown. The filter pushdown is performed in
in Drill physical planning phase.
Only a local filter, which refers to columns in a single table, is qualified for filter pushdown.
A filter may be qualified if it is a simple comparison filter, or a compound "and/or" filter consists of
simple comparison filter. Data types allowed in comparison filter are int, bigint, float, double, date,
timestamp, time. Comparison operators are =, !=, <, <=, >, >=. Operands have to be a column of the above
data types, or an explicit cast or implicit cast function, or a constant expressions.
This closes #637
* DRILL-4945: Report INTERVAL exact type as column type name
closes #618
* DRILL-4969: Basic implementation of display size for column metadata
Add a simple implementation of display size, based on the ODBC table
available at: https://msdn.microsoft.com/en-us/library/ms713974(v=vs.85).aspx
closes #632
* DRILL-4823: Fix OOM while trying to prune partitions with reasonable data size
closes #560
* DRILL-4674: Allow casting to boolean the same literals as in Postgres
closes #610
* Add Sudheesh's PGP Key
* DRILL-5007: Dynamic UDF lazy-init does not work correctly in multi-node cluster
closes #650
* DRILL-5009: Skip reading of empty row groups while reading Parquet metadata
+ We will no longer attempt to scan such row groups.
closes #651
* DRILL-5047: When session option is string, query profile is displayed incorrectly on Web UI
closes #655
* DRILL-4935: Allow drillbits to advertise a configurable host address to
Zookeeper
This closes #647
* DRILL-4979: Make port of the DataConnection configurable
This closes #649
* DRILL-4831: Running refresh table metadata concurrently randomly fails with JsonParseException
This closes #653
* DRILL-4980: Upgrading of the approach of parquet date correctness status detection - Parquet writer version is added; - Updated the detection method of parquet date correctness.
This closes #644
* DRILL-4995: Allow lazy init when dynamic UDF support is disabled
This closes #645
* Update version to 1.10.0-SNAPSHOT
* DRILL-4604: Generate warning on Web UI if drillbits version mismatch is detected
close apache/drill#482
* DRILL-5094: Comparator should guarantee transitive attribute.
close apache/drill#675
* DRILL-5086: Fix conversion of min and max values to appropriate data type.
close apache/drill#674
* DRILL-5050: C++ client library has symbol resolution issues when loaded by a process that already uses boost::asio
Build with Boost static libs and drill_boost namespace on mac. Added
readme with instructions
DRILL-5050: Addressed review comments
DRILL-5050: address more review comments
close apache/drill#659
* DRILL-4982: Separate Hive reader classes for different data formats to improve performance.
1, Separating Hive reader classes allows optimization to apply on different classes in optimized ways. This separation effectively avoid the performance degradation of scan.
2, Do not apply Skip footer/header mechanism on most Hive formats. This skip mechanism introduces extra checks on each incoming records.
close apache/drill#638
* DRILL-5015: Randomly select the drillbit from the list provided by user in connection string
Note: Improved the connection string validation and error handling during parsing.
Added unit test for the new parsing mechanism.
close apache/drill#648
* DRILL-5091: Handle new Java 8 JDBC functions and add missing test config parameter
* JDBC unit test fails to set up test storage plugin config on Java 8
* Cleans up compiler warnings seen in Eclipse.
* Added TODO's based on code reviews
closes #676
* DRILL-5108: Reduce output from Maven git-commit-id-plugin
The git-commit-id-plugin grabs information from Git to display
during a build. It prints many e-mail addresses and other generic
project information. As part of the effort to trim down unit test
output, we propose to turn off the verbose output from this plugin
by default.
closes #680
* DRILL-5113: Upgrade Maven RAT plugin to avoid annoying XML errors
Upgrade to eliminate XML compiler warnings that appear for each
sub-project in the Drill build.
Also fixes a Maven warning about a duplicated version error for
the dependency plugin (version is inherited from root pom.xml).
closes #682
* DRILL-5048: Fix type mismatch error in case statement with null timestamp
closes #657
* DRILL-5044: Fix retry logic to handle VersionMismatchException by not deleting jars in remote UDFs area
closes #669
* DRILL-5085: Update description for dynamic UDFs directories in drill-env.sh and drill-module.conf
closes #672
* DRILL-5112: Fix config in PopUnitTestBase
Tests rely on command-line settings in the pom.xml file. Those
settings are not available when tests are run in Eclipse.
Replicated required settings into the base test class (as in
BaseTestQuery).
closes #681
* DRILL-4812: Fix FileSelection#handleWildCard to use normalized path separator
closes #627
* DRILL-5056: Fix UserException to write full message to log
A case occurred in which the External Sort failed during spilling.
All that was written to the log was:
2016-11-18 ... INFO o.a.d.e.p.i.xsort.ExternalSortBatch - User Error Occurred
Modified the logging code to provide more information to aid tracking down problems that occur in the field.
closes #665
* DRILL-5119: Update MapR version to 5.2.0.40963-mapr
closes #688
* DRILL-4987: Use ImpersonationUtil to get process user’s groups in RemoteFunctionRegistry
closes #642
* DRILL-5081: Lower logging level for corrupt dates message
* introduced in DRILL-4203
closes #691
* DRILL-4938: Report UserException when constant expression reduction fails
closes #689
* DRILL-5065: Optimize count(*) queries on MapR-DB JSON Tables
In MapR-DB v5.2.0, we enabled '_id' only projection for JSON
tables. Hence, we can now optimize the following queries:
a. count(*) by projecting only the '_id' column.
b. '_id' only projections, including count(_id)
Change the format plugin config parameter name.
Fix setter of config parameter `disableCountOptimization` for drill-maprdb plugin
closes #678
* DRILL-5123: Write query profile after sending final response to client to improve latency
In testing a particular query, I used a test setup that does not write
to the "persistent store", causing query profiles to not be saved. I
then changed the config to save them (to local disk). This produced
about a 200ms difference in query run time as perceived by the client.
I then moved writing the query profile after sending the client the
final message. This resulted in an approximately 100ms savings, as
perceived by the client, in query run time on short (~3 sec.) queries.
close apache/drill#692
* DRILL-5117: Compile error when query a json file with 1000+columns
close apache/drill#686
* DRILL-5098: Improving fault tolerance for connection between client and foreman node.
Adding tries config option in connection string. Improving fault tolerance in Drill client when trying to make first connection with foreman. The client will try to connect to min(tries, num_drillbits) unique drillbits unless a successfull connection is established.
HYGIENE: Refactoring BasicClient::close to call RemoteConnection::close
close apache/drill#679
* DRILL-5051: Fix incorrect computation of 'fetch' in LimitRecordBatch when 'offset' is specified
close apache/drill#662
* DRILL-5032: Drill query on hive parquet table failed with OutOfMemoryError: Java heap space
close apache/drill#654
* DRILL-5052: Option to debug generated Java code using an IDE
Provides a second compilation path for generated code: “plan old Java”
in which generated code inherit from their templates. Such code can be
compiled directly, allowing easy debugging of generated code.
Also show to generate two classes in the External Sort Batch as “plain
old Java” to enable IDE debugging of that generated code. Required
minor clean-up of the templates.
Fixes some broken toString( ) methods in code generation classes
Fixes a variety of small compilation warnings
Adds Java doc to a few classes
Includes clean-up from code review comments.
close apache/drill#660
* DRILL-5159: Drill's ProjectMergeRule should operate on RelNodes with same convention trait.
close apache/drill#705
* DRILL-5127: Revert the fix for DRILL-4831
close apache/drill#718
* DRILL-5121 Fix for memory leak. Changes fieldVectorMap in ScanBatch to a CaseInsensitiveMap
close apache/drill#690
* DRILL-5039: NPE - CTAS PARTITION BY (<char-type-column>)
close apache/drill#706
* DRILL-5116: Enable generated code debugging in each Drill operator
DRILL-5052 added the ability to debug generated code. The reviewer suggested
permitting the technique to be used for all Drill operators. This PR provides
the required fixes. Most were small changes, others dealt with the rather
clever way that the existing byte-code merge converted static nested classes
to non-static inner classes, with the way that constructors were inserted
at the byte-code level and so on. See the JIRA for the details.
This code passed the unit tests twice: once with the traditional byte-code
manipulations, a second time using "plain-old Java" code compilation.
Plain-old Java is turned off by default, but can be turned on for all
operators with a single config change: see the JIRA for info. Consider
the plain-old Java option to be experimental: very handy for debugging,
perhaps not quite tested enough for production use.
close apache/drill#716
* DRILL-4996: Parquet Date auto-correction is not working in auto-partitioned parquet files generated by drill-1.6
- Changed detection approach of corrupted date values for the case, when parquet files are generated by drill:
the corruption status is determined by looking at the min/max values in the metadata;
- Appropriate refactoring of TestCorruptParquetDateCorrection.
This closes #687
* DRILL-5105: comment out unecessary recursive buffer size check
This closes #715
* DRILL-5152: Enhance the mock data source: better data, SQL access
Provides an enhanced version of the mock data source. See the JIRA
entry for motivation, package-info.java for details of operation.
Revisions suggested by code review
Also includes additional comments and a few more compiler warning
cleanups.
This closes #708
* DRILL-4919: Fix select count(1) / count(*) on csv with header
This closes #714
* DRILL-4558: BSonReader should prepare buffer size as actual need
This closes #696
* DRILL-4868: fix how hive function set DrillBuf.
This closes #695
* DRILL-5172: Display elapsed time for queries in the UI
Displays the elapsed time for running queries and the total duration of completed/failed/cancelled queries in the list of query profiles displayed, and within a query's profile page as well.
The query runtime is displayed in '[hr] [min] sec'.
e.g. A duration of 25,254,321ms is displayed 7 hr 00 min 54.321 sec
This closes #721
* DRILL-4956: Temporary tables support
close apache/drill#666
* DRILL-5097: Using store.parquet.reader.int96_as_timestamp gives IOOB whereas convert_from works
close apache/drill#697
* DRILL-5104: Foreman should not set external sort memory for a physical plan
Physical plans include a plan for memory allocations. However, the code
path in Foreman replans external sort memory, even for a physical plan.
This makes it impossible to use a physical plan to test memory
configuration.
This change avoids changing memory settings in a physical plan; while
preserving the adjustments for logical plans or SQL queries.
Revised to put a property in the plan itself. Old plans, and those
generated from SQL, will have memory allocations applied. Plans
marked as already "resource management" planned will be used as-is.
Includes a unit test that demonstrates the new behavior.
close apache/drill#703
* DRILL-5164: Equi-join query results in CompileException when inputs have large number of columns.
close apache/drill#711
* DRILL-5218: Support optionally disabling heartbeats from C++ client
closes #726
* DRILL-4764: Parquet file with INT_16, etc. logical types not supported by simple SELECT
closes #673
* DRILL-5126: Provide simplified, unified "cluster fixture" for test
Drill provides a robust selection of test frameworks that have evolved to satisfy the needs of a variety of test cases.
However, some do some of what a given test needs, while others to other parts. Also, the various frameworks make
assumptions (in the form of boot-time configuration) that differs from what some test may need, forcing the test
to start, then stop, then restart a Drillbit - an expensive operation.
Also, many ways exist to run queries, but they all do part of the job. Several ways exist to channge
runtime options.
This checkin shamelessly grabs the best parts from existing frameworks, adds a fluent builder facade
and provides a complete, versitie test framework for new tests. Old tests are unaffected by this
new code.
An adjustment was made to allow use of the existing TestBuilder mechanism. TestBuilder used to
depend on static members of BaseTestQuery. A "shim" allows the same code to work in the old
way for old tests, but with the new ClusterFixture for new tests.
Details are in the org.apache.drill.test.package-info.java file.
This commit modifies a single test case, TestSimpleExternalSort, to use the new framework.
More cases will follow once this framework itself is committed.
Also, the framework will eventually allow use of the extended mock data source
from SQL. However, that change must await checkin of the mock data source changes.
Includes a LogFixture that allows setting logger options per test to simplify debugging via tests.
Also includes a “summary listener” to run a query and return a summary of the
run. Handy to simply verify that a query runs and to time it.
Added an async query runner for tests that want to run multiple
concurrent queries.
closes #710
* DRILL-3562: Query fails when using flatten on JSON data where some documents have an empty array
closes #713
* DRILL-5043: Function that returns a unique id per session/connection similar to MySQL's CONNECTION_ID() #685
* DRILL-5207: Improve Parquet Scan pipelining. Add a configurable AsyncPageReader Queue. Enforce total size of parquet row group. Do not initialize BufferedDirectBufInputStream buffer in init. Wait for first read. Change default size of BufferedDirectBufInputStream. Do not invoke getOptions too many times in Parquet reader. Add metrics for processing time, and decoding time for varlen and fixedlen columns.
This closes #723
* DRILL-5215: CTTAS: disallow temp tables in view expansion logic
This closes #725
* DRILL-5220: Provide API to set application/client names in C++ connector
Add method to DrillClientConfig to set the client and the application names
in the C++ connector.
Allow the ODBC driver (or any user of the C++ connector) to provide more
specific informations like the application using the client.
This closes #728
* DRILL-5224: CTTAS: fix errors connected with system path delimiters (Windows)
This closes #731
* DRILL-5237: FlattenRecordBatch loses nested fields from the schema when returns empty batches for the first time
This closes #735
* DRILL-5238: CTTAS: unable to resolve temporary table if workspace is indicated without schema
This closes #736
* DRILL-5240: Parquet - fix unnecessary object creation while checking for null values in nullable var length columns
This closes #740
* DRILL-5223: Drill should ensure balanced workload assignment at node level in order to get better query performance
This closes #730
* DRILL-5241: JDBC proxy driver: Do not put null value in map
This closes #724
* DRILL-5243: Fix TestContextFunctions.sessionIdUDFWithinSameSession unit test
This closes #743
* DRILL-5219: Relax user properties validation in C++ client
Unlike Java client, C++ client only allows user properties present in a
whitelist. Relax this restriction so that user can add extra properties.
This closes #727
* DRILL-5040: Parquet writer unable to delete table folder on abort
close apache/drill#744
* DRILL-4864: Add ANSI format for date/time functions
DRILL-4864: Add ANSI format for date/time functions(review changes)
close apache/drill#581
* DRILL-5080: Memory-managed version of external sort
Please see JIRA entry for reasons for revision, design spec and list of
changes.
This PR covers the changes to the external sort itself. Tests for this
operator require the test framework in DRILL-5126 and the mock data
source in DRILL-5152. Tests for this operator will be issued as a
separate PR once those two dependencies are committed.
Until then, the new operator is disabled by default. It can be enabled
using drill.sort.external.disable_managed: false.
The operator now spills before receiving a new batch. Revised memory calcs and
merge calcs to make them a bit clearer and provide more margin of error
for the power-of-two allocations used when allocating vectors.
We have two external sort implementations, but only one operator code
for both. They can use only one Metrics enum between them. When adding
new metrics to the new version, didn’t add matching metrics to the old
one. This fixes that issue. (The issue will go away once the old one is
retired.)
Revised memory calculations to reflect limit of 16 MB per vector.
Current revision limits to 16 MB per output batch to be safe. Next
revision will enforce per-vector limits to allow the overall batch to
be larger when possible.
Also simplified the merge-time calculations.
Original code provided only crude methods to learn the size of a record
batch. Adds a "RecordBatchSizer" to provide detailed analysis so the
sort can know the amount of memory used to buffer a batch, the number
of rows, and the expected row width once the rows are copied to a
spill file or the output.
Moved generic spill classes to a separate package.
Created parameters for spill batch size and merge batch size. Separated
these values in code. Deprecated the min, max spill parameters as they
no longer add much value. Minor code rearranging.
Bug fix
Fixes a corner case of merging spilled files in a low-memory condition.
Fixes from code review
close apache/drill#717
* DRILL-5263: Prevent left NLJoin with non scalar subqueries
* DRILL-5230: Translation of millisecond duration into hours is incorrect
Fixed invalid representation of readable elapsed time using `TimeUnit` class in JDK.
e.g. 4545 sec is now correctly translated as `1h15m` instead of `17h15m`
* DRILL-5242: The UI breaks when trying to render profiles having unknown metrics
Skip any metrics whose metric ID is unknown, This prevents any ArrayIndexOutOfBoundsException from being thrown and breaking the UI rendering.
* DRILL-5157: Multiple Snappy versions on class path
Multiple Snappy versions on class path; causes unit test failures.
This fix updates the Snappy library and adds dependency management to
exclude older versions brought in by Avro and Parquet.
* DRILL-5273: CompliantTextReader exhausts 4 GB memory when reading 5000 small files
Please see JIRA for details of problem and fix.
closes #750
* DRILL-5275: Sort spill is slow due to repeated allocations
Rather than create a heap buffer per vector when writing and reading,
the revised code creates a single, shared buffer used for all I/O
within a particular container. This improves performance by reducing GC
and CPU costs during I/Os.
Move I/O buffer, and methods to allocator
Allows the buffer to be shared. Especially in the sort, this is
important, as the sort may have many serializations open at once.
closes #754
* DRILL-5274: Exception thrown in Drillbit shutdown in UDF cleanup code
closes #760
* DRILL-5260: Extend "Cluster Fixture" test framework
- Config option to suppress printing of CSV and other output. (Allows
printing for single tests, not printing when running from Maven.)
- Parsing of query profiles to extract plan and run time information.
- Fix bug in log fixture when enabling logging for a package.
- Improved ZK support.
- Set up the new CTTAS default temporary workspace for tests.
- Clean up persistent storage files on disk to avoid CTTAS startup
failures.
- Provides a set of examples for how to use the cluster fixture.
closes #753
* DRILL-5259: Allow listing a user-defined number of profiles
Allow changing default number of finished queries in web UI, when starting up Drillbits.
Option provided in drill-override.conf (default=100 ; defined in drill-module.conf)
Alternatively, the page can be loaded dynamically for the same.
e.g.
https://<hostname>:8047/profiles?max=100
closes #751
* DRILL-5257: Run-time control of query profiles
Adds a run-time option to save (default) or not save query profiles.
Adds a run-time option to save query profiles in "debug" mode:
that is, after returning the last client response. (Normal mode is
to return the response before writing the profile.)
Tests for normal case are normal unit tests. Tests for debug mode
case are unit tests using the new framework that parse profiles.
The test framework is extended to save query profiles using this
new option.
Modifies the test framework to use the new options when a test
asks to save query profiles.
closes #747
* DRILL-5088: Set client's codec for toJson
closes #702
* DRILL-5196: Init MongoDB cluster when run a single test case directly through command line or IDE
Other fixes include:
+ Sync mongo-java-driver versions to newer 3.2.0
+ update flapdoodle package to latest accordingly
closes #741
* DRILL-5190: Display planning and queued time for a query's profile page
Modified UserSharedBit protobuf for marking planning and wait-in-queue end times. This will allow for accurately reporting the planning, queued and actual execution times of a query.
Planning Time:
In the absence of the planning time's end, for older profiles, the root fragment's (i.e. SCREEN operator) start time is taken as the estimated end of planning time, and as the estimated start time of the execution phase.
QueueWait Time:
We do not estimate the queue time if the planning end time is not available.
Execution Time:
We calculate the execution time based on the availability of these 2 planning time. The computation is done the following way, and reflects a decreasing level of accuracy
1. Execution time = [end(QueueWait) - endTime(Query)]
2. Execution time = [end(Planning) - endTime(Query)]
3. Execution time = [start(rootFragment) - endTime(Query)] - {Estimated}
closes #738
* DRILL-5255: Remove default temporary workspace check at drillbit start up
closes #759
* DRILL-4280: HYGIENE
+ Ignore files generated by IntelliJ in RAT plugin
* DRILL-4280: CORE (Java protocol)
+ Define SaslStatus and SaslMessage messages in protocol
+ Add "authenticationMechanisms" field to all handshakes
+ Add "saslSupport” field to UserToBitHandshake
* DRILL-4280: REFACTOR
+ Extract RemoteConnection interface, and add AbstractRemoteConnection
+ Add ServerConnection and ClientConnection interfaces
+ Add RequestHandler interface to decouple connections from how requests are handled
+ Add NonTransientRpcException
+ Remove unused classes and methods
+ Code style changes
* DRILL-5195: Publish Operator and MajorFragment Stats in Profile page
Improved UI
1. Introduction of Tooltips
2. Share of each operator as a percentages of the major fragment and of the query
- This would help identify the most CPU intensive operators within a fragment and across the query
3. Rows emitted by each operator
4. For a running query, changes to 'last update' and 'last progress' now shows the elapsed time since.
closes #756
* DRILL-4280: CORE (service login)
+ Support Drillbit login to KDC using Hadoop's
UserGroupInformation library
+ Set hostname in BootstrapContext
+ Use process user's short name in ImpersonationUtil
+ Add KerberosUtil class
* DRILL-4280: HYGIENE
+ Do not recreate DrillConfig object in PamUserAuthenticator
+ Add new factory method to CaseInsensitiveMap
+ Clean documentation
* DRILL-4280: CORE (security package)
+ Add AuthenticatorFactory interface
+ Kerberos implementation
+ includes SaslServer and SaslClient wrappers
+ Plain implementation
+ PlainServer implements SaslServer (unavailable in Java)
for username/password based authentication
+ retrofit user authenticator
+ add logic for backward compatibility
+ Add AuthenticatorProvider interface to provide authenticator
factories, and add two implementations:
+ DrillConfig and ScanResult based AuthenticatorProviderImpl
+ Default and system property based ClientAuthenticatorProvider
+ FastSaslServerFactory caches SaslServer factories
+ FastSaslClientFactory caches SaslClient factories
+ ServerAuthenticationHandler handles authentication on server-side
+ FailingRequestHandler to fail any message received
+ AuthenticationOutcomeListener handles authentication on client-side
security
* DRILL-4280: CORE (user to bit authentication, Java)
+ Add logic for authentication in UserClient and UserServer with
backward compatibility in both directions
+ Add abstract extension to ServerConnection and ClientConnection
+ Add concrete extensions to abstract connections:
BitToUserConnection and UserToBitConnection
+ Add ConnectionConfig interface with abstract and concrete
implementations to encapsulate configuration for server-side
connections
+ Encapsulate all requests handled by UserServer in
UserServerRequestHandler
+ Clear UserSession when connection is closed either by user or
bit
+ Add DrillProperties to encapsulate all connection properties
used during connection time
* DRILL-4280: CORE (revert DRILL-3242)
+ DRILL-3242 aims to provide offloading request handling to a secondary thread, but this feature is disabled by default due to concurrency issues
+ One of the implications of the feature was to ignore exceptions that were not of UserRpcException type. But exceptions must not be ignored, they should be handled properly, specially in the context of security
* DRILL-4280: CORE (bit to bit authentication, control)
+ Support authentication in ControlServer and ControlClient
+ Add AuthenticationCommand as an initial command after handshake
and before the command that initiates a connection
+ Add ControlConnectionConfig to encapsulate configuration
+ ControlMessageHandler now implements RequestHandler
control
* DRILL-4280: CORE (bit to bit authentication, data)
+ Support authentication in DataServer and DataClient
+ Add AuthenticationCommand as an initial command after handshake
and before the command that initiates a connection
+ Add DataConnectionConfig to encapsulate configuration
+ Add DataServerRequestHandler to encapsulate all handling of
requests to DataServer
data
* DRILL-4280: CORE (web client)
+ Disabled web server when authentication is enabled but PLAIN mechanism
is not configured; log a warning
* DRILL-4280: CORE (C++ protocol)
* DRILL-4280: CORE (user to bit authentication, C++)
closes #578
* DRILL-4280: CORE (unit tests)
+ Modify existing tests to use new authentication configuration
+ Add TestUserBitKerberos and TestBitBitKerberos using Apache Kerby library
* Bump maxsize of jdbc-all jar to accommodate the increased size of jar file due to new code.
* DRILL-4994: Refactor DrillCursor
Refactor DrillCursor to be more self-contained.
* DRILL-4994: Add back JDBC prepared statement for older servers
When the JDBC client is connected to an older Drill server, it always
attempted to use server-side prepared statement with no fallback.
With this change, client will check server version and will fallback to the
previous client-side prepared statement (which is still limited to only execute
queries and does not provide metadata).
close #613
* DRILL-4730: Update JDBC DatabaseMetaData implementation to use new Metadata APIs
Update JDBC driver to use Metadata APIs instead of executing SQL queries
close #613
* DRILL-5301: Server metadata API
Add a Server metadata API to the User protocol, to query server support
of various SQL features.
Add support to the client (DrillClient) to query this information.
Add support to the JDBC driver to query this information, if the server supports
the new API, or fallback to the previous behaviour (rely on Avatica defaults) otherwise.
close #764
* DRILL-5301: Add C++ client support for Server metadata API
Add support to the Server metadata API to the C++ client if
available. If the API is not supported to the server, fallback
to the previous hard-coded values.
Update the querySubmitter example program to query the information.
close #764
* DRILL-5167: Send escape character for metadata queries
Escape character was not sent when doing metadata queries, which caused
the server to return incorrect results as the pattern is interpreted
differently form what the user asked for.
close #712
* DRILL-5221: Send cancel message as soon as possible in C++ connector
In C++ connector, try to send cancel request to the server as soon as
possible, which means when receiving the queryId or when requested by the
user if queryId has already been received.
close #733
* DRILL-5258: Access mock data definition from SQL
Extends the mock data source to allow using the full power of the mock
data source from an SQL query by referencing the JSON definition
file. See JIRA and package-info for details.
Adds a boolean data generator and a varying-length string generator.
Adds “mock” table stats for use in the planner.
Revisions based on code review comments
close #752
* DRILL-5284: Roll-up of final fixes for managed sort
See subtasks for details.
* Provide detailed, accurate estimate of size consumed by a record batch
* Managed external sort spills too often with Parquet data
* Managed External Sort fails with OOM
* External sort refers to the deprecated HDFS fs.default.name param
* Config param drill.exec.sort.external.batch.size is not used
* NPE in managed external sort while spilling to disk
* External Sort BatchGroup leaks memory if an OOM occurs during read
* DRILL-5294: Under certain low-memory conditions, need to force the sort to merge
two batches to make progress, even though this is a bit more than
comfortably fits into memory.
close #761
* DRILL-5304: Queries fail intermittently when there is skew in data distribution
close #766
* DRILL-4963: Fix issues with dynamically loaded overloaded functions
close #701
* DRILL-5252: Fix a condition that always returns true
close #745
* DRILL-5266: Parquet returns low-density batches
Fixes one glaring problem related to bit/byte confusion.
Includes a few clean-up items found along the way.
Additional fixes from code review comments
More code clean up from code review
close #749
* DRILL-5034: Select timestamp from hive generated parquet always return in UTC
- TIMESTAMP_IMPALA function is reverted to retaine local timezone
- TIMESTAMP_IMPALA_LOCALTIMEZONE is deleted
- Retain local timezone for the INT96 timestamp values in the parquet files while
PARQUET_READER_INT96_AS_TIMESTAMP option is on
Minor changes according to the review
Fix for the test, which relies on particular timezone
close #656
* DRILL-5290: Provide an option to build operator table once for built-in static functions and reuse it across queries.
close #757
* DRILL-5287: Provide option to skip updates of ephemeral state changes in Zookeeper
close #758
* DRILL-5208: Finding path to java executable should be deterministic
See DRILL-5208 for background. Instead of using “find” to locate the
java command, we use the any information available, resorting to find
only if the “usual suspects” fails. The result is that we use the JDK
java when available, instead of randomly choosing JDK or JRE java.
close #763
* DRILL-5313: Fix compilation issue in C++ connector
DRILL-5301 and DRILL-5167 have conflicting changes, which causes
the C++ connector to not compile: the static symbol for the search
escape string has been removed as the server might use a different one.
Fix the issue by using the current search escape string (injected from the
meta to the internal drill client when querying metadata).
close #769
* DRILL-5293: Change seed for distribution hash function to differ from that of the hash table
close #765
* DRILL-5326: Unit tests failures related to the SERVER_METADTA
- adding of the sql type name for the "GENERIC_OBJECT";
- changing "NullCollation" in the "ServerMetaProvider" to the correct default value;
- changing RpcType to GET_SERVER_META in the appropriate ServerMethod
close #775
* Update version to 1.11.0-SNAPSHOT
* DRILL-5226: Managed external sort fixes
* Memory leak in managed sort if OOM during sv2 allocation
* "Record batch sizer" does not include overhead for variable-sized
vectors
* Paranoid double-checking of merge batch sizes to prevent OOM when the
sizes differ from expectations
* Revised logging
Addresses review comments
close apache/drill#767
* DRILL-5311: Check handshake result in C++ connector
In C++ client connector, DrillClientImpl::recvHandshake always
return success, even in case of connection error (like a tcp
timeout issue). Only on WIN32 platform would the error code be
checked.
Remove the restriction to only check on WIN32, plus add some logging.
close apache/drill#770
* DRILL-5316: Check drillbits size before we attempt to access the vector element
close apache/drill#772
* DRILL-5330: NPE in FunctionImplementationRegistry
Fixes:
* DRILL-5330: NPE in
FunctionImplementationRegistry.functionReplacement()
* DRILL-5331:
NPE in FunctionImplementationRegistry.findDrillFunction() if dynamic
UDFs disabled
When running in a unit test, the dynamic UDF (DUDF) mechanism is not
available. When running in production, the DUDF mechanism is available,
but may be disabled.
One confusing aspect of this code is that the function registry
is given the option manager, but the option manager is not yet valid
(not yet initialized) in the function registry constructor. So, we
cannot access the option manager in the function registry constructor.
In any event, the existing system options cannot be used to disable DUDF
support. For obscure reasons, DUDF support is always enabled, even when
disabled by the user.
Instead, for DRILL-5331, we added a config option to "really" disable DUDFS.
The property is set only for tests, disables DUDF support.
Note that, in the future, this option could be generalized to
"off, read-only, on" to capture the full set of DUDF modes.
But, for now, just turning this off is sufficient.
For DRILL-5330, we use an existing option validator rather than
accessing the raw option directly.
Also includes a bit of code cleanup in the class in question.
The result is that the code now works when used in a sub-operator unit
test.
close apache/drill#777
* DRILL-5349: Fix TestParquetWriter unit tests when synchronous parquet reader is used.
close apache/drill#780
* DRILL-5352: Profile parser printing for multi fragments
Enhances the recently added ProfileParser to display run times for
queries that contain multiple fragments. (The original version handled
just a single fragment.)
Prints the query in “classic” mode if it is linear, or in the new
semi-indented mode if the query forms a tree.
Also cleans up formatting - removing spaces between parens.
Fixes from review
close apache/drill#782
* Fixed process time percent.
* Added support for getting operator profiles in a multi-fragment query.
* DRILL-5359: Fix ClassCastException when Drill pushes down filter on the output of flatten operator.
- Move findItemOrFlatten as a static method in DrillRelOptUtil.
- Exclude filter conditions if they contain item/flatten operator.
close apache/drill#786
* DRILL-4678: Tune metadata by generating a dispatcher at runtime
main code changes are in Calcite library.
update drill's calcite version to 1.4.0-drill-r20.
close #793
* DRILL-5394: Optimize query planning for MapR-DB tables by caching row counts
close #802
* DRILL-5378: Put more information for schema change exception in hash join, hash agg, streaming agg and sort operator.
close #801
* DRILL-5369: Add initializer for ServerMetaContext
ServerMetaContext had no default constructor. The lack of it
might cause m_done to be set to true, same for other variables.
Add a default constructor to explicitly initialize its members.
close #791
* DRILL-5368: Fix memory leak issue in DrillClientImpl::processServerMetaResult
Fix a small memory leak by doing local allocation instead since the
object doesn't escape the function.
close #790
* DRILL-5351: Minimize bounds checking in var len vectors for Parquet reader
close #781
* DRILL-5297: when the generated plan mismatches, PlanTest print the generated plan along with expected pattern
close #798
* DRILL-4971: Query encounters system error, when there aren't eval subexpressions of any function in boolean and/or expressions
- New evaluated blocks for boolean operators should be with braces always, since they use labels.
close #792
* DRILL-3510: Add ANSI_QUOTES option so that Drill's SQL Parser will recognize ANSI_SQL identifiers
- added supporing of quoting identifiers with DOUBLE_QUOTES or BRACKETS via setting new
sys/sess EnumString option QUOTING_IDENTIFIERS;
- added possibility of setting QUOTING_IDENTIFIERS by the jdbc connection URL string;
- added relevant unit tests;
close #520
* DRILL-5373: Drill JDBC error in the process of connection via SQuirrel
- java.lang.NoClassDefFoundError: javax/validation/constraints/NotNull
* DRILL-5413: DrillConnectionImpl.isReadOnly() throws NullPointerException
change is in CALCITE-843.
update drill's calcite version to 1.4.0-drill-r21
close #806
* DRILL-5375: Nested loop join: return correct result for left join closes #794
* DRILL-5315: Address small typo in the comment in drillClient.hpp closes #771
* DRILL-5355: Misc. code cleanup closes #784
* DRILL-5234: External sort's spilling functionality does not work when the spilled columns contains a map type column closes #799
* DRILL-5395: Query on MapR-DB table fails with NPE due to an issue with assignment logic closes #803
* DRILL-5387: Ignore TestBitBitKerberos and TestUserBitKerberos closes #810
* DRILL-5319: Refactor "contexts" for unit testing closes #787
This PR is purely a refactoring: no functionality is added or changed.
The refactoring splits various context and related classes into a set
of new interfaces with needed for operator-level unit tests. The other,
Drillbit-related methods are left in the original interfaces. Most code
need not change.
The changes here allow operator-level unit tests to mock up the
exec-time methods so we can use them without firing up a Drillbit (or
using mocking libraries).
A later PR will provide the sub-operator test framework that uses this
refactoring.
Changes include:
* The OptionManager is split, with read-only methods moving to a new
OptionSet interface.
* The FragmentContext is split, with an exec-only FragmentExecContext
proving low-level methods.
* OperatorStats is split, with a new OperatorStatReceiver class
providing write-only support to operators.
* Several places that accepted an OperatorContext or FragmentContext,
but needed only an allocator, are changed to accept the allocator
directly.
Includes fixes for code review comments
Adds more comments. Postpones the suggested rename until all affected
code is in master, else it will be difficult to synchronize the rename
across multiple branches.
* DRILL-5213: Prepared statement for actual query is missing the query text
close apache/drill#812
* DRILL-5409 - update MapR version to 5.2.1
close apache/drill#813
* DRILL-5415: Improve Fixture Builder to configure client properties and keep collection type properties for server
Updated with review feedback
close apache/drill#807
* DRILL-5424: Fix IOBE for reverse function
close apache/drill#815
* Test-specific column accessor implementation. Provides a simplified, unified set of access methods for value vectors specifically for wrting simple, compact unit test code.
* Interfaces for column readers and writers
* Interfaces for tuple (row and map) readers and writers
* Generated implementations
* Base implementation used by the generated code
* Factory class to create the proper reader or writer given a major
type (type and cardinality)
* Utilities for generic access, type conversions, etc.
Many vector types can be mapped to an int for get and set. One key
exception are the decimal types: decimals, by definition, require a
different representation. In Java, that is `BigDecimal`. Added get, set
and setSafe accessors as required for each decimal type that uses
`BigDecimal` to hold data.
The generated code builds on the `valueVectorTypes.tdd` file, adding
additional properties needed to generate the accessors.
The PR also includes a number of code cleanups done while reviewing
existing code. In particular `DecimalUtility` was very roughly
formatted and thus hard to follow.
Supports Drill’s interval types (INTERVAL, INTERVALDAY,
INTERVALYEAR) in the form of the Joda interval class.
Adds support for Map vectors. Maps are treated as nested tuples and are
expanded out to create a flattened row in the schema. The accessors
then access rows using the flattened column index or the combined name
(“a.b”).
Supports arrays via a writer interface that appends values as written,
and an indexed, random-access reader interface.
Removed HTTP log parser from JDBC jar to keep the JDBC jar from getting
too big.
close apache/drill#783
* DRILL-5323: Test tools for row sets
Provide test tools to create, populate and compare row sets
To simplify tests, we need a TestRowSet concept that wraps a
VectorContainer and provides easy ways to:
- Define a schema for the row set.
- Create a set of vectors that implement the schema.
- Populate the row set with test data via code.
- Add an SV2 to the row set.
- Pass the row set to operator components (such as generated code
blocks.)
- Examine the contents of a row set
- Compare the results of the operation with an expected result set.
- Dispose of the underling direct memory when work is done.
This code builds on that in DRILL-5324 to provide a complete row set
API. See DRILL-5318 for the spec.
Note: this code can be reviewed as-is, but cannot be committed until
after DRILL-5324 is committed: this code has compile-time dependencies
on that code. This PR will be rebased once DRILL-5324 is pulled into
master.
Handles maps and intervals
The row set schema is refined to provide two forms of schema. A
physical schema shows the nested structure of the data with maps
expanding into their contents.
Updates the row set schema builder to easily build a schema with maps.
An access schema shows the row “flattened” to include just scalar
(non-map) columns, with all columns at a single level, with dotted
names identifying nested fields. This form makes for very simple access.
Then, provides tools for reading and writing batches with maps by
presenting the flattened view to the row reader and writer.
HyperVectors have a very complex structure for maps. The hyper row set
implementation takes a first crack at mapping that structure into the
standardized row set format.
Also provides a handy way to set an INTERVAL column from an int. There
is no good mapping from an int to an interval, so an arbitrary
convention is used. This convention is not generally useful, but is
very handy for quickly generating test data.
As before, this is a partial PR. The code here still depends on
DRILL-5324 to provide the column accessors needed by the row reader and
writer.
All this code is getting rather complex, so this commit includes a unit
test of the schema and row set code.
Revisions to support arrays
Arrays require a somewhat different API. Refactored to allow arrays to
appear as a field type.
While refactoring, moved interfaces to more logical locations.
Added more comments.
Rejiggered the row set schema to provide both a physical and flattened
(access) schema, both driven from the original batch schema.
Pushed some accessor and writer classes into the accessor layer.
Added tests for arrays.
Also added more comments where needed.
Moved tests to DRILL-5318
The test classes previously here depend on the new “operator fixture”.
To provide a non-cyclic checkin order, moved the tests to the PR with
the fixtures so that this PR is clear of dependencies. The tests were
reviewed in the context of DRILL-5318.
Also pulls in batch sizer support for map fields which are required by
the tests.
closes #785
* DRILL-5318: Sub-operator test fixture
This commit depends on:
* DRILL-5323
This PR cannot be accepted (or built) until the above are pulled and
this PR is rebased on top of them. The PR is issued now so that reviews
can be done in parallel.
Provides the following:
* A new OperatorFixture to set up all the objects needed to test at the
sub-operator level. This relies on the refactoring to create the
required interfaces.
* Pulls the config builder code out of the cluster fixture builder so
that configs can be build for sub-operator tests.
* Modifies the QueryBuilder test tool to run a query and get back one
of the new row set objects to allow direct inspection of data returned
from a query.
* Modifies the cluster fixture to create a JDBC connection to the test
cluster. (Use requires putting the Drill JDBC project on the test class
path since exec does not depend on JDBC.)
Created a common subclass for the cluster and operator fixtures to
abstract out the allocator and config. Also provides temp directory
support to the operator fixture.
Merged with DRILL-5415 (Improve Fixture Builder to configure client
properties)
Moved row set tests here from DRILL-5323 so that DRILL-5323 is self
contained. (The tests depend on the fixtures defined here.)
Added comments where needed.
Puts code back as it was prior to a code review comment. The code is
redundant, but necessarily so due to code which is specific to several
primitive types.
closes #788
* DRILL-5391: CTAS: make folder and file permission configurable
close #820
* DRILL-5428: submit_plan fails after Drill 1.8 script revisions
When the other scripts were updated, submit_plan was not corrected.
After Drill 1.8, drill-config.sh consumes all command line arguments,
finds the —config and —site options, removes them, and places the rest
in the new args array.
This PR updates submit_plan to use the new args array.
The fix was tested on a test cluster: we verified that a physical plan
was submitted and ran.
closes #816
* DRILL-5385: Vector serializer fails to read saved SV2
Unit testing revealed that the VectorAccessorSerializable class claims
to serialize SV2s, but, in fact, does not. Actually, it writes them,
but does not read them, resulting in corrupted data on read.
Fortunately, no code appears to serialize sv2s at present. Still, it is
a bug and needs to be fixed.
First task is to add serialization code for the sv2.
That revealed that the recently-added code to save DrillBufs using a
shared buffer had a bug: it relied on the writer index to know how much
data is in the buffer. Turns out sv2 buffers don’t set this index. So,
new versions of the write function takes a write length.
Then, closer inspection of the read code revealed duplicated code. So,
DrillBuf allocation moved into a version of the read function that now
does reading and DrillBuf allocation.
Turns out that value vectors, but not SV2s, can be built from a
Drillbuf. Added a matching constructor to the SV2 class.
Finally, cleaned up the code a bit to make it easier to follow. Also
allowed test code to access the handy timer already present in the code.
closes #800
* DRILL-5344: External sort priority queue copier fails with an empty batch
Unit tests showed that the “priority queue copier” does not handle an
empty batch. This has not been an issue because code elsewhere in the
sort specifically works around this issue. This fix resolves the issue
at the source to avoid the need for future work-arounds.
closes #778
* DRILL-5423: Refactor ScanBatch to allow unit testing record readers
Refactors ScanBatch to allow unit testing of record reader
implementations, especially the “writer” classes.
See JIRA for details.
closes #811
* DRILL-4039: Query fails when non-ascii characters are used in string literals
closes #825
* DRILL-5419: Calculate return string length for literals & some string functions
1. Revisited calculation logic for string literals and some string functions
(cast, upper, lower, initcap, reverse, concat, concat operator, rpad, lpad, case statement,
coalesce, first_value, last_value, lag, lead).
Synchronized return type length calculation logic between limit 0 and regular queries.
2. Deprecated width and changed it to precision for string types in MajorType.
3. Revisited FunctionScope and splitted it into FunctionScope and ReturnType.
FunctionScope will indicate only function usage in term of number of in / out rows, (n -> 1, 1 -> 1, 1->n).
New annotation in UDFs ReturnType will indicate which return type strategy should be used.
4. Changed MAX_VARCHAR_LENGTH from 65536 to 65535…