DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2 #2485

paul-rogers · 2022-03-07T07:29:10Z

DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Description

This PR is the remaining bits of the original DRILL-8085 PR: the Text and HTTPD log plugins.

The text plugin is surprisingly complex. Upgrading it is helpful technically, but it does not serve as a very good example. So, this PR also upgraded the far simpler HTTPD log plugin to act as an example for other conversions.

Note that, when doing conversions, you can rip out the prior do-it-yourself limit code: EVF now handles this work.

Also includes:

DRILL-8159: Convert the HTTPD reader to use EVF V2

Documentation

No changes: this is purely an internal change.

Testing

GitHub will helpfully run the unit tests. (Last time I tried, they didn't run on my machine.)

cgivre

@paul-rogers Thanks for this! This should serve as great examplars for converting all the other formats to EVF2.

I had a few points which are more questions than anything.
Thanks!

cgivre · 2022-03-15T13:17:41Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java

+      // Drill 1.17 and before used "delimiter" in the
+      // bootstrap storage plugins file, assume many instances
+      // exist in the field.
+      @JsonAlias("delimiter")


That's cool... I didn't know you could do that.

To be honest, I don't even remember adding this, but I must have read up on Jackson at some point.

cgivre · 2022-03-15T13:18:55Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java

+    this.extractHeader = extractHeader == null ? false : extractHeader;
+  }
+
+  public TextFormatConfig() {


What is the purpose of this constructor? If this is some default constructor, should there be some common default values?

Heck if I know: I just moved the code from one file to another. A search showed that the only reference is here:

public TextFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig storageConfig) { this(name, context, fsConf, storageConfig, new TextFormatConfig()); }

Which itself has zero references.

I removed both. Let's see if tests still pass.

cgivre · 2022-03-15T13:20:10Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java

+    this.escape = Strings.isNullOrEmpty(escape) ? '"' : escape.charAt(0);
+    this.comment = Strings.isNullOrEmpty(comment) ? '#' : comment.charAt(0);
+    this.skipFirstLine = skipFirstLine == null ? false : skipFirstLine;
+    this.extractHeader = extractHeader == null ? false : extractHeader;


One of the major annoyances of Drill (IMHO) is the weird behavior with CSV files. IMHO, we could give the users a better experience by making the extractHeader field true by default and that way the user gets what one would expect when you query tabular data.... a table.

If we change this default, we possibly break production systems. I would suggest doing this in Drill 2.0 if the team feels that the old big-data use cases are no longer common.

cgivre · 2022-03-16T16:38:19Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatPlugin.java

-
-  private final static String PLUGIN_NAME = "text";
+public class TextFormatPlugin extends EasyFormatPlugin<TextFormatConfig> {
+  final static String PLUGIN_NAME = "text";


I know this would probably break things, but I'll ask anyway. Do you think this plugin would be better named delimited? It really refers to delimited formats such as PSV, TSV, etc. If this is a major breaking change, don't worry about it. It's always bothered me though.

Drill was originally designed for big data: huge CSV files that get split across readers. That huge file might have a header, but more typically did not. The CSV reader still handles both cases: if the file has a header, it will seek to the beginning of the file even if the "file work" says to start at an offset of 256MB, etc.

In your use case, you likely work with "small data" but, with much variety and you want to use Drill as a convenient way to work with such data sets. (Hence all the recent work on the REST API which is decidedly not a big data solution!)

If we change the default, then any "big data" users will find their apps break if they still happen to use CSV files without headers.

This might be something the team chooses to change in Drill 2.0 along with other breaking changes: shift the focus from old-style HDFS files to modern S3 files are your case of small data-science style files.

I wonder if it might make sense to follow the model of some of the other format readers where if the column headers are not known, it assigns a name of field_n to them. That way you still get columns instead of the columns array.

My view is that if we change the default, but still allow the users to revert back with a config change, it's not that big of a deal.

@cgivre @paul-rogers the master branch is now rolling towards 2.0. If we don't take the chance in this cycle to improve user-facing stuff, it's a long time until the next chance.

I do think we should try to segregate our breaking changes into their own PRs, and therefore squashed commits, as much as we can so that our ability to cherry pick from master back to a 1.20.1 or, god forbid, a 1.21 is maximally preserved. We could also label such commits with "[BREAKING]", or apply a breaking tag to the PR in GitHub, I wonder if that would be helpful.

FWIW, I'm +1 for turning columns[n] into column_n (and I think this terminolgy is more standard for SQL result sets than field), +1 for "text" becoming "delimited" and +0 for the default value of extractHeaders, which is trivial for users to set for themselves.

The team can certainly many things in the 2.0 branch. But, this is not a 2.0 PR. The goal of this PR is to preserve functionality and simply upgrade the underlying reader mechanism to use EVF 2.

On the field_x vs columns[] issue, that part is possible in the current code, as an option. There is a "columns" class in EVF that handles the columns[] column. I suspect that a parallel "fields" version could be created that generates the numbered fields instead of array indexes. An option on the plugin would choose one or the other for the no-headers case.

In fact, with this approach, the individual field_n columns could be typed, based on first-row type inference: something not possible with columns[].

Anyone up for giving it a shot? I'd be happy to offer pointers along the way.

luocooong

@paul-rogers Thanks for this great refactoring. In addition to Charles's question, LGTM + 1.

exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java

paul-rogers · 2022-03-21T04:48:33Z

Another thought on the CSV (and other) readers: since they were originally designed for an HDFS-like file system, they may not be ideal when Drill is used as a data science tool against "classic" desktop-style files. For one thing, there is no need for distribution when reading a 10MB (or 100MB) CSV file. For another thing, the compromises made in the days of old with big data don't apply to desktop use. (In fact, much of Drill's machinery is overkill when Drill is run as a single process.)

Drill 2.0 is an opportunity to reorient Drill away from fading big data space and toward the data science use cases that most PRs now seem to support. (It's not that big data itself is gone, it'd just that most folks who need that kind of scale now run in the cloud where Drill is not common.) As one of many examples, REST APIs make no sense at scale, but do make sense for a "small data" tool.

Or, have two Drill additions, the old-school "distributed systems" edition and the newer "data science edition". Those who still need Drill to work distributed can keep that edition going (along with the big data CSV quirks), while the data science folks can fork the data science edition, chuck the distributed systems stuff that gets in the way, and focus on things that data scientists do (such as reading Excel and PDF files.)

A step in that direction would be to create a "data science edition" for things like the CSV reader: have it require headers (to avoid the need to use the odd columns column.) Go ahead and sample the first 20 or 100 rows like Pandas does to infer types. Don't worry about distributed scans. And so on. Much of the distributed-systems weirdness can be ignored when running single process on small files. (Drill was created for the opposite reasons: the prior small data tools don't work at scale.)

Users at scale will use the "distributed" versions of the readers, those who run Drill embedded, or single-server can enable the "desktop" versions.

lgtm-com · 2022-03-21T06:22:12Z

This pull request introduces 3 alerts and fixes 3 when merging 82e1055 into 7ef203b - view on LGTM.com

new alerts:

3 for Dereferenced variable may be null

fixed alerts:

3 for Dereferenced variable may be null

paul-rogers · 2022-03-23T06:56:24Z

Looks like some of the secondary builds failed due to issues outside this PR. Can someone who understands that stuff please double-check?

luocooong · 2022-03-23T07:21:58Z

@paul-rogers Hi. Could you pelase do a rebase on the master branch? Because it contains a patch for the Travis CI issue, Thanks.

cgivre · 2022-03-23T13:03:34Z

@paul-rogers I'd second @luocooong 's message. The Travis CI has been a pain and I think rebasing will solve the issues.

jnturton · 2022-03-23T13:55:27Z

Drill 2.0 is an opportunity to reorient Drill away from fading big data space and toward the data science use cases that most PRs now seem to support. (It's not that big data itself is gone, it'd just that most folks who need that kind of scale now run in the cloud where Drill is not common.) As one of many examples, REST APIs make no sense at scale, but do make sense for a "small data" tool.
Or, have two Drill additions, the old-school "distributed systems" edition and the newer "data science edition". Those who still need Drill to work distributed can keep that edition going (along with the big data CSV quirks), while the data science folks can fork the data science edition, chuck the distributed systems stuff that gets in the way, and focus on things that data scientists do (such as reading Excel and PDF files.)

@paul-rogers supporting two editions of Drill would be even harder for our small band of developers than supporting one, surely? Also, I think it would be a major loss to unpick all of the MPP work done in Drill to make big data queryable, in any notional edition. Indeed, for a small-data-only query engine, I doubt that there would be any sense in starting from Drill at all. A fresh start based on Calcite, Pandas or Julia would be simpler and cleaner.

Many people do their big data processing in the cloud but not all of them want the vendor lock-in of the SAAS products so prefer to deploy open source in the cloud. Others remain on-prem. In addition, I still contend that the worlds of small and big data are not disjoint, and that a single system that can query over many storages, formats and data sizes is valuable to a viable audience. If the contrib/ plugins can be sufficiently separated away from the rest of Drill then the variability in their scalability and behaviour is quarantined away from core Drill.

Also includes: * DRILL-8159: Convert the HTTPD reader to use EVF V2

paul-rogers · 2022-03-24T06:24:37Z

@jnturton you make good points. I somewhat worry that folks who build large distributed systems are a bit under-represented in our current wonderful group of contributors, so I do see a drift toward small-scale concerns. (For example, I've never seen any shop query 1000s of Excel files, but folks do that every day for JSON, Parquet and CSV.)

Similarly, the data science folks want Drill do to it all in SQL. But, at scale, people build out data pipelines so that the files which Drill queries are the result of a process to produce a query-optimized format such as well-partitioned Parquet.

The idea of having two forks (or modes) was an attempt to preserve Drill's distributed system heritage, while also allowing the current contributors to go wild on the data science use cases.

Maybe a simpler solution is to just have different "bootstrap" files: the at-scale edition, the data science edition, etc. Each addition includes plugins and defaults optimized for the target use case.

paul-rogers · 2022-03-24T07:18:57Z

Rebased on master. It is now failing on the C++ code check, which is a bit odd because I don't recall seeing much C++ code in Drill...

lgtm-com · 2022-03-24T07:26:56Z

This pull request introduces 3 alerts and fixes 3 when merging 9466921 into ed3b1d3 - view on LGTM.com

new alerts:

3 for Dereferenced variable may be null

fixed alerts:

3 for Dereferenced variable may be null

jnturton · 2022-03-24T07:45:21Z

Rebased on master. It is now failing on the C++ code check, which is a bit odd because I don't recall seeing much C++ code in Drill...

What are these new CodeQL actions here, a rebranding of LGTM? And what prompted this new analysis of our C++, we previously only analysed Java, JS and Python IIRC.

(These are more questions for us, @cgivre @luocooong @vdiravka, than for you @paul-rogers)

luocooong · 2022-03-24T08:08:27Z

@paul-rogers @jnturton Don't worry, I think Charles accidentally pushed the local file: Pushed a CodeQL commit. Let's wait for him for a few hours.

cgivre · 2022-03-24T12:32:10Z

@jnturton I added a codeql check to the repo. CodeQL suggested CPP, Java, and JavaScript as languages. I thought we'd try it and see if it adds any value or not. In any event, I removed cpp. I'm fine with merging at this point.

…pache#2485) * DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2 Also includes: * DRILL-8159: Convert the HTTPD reader to use EVF V2 * Build fix * Changes from review comments * Fix test issue

luocooong added performance PRs that Improve Performance refactoring PR related to code refactoring updates labels Mar 7, 2022

luocooong assigned paul-rogers Mar 7, 2022

luocooong self-requested a review March 7, 2022 10:48

cgivre reviewed Mar 16, 2022

View reviewed changes

luocooong approved these changes Mar 18, 2022

View reviewed changes

exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java Outdated Show resolved Hide resolved

paul-rogers added 4 commits March 23, 2022 23:01

DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

95a3388

Also includes: * DRILL-8159: Convert the HTTPD reader to use EVF V2

Build fix

8ec000e

Changes from review comments

33e6b0a

Fix test issue

9466921

paul-rogers force-pushed the DRILL-8086 branch from 05f3453 to 9466921 Compare March 24, 2022 06:02

cgivre merged commit fb016f4 into apache:master Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2 #2485

DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2 #2485

paul-rogers commented Mar 7, 2022

cgivre left a comment

cgivre Mar 15, 2022

paul-rogers Mar 21, 2022

cgivre Mar 15, 2022

paul-rogers Mar 21, 2022

cgivre Mar 15, 2022

paul-rogers Mar 21, 2022

cgivre Mar 16, 2022

paul-rogers Mar 21, 2022

cgivre Mar 23, 2022

jnturton Mar 23, 2022 •

edited

paul-rogers Mar 24, 2022

paul-rogers Mar 24, 2022

luocooong left a comment

paul-rogers commented Mar 21, 2022

lgtm-com bot commented Mar 21, 2022

paul-rogers commented Mar 23, 2022

luocooong commented Mar 23, 2022

cgivre commented Mar 23, 2022

jnturton commented Mar 23, 2022

paul-rogers commented Mar 24, 2022

paul-rogers commented Mar 24, 2022

lgtm-com bot commented Mar 24, 2022

jnturton commented Mar 24, 2022

luocooong commented Mar 24, 2022

cgivre commented Mar 24, 2022

DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2 #2485

DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2 #2485

Conversation

paul-rogers commented Mar 7, 2022

DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Description

Documentation

Testing

cgivre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnturton Mar 23, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luocooong left a comment

Choose a reason for hiding this comment

paul-rogers commented Mar 21, 2022

lgtm-com bot commented Mar 21, 2022

paul-rogers commented Mar 23, 2022

luocooong commented Mar 23, 2022

cgivre commented Mar 23, 2022

jnturton commented Mar 23, 2022

paul-rogers commented Mar 24, 2022

paul-rogers commented Mar 24, 2022

lgtm-com bot commented Mar 24, 2022

jnturton commented Mar 24, 2022

luocooong commented Mar 24, 2022

cgivre commented Mar 24, 2022

jnturton Mar 23, 2022 •

edited