Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2 #2485

Merged
merged 4 commits into from Mar 24, 2022

Conversation

paul-rogers
Copy link
Contributor

DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Description

This PR is the remaining bits of the original DRILL-8085 PR: the Text and HTTPD log plugins.

The text plugin is surprisingly complex. Upgrading it is helpful technically, but it does not serve as a very good example. So, this PR also upgraded the far simpler HTTPD log plugin to act as an example for other conversions.

Note that, when doing conversions, you can rip out the prior do-it-yourself limit code: EVF now handles this work.

Also includes:

  • DRILL-8159: Convert the HTTPD reader to use EVF V2

Documentation

No changes: this is purely an internal change.

Testing

GitHub will helpfully run the unit tests. (Last time I tried, they didn't run on my machine.)

@luocooong luocooong added performance PRs that Improve Performance refactoring PR related to code refactoring updates labels Mar 7, 2022
@luocooong luocooong self-requested a review March 7, 2022 10:48
Copy link
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paul-rogers Thanks for this! This should serve as great examplars for converting all the other formats to EVF2.

I had a few points which are more questions than anything.
Thanks!

// Drill 1.17 and before used "delimiter" in the
// bootstrap storage plugins file, assume many instances
// exist in the field.
@JsonAlias("delimiter")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's cool... I didn't know you could do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't even remember adding this, but I must have read up on Jackson at some point.

this.extractHeader = extractHeader == null ? false : extractHeader;
}

public TextFormatConfig() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this constructor? If this is some default constructor, should there be some common default values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heck if I know: I just moved the code from one file to another. A search showed that the only reference is here:

  public TextFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig storageConfig) {
     this(name, context, fsConf, storageConfig, new TextFormatConfig());
  }

Which itself has zero references.

I removed both. Let's see if tests still pass.

this.escape = Strings.isNullOrEmpty(escape) ? '"' : escape.charAt(0);
this.comment = Strings.isNullOrEmpty(comment) ? '#' : comment.charAt(0);
this.skipFirstLine = skipFirstLine == null ? false : skipFirstLine;
this.extractHeader = extractHeader == null ? false : extractHeader;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the major annoyances of Drill (IMHO) is the weird behavior with CSV files. IMHO, we could give the users a better experience by making the extractHeader field true by default and that way the user gets what one would expect when you query tabular data.... a table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change this default, we possibly break production systems. I would suggest doing this in Drill 2.0 if the team feels that the old big-data use cases are no longer common.


private final static String PLUGIN_NAME = "text";
public class TextFormatPlugin extends EasyFormatPlugin<TextFormatConfig> {
final static String PLUGIN_NAME = "text";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this would probably break things, but I'll ask anyway. Do you think this plugin would be better named delimited? It really refers to delimited formats such as PSV, TSV, etc. If this is a major breaking change, don't worry about it. It's always bothered me though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drill was originally designed for big data: huge CSV files that get split across readers. That huge file might have a header, but more typically did not. The CSV reader still handles both cases: if the file has a header, it will seek to the beginning of the file even if the "file work" says to start at an offset of 256MB, etc.

In your use case, you likely work with "small data" but, with much variety and you want to use Drill as a convenient way to work with such data sets. (Hence all the recent work on the REST API which is decidedly not a big data solution!)

If we change the default, then any "big data" users will find their apps break if they still happen to use CSV files without headers.

This might be something the team chooses to change in Drill 2.0 along with other breaking changes: shift the focus from old-style HDFS files to modern S3 files are your case of small data-science style files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it might make sense to follow the model of some of the other format readers where if the column headers are not known, it assigns a name of field_n to them. That way you still get columns instead of the columns array.

My view is that if we change the default, but still allow the users to revert back with a config change, it's not that big of a deal.

Copy link
Contributor

@jnturton jnturton Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgivre @paul-rogers the master branch is now rolling towards 2.0. If we don't take the chance in this cycle to improve user-facing stuff, it's a long time until the next chance.

I do think we should try to segregate our breaking changes into their own PRs, and therefore squashed commits, as much as we can so that our ability to cherry pick from master back to a 1.20.1 or, god forbid, a 1.21 is maximally preserved. We could also label such commits with "[BREAKING]", or apply a breaking tag to the PR in GitHub, I wonder if that would be helpful.

FWIW, I'm +1 for turning columns[n] into column_n (and I think this terminolgy is more standard for SQL result sets than field), +1 for "text" becoming "delimited" and +0 for the default value of extractHeaders, which is trivial for users to set for themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The team can certainly many things in the 2.0 branch. But, this is not a 2.0 PR. The goal of this PR is to preserve functionality and simply upgrade the underlying reader mechanism to use EVF 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the field_x vs columns[] issue, that part is possible in the current code, as an option. There is a "columns" class in EVF that handles the columns[] column. I suspect that a parallel "fields" version could be created that generates the numbered fields instead of array indexes. An option on the plugin would choose one or the other for the no-headers case.

In fact, with this approach, the individual field_n columns could be typed, based on first-row type inference: something not possible with columns[].

Anyone up for giving it a shot? I'd be happy to offer pointers along the way.

Copy link
Member

@luocooong luocooong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paul-rogers Thanks for this great refactoring. In addition to Charles's question, LGTM + 1.

@paul-rogers
Copy link
Contributor Author

Another thought on the CSV (and other) readers: since they were originally designed for an HDFS-like file system, they may not be ideal when Drill is used as a data science tool against "classic" desktop-style files. For one thing, there is no need for distribution when reading a 10MB (or 100MB) CSV file. For another thing, the compromises made in the days of old with big data don't apply to desktop use. (In fact, much of Drill's machinery is overkill when Drill is run as a single process.)

Drill 2.0 is an opportunity to reorient Drill away from fading big data space and toward the data science use cases that most PRs now seem to support. (It's not that big data itself is gone, it'd just that most folks who need that kind of scale now run in the cloud where Drill is not common.) As one of many examples, REST APIs make no sense at scale, but do make sense for a "small data" tool.

Or, have two Drill additions, the old-school "distributed systems" edition and the newer "data science edition". Those who still need Drill to work distributed can keep that edition going (along with the big data CSV quirks), while the data science folks can fork the data science edition, chuck the distributed systems stuff that gets in the way, and focus on things that data scientists do (such as reading Excel and PDF files.)

A step in that direction would be to create a "data science edition" for things like the CSV reader: have it require headers (to avoid the need to use the odd columns column.) Go ahead and sample the first 20 or 100 rows like Pandas does to infer types. Don't worry about distributed scans. And so on. Much of the distributed-systems weirdness can be ignored when running single process on small files. (Drill was created for the opposite reasons: the prior small data tools don't work at scale.)

Users at scale will use the "distributed" versions of the readers, those who run Drill embedded, or single-server can enable the "desktop" versions.

@lgtm-com
Copy link

lgtm-com bot commented Mar 21, 2022

This pull request introduces 3 alerts and fixes 3 when merging 82e1055 into 7ef203b - view on LGTM.com

new alerts:

  • 3 for Dereferenced variable may be null

fixed alerts:

  • 3 for Dereferenced variable may be null

@paul-rogers
Copy link
Contributor Author

Looks like some of the secondary builds failed due to issues outside this PR. Can someone who understands that stuff please double-check?

@luocooong
Copy link
Member

@paul-rogers Hi. Could you pelase do a rebase on the master branch? Because it contains a patch for the Travis CI issue, Thanks.

@cgivre
Copy link
Contributor

cgivre commented Mar 23, 2022

@paul-rogers I'd second @luocooong 's message. The Travis CI has been a pain and I think rebasing will solve the issues.

@jnturton
Copy link
Contributor

Drill 2.0 is an opportunity to reorient Drill away from fading big data space and toward the data science use cases that most PRs now seem to support. (It's not that big data itself is gone, it'd just that most folks who need that kind of scale now run in the cloud where Drill is not common.) As one of many examples, REST APIs make no sense at scale, but do make sense for a "small data" tool.
Or, have two Drill additions, the old-school "distributed systems" edition and the newer "data science edition". Those who still need Drill to work distributed can keep that edition going (along with the big data CSV quirks), while the data science folks can fork the data science edition, chuck the distributed systems stuff that gets in the way, and focus on things that data scientists do (such as reading Excel and PDF files.)

@paul-rogers supporting two editions of Drill would be even harder for our small band of developers than supporting one, surely? Also, I think it would be a major loss to unpick all of the MPP work done in Drill to make big data queryable, in any notional edition. Indeed, for a small-data-only query engine, I doubt that there would be any sense in starting from Drill at all. A fresh start based on Calcite, Pandas or Julia would be simpler and cleaner.

Many people do their big data processing in the cloud but not all of them want the vendor lock-in of the SAAS products so prefer to deploy open source in the cloud. Others remain on-prem. In addition, I still contend that the worlds of small and big data are not disjoint, and that a single system that can query over many storages, formats and data sizes is valuable to a viable audience. If the contrib/ plugins can be sufficiently separated away from the rest of Drill then the variability in their scalability and behaviour is quarantined away from core Drill.

@paul-rogers
Copy link
Contributor Author

@jnturton you make good points. I somewhat worry that folks who build large distributed systems are a bit under-represented in our current wonderful group of contributors, so I do see a drift toward small-scale concerns. (For example, I've never seen any shop query 1000s of Excel files, but folks do that every day for JSON, Parquet and CSV.)

Similarly, the data science folks want Drill do to it all in SQL. But, at scale, people build out data pipelines so that the files which Drill queries are the result of a process to produce a query-optimized format such as well-partitioned Parquet.

The idea of having two forks (or modes) was an attempt to preserve Drill's distributed system heritage, while also allowing the current contributors to go wild on the data science use cases.

Maybe a simpler solution is to just have different "bootstrap" files: the at-scale edition, the data science edition, etc. Each addition includes plugins and defaults optimized for the target use case.

@paul-rogers
Copy link
Contributor Author

Rebased on master. It is now failing on the C++ code check, which is a bit odd because I don't recall seeing much C++ code in Drill...

@lgtm-com
Copy link

lgtm-com bot commented Mar 24, 2022

This pull request introduces 3 alerts and fixes 3 when merging 9466921 into ed3b1d3 - view on LGTM.com

new alerts:

  • 3 for Dereferenced variable may be null

fixed alerts:

  • 3 for Dereferenced variable may be null

@jnturton
Copy link
Contributor

Rebased on master. It is now failing on the C++ code check, which is a bit odd because I don't recall seeing much C++ code in Drill...

What are these new CodeQL actions here, a rebranding of LGTM? And what prompted this new analysis of our C++, we previously only analysed Java, JS and Python IIRC.

(These are more questions for us, @cgivre @luocooong @vdiravka, than for you @paul-rogers)

@luocooong
Copy link
Member

@paul-rogers @jnturton Don't worry, I think Charles accidentally pushed the local file: Pushed a CodeQL commit. Let's wait for him for a few hours.

@cgivre
Copy link
Contributor

cgivre commented Mar 24, 2022

@jnturton I added a codeql check to the repo. CodeQL suggested CPP, Java, and JavaScript as languages. I thought we'd try it and see if it adds any value or not. In any event, I removed cpp. I'm fine with merging at this point.

@cgivre cgivre merged commit fb016f4 into apache:master Mar 24, 2022
luocooong pushed a commit to luocooong/drill that referenced this pull request Mar 25, 2022
…pache#2485)

* DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Also includes:

* DRILL-8159: Convert the HTTPD reader to use EVF V2

* Build fix

* Changes from review comments

* Fix test issue
jnturton pushed a commit to jnturton/drill that referenced this pull request Jul 11, 2022
…pache#2485)

* DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Also includes:

* DRILL-8159: Convert the HTTPD reader to use EVF V2

* Build fix

* Changes from review comments

* Fix test issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance PRs that Improve Performance refactoring PR related to code refactoring updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants