New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2 #2485
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paul-rogers Thanks for this! This should serve as great examplars for converting all the other formats to EVF2.
I had a few points which are more questions than anything.
Thanks!
// Drill 1.17 and before used "delimiter" in the | ||
// bootstrap storage plugins file, assume many instances | ||
// exist in the field. | ||
@JsonAlias("delimiter") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's cool... I didn't know you could do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't even remember adding this, but I must have read up on Jackson at some point.
this.extractHeader = extractHeader == null ? false : extractHeader; | ||
} | ||
|
||
public TextFormatConfig() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this constructor? If this is some default constructor, should there be some common default values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heck if I know: I just moved the code from one file to another. A search showed that the only reference is here:
public TextFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig storageConfig) {
this(name, context, fsConf, storageConfig, new TextFormatConfig());
}
Which itself has zero references.
I removed both. Let's see if tests still pass.
this.escape = Strings.isNullOrEmpty(escape) ? '"' : escape.charAt(0); | ||
this.comment = Strings.isNullOrEmpty(comment) ? '#' : comment.charAt(0); | ||
this.skipFirstLine = skipFirstLine == null ? false : skipFirstLine; | ||
this.extractHeader = extractHeader == null ? false : extractHeader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the major annoyances of Drill (IMHO) is the weird behavior with CSV files. IMHO, we could give the users a better experience by making the extractHeader
field true
by default and that way the user gets what one would expect when you query tabular data.... a table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change this default, we possibly break production systems. I would suggest doing this in Drill 2.0 if the team feels that the old big-data use cases are no longer common.
|
||
private final static String PLUGIN_NAME = "text"; | ||
public class TextFormatPlugin extends EasyFormatPlugin<TextFormatConfig> { | ||
final static String PLUGIN_NAME = "text"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this would probably break things, but I'll ask anyway. Do you think this plugin would be better named delimited
? It really refers to delimited formats such as PSV, TSV, etc. If this is a major breaking change, don't worry about it. It's always bothered me though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drill was originally designed for big data: huge CSV files that get split across readers. That huge file might have a header, but more typically did not. The CSV reader still handles both cases: if the file has a header, it will seek to the beginning of the file even if the "file work" says to start at an offset of 256MB, etc.
In your use case, you likely work with "small data" but, with much variety and you want to use Drill as a convenient way to work with such data sets. (Hence all the recent work on the REST API which is decidedly not a big data solution!)
If we change the default, then any "big data" users will find their apps break if they still happen to use CSV files without headers.
This might be something the team chooses to change in Drill 2.0 along with other breaking changes: shift the focus from old-style HDFS files to modern S3 files are your case of small data-science style files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it might make sense to follow the model of some of the other format readers where if the column headers are not known, it assigns a name of field_n
to them. That way you still get columns instead of the columns array.
My view is that if we change the default, but still allow the users to revert back with a config change, it's not that big of a deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cgivre @paul-rogers the master branch is now rolling towards 2.0. If we don't take the chance in this cycle to improve user-facing stuff, it's a long time until the next chance.
I do think we should try to segregate our breaking changes into their own PRs, and therefore squashed commits, as much as we can so that our ability to cherry pick from master back to a 1.20.1 or, god forbid, a 1.21 is maximally preserved. We could also label such commits with "[BREAKING]", or apply a breaking
tag to the PR in GitHub, I wonder if that would be helpful.
FWIW, I'm +1 for turning columns[n]
into column_n
(and I think this terminolgy is more standard for SQL result sets than field
), +1 for "text" becoming "delimited" and +0 for the default value of extractHeaders
, which is trivial for users to set for themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The team can certainly many things in the 2.0 branch. But, this is not a 2.0 PR. The goal of this PR is to preserve functionality and simply upgrade the underlying reader mechanism to use EVF 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the field_x
vs columns[]
issue, that part is possible in the current code, as an option. There is a "columns" class in EVF that handles the columns[]
column. I suspect that a parallel "fields" version could be created that generates the numbered fields instead of array indexes. An option on the plugin would choose one or the other for the no-headers case.
In fact, with this approach, the individual field_n
columns could be typed, based on first-row type inference: something not possible with columns[]
.
Anyone up for giving it a shot? I'd be happy to offer pointers along the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paul-rogers Thanks for this great refactoring. In addition to Charles's question, LGTM + 1.
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java
Outdated
Show resolved
Hide resolved
Another thought on the CSV (and other) readers: since they were originally designed for an HDFS-like file system, they may not be ideal when Drill is used as a data science tool against "classic" desktop-style files. For one thing, there is no need for distribution when reading a 10MB (or 100MB) CSV file. For another thing, the compromises made in the days of old with big data don't apply to desktop use. (In fact, much of Drill's machinery is overkill when Drill is run as a single process.) Drill 2.0 is an opportunity to reorient Drill away from fading big data space and toward the data science use cases that most PRs now seem to support. (It's not that big data itself is gone, it'd just that most folks who need that kind of scale now run in the cloud where Drill is not common.) As one of many examples, REST APIs make no sense at scale, but do make sense for a "small data" tool. Or, have two Drill additions, the old-school "distributed systems" edition and the newer "data science edition". Those who still need Drill to work distributed can keep that edition going (along with the big data CSV quirks), while the data science folks can fork the data science edition, chuck the distributed systems stuff that gets in the way, and focus on things that data scientists do (such as reading Excel and PDF files.) A step in that direction would be to create a "data science edition" for things like the CSV reader: have it require headers (to avoid the need to use the odd Users at scale will use the "distributed" versions of the readers, those who run Drill embedded, or single-server can enable the "desktop" versions. |
This pull request introduces 3 alerts and fixes 3 when merging 82e1055 into 7ef203b - view on LGTM.com new alerts:
fixed alerts:
|
Looks like some of the secondary builds failed due to issues outside this PR. Can someone who understands that stuff please double-check? |
@paul-rogers Hi. Could you pelase do a rebase on the master branch? Because it contains a patch for the Travis CI issue, Thanks. |
@paul-rogers I'd second @luocooong 's message. The Travis CI has been a pain and I think rebasing will solve the issues. |
@paul-rogers supporting two editions of Drill would be even harder for our small band of developers than supporting one, surely? Also, I think it would be a major loss to unpick all of the MPP work done in Drill to make big data queryable, in any notional edition. Indeed, for a small-data-only query engine, I doubt that there would be any sense in starting from Drill at all. A fresh start based on Calcite, Pandas or Julia would be simpler and cleaner. Many people do their big data processing in the cloud but not all of them want the vendor lock-in of the SAAS products so prefer to deploy open source in the cloud. Others remain on-prem. In addition, I still contend that the worlds of small and big data are not disjoint, and that a single system that can query over many storages, formats and data sizes is valuable to a viable audience. If the contrib/ plugins can be sufficiently separated away from the rest of Drill then the variability in their scalability and behaviour is quarantined away from core Drill. |
Also includes: * DRILL-8159: Convert the HTTPD reader to use EVF V2
@jnturton you make good points. I somewhat worry that folks who build large distributed systems are a bit under-represented in our current wonderful group of contributors, so I do see a drift toward small-scale concerns. (For example, I've never seen any shop query 1000s of Excel files, but folks do that every day for JSON, Parquet and CSV.) Similarly, the data science folks want Drill do to it all in SQL. But, at scale, people build out data pipelines so that the files which Drill queries are the result of a process to produce a query-optimized format such as well-partitioned Parquet. The idea of having two forks (or modes) was an attempt to preserve Drill's distributed system heritage, while also allowing the current contributors to go wild on the data science use cases. Maybe a simpler solution is to just have different "bootstrap" files: the at-scale edition, the data science edition, etc. Each addition includes plugins and defaults optimized for the target use case. |
Rebased on master. It is now failing on the C++ code check, which is a bit odd because I don't recall seeing much C++ code in Drill... |
This pull request introduces 3 alerts and fixes 3 when merging 9466921 into ed3b1d3 - view on LGTM.com new alerts:
fixed alerts:
|
What are these new CodeQL actions here, a rebranding of LGTM? And what prompted this new analysis of our C++, we previously only analysed Java, JS and Python IIRC. (These are more questions for us, @cgivre @luocooong @vdiravka, than for you @paul-rogers) |
@paul-rogers @jnturton Don't worry, I think Charles accidentally pushed the local file: Pushed a CodeQL commit. Let's wait for him for a few hours. |
@jnturton I added a codeql check to the repo. CodeQL suggested CPP, Java, and JavaScript as languages. I thought we'd try it and see if it adds any value or not. In any event, I removed cpp. I'm fine with merging at this point. |
…pache#2485) * DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2 Also includes: * DRILL-8159: Convert the HTTPD reader to use EVF V2 * Build fix * Changes from review comments * Fix test issue
…pache#2485) * DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2 Also includes: * DRILL-8159: Convert the HTTPD reader to use EVF V2 * Build fix * Changes from review comments * Fix test issue
DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2
Description
This PR is the remaining bits of the original DRILL-8085 PR: the Text and HTTPD log plugins.
The text plugin is surprisingly complex. Upgrading it is helpful technically, but it does not serve as a very good example. So, this PR also upgraded the far simpler HTTPD log plugin to act as an example for other conversions.
Note that, when doing conversions, you can rip out the prior do-it-yourself limit code: EVF now handles this work.
Also includes:
Documentation
No changes: this is purely an internal change.
Testing
GitHub will helpfully run the unit tests. (Last time I tried, they didn't run on my machine.)