New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-5498: Improve handling of CSV column headers #830
Conversation
* Given an invalid header, rewrite it to replace illegal characters | ||
* with valid ones. The header won't be what the user specified, | ||
* but it will be a valid SQL identifier. This solution avoids failing | ||
* queries due to corrupted of invalid header data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrupted or invalid ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// For the strange case of only one character, format | ||
// the same as an empty header. | ||
|
||
if (header.length() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
the check
(header.length() == 1)
can be moved to the very beginning of the function. -
Why not change:
if () {
} else {
if () {
}
else {}
}
to
if () {
} else if() {
} else {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
*/ | ||
|
||
private String rewriteHeader(String header) { | ||
StringBuilder buf = new StringBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final StringBuilder buf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// the same as an empty header. | ||
|
||
if (header.length() == 1) { | ||
return "column_" + (headers.size() + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we can use col_
everywhere instead of column
in this case ? Then we can just have a class constant equal to string literal col_
and use that everywhere.
Also it looks like valid header can be of maximum length 1024 chars, but if we are replacing first invalid char with col_
then we can exceed this length. Will this cause issues anywhere ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, created the constants. "col_" is used to replace a non-alpha leading character: $foo --> col_foo. "column_" is used for an anonymous column: "column_3".
The symbol limit is not really a hard limit since names are stored as strings. The extra few characters shouldn't be a problem. A related issue is that we limit field width to 1024 bytes. But, in UTF-8, up to four bytes make up a character, so the byte limit results in a lower character limit. Should not hit too many files in practice.
|
||
// Force headers to be unique. | ||
|
||
Set<String> idents = new HashSet<String>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final Set<String> idents.
Please use final in other places as well like below for final String header
and final String rewritten
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
public List<String> headers = new ArrayList<>(); | ||
public ByteBuffer currentField = ByteBuffer.allocate(MAX_HEADER_LEN); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
for both the parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
|
||
public List<String> headers = new ArrayList<>(); | ||
public ByteBuffer currentField = ByteBuffer.allocate(MAX_HEADER_LEN); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
* Given an invalid header, rewrite it to replace illegal characters | ||
* with valid ones. The header won't be what the user specified, | ||
* but it will be a valid SQL identifier. This solution avoids failing | ||
* queries due to corrupted of invalid header data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
*/ | ||
|
||
private String rewriteHeader(String header) { | ||
StringBuilder buf = new StringBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// For the strange case of only one character, format | ||
// the same as an empty header. | ||
|
||
if (header.length() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
// the same as an empty header. | ||
|
||
if (header.length() == 1) { | ||
return "column_" + (headers.size() + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, created the constants. "col_" is used to replace a non-alpha leading character: $foo --> col_foo. "column_" is used for an anonymous column: "column_3".
The symbol limit is not really a hard limit since names are stored as strings. The extra few characters shouldn't be a problem. A related issue is that we limit field width to 1024 bytes. But, in UTF-8, up to four bytes make up a character, so the byte limit results in a lower character limit. Should not hit too many files in practice.
|
||
// Force headers to be unique. | ||
|
||
Set<String> idents = new HashSet<String>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
pom.xml
Outdated
<source>1.7</source> | ||
<target>1.7</target> | ||
<source>1.8</source> | ||
<target>1.8</target> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change needed ? It's failing on my setup. This will use JDK 1.8 but my environment is setup for 1.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh... Checked in a private change. Mockito won't run tests unless compiled with JDK 7. Eclipse won't run tests unless I recompile for JDK 8. Reverted.
+1. LGTM |
+1 Please squash the commits. |
See DRILL-5498 for details. Replaced the repeated varchar reader for reading columns with a purpose built column parser. Implemented rules to recover from invalid column headers.
Commits squashed. |
See DRILL-5498 for details.
Replaced the repeated varchar reader for reading columns with a purpose
built column parser. Implemented rules to recover from invalid column
headers.