Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-5498: Improve handling of CSV column headers #830

Closed
wants to merge 1 commit into from

Conversation

paul-rogers
Copy link
Contributor

See DRILL-5498 for details.

Replaced the repeated varchar reader for reading columns with a purpose
built column parser. Implemented rules to recover from invalid column
headers.

* Given an invalid header, rewrite it to replace illegal characters
* with valid ones. The header won't be what the user specified,
* but it will be a valid SQL identifier. This solution avoids failing
* queries due to corrupted of invalid header data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrupted or invalid ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// For the strange case of only one character, format
// the same as an empty header.

if (header.length() == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the check (header.length() == 1) can be moved to the very beginning of the function.

  2. Why not change:

if () {
} else {
     if () {
     }
    else {}
}

to

if () {
} else if() {
} else {
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

*/

private String rewriteHeader(String header) {
StringBuilder buf = new StringBuilder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final StringBuilder buf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// the same as an empty header.

if (header.length() == 1) {
return "column_" + (headers.size() + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we can use col_ everywhere instead of column in this case ? Then we can just have a class constant equal to string literal col_ and use that everywhere.

Also it looks like valid header can be of maximum length 1024 chars, but if we are replacing first invalid char with col_ then we can exceed this length. Will this cause issues anywhere ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, created the constants. "col_" is used to replace a non-alpha leading character: $foo --> col_foo. "column_" is used for an anonymous column: "column_3".

The symbol limit is not really a hard limit since names are stored as strings. The extra few characters shouldn't be a problem. A related issue is that we limit field width to 1024 bytes. But, in UTF-8, up to four bytes make up a character, so the byte limit results in a lower character limit. Should not hit too many files in practice.


// Force headers to be unique.

Set<String> idents = new HashSet<String>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final Set<String> idents. Please use final in other places as well like below for final String header and final String rewritten, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


public List<String> headers = new ArrayList<>();
public ByteBuffer currentField = ByteBuffer.allocate(MAX_HEADER_LEN);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final for both the parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor Author

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!


public List<String> headers = new ArrayList<>();
public ByteBuffer currentField = ByteBuffer.allocate(MAX_HEADER_LEN);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

* Given an invalid header, rewrite it to replace illegal characters
* with valid ones. The header won't be what the user specified,
* but it will be a valid SQL identifier. This solution avoids failing
* queries due to corrupted of invalid header data.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

*/

private String rewriteHeader(String header) {
StringBuilder buf = new StringBuilder();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// For the strange case of only one character, format
// the same as an empty header.

if (header.length() == 1) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

// the same as an empty header.

if (header.length() == 1) {
return "column_" + (headers.size() + 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, created the constants. "col_" is used to replace a non-alpha leading character: $foo --> col_foo. "column_" is used for an anonymous column: "column_3".

The symbol limit is not really a hard limit since names are stored as strings. The extra few characters shouldn't be a problem. A related issue is that we limit field width to 1024 bytes. But, in UTF-8, up to four bytes make up a character, so the byte limit results in a lower character limit. Should not hit too many files in practice.


// Force headers to be unique.

Set<String> idents = new HashSet<String>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

pom.xml Outdated
<source>1.7</source>
<target>1.7</target>
<source>1.8</source>
<target>1.8</target>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change needed ? It's failing on my setup. This will use JDK 1.8 but my environment is setup for 1.7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh... Checked in a private change. Mockito won't run tests unless compiled with JDK 7. Eclipse won't run tests unless I recompile for JDK 8. Reverted.

@sohami
Copy link
Contributor

sohami commented May 19, 2017

+1. LGTM

@sudheeshkatkam
Copy link
Contributor

+1

Please squash the commits.

See DRILL-5498 for details.

Replaced the repeated varchar reader for reading columns with a purpose
built column parser. Implemented rules to recover from invalid column
headers.
@paul-rogers
Copy link
Contributor Author

Commits squashed.

@asfgit asfgit closed this in f21edb0 May 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants