Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CALCITE-2769] CSV Adapter does not handle - Empty and malformed csv lines, space before or after comma #1032

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

chiradip
Copy link

@chiradip chiradip commented Feb 6, 2019

CSV Adapter does not handle - Empty and malformed csv lines, space before or after comma

File: org.apache.calcite.adapter.csv.CsvEnumerator.java in example folder did not handle CSV with space between delimiter i.e."," and also did not handle empty lines or malformed lines in the middle of the file or to the end. The sqlline crashes is there is an empty line with java.lang.ArrayIndexOutOfBoundsException.

Also test cases fail. Refer jira: https://issues.apache.org/jira/projects/CALCITE/issues/CALCITE-2769

CSV Adapter does not handle - Empty and malformed csv lines, space before or after comma
30,"Accounts"

31,"My Dept"
32, "My Another Depts"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this line should be accepted; the space seems to belong to a field which is the quoted - it becomes cloudy for me that this field is quoted or not

try {
current = rowConverter.convertRow(strings);
} catch (Exception e) {
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should skip records if the conversion fails; I think every row should contain a full record.
A more descriptive error message and some info about where is the line in the file which is incorrect would probably help the end user to correct the mistake.
https://tools.ietf.org/html/rfc4180

@@ -237,6 +241,9 @@ public void close() {
abstract E convertRow(String[] rows);

protected Object convert(CsvFieldType fieldType, String string) {
if (string != null) {
string = string.trim();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would probably mean that a space value is not anymore considered a value - it will be the same as an empty string
I think rfc4180 permits multiple spaces without quotes

@risdenk risdenk changed the title CALCITE-2769: CSV Adapter does not handle - Empty and malformed csv lines, space before or after comma [CALCITE-2769] CSV Adapter does not handle - Empty and malformed csv lines, space before or after comma Feb 28, 2019
@asereda-gs
Copy link
Member

Proposed solution looks very specific.

What people think about the following alternative:

  1. Add default method to Source interface:
  default Stream<String> lines() {
     // TODO add close() logic
     return new BufferedReader(reader()).lines();
   }
  1. Add / Replace constructor of CsvSchema to accept a list (stream) of sources (instead of parent dir as File) :
public CsvSchema(Stream<Source> sources, CsvTable.Flavor flavor) {
}

If user wants to modify / filter lines based on particular logic they can wrap existing source and override lines() method:

return lines().map(String::trim()).filter(line -> !line.isEmpty());

It also has the advantage of accepting generic sources of data like HDFS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants