Skip to content

cvs-parse fails 'randomly' on multi-character column delimiter and row terminator #433

@metarama

Description

@metarama

Describe the bug

Thank you for the great work on csv-parse.

I am trying to import a csv files (over 850 rows with 51 columns) that is located in an AWS S3 folder. At the moment I am processing one row at a time.

I see the column count mismatch error happening on different rows. The erroring row keeps moving around. Sometimes the 11th row will fail. Sometime 439th row will fail. Whenever the row fails I can see the delimiter '|@|' has been taken in as part of the field. For example, a failing row comes in as '11|@|543E9108-177F-4402-B423-AF9A006A6698', '9', '00000000014621DD', ... You see the delimiter was not processed correctly and the first two fields got fused together, and as a result the column count does not match and thus the error. Also, the key observation is that the location of this error moves around.

Sometimes the delimiter all by itself becomes a field, like in this example of row data parsed, '2023-11-29 11:08:25.2150201 -05:00', '|@|', 'Boot', '', 'Testing, ...

I have not looked at the code but based on my observation I can guess that the delimiter (and terminator) detection misses the case when the delimiter (or terminator) character sequence spans across the current buffer boundary. The partial delimiter sequence at the end of the buffer or at the beginning of the buffer gets handled like a regular field characters and gets fused with a field.

Single character delimiters and terminators on the other hand are easier to process. Multi-character delimiter and row terminator will need extra care. I feel that care is missing.

To Reproduce

const parser = parse({ 
      delimiter: '|@|', 
      record_delimiter: '|$|', 
      columns: true, 
      from: 1, 
      relax_quotes: false,
      encoding: "utf8", 
      quote: '"',
      ignore_last_delimiters: true,
      trim: true, 
      bom: true, 
    } as Options);

const s3Stream = Readable.from(response.Body as AsyncIterable<Uint8Array>);
const csvStream = s3Stream.pipe(parser);

for await (const row of csvStream) {
  // process the row
}


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions