Tabitha starting halfway through CSV instead of at the beginning #20

twilco · 2018-09-14T16:21:43Z

Using the attached CSV and iterating through it with Tabitha (version 0.2.0), it appears to be starting halfway through the document rather than at the beginning. Here's my code:

try (InputStream delimFileStream = fileUrl.openStream();
     RowReader reader = RowReaderFactory.open(delimFileStream)
                                        .orElseThrow(() -> new OpeningReaderException(String.format(
                                            "Could not open reader for URL (%s), MIME type of %s.",
                                            url,
                                            fileMime))
                                        )
                                        .withInlineHeaders()) {

    Row firstRow = reader.read().orElse(null);
    firstRow.header().ifPresent(header -> log.info("Header: " + Arrays.toString(header.toArray())));

    reader.forEach(row -> {
        List<String> strs = new ArrayList<>();
        row.forEach(cell -> strs.add(cell.getString().orElse("empty")));
        log.info(strs.toString());
    });
}

This prints the following:

Header: [, CA, United States, 12/1/03 19:13, 2/5/09 22:11, 35.36583, -120.84889]
[1/4/09 16:59, Product1, 1200, Visa, Amy, Parramatta, New South Wales, Australia, 1/3/09 22:35, 2/5/09 22:44, -33.8166667, 151]
[1/30/09 11:56, Product1, 1200, Mastercard, Whitney, Dumbleton, England, United Kingdom, 7/31/08 13:46, 2/6/09 0:04, 52.0166667, -1.9666667]
[1/6/09 5:10, Product1, 1200, Visa, Astrid, Altlengbach, Lower Austria, Austria, 6/24/08 0:49, 2/6/09 0:37, 48.15, 15.9166667]
[1/14/09 3:39, Product1, 1200, Visa, jo, Ballincollig, Cork, Ireland, 12/10/08 7:41, 2/6/09 2:36, 51.8833333, -8.5833333]
.... roughly 500 more items...
[1/8/09 11:55, Product1, 1200, Diners, julie, Haverhill, England, United Kingdom, 11/29/06 13:31, 3/1/09 7:28, 52.0833333, 0.4333333]
[1/12/09 21:30, Product1, 1200, Visa, Julia , Madison                     , WI, United States, 11/17/08 22:24, 3/1/09 10:14, 43.07306, -89.40111]

What Tabitha is registering as the header (the first row it has found) is actually line 526 in the CSV.

Turning this CSV into an XLSX via Microsoft Excel and then using that as the input to the code above works correctly (ignore the "empty" strings - I realize that's my problem and not a Tabitha problem):

Header: [Transaction_date, Product, Price, Payment_Type, Name, City, State, Country, Account_Created, Last_Login, Latitude, Longitude]
[empty, Product1, empty, Visa, Betina, Parkville                   , MO, United States, empty, empty, empty, empty]
[empty, Product1, empty, Mastercard, Federica e Andrea, Astoria                     , OR, United States, empty, empty, empty, empty]
...all the rest of the items...

CSV is attached as a .txt, as Github does not allow CSVs to be added as attachments.

SalesJan2009.txt
salesjan2009xl.xlsx

The text was updated successfully, but these errors were encountered:

sagebind · 2018-09-14T17:41:30Z

I am able to reproduce this behavior. Oddly enough, using a File instead of an InputStream fixes the issue. This works:

import com.widen.tabitha.RowReaderFactory

RowReaderFactory.open(new File("SalesJan2009.csv")).get().withInlineHeaders().withCloseable { reader ->
    def header = false

    reader.forEach {
        if (!header) {
            header = true
            println("header: " + it.header().get())
        }
        println("row: " + it)
    }
}

This does not:

import com.widen.tabitha.RowReaderFactory

RowReaderFactory.open(new FileInputStream("SalesJan2009.csv")).get().withInlineHeaders().withCloseable { reader ->
    def header = false

    reader.forEach {
        if (!header) {
            header = true
            println("header: " + it.header().get())
        }
        println("row: " + it)
    }
}

Ensure we pass in a rewindable InputStream to Tika so that we can start from the beginning of the stream when we do the actual file parsing. Fixes #20.

sagebind self-assigned this Sep 14, 2018

sagebind added a commit that referenced this issue Sep 14, 2018

Ensure streams support marks during detection

82d29e5

Ensure we pass in a rewindable InputStream to Tika so that we can start from the beginning of the stream when we do the actual file parsing. Fixes #20.

sagebind mentioned this issue Sep 14, 2018

Ensure streams support marks during detection #21

Merged

sagebind closed this as completed in #21 Sep 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabitha starting halfway through CSV instead of at the beginning #20

Tabitha starting halfway through CSV instead of at the beginning #20

twilco commented Sep 14, 2018 •

edited

sagebind commented Sep 14, 2018

Tabitha starting halfway through CSV instead of at the beginning #20

Tabitha starting halfway through CSV instead of at the beginning #20

Comments

twilco commented Sep 14, 2018 • edited

sagebind commented Sep 14, 2018

twilco commented Sep 14, 2018 •

edited