Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabitha starting halfway through CSV instead of at the beginning #20

Closed
twilco opened this issue Sep 14, 2018 · 1 comment
Closed

Tabitha starting halfway through CSV instead of at the beginning #20

twilco opened this issue Sep 14, 2018 · 1 comment
Assignees

Comments

@twilco
Copy link
Contributor

twilco commented Sep 14, 2018

Using the attached CSV and iterating through it with Tabitha (version 0.2.0), it appears to be starting halfway through the document rather than at the beginning. Here's my code:

try (InputStream delimFileStream = fileUrl.openStream();
     RowReader reader = RowReaderFactory.open(delimFileStream)
                                        .orElseThrow(() -> new OpeningReaderException(String.format(
                                            "Could not open reader for URL (%s), MIME type of %s.",
                                            url,
                                            fileMime))
                                        )
                                        .withInlineHeaders()) {

    Row firstRow = reader.read().orElse(null);
    firstRow.header().ifPresent(header -> log.info("Header: " + Arrays.toString(header.toArray())));

    reader.forEach(row -> {
        List<String> strs = new ArrayList<>();
        row.forEach(cell -> strs.add(cell.getString().orElse("empty")));
        log.info(strs.toString());
    });
}

This prints the following:

Header: [, CA, United States, 12/1/03 19:13, 2/5/09 22:11, 35.36583, -120.84889]
[1/4/09 16:59, Product1, 1200, Visa, Amy, Parramatta, New South Wales, Australia, 1/3/09 22:35, 2/5/09 22:44, -33.8166667, 151]
[1/30/09 11:56, Product1, 1200, Mastercard, Whitney, Dumbleton, England, United Kingdom, 7/31/08 13:46, 2/6/09 0:04, 52.0166667, -1.9666667]
[1/6/09 5:10, Product1, 1200, Visa, Astrid, Altlengbach, Lower Austria, Austria, 6/24/08 0:49, 2/6/09 0:37, 48.15, 15.9166667]
[1/14/09 3:39, Product1, 1200, Visa, jo, Ballincollig, Cork, Ireland, 12/10/08 7:41, 2/6/09 2:36, 51.8833333, -8.5833333]
.... roughly 500 more items...
[1/8/09 11:55, Product1, 1200, Diners, julie, Haverhill, England, United Kingdom, 11/29/06 13:31, 3/1/09 7:28, 52.0833333, 0.4333333]
[1/12/09 21:30, Product1, 1200, Visa, Julia , Madison                     , WI, United States, 11/17/08 22:24, 3/1/09 10:14, 43.07306, -89.40111]

What Tabitha is registering as the header (the first row it has found) is actually line 526 in the CSV.

Turning this CSV into an XLSX via Microsoft Excel and then using that as the input to the code above works correctly (ignore the "empty" strings - I realize that's my problem and not a Tabitha problem):

Header: [Transaction_date, Product, Price, Payment_Type, Name, City, State, Country, Account_Created, Last_Login, Latitude, Longitude]
[empty, Product1, empty, Visa, Betina, Parkville                   , MO, United States, empty, empty, empty, empty]
[empty, Product1, empty, Mastercard, Federica e Andrea, Astoria                     , OR, United States, empty, empty, empty, empty]
...all the rest of the items...

CSV is attached as a .txt, as Github does not allow CSVs to be added as attachments.

SalesJan2009.txt
salesjan2009xl.xlsx

@sagebind sagebind self-assigned this Sep 14, 2018
@sagebind
Copy link
Member

I am able to reproduce this behavior. Oddly enough, using a File instead of an InputStream fixes the issue. This works:

import com.widen.tabitha.RowReaderFactory

RowReaderFactory.open(new File("SalesJan2009.csv")).get().withInlineHeaders().withCloseable { reader ->
    def header = false

    reader.forEach {
        if (!header) {
            header = true
            println("header: " + it.header().get())
        }
        println("row: " + it)
    }
}

This does not:

import com.widen.tabitha.RowReaderFactory

RowReaderFactory.open(new FileInputStream("SalesJan2009.csv")).get().withInlineHeaders().withCloseable { reader ->
    def header = false

    reader.forEach {
        if (!header) {
            header = true
            println("header: " + it.header().get())
        }
        println("row: " + it)
    }
}

sagebind added a commit that referenced this issue Sep 14, 2018
Ensure we pass in a rewindable InputStream to Tika so that we can start from the beginning of the stream when we do the actual file parsing.

Fixes #20.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants