Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail parsing large files #25

Closed
bertrandmartel opened this issue Feb 28, 2017 · 1 comment
Closed

Fail parsing large files #25

bertrandmartel opened this issue Feb 28, 2017 · 1 comment

Comments

@bertrandmartel
Copy link

bertrandmartel commented Feb 28, 2017

There is an issue when parsing large file. I tested with a 1.4G JSON file and it throws :

buffer.js:490
    throw new Error('toString failed');
    ^

Error: toString failed
    at Buffer.toString (buffer.js:490:11)
    at StringDecoder.write (string_decoder.js:130:21)
    at StripBOMWrapper.write (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/iconv-lite/lib/bom-handling.js:35:28)
    at Object.decode (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/iconv-lite/lib/index.js:38:23)
    at /home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/bin/dsv2json:27:35
    at ReadStream.<anonymous> (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/rw/lib/rw/read-file.js:22:33)
    at emitNone (events.js:85:20)
    at ReadStream.emit (events.js:179:7)
    at endReadableNT (_stream_readable.js:913:12)
    at _combinedTickCallback (internal/process/next_tick.js:74:11)
    at process._tickCallback (internal/process/next_tick.js:98:9)

I've found this link which illustrates the same issue with big files

You can test it with

wget http://download.geonames.org/export/dump/allCountries.zip
unzip allCountries.zip
sed -i '1s/^/geonameid\tname\tasciiname\talternatenames\tlatitude\tlongitude\tfeature_class\tfeature_code\tcountry_code\tcc2\tadmin1_code\tadmin2_code\tadmin3_code\tadmin4_code\tpopulation\televation\tdem\ttimezone\tmodification_date\n/' allCountries.txt
time tsv2json  < allCountries.txt > allCountries-pre.json

Do you have a recommended way to parse big files using either command line or via API ?

Note that it's working well with csv-parser :

cat allCountries.txt | csv-parser -s $'\t' > allCountries-pre.json
@mbostock
Copy link
Member

mbostock commented Mar 1, 2017

This is not a streaming parser, so it is subject to Node’s buffer size limitations. This failure is occurring before it even gets to parsing; it’s just trying to decode the input file bytes into a string.

The way to fix this is to rewrite this library to be streaming. That’s doable, but it requires a new API. (The CLI could remain unchanged, however.) This request has already been filed at #20. It’d be a nice improvement, however I have no immediate plans to work on it.

@mbostock mbostock closed this as completed Mar 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants