Reading CSV files #14

fceller · 2011-12-14T17:16:16Z

Daten lesen:
Die zwei Methoden processCsvFile und processJsonFile sind sehr nützlich. Vielleicht braucht man später auch was zum schreiben.
Vielleicht macht es Sinn, die in ein Module zu packen? In Node heißt ein Ähnliches Module einfach fs, was sicher kein schlechter Name ist!

a2800276 · 2013-05-01T11:00:09Z

I would really like to see something like this:

while the csv and json importers are useful, they are not generic enough to import arbitrary data: I was playing around recently and wanted to do a bulk import of data, roughly 2GB. It would be nice to be able to load and process such a file immediately from within arango, but the File.read function in the fs module (is this module even officially there? I had to dig around quite a lot to find it) will always read the entire file.

It would be great to have a more versatile read that can be provided length or buffers or callbacks. This way the import tools can be rewritten in arango directly (I think I recall the json and csv importer imported via the http API?)

rotatingJazz · 2013-05-01T11:07:42Z

I pre-process my raw data to csv and then use the importer. Works fine. ;) What would the benefit be from moving this to ArangoDb?

frankmayer · 2013-05-01T11:55:05Z

Hi, would like to chime in, but I am not sure what you mean by "What would the benefit be from moving this to ArangoDb?" 😄 Can you elaborate, on what you want to do, what you did, and what the last sentence means? 😃

rotatingJazz · 2013-05-01T12:04:43Z

Hi Frank,

If I understood correctly, @a2800276 wants to be able to process his raw data and enter them into the db from withing arangosh.

I was wandering why the devs should invest time to this feature since one can easily process his raw data into csv/json (via any language, say PHP or Python, or Bash) and use the already working importer.

frankmayer · 2013-05-01T12:27:53Z

Oh, yes, didn't notice the different users 😄 . Yes of course, I totally agree with @rotatingJazz on that.
@a2800276 is there some specific reason not to use the import? Is there some edge case that you're trying to tackle?
I have had no issues for importing external data, so far, so I am interested in your edge case 😄

a2800276 · 2013-05-02T10:54:29Z

I pre-process my raw data to csv and then use the importer.

To me it seems very elaborate to preprocess data, that may or may not be in a form suitable for CSV/JSON, transforming it to a different format, throwing that against a --functionally restricted-- import script which then uses HTTP to import individual records to the database.

When instead:

I could be reading and transforming arbitrarily formatted files from within DB and have a much more efficient workflow, both from the "programmer efficiency" point of view and in terms of performance.

What I was trying to do concretely:

re-implement a toy project to play with graph functionality that I have working for neo4j in arango. I'd like to importethe wikipedia inter-page links and play around with that dataset. The dump of that data is 4GB, (in the form of mysql INSERT statements). If I can avoid it, I don't want to preprocess 4GB of data into 3GB of some other data that I can import when I could import directly in ~half the time.

More generally:

Since arango wants to become a general purpose deployment platform with Foxx, then it will certainly need some rudimentary file io implementation. As it's currently implemented, File.read is utterly useless apart from reading tiny toy files.

mulderp · 2013-05-02T12:29:24Z

It might be interesting to have some reference data that one could try to
'feel' or confirm the problem with the current importer; maybe the problem
can be shown with similar data from here e.g:

http://www.imdb.com/interfaces (movies, actors, ...?)
http://www.datawrangling.com/some-datasets-available-on-the-web

jsteemann · 2013-05-02T13:37:33Z

We'll eventually have an implementation of Buffer, which will allow us to read binary files and process them in chunks from JavaScript.

Until that's available, I think there are two alternatives available at least for processing CSV and JSON files.
They should work incrementally and process the input file line-wise (not quite true for CSV but think of them as working line-wise). They allow supplying a callback function that is executed whenever a record was read. You can then use the callback to process the data and put it into the database.

Example invocation for CSV files:

var internal = require("internal");
// var options = { separator: ",", quote: "\"" };
internal.processCsvFile("test.csv", function (data) { internal.print(data); }, options );

And for p```
var internal = require("internal");
internal.processCsvFile("test.csv", function (data) { internal.print(data); } );


Processing JSON files is similar:

var internal = require("internal");
internal.processJsonFile("test.json", function (data) { internal.print(data); } );


Note that the above function aren't general purpose file-processing functions, but targeted for handling UTF-8 encoded CSV and JSON data. 
For arbitrary file formats, we'll need an implementation of Buffer in Javascript.

fceller · 2014-12-02T14:57:20Z

Closed because processCsvFile and processJsonFile are doing what I intended.

ghost assigned fceller Dec 14, 2011

fceller mentioned this issue Jun 21, 2012

crash during "make unittests" #111

Closed

fceller mentioned this issue Nov 25, 2012

Deadlock on Shutdown under MacOSX in 1.1 #295

Closed

fceller closed this as completed Dec 2, 2014

markterm mentioned this issue Nov 14, 2015

arangodb stopped responding (Arangodb 2.7.0 on OS X) #1571

Closed

tthil mentioned this issue Jul 22, 2016

Performance #1955

Closed

21 tasks

roman-aprias-ipf mentioned this issue Nov 18, 2016

Upsert insert errror #2182

Open

21 tasks

xchmwang mentioned this issue Aug 16, 2018

web interface aql query returns more than 1000 items, arangod gets segment fault (core dump) #6174

Closed

betwjp mentioned this issue Oct 29, 2018

coordinator crashed frequency #7093

Closed

jan-zajic mentioned this issue Feb 5, 2019

AQL variable - Critical Bug - not working query since upgrade to 3.4 release #8108

Closed

rackom mentioned this issue May 17, 2019

Different behaviour using prefix and suffix FILTER #9032

Closed

yogeshbidari mentioned this issue Jul 8, 2021

ArangoDB reload taking lot of time when pod restarts #14386

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading CSV files #14

Reading CSV files #14

fceller commented Dec 14, 2011

a2800276 commented May 1, 2013

rotatingJazz commented May 1, 2013

frankmayer commented May 1, 2013

rotatingJazz commented May 1, 2013

frankmayer commented May 1, 2013

a2800276 commented May 2, 2013

mulderp commented May 2, 2013

jsteemann commented May 2, 2013

fceller commented Dec 2, 2014

Reading CSV files #14

Reading CSV files #14

Comments

fceller commented Dec 14, 2011

a2800276 commented May 1, 2013

rotatingJazz commented May 1, 2013

frankmayer commented May 1, 2013

rotatingJazz commented May 1, 2013

frankmayer commented May 1, 2013

a2800276 commented May 2, 2013

mulderp commented May 2, 2013

jsteemann commented May 2, 2013

fceller commented Dec 2, 2014