Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arangoexport issues with exporting 25.6GB Database #3808

Closed
5 tasks done
gabefair opened this issue Nov 26, 2017 · 22 comments
Closed
5 tasks done

arangoexport issues with exporting 25.6GB Database #3808

gabefair opened this issue Nov 26, 2017 · 22 comments
Assignees
Labels
1 Duplicate 3 OOM System runs out of memory / resources 3 RocksDB Storage engine related

Comments

@gabefair
Copy link

gabefair commented Nov 26, 2017

my environment running ArangoDB

I'm using the latest ArangoDB of the respective release series:

  • 3.2.7 (Community Edition)

ArangoDB 3.2.7 [win64] 64bit, using VPack 0.1.30, RocksDB 5.6.0, ICU 58.1, V8 5.7.492.77, OpenSSL 1.0.2a 19 Mar 2015

Mode:

  • Single-Server

Storage-Engine:

  • rocksdb

On this operating system:

  • Windows, version: 10 Enterprise 64-bit 32GB RAM
    AMD FX-6300 Six-Core Processor 3.50GHz

I'm issuing AQL via:

  • this Driver: arangoexport

The issue can be reproduced using this dataset:

  1. Go to http://files.pushshift.io/reddit/comments/ and download the 2017-01 tar. Exact file here. http://files.pushshift.io/reddit/comments/RC_2017-01.bz2

  2. Import to Arangodb: arangoimp --threads 6 --file "RC_2017-01" --type json --collection seventeen --server.database reddit --progress true --create-collection true --overwrite true
    -- You might get this issue.

  3. Create an index on "author"

  4. Once done, restart your computer for good measure.

  5. Export the database: arangoexport --server.database reddit --collection seventeen

  6. Notice how much memory the export takes. The export consumes more than twice the amount of RAM than the size of the database. While I was exporting I saw it consume all 32GB of my Physical RAM and about 58GB of Virtual Swap RAM.

The following problem occurs:
image

Also while exporting there is a memory leak somewhere.
Also you can expect the web interface's "Download documents as JSON file" feature not to work either. Your browser will just freeze no matter which one you use.

Connected to ArangoDB 'http+tcp://127.0.0.1:8529', version 3.2.7, database: 'reddit', username: 'root'

Exporting collection 'seventeen'...

�[31m2017-11-26T01:22:24Z [11036] ERROR {startup} startup aborted: caught exception during start of feature 'Export': got invalid response from server: Request timeout reached (exception location: C:\b\workspace\RELEASE__BuildPackages\arangosh\Export\ExportFeature.cpp:556). Please report this error to arangodb.com. shutting down
�[0m�[31m2017-11-26T01:22:24Z [11036] ERROR {startup} arangoexport terminated because of an unhandled exception: startup aborted: caught exception during start of feature 'Export': got invalid response from server: Request timeout reached (exception location: C:\b\workspace\RELEASE__BuildPackages\arangosh\Export\ExportFeature.cpp:556). Please report this error to arangodb.com (exception location: C:\b\workspace\RELEASE__BuildPackages\lib\ApplicationFeatures\ApplicationServer.cpp:636). Please report this error to arangodb.com

Instead I would be expecting:
A successful clean export

@maxkernbach maxkernbach added 3 OOM System runs out of memory / resources 3 RocksDB Storage engine related labels Nov 27, 2017
@dothebart
Copy link
Contributor

Just downloading it and taking a look. Will also have a look on the other questions you've posted on SO.

@dothebart
Copy link
Contributor

First result of this effort: #3828 fix arangoimp.

@dothebart
Copy link
Contributor

hm, Creating the authors collection with the limited resources of my workstation is taking a while now.

db._query("FOR item IN RawSubReddits INSERT {_key: item.author, test: 'abc123' } INTO authors OPTIONS { ignoreErrors: true }")

@dothebart
Copy link
Contributor

this is related to #3437 - I will close this as duplicate once the other questions have been solved.

@dothebart
Copy link
Contributor

OK,
heres the import procedure and creating graphs from the reddit dataset.

Please note that the webinterface will be very resource intense to browse parts of this graph, until the release of ArangoDB 3.2.10, alternatively you can download:
https://github.com/arangodb/arangodb/blob/bug-fix-3.2/aardvark-dont-use-any-for-graph-start/js/apps/system/_admin/aardvark/APP/aardvark.js
and place it in in that place below /usr/share/arangodb3/ in your installation.

Please note that the following queries take a while to complete on this huge dataset, however they should complete sucessfully after some hours.

We start the arangoimp to import our base dataset:

arangoimp --create-collection true  --collection RawSubReddits --type jsonl ./RC_2017-01 

We use arangosh to create the collections where our final data is going to live in:

db._create("authors")
db._createEdgeCollection("authorsToSubreddits")

We fill the authors collection by simply ignoring any subsequently occuring duplicate authors;
We will calculate the _key of the author by using the MD5 function,
so it obeys the restrictions for allowed chars in _key, and we can know it later on by calling MD5() again on the author field:

db._query(`
  FOR item IN RawSubReddits
    INSERT {
      _key: MD5(item.author),
      author: item.author
      } INTO authors
        OPTIONS { ignoreErrors: true }`);

After the we have filled the second vertex collection (we will keep the imported collection as the first vertex collection) we have to calculate the edges.
Since each author can have created several subreds, its most probably going to be several edges originating from each author. As previously mentioned,
we can use the MD5()-function again to reference the author previously created:

 db._query(`
   FOR onesubred IN RawSubReddits
     INSERT {
       _from: CONCAT('authors/', MD5(onesubred.author)),
       _to: CONCAT('RawSubReddits/', onesubred._key)
     } INTO  authorsToSubreddits")

After the edge collection is filled (which may again take a while - we're talking about 40 million edges herer, right? - we create the graph description:

db._graphs.save({
  "_key": "reddits",
  "orphanCollections" : [ ],
  "edgeDefinitions" : [ 
    {
      "collection": "authorsToSubreddits",
      "from": ["authors"],
      "to": ["RawSubReddits"]
    }
  ]
})

We now can use the UI to browse the graphs, or use AQL queries to browse the graph. Lets pick the sort of random first author from that list:

db._query(`for author IN authors LIMIT 1 RETURN author`).toArray()
[ 
  { 
    "_key" : "1cec812d4e44b95e5a11f3cbb15f7980", 
    "_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
    "_rev" : "_W_Eu-----_", 
    "author" : "punchyourbuns" 
  } 
]

We identified an author, and now run a graph query for him:

db._query(`FOR vertex, edge, path IN 0..1
   OUTBOUND 'authors/1cec812d4e44b95e5a11f3cbb15f7980'
   GRAPH 'reddits'
   RETURN path`).toArray()

One of the resulting paths looks like that:

{ 
  "edges" : [ 
    { 
      "_key" : "128327199", 
      "_id" : "authorsToSubreddits/128327199", 
      "_from" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
      "_to" : "RawSubReddits/38026350", 
      "_rev" : "_W_LOxgm--F" 
    } 
  ], 
  "vertices" : [ 
    { 
      "_key" : "1cec812d4e44b95e5a11f3cbb15f7980", 
      "_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
      "_rev" : "_W_HAL-y--_", 
      "author" : "punchyourbuns" 
    }, 
    { 
      "_key" : "38026350", 
      "_id" : "RawSubReddits/38026350", 
      "_rev" : "_W-JS0na--b", 
      "distinguished" : null, 
      "created_utc" : 1484537478, 
      "id" : "dchfe6e", 
      "edited" : false, 
      "parent_id" : "t1_dch51v3", 
      "body" : "I don't understand tension at all."
         "Mine is set to auto."
         "I'll replace the needle and rethread. Thanks!", 
      "stickied" : false, 
      "gilded" : 0, 
      "subreddit" : "sewing", 
      "author" : "punchyourbuns", 
      "score" : 3, 
      "link_id" : "t3_5o66d0", 
      "author_flair_text" : null, 
      "author_flair_css_class" : null, 
      "controversiality" : 0, 
      "retrieved_on" : 1486085797, 
      "subreddit_id" : "t5_2sczp" 
    } 
  ] 
}

@gabefair
Copy link
Author

gabefair commented Dec 6, 2017

Thank you so much for that explanation and for the help! I copied over the js file and started the import (--threads 5). The first 20% finished in about 5 mins. The next 20% finished 10mins later and the last 60% took over two hours. So there seems to be some slight exponential slow down. But that might be normal, idk.

Something I noticed is that all of my physical RAM was used (32g) during the import (which makes sense) But hardly any of my virtual RAM was used.

image

@dothebart
Copy link
Contributor

Hi,
yes, up to a certain amount of data in relation to your physical ram caches can be utilized.
The import will have to do index lookups (i.e. to find duplicates) which then may take longer once once the working set exceeds your memory.

@pkoppstein
Copy link

pkoppstein commented Dec 8, 2017

@dothebart - The fact that your solution incurs the potential disadvantages here of md5 seems to point to an easily-fixed omission from AQL: support for base64 encoding.

The main potential disadvantage of md5 here is that it is, by design, not readily invertible. Also, a fast implementation of base64 should be faster.

Please note that I am not suggesting that using md5 here is wrong or necessarily ill-advised. Indeed, there may be some circumstances where it is entirely appropriate. However, if a user wanted ._key values to be “transparent” or “invertible”, it seems that the existing repertoire of AQL functions is inadequate.

base64 is by all accounts the way to go, and of course one would also want a companion function to perform the inversion when possible. If you agree, what would be the best way to ensure the enhancement is “fast-tracked”?

@dothebart
Copy link
Contributor

Hi,
base64 wouldn't work since it may contain / which is forbidden in keys as well.
Hex-encoding would work.

However,
we've put a lot of effort into AQL and its json-based type system to be self contained and wipe out
erroneous conditions so queries may continue in almost any cases. Its defined to have valid UTF8-strings where its crystal clear how to handle them.

If you now get an en/decode of arbitrary (possibly broken or invalid) data, imagine a string containing a \0 halft ways through, then you have put something into the database (saving it in client encoded hex strings) where you don't know how to handle - i.e. compare it to another similar looking string with a blank or just another string part after the \0 - and strcmp() will just say "they're the same!"
This is one of the reasons behind #107 being a long standing issue - though our current storage interface (velocypack) could supports pluggeable datatypes.
Next issue - how do you handle browser representation of such strings? How do you output them to regular json without breaking the client?

So, while implementation of such a function would be fairly easy, the implications it brings are very far reaching and have to be well thought - so in any case this could be fast-tracked.

The simple and not to complicated thing is to keep the original author in a second field (as I did in my examples) next to its checksum.

If you want to, open a feature request github issue, so once all these thoughts have been thought, we may remember to implement them.

@Simran-B
Copy link
Contributor

Simran-B commented Dec 8, 2017

valid UTF8-strings

More specifically, valid and normalized UTF-8 strings. Which normalization form is it though?

@pkoppstein
Copy link

pkoppstein commented Dec 8, 2017

@dothebart - Thanks for your detailed explanation. I had certainly overlooked the prohibition against "/", and I also understand that the restriction about key-length would require some care.

Now that I've looked more closely, I see also that keys cannot contain "~", so that rules out encodeURIComponent() :-(

However, apart from the length restriction, the combination of base64 and encodeURIComponent should work, shouldn't it?

In any case, since arangosh already has encodeURIComponent, I think it (and decodeURIComponent) would be worthy additions to AQL.

Which leads to the question: is there any compelling reason why all strictly-deterministic functions provided by arangosh at the top-level couldn't be made available in AQL?

Regarding the issue you raised about the inverse of base64, please note that I was careful to say:
"companion function to perform the inversion when possible". I have no problem with the "companion" function failing in AQL if given an inadmissible string.

p.s. encodeURIComponent() does not encode "~" but decodeURIComponent does decode "%7E" to "~".

@Simran-B
Copy link
Contributor

Simran-B commented Dec 8, 2017

AQL could be extended rather easily with JS implementations, but all such functions should also have a C++ implementation for performance reasons - which is quite a lot more work.

If performance isn't much a concern, user-defined functions may do the job just fine and can be created by the user without building custom binaries.

Not wanting to pollute AQL with too many functions might be another reason. Namespaces could help to mitigate this, e.g. ENCODE::BASE64(str).

@pkoppstein
Copy link

@Simran-B - Thanks for your observations.

UDFs are fine for true "customization" but it should not be necessary to use them for standard functions. There are (depending on the user) a variety of potentially large costs associated with the care and maintenance of anything that is "customized".

I don't know enough about ArangoDB internals to know exactly when C++ implementations would be required, but the argument from cost or lack-of-resources (to implement the functions in C++) seems to me to be completely invalid: users who use UDFs already incur a performance penalty.

Perhaps I'm wrong, but it looks as though there should be a path forward that at every step improves things, from both the user's and the vendor's points of view:

  1. provide support in AQL for the standard strictly-deterministic functions that are already available at the top-level in arangosh (the small cost to the vendor would surely be offset by much happier users and hopefully by having to attend to fewer Enhancement Requests and posts such as this :-)

  2. as resources permit, incrementally add C++ implementations, in accordance with field reports about performance bottlenecks.

Regarding your point about pollution -- since users must currently provide a namespace, I am not quite sure what the issue here is; indeed, adding "ENCODE::BASE64" would run the risk of a collision with a UDF, wouldn't it?

But if the path forward requires some kind of additional namespace, then so be it!

@pkoppstein
Copy link

pkoppstein commented Dec 8, 2017

The following seems to fill the bill:

aqlfunctions.register('STRING::urlencode', function(s) {
    'use strict';
    return encodeURIComponent(s).replace(/~/g, "%7E");
}, true);

aqlfunctions.register('STRING::urldecode', function(s) {
    'use strict';
    return decodeURIComponent(s);
}, true);

Example:

LET s="%7E %7E# % -_.!~*'(再见)"
LET e=STRING::URLENCODE(s)
LET u=STRING::URLDECODE(e)
RETURN [s,e, (s==u)]

Output:

[
  [
    "%7E %7E# % -_.!~*'(再见)",
    "%257E%20%257E%23%20%25%20-_.!%7E*'(%E5%86%8D%E8%A7%81)",
    true
  ]
]

@Simran-B
Copy link
Contributor

Simran-B commented Dec 8, 2017

I don't know enough about ArangoDB internals to know exactly when C++ implementations would be required

It is not required. If there is one, it will be prefered, unless the data is already in V8 - it's more efficient to use a JS implementation than to convert from V8 to native data structures to perform the operation. Most operations are carried out in native code. If a JS implementation is used, you will see /* V8 expression */ in the execution plan.

users who use UDFs already incur a performance penalty.

Processing speed would be more or less on par with an integrated JS implementation. Cluster behavior would be different to my knowledge: UDFs require intermediate data to be sent to the coordinator to execute UDF code there and the result needs to be sent back to the DB servers. With normal AQL functions, the processing happens on the DB servers AFAIK.

as resources permit, incrementally add C++ implementations

That is always the goal. Community contributions would be more than welcome for both, JS implementations as well as C++ implementations. It would be a nice challenge for a new contributor in fact.

The relevant source code:

adding "ENCODE::BASE64" would run the risk of a collision with a UDF, wouldn't it?

True. Only the AQL:: namespace is reserved so far.

What would be your recommendation to call these functions? ENCODE_BASE64, ENCODE_URI_COMPONENT etc.? Or maybe TO_BASE64 (in the style of existing type casting functions), FROM_BASE64 etc.?

BTW: I remember that there are inconsistencies between how PHP encodes URI Components and how JS does it... I believe they differ for a certain character. Was it + vs. %20? We should probably follow whatever JS does, unless there are good reasons to do it differently.

@pkoppstein
Copy link

pkoppstein commented Dec 8, 2017

@Simran-B -

What would be your recommendation to call these functions? ENCODE_BASE64, ENCODE_URI_COMPONENT etc.? Or maybe TO_BASE64 (in the style of existing type casting functions), FROM_BASE64 etc.?

Currently, AQL has MD5 so it would seem to me that either BASE64 or AQL::BASE64 would be appropriate, especially if there is to be no AQL function for decoding arbitrary base64-encodings. For the "left-inverse" of BASE64 (i.e. the function f() such that for UTF-8 strings, s, f(BASE64(s)) == s), I have seen "BASE64D", which at least has the advantage of being concise :-)

Regarding URI encoding and decoding: since JS prescribes a carefully-defined encodeURIComponent as a global function, it seems clear that AQL should provide the same function.

As for PHP's "urlencode" -- here is its specification:

Returns a string in which all non-alphanumeric characters except -_. have been replaced with a percent (%) sign followed by two hex digits and spaces encoded as plus (+) signs. It is encoded the same way that the posted data from a WWW form is encoded, that is the same way as in application/x-www-form-urlencoded media type. This differs from the » RFC 3986 encoding (see rawurlencode()) in that for historical reasons, spaces are encoded as plus (+) signs.

I think that answers your question about "+".

If AQL had functions with the same names and semantics of either of these PHP functions, that would seem to me to be eminently sensible, but I am sure there are sensible alternatives as well.

All of this unfortunately still fails to provide a single function for mapping strings to keys in a more transparent manner than MD5. One possibility would be to piggy-back off the fact that MD5 only chews up 32 bytes, and to define a function, say, TO_KEY, as follows: given a UTF-8 string, s, beginning with a character suitable for a document key, and a function that returns strings with characters suitable for ArangoDB keys such as urlencode, let t=urlencode(s); if length(t)<=222 then return t; otherwise return LEFT(t,222) + MD5(s).

@Simran-B
Copy link
Contributor

Simran-B commented Dec 8, 2017

There are also base58 and base62, which only use 0-9, A-Z and a-z. They are not that common, but wouldn't that be a solution for document keys?

For collection names, we would need a base52 however (A-Z, a-z), because leading numbers are not allowed (internally, collections have number-based identifiers and a check of the first character can determine whether a collection identifier is a collection number or a collection name).

Encoding spaces as + wouldn't make sense and I doubt we need an extra function to provide PHP-style encoding...

@pkoppstein
Copy link

pkoppstein commented Dec 8, 2017

There are also base58 and base62

None of the base* family solve the problem of key-length by themselves, so for TO_KEY I much prefer the transparency and popularity of something in the "urlencode" family.

Regarding TO_KEY(s), I've updated my proposal, making it a requirement that the first character of s be a admissible as the first character of a key. (The user can easily enough add a dummy character if need be.)

base62 might be handy, but seems to be relatively unpopular (no wikipedia article).

Regarding base58, my understanding is that it's quite unlike the others, in that it covers a family of encodings. It all seems rather complex, unless you're into cryptocurrencies. See e.g. https://en.bitcoin.it/wiki/Base58Check_encoding

@Simran-B
Copy link
Contributor

Simran-B commented Dec 9, 2017

Doesn't look too complex and supports arbitrary alphabets (base58, base62 etc.): https://github.com/cryptocoinjs/base-x

Downside: the algorithm produces non-standard outputs for hex and base64, apparently.

No encoding scheme can solve the key-length problem if I'm not mistaken. You either use a reversible algorithm / lossless compression where input length and output length correlate in some way, or you use a hash algorithm which is a one way street (kind of a lossy compression) but can make guarantees about the output length. If you combine both, you effectively end up with the latter, don't you (not fully reversible)? Question is, if there are any benefits with the combined approach. The only advantage that comes to my mind for something like encodeURIComponent is that humans would be able to interpret it, because it partially contained the verbatim input. So you might wanna add "human readability" to the requirements.

@dothebart
Copy link
Contributor

@pkoppstein how is this related to importing the reddit dataset in first place?

@pkoppstein
Copy link

@dothebart - I am not sure what the referent of “this” is. I agree that the discussion has become a bit convoluted, but if it would help if I created a separate “ enhancement request” regarding the availability in AQL of deterministic functions that are already available in arangosh without any further “require”s, please say so.

@dothebart
Copy link
Contributor

@pkoppstein whatever you're talking about - does it have any connection with the originaly (closed - because solved) issue? If not, please open a new one, add all relevant information, and let this one rest in peace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 Duplicate 3 OOM System runs out of memory / resources 3 RocksDB Storage engine related
Projects
None yet
Development

No branches or pull requests

5 participants