New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DGraph Times Out Processing Graph #4231
Comments
What exactly are you doing? Is it a bulk load? live load? The file
If you wanna change something, you should do it after the load via upsert. Not during the load. What is the config of your cluster (how many groups, instances) do you have? |
This is on a single box using 1 dgraph/zero/ratel server (e.g. the getting started docker-compose). This uses the standard localhost:8080/mutate endpoint (with commitNow=true). The graph.gz is a dump of what I am posting to the endpoint. |
How much RAM does it consume? doing this. |
It would do occasional spikes and then eventually sit at 6~10GB. I would have to run it again to make sure and I won't be at my dev box for a while. But the graphj.gz is only ~168 mb unzipped, so I am kind of surprised it was taking so much ram. |
@MichelDiz I updated description with proper repo steps (sorry should have done that to begin with). Got onto a laptop and performed repo, I killed dgraph once it started to exceed 10GB in RAM. Also fixed traits.txt (now traits.txt.gz) |
As you have a lot of RAM available, try to run your cluster with these flags (Just for this situation):
|
Ok I will have to give it a shot tonight, my current laptop doesn't have near as much ram :) |
What I have indicated is a way of assessing whether it could be bottleneck by I/O. I would then recommend that you have an actual cluster, where each group has its own resources. A single instance with slow HDD tends to have poor performance. But if you have multiple instances with own resources, even with slow HDDs and well-defined groups, there is no performance loss. Also, Dgraph/BadgerDB works better with SSD/NVMe. |
So I am running on a corsair MP600, that gets about 4250MB/s in write throughput, so I doubt its IO bound. You wont even see those speeds in a small cluster setup. The only thing that i can think that might prevent this is I am using a docker volume and that could be IO limited. |
So it seems these flags did not make any difference. I've also noticed that since the request failed to fully process, dgraph tries to recover it when I restart the server. The recovered "job?" causes the server to start spinning again. The server also doesn't seem to be responsive to simple query commands. |
Also, I've been monitoring the server trying to process the job for over 15 min and the server is sitting at 7.5GBof ram use (which has stayed consistent when reached, slowly rising though) and fluctuates from using 2 vcores to all 24 vcores. I don't see much io activity (via iotop). I suspect maybe golangs GC is kicking in for a massive cleanup when i hit all 24 cores? |
Quick question, how do you run it? 185MB via Ratel? |
how to I run the request? I run all servers via docker-compose and send the request via localhost:8080/mutate which is alpha |
Looks like job finally finished (but didn't commit graph changes)
|
ah, just saw your update in the first comment
|
fyi, server is sitting at 8.8 GB of ram use after job completion. That seems really high... given how small the data set was. |
Well, I'm running your dataset for the last 26 minutes. Only the transaction to be aborted in the logs (Maybe some typo in the dataset). There's no issue as I could tell. (I'm not using Docker. Just a local instances of Dgraph in Darwin) It uses 10.13GB of RAM. Which is normal. And have some spikes in CPU. Which is also normal due to the background activities. Use Upsert Block with 185MB of data (decompressed) was never tested or the intention to do in that kind of query. Upsert is very powerful, and its use must be done with care. I believe these issues below cover what you need: |
I would be surprised if there was a typo in data set, since it programattically generated and it works on smaller data sets (but i guess there is always the chance). Are you saying I should use the mutate endpoint instead and should expect better perf? I really just use upsert so that i can get the trait node ids and insert new data with them in one call. |
I would personally recommend that you do this in small chunks with longer intervals. And add a load balancing logic to the logic of your code. For example, instead of firing all chunks for just one instance, you can fire like this:
Sending chunks for each available instance. That way you split the load well in your cluster. |
So I changed the graph.gz to be a single mutate and no change occurred (so far). If this is the expected behavior then I think I might just explore some other options. Still a cool product, just might not yet be up for my use cases. |
Hey, btw. Don't use |
@MichelDiz Not sure if you want to do anything with this issue. I feel like there is a bug here, but I can close it if you think otherwise. For comparison I finally loaded my graph into another graph db (arango db) and It had a much easier time with this data set. Load time ~3min, but probably ignore that since the way data is loaded is fundamentally different. The final residual memory was ~3GB after a full ingest. So maybe golang is just really relaxed about releasing memory back to the OS? |
This week we start investigating this and other related things. We have some engineers trying to replicate similar experiences and comparing with other DBs. Tell me, is there an equivalent for Upsert in ArangoDB? have docs about it? Did the upload take 3 minutes via some live upload? By the size of your dataset Bulkload would load it within seconds. We are investigating this. I've been investigating since last week and now the cavalry has arrived. We gonna have some north on this. Let's keep it open. I think someone is tracking your issue too. Cheers. |
The upload into arango db was live. They support a bulk upload format for large data sets. And it lets you do the bulk upload at anytime into existing graphs. I could try the dgraph bulk loader as a one off to see if it makes any difference, but that doesn't serve my end purpose since I will be uploading many similar sized data sets continuously. The data set I attached to this issue does have 3m+ edges, so its non-trivial fwiw. I think my bigger concern for this issue would be the memory outstanding after upload. Even if it takes 40mins, I would expect a much smaller active memory set after. |
Hum, I'll check this tomorrow. By the top, it is not recommended to have such a large transaction via a client (in your case HTTP). If you take the code from LiveLoad as an example, you will see that it splits the dataset into smaller parts and multiple transactions. Making the load smoother. But you wanna use upsert, this could be done but differently if you go with Liveload. I would recommend that you make a pipeline with Liveload and your code/env. If you have multiple instances of Dgraph, I recommend adding the URLs of all of them in the process (you can put them separated by commas in the -a flag). So LiveLoad balances for you. You could eventually (if you have multiple Alpha instances), if using HTTP only, use a load balancer in front of your instances. I have a code for this with Traefik and Dgraph, I just need to find where is it, to share it. |
hey @jm4games just to update you. Here the LB with Dgraph if you need https://github.com/MichelDiz/ItisTimetoReproduce
There's a set of other issues that covers this one. |
Github issues have been deprecated. |
Dgraph Version?
Dgraph version : v1.1.0
Dgraph SHA-256 : 7d4294a80f74692695467e2cf17f74648c18087ed7057d798f40e1d3a31d2095
Commit SHA-1 : ef7cdb2
Hardware Spec ?
OS: Arch Linux, (running dgraph in docker env)
CPU: AMD 3900x
RAM: 64GB
Steps to reproduce the issue (command/config used to run Dgraph).
schema.txt - graph layout
trait.txt - initial seed data
graph.gz - (full graph)
Commands:
Update schema through dgraph web UI (copy and paste schema.txt into schema bulk edit)
Then run the following from shell in dir where downloaded trait.txt.gz and graph.gz exist.
Expected behaviour and actual result.
Dgraph should process the graph, but instead spins on it and never completes processing.
Other
I know the graph itself is sound as I have processed smaller data sets with the exact layout. It seems once a graph reaches a certain size dgraph can no longer process it. Maybe I have exceed some sort of internal limit. I have also attached log of dgraph server and it looks like a raft consensus is timing out.
Files
graph.gz
error.log
schema.txt
trait.txt.gz
The text was updated successfully, but these errors were encountered: