considerations for running the shasta server / how the intermediate graphs is stored on disk #73

ekg · 2019-11-07T09:42:18Z

I'd like to run the shasta server to look at intermediate graphs. I've not been keeping these around or working with them. I have just been using the default mode for memory management, and I've only tested --memoryBacking 2M once. (I actually didn't use it because it didn't seem to help my performance. My assembly jobs have only required tens of GB of RAM, so perhaps this is why?)

Can I copy or mount graphs from a remote server to a local directory and then load them in the shasta server? I noticed that I was unable to remove files that were generated with --memoryBacking 2M until running shasta --command cleanupBinaryData.

I am generally curious about how the disk backed memory management works in practice. The reason is that I'd like to improve on what I'm doing in seqwish. Would you provide some pointers to code to read to understand how you're doing this?

The text was updated successfully, but these errors were encountered:

paoloczi · 2019-11-07T15:01:30Z

All graphs and other html are generated on the fly and never go to disk. To display them you must have a process running shasta --command explore on your local or remote machine, and the assembly must have an intact Data directory containing assembly binary data (see below for more on this). In this mode of operation, Shasta operates as an http server: it interprets requests from a browser and responds with html that displays what the browser is asking for.

If your assembly is local, just use shasta --command explore --assemblyDirectory xyz. This will start both the Shasta process running the http server and a browser pointing to it.
If your assembly is on a remote machine, use shasta --command explore --exploreAccess unrestricted --assemblyDirectory xyz on the remote machine. This will tell you what port Shasta is using (usually port 17100). Then on your local machine start a browser session and point it to the ip address of the remote machine and that port number. To do this, depending on the browser you use, you can enter in the URL field a.b.c.d:17100 or http://a.b.c.d:17100, replacing a.b.c.d with the ip address of the remote machine (the same ip address you use to ssh to the machine - you can also use a host name instead if your machine has one). For this to work, access to that port should not be blocked by a firewall. If you are using AWS, you need to create a security group that opens TCP ports 17100-17110 and attach that security group to your AWS instance. Understand that in this mode of operation anybody on the Internet can access your data, so you should not do this if you are working with confidential data.

In both cases, the assembly must have an intact Data directory containing binary data. Depending on your situation, there are various ways to achieve that.

For a small run (I think this is your case), use --memoryMode filesystem --memoryBacking disk when running the assembly. This will create the Data directory on disk. This is not the most performing approach but is fine for small runs.
For a large run, use --memoryMode filesystem --memoryBacking 2M when running the assembly. In this case, the Data directory is a filesystem backed by 2 MB pages. You can then immediately run the http server immediately after the assembly finishes and before the machine reboots. The http server will be working off the binary data in memory, and you will still be using that memory. If you want to save the data to disk for future use, run shasta --command saveBinaryData --assemblyDirectory xyz followed by shasta --command cleanupBinaryData --assemblyDirectory xyz. (This latter command frees up the memory you were using). Now you can run the http server, and it will work off binary data on disk, which of course are persistent after a reboot.

Documentation for the http server is sparse, and improving it is in my list of things to do.

paoloczi · 2019-11-07T15:06:07Z

Two more points:

As you suspect, the reason you don't benefit from using 2 MB memory is because your assembly is small. In large assemblies I have seen a 30% or better improvement in performance resulting from using 2 MB memory versus the default mode of operation.
You can also copy the entire assembly directory locally, including the Data directory, and then run in the local mode of operation I described above.

paoloczi · 2019-11-07T15:49:40Z

One more comment: the reason you can't delete the Data directory if you used --memoryBacking 2M is that there is a filesystem mounted there, and you have to unmount it before the system will let you remove the mount point Data. The Shasta cleanupBinaryData command does the unmount.

paoloczi · 2019-11-18T15:11:45Z

I am closing this, but feel free to reopen or start a new issue if more questions come up.

ekg mentioned this issue Nov 7, 2019

Optimizing Shasta for short read lengths (<10kb) #61

Closed

paoloczi added the question Further information is requested label Nov 7, 2019

paoloczi closed this as completed Nov 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

considerations for running the shasta server / how the intermediate graphs is stored on disk #73

considerations for running the shasta server / how the intermediate graphs is stored on disk #73

ekg commented Nov 7, 2019

paoloczi commented Nov 7, 2019

paoloczi commented Nov 7, 2019

paoloczi commented Nov 7, 2019

paoloczi commented Nov 18, 2019

considerations for running the shasta server / how the intermediate graphs is stored on disk #73

considerations for running the shasta server / how the intermediate graphs is stored on disk #73

Comments

ekg commented Nov 7, 2019

paoloczi commented Nov 7, 2019

paoloczi commented Nov 7, 2019

paoloczi commented Nov 7, 2019

paoloczi commented Nov 18, 2019