Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

considerations for running the shasta server / how the intermediate graphs is stored on disk #73

Closed
ekg opened this issue Nov 7, 2019 · 4 comments
Labels
question Further information is requested

Comments

@ekg
Copy link

ekg commented Nov 7, 2019

I'd like to run the shasta server to look at intermediate graphs. I've not been keeping these around or working with them. I have just been using the default mode for memory management, and I've only tested --memoryBacking 2M once. (I actually didn't use it because it didn't seem to help my performance. My assembly jobs have only required tens of GB of RAM, so perhaps this is why?)

Can I copy or mount graphs from a remote server to a local directory and then load them in the shasta server? I noticed that I was unable to remove files that were generated with --memoryBacking 2M until running shasta --command cleanupBinaryData.

I am generally curious about how the disk backed memory management works in practice. The reason is that I'd like to improve on what I'm doing in seqwish. Would you provide some pointers to code to read to understand how you're doing this?

@paoloczi
Copy link
Contributor

paoloczi commented Nov 7, 2019

All graphs and other html are generated on the fly and never go to disk. To display them you must have a process running shasta --command explore on your local or remote machine, and the assembly must have an intact Data directory containing assembly binary data (see below for more on this). In this mode of operation, Shasta operates as an http server: it interprets requests from a browser and responds with html that displays what the browser is asking for.

  • If your assembly is local, just use shasta --command explore --assemblyDirectory xyz. This will start both the Shasta process running the http server and a browser pointing to it.

  • If your assembly is on a remote machine, use shasta --command explore --exploreAccess unrestricted --assemblyDirectory xyz on the remote machine. This will tell you what port Shasta is using (usually port 17100). Then on your local machine start a browser session and point it to the ip address of the remote machine and that port number. To do this, depending on the browser you use, you can enter in the URL field a.b.c.d:17100 or http://a.b.c.d:17100, replacing a.b.c.d with the ip address of the remote machine (the same ip address you use to ssh to the machine - you can also use a host name instead if your machine has one). For this to work, access to that port should not be blocked by a firewall. If you are using AWS, you need to create a security group that opens TCP ports 17100-17110 and attach that security group to your AWS instance. Understand that in this mode of operation anybody on the Internet can access your data, so you should not do this if you are working with confidential data.

In both cases, the assembly must have an intact Data directory containing binary data. Depending on your situation, there are various ways to achieve that.

  • For a small run (I think this is your case), use --memoryMode filesystem --memoryBacking disk when running the assembly. This will create the Data directory on disk. This is not the most performing approach but is fine for small runs.

  • For a large run, use --memoryMode filesystem --memoryBacking 2M when running the assembly. In this case, the Data directory is a filesystem backed by 2 MB pages. You can then immediately run the http server immediately after the assembly finishes and before the machine reboots. The http server will be working off the binary data in memory, and you will still be using that memory. If you want to save the data to disk for future use, run shasta --command saveBinaryData --assemblyDirectory xyz followed by shasta --command cleanupBinaryData --assemblyDirectory xyz. (This latter command frees up the memory you were using). Now you can run the http server, and it will work off binary data on disk, which of course are persistent after a reboot.

Documentation for the http server is sparse, and improving it is in my list of things to do.

@paoloczi paoloczi added the question Further information is requested label Nov 7, 2019
@paoloczi
Copy link
Contributor

paoloczi commented Nov 7, 2019

Two more points:

  • As you suspect, the reason you don't benefit from using 2 MB memory is because your assembly is small. In large assemblies I have seen a 30% or better improvement in performance resulting from using 2 MB memory versus the default mode of operation.

  • You can also copy the entire assembly directory locally, including the Data directory, and then run in the local mode of operation I described above.

@paoloczi
Copy link
Contributor

paoloczi commented Nov 7, 2019

One more comment: the reason you can't delete the Data directory if you used --memoryBacking 2M is that there is a filesystem mounted there, and you have to unmount it before the system will let you remove the mount point Data. The Shasta cleanupBinaryData command does the unmount.

@paoloczi
Copy link
Contributor

I am closing this, but feel free to reopen or start a new issue if more questions come up.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants