Skip to content

AWS GenePattern File System Storage

liefeld edited this page Apr 27, 2018 · 5 revisions

GenePattern AWS File System Storage

This documentation is in response to GP-7039 Disk usage/capacity/storage for AWS

Overview of Storage on AWS

On AWS there are effectively 3 different storage locations using the current (04/2018) implementations.

  1. GenePattern Server - File system
  2. Long term storage and storage for transfer - AWS S3
  3. Compute Node Storage - File system

The first category is the File System storage for the GenePattern head node (i.e. the server running the GenePattern web application and dispatching jobs to some execution engine such as AWS Batch). This storage is provided via an AWS EBS Disk that is mounted to the server at startup. The GenePattern server requires all job inputs and results to be on this disk since it was never written to expect storage on anything other than a locally accessible file system (truely local or NFS mounted). At some point the GenePattern server should be re-written to allow storage on other systems (such as S3 buckets ) since EBS Storage is ~5 times more expensive than S3 storage.

The second category is long term storage and storage for transfer. This is provided by an AWS S3 bucket (object store). It is meant to eventually be the sole location for all GenePattern files but that won't happen until the GP Web app is rewritten to not expect a local drive. In the meantime this is still used to transfer data between the GenePattern head node and the compute nodes that execute the actual jobs. This is done by 'sync'ing directories from the GP server into S3, then when the compute node runs the job it sync's from S3 to its local file system, runs the job and sync's back any output files. The GP server then sync's the job outputs back to its local file system.

The final category is the File system storage on the compute nodes. This is again provided as EBS discs but unlike the GP server, the nodes and thus their disks are ephemeral and only need to hold the data for the job they are executing at any given moment. After a job completes the nodes typically shut down and release the EBS disk.

Expanding AWS Storage

Zone 1, the GenePattern Head node

This is the most common and likely need for rescaling since the head node needs a copy of all files (even though they are likely also in the cheaper S3 storage). This can be done without stopping the server or interrupting it in any way. For detailed instructions see the AWS Doc for resizing an EBS Volume. The basic steps are as so;

  1. Modify the volume size from the console. AWS doc
  2. Extend the Linux file system to use the new size. AWS doc. This requires logging into the server.
  3. Check that it worked ;)

Zone 2, Transfer vis S3

It is never necessary to expand S3 storage. Given the S3 infrastructure there is no limit to the number of objects (files) in a bucket though the largest size of a single file is 5 TB.

Zone 3, Compute node file systems

Compute nodes are auto-launched by AWS batch using an AMI image that we designate. To increase the amount of disk space to the compute nodes it is necessary to create a new image using a larger EBS volume. For full details see the AWS documentation for launching a custom image.

The quick steps are to;

  1. Specify the launch of an instance of the current AMI (ami-7155d20b)
    • Specify a larger size on the storage page for /dev/xvdcz (82GB as of this writing).
  2. Create a new image from the instance via the console.
  3. Update the Autoscaling groups launch configuration to use the new instance ID

After this all new compute nodes will use the new image, which has the new larger disk.

Determining the required storage capacity

Since S3 does not have any limits, this applies only to the EBS drives of the head node and compute nodes. As of this writing (4/27/2018) here is what GP Prod looks like

  • bash:vgpprod01:/xchip/gpprod/servers/genepattern 118 $ du -sh *
  • 2.5K attachments
  • 349K bin
  • 20G gp
  • 1.2G installer
  • 640G jobResults
  • 326G taskLib
  • 46G temp
  • 2.5K tmp
  • 34K usage_report.txt
  • 3.4T users

So it's a total of ~4.1TB which will cost $430/month on EBS ($86 on S3). Simply put we cannot afford $430/month (and growing) as a regular charge. We can see however that 80% of it is actually in the "users" directories so this probably corresponds to uploaded files. There is probably some room to be saved as well in the taskLib (drop old versions of modules on the AWS version) and jobResults (640 GB = $67/month). Over 30 people have 20+ GB in their user directories, a few with over 100 GB, so we could crack down on people using this for long term storage.

At the moment (4/2018), the gp-beta-ami only has 200GB of storage.