Airavata 3348 final cut #260

vivekshresta · 2020-08-21T09:21:47Z

This PR includes changes to validate the storage limit for a user. Airavata retrieves the size of a user given an experiment and validates against the UserStorageQuota in StoragePreference

Airavata 3315 storage quotas

…vata into AIRAVATA-3348_finalCut Merging

machristie

Thanks for this PR @vivekshresta . I have couple comments:

As I've mentioned on the mailing list, ideally the storage resource itself enforces the quota. That said, it is possible to have Airavata validate a quota external to the storage resource as you've implemented here. However, one issue I have with adding an API method to check the quota is that this means the quota will only be enforced if the API client chooses to do so. Who calls validateStorageLimit and what do they do with it if it throws an exception? The client could just ignore the exception or just not call validateStorageLimit at all. This seems like it would be better not as a public API method but as an internal one that is used by Airavata to check and enforce the quota, say perhaps as part of checks done before launching the experiment.
Also, since the storage quota check involves establishing an SSH connection and running a remote command, I think it would be better suited to being executed as a Helix task and in the background. This way it doesn't tie up API server resources and it will be able to tolerate transient network failures. It could either be incorporated into one of the existing workflows for launching an experiment, or, I think ideally, it could be implemented as a new workflow that gets queued up maybe after an experiment completes to get and store the amount of storage space used by a user (would need a database table to persist this).

vivekshresta · 2020-08-28T02:16:32Z

Hi @machristie ,

Thanks for reviewing the code.

From our last conversation in the dev mailing list, I assumed we agreed on Airavata worrying about the storage limit makes sense since, in the future, Airavata can choose between multiple StoragePreferences or choose a storage preference mentioned in GatewayResourceProfile or UserStoragePreference(which is about to be deprecated) when the storage preference id given by the gateway is invalid. I guess the gateway too can easily achieve these functionalities.

I did want the validation to happen internally, but the problem I faced was, during the experiment creation phase in Airavata, the experiment model does not have any data related to the StoragePreference in which the experiment is being created. Changing the createExperiment() method to accept another parameter would mean changes across all the gateways. And in my previous discussions with the team, I came to know that in the future, similar to choosing compute preferences for an experiment during the experiment creation phase, we're gonna develop a new functionality where the user gets to choose the StoragePreference in which he/she is going to create the experiment. With that thought process, I created a new API that can be invoked by any gateway, if they choose to use this feature.
But I just verified that, by calling '_set_storage_id_and_data_dir(experiment)' before creating an experiment, I can set the storageId and experiment data directory removing the need for passing the storageId explicitly.

Basically, the public API can be changed to an internal API now. Will make those changes soon.

I did consider this. The problems with this approach are:
1. We will check the size limit only after the experiment creation is done.
2. When we know the size limit is exceeded, helix needs to communicate back to APIServer for deleting the created
  experiment entries and if needed, deleting the experiment directory.
  Considering these and after discussing with Dimuthu, I thought this might be the better approach when we use 'StorageResourceAdaptor', but this does seem to complicate things.

Even if I remove the new public API and integrate it with createExperiment(), this approach would still be consuming APIServer's resources(though we're using pooled resources instead of creating a new SSH connection every time). Does it make sense to just stick with the original approach - the gateway worrying about the storage quotas?
Also can you please elaborate a little on transient network failure in helix.

DImuthuUpe

Didn't evaluate the algorithm yet. Need another pass after basic coding standards are met

DImuthuUpe · 2020-08-28T03:27:04Z