Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airavata 3348 final cut #260

Open
wants to merge 3 commits into
base: AIRAVATA-3315-storage-quotas
Choose a base branch
from

Conversation

vivekshresta
Copy link

This PR includes changes to validate the storage limit for a user. Airavata retrieves the size of a user given an experiment and validates against the UserStorageQuota in StoragePreference

Copy link
Contributor

@machristie machristie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @vivekshresta . I have couple comments:

  • As I've mentioned on the mailing list, ideally the storage resource itself enforces the quota. That said, it is possible to have Airavata validate a quota external to the storage resource as you've implemented here. However, one issue I have with adding an API method to check the quota is that this means the quota will only be enforced if the API client chooses to do so. Who calls validateStorageLimit and what do they do with it if it throws an exception? The client could just ignore the exception or just not call validateStorageLimit at all. This seems like it would be better not as a public API method but as an internal one that is used by Airavata to check and enforce the quota, say perhaps as part of checks done before launching the experiment.
  • Also, since the storage quota check involves establishing an SSH connection and running a remote command, I think it would be better suited to being executed as a Helix task and in the background. This way it doesn't tie up API server resources and it will be able to tolerate transient network failures. It could either be incorporated into one of the existing workflows for launching an experiment, or, I think ideally, it could be implemented as a new workflow that gets queued up maybe after an experiment completes to get and store the amount of storage space used by a user (would need a database table to persist this).

@vivekshresta
Copy link
Author

Hi @machristie ,

Thanks for reviewing the code.

  • From our last conversation in the dev mailing list, I assumed we agreed on Airavata worrying about the storage limit makes sense since, in the future, Airavata can choose between multiple StoragePreferences or choose a storage preference mentioned in GatewayResourceProfile or UserStoragePreference(which is about to be deprecated) when the storage preference id given by the gateway is invalid. I guess the gateway too can easily achieve these functionalities.

I did want the validation to happen internally, but the problem I faced was, during the experiment creation phase in Airavata, the experiment model does not have any data related to the StoragePreference in which the experiment is being created. Changing the createExperiment() method to accept another parameter would mean changes across all the gateways. And in my previous discussions with the team, I came to know that in the future, similar to choosing compute preferences for an experiment during the experiment creation phase, we're gonna develop a new functionality where the user gets to choose the StoragePreference in which he/she is going to create the experiment. With that thought process, I created a new API that can be invoked by any gateway, if they choose to use this feature.
But I just verified that, by calling '_set_storage_id_and_data_dir(experiment)' before creating an experiment, I can set the storageId and experiment data directory removing the need for passing the storageId explicitly.

Basically, the public API can be changed to an internal API now. Will make those changes soon.

  • I did consider this. The problems with this approach are:
    1. We will check the size limit only after the experiment creation is done.
    2. When we know the size limit is exceeded, helix needs to communicate back to APIServer for deleting the created
      experiment entries and if needed, deleting the experiment directory.
      Considering these and after discussing with Dimuthu, I thought this might be the better approach when we use 'StorageResourceAdaptor', but this does seem to complicate things.

Even if I remove the new public API and integrate it with createExperiment(), this approach would still be consuming APIServer's resources(though we're using pooled resources instead of creating a new SSH connection every time). Does it make sense to just stick with the original approach - the gateway worrying about the storage quotas?
Also can you please elaborate a little on transient network failure in helix.

Copy link
Contributor

@DImuthuUpe DImuthuUpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't evaluate the algorithm yet. Need another pass after basic coding standards are met


GatewayResourceProfile gatewayResourceProfile = regClient.getGatewayResourceProfile(gatewayId);
String token;
if(isValid(storagePreference.getResourceSpecificCredentialStoreToken()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (

token = storagePreference.getResourceSpecificCredentialStoreToken();
else
token = gatewayResourceProfile.getCredentialStoreToken();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null check for the token

RegistryService.Client regClient = registryClientPool.getResource();
try {
StoragePreference storagePreference = regClient.getGatewayStoragePreference(gatewayId, storageResourceId);
if(storagePreference.getUserStorageQuota() == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (


String groupResourceProfileId = experiment.getUserConfigurationData().getGroupResourceProfileId();
if (groupResourceProfileId == null) {
logger.error("Experiment not configured with a Group Resource Profile: {}", experiment.getExperimentId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experiment {} not configured with a Group Resource Profile

public void validateStorageLimit(AuthzToken authzToken, ExperimentModel experiment, String storageResourceId)
throws InvalidRequestException, AiravataClientException, AiravataSystemException, TException {
String gatewayId = authzToken.getClaimsMap().get(Constants.GATEWAY_ID);
RegistryService.Client regClient = registryClientPool.getResource();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return the regClient back to the pool after using

throw exception;
}

double userDirectorySize = Double.parseDouble(output.getStdOut().substring(0, output.getStdOut().length() - 2).trim())/1024.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print output.getStdOut() as a log before doing this. You don't know what's happening here in case of and error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And do you think that the output format is same for all operating systems (Linux)? If not, your algorithm might be wrong

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And do you think that the output format is same for all operating systems (Linux)? If not, your algorithm might be wrong

Did not know about such a use case. Will test it and update it if necessary.

throw exception;
}
} catch (Exception ex) {
logger.error(ex.getMessage(), ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More detailed error message

String[] directories = experimentDataDirectory.split("/");
StringBuilder userDirectory = new StringBuilder();

for(int i = 0; i < directories.length - 2; i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible index out of bound error here

StringBuilder userDirectory = new StringBuilder();

for(int i = 0; i < directories.length - 2; i++)
if(!directories[i].isEmpty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (

}
}

private String getUserDirectory(String experimentDataDirectory) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a blind algorithm. At least provide your assumptions for experimentDataDirectory value as comments

@machristie
Copy link
Contributor

@vivekshresta,

Regarding validating internally, I would add the check to launchExperiment instead of createExperiment.

We will check the size limit only after the experiment creation is done.

That's true, but the current approach only checks before the experiment runs and so doesn't account for experiment output files.

When we know the size limit is exceeded, helix needs to communicate back to APIServer for deleting the created
experiment entries and if needed, deleting the experiment directory.
Considering these and after discussing with Dimuthu, I thought this might be the better approach when we use 'StorageResourceAdaptor', but this does seem to complicate things.

Well, my two cents, but I don't think the experiment needs to be deleted. We just need to set a flag in the database that the user is over quote on that storage resource and then prevent further file uploads/experiments on that storage resource.

Also can you please elaborate a little on transient network failure in helix.

Sure, Helix's task framework has builtin fault tolerance support, for example retrying in the case of failure: https://helix.apache.org/0.8.0-docs/tutorial_task_framework.html. By transient network failure I mean some sort of transient network failure between Airavata and the SSH host that prevents the SSH connection from establishing or completing successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants