Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow storing Avatars/Attachments in S3 #166

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

MatthewCochrane
Copy link

Adds a parameter to DataCenterFormula to store avatar and/or attachment data in an S3 bucket. For use with the Object Storage for Jira projects for avatars and attachments.

It uses a new cloud formation stack for the bucket which gets set up with appropriate permissions (based on the account, which is why I created this PR into aws-resources).

After the bucket is set up (in DataCenterFormula), the attachment/avatar data is then copied from the shared home into the bucket (in SharedHomeFormula). I played around with this for a while to get something that copies the large datasets quickly.

It also configures Jira to look for avatars/attachments in the newly created S3 bucket (in DataCenterNodeFormula) via the new filestore-config.xml file.

@MatthewCochrane MatthewCochrane requested a review from a team as a code owner May 24, 2023 23:22
@@ -12,7 +12,14 @@ import java.util.concurrent.ConcurrentHashMap
*
* @since 2.15.0
*/
class AwsCli {
class AwsCli (val cliVersion: String = "2.9.12") {
Copy link
Author

@MatthewCochrane MatthewCochrane May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the default version to 2.9.12. The bulk copying in SharedHomeFormula.kt wasn't reliable with the V1 CLI because it only supports legacy retry mode not standard retry mode. The new code can install a V1 or V2 cli and I've tested both, though it uses 2.9.12 by default everywhere now which I'm not sure if you'll be ok with or not. Also note that I've added tests around this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop old AWS cli support, unnecessary complexity. Preserve one constructor without parameters AwsCli()

.withParameterValue(s3StorageBucketName),
Parameter()
.withParameterKey("AWSAccount")
.withParameterValue(aws.callerIdentity.account)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we pass in the caller identity from the change in aws-resources.

Properties:
BucketName:
Ref: S3StorageBucketName
LifecycleConfiguration:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bucket contents get automatically deleted by AWS in 1 day using a lifecycle configuration. This reduces the complexity and increases the reliability of deleting the files in the bucket. The alternative of deleting files from the JPT code is both slow and error-prone (ie if JPT stops then the files would never get deleted).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another side effect of this is that we need permission to set the lifecycle configuration for buckets. This will require adding a couple of permissions to the role JPT runs as.

Copy link
Author

@MatthewCochrane MatthewCochrane May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've hacked this in to run it, similar to the libs dir in JPT, but I can remove and update the library version when the aws-infrastructure PR is merged.

// Copy the attachment data from NFS onto S3
// Use xargs to split the work across 30 concurrent jobs. This is much faster than a single job.
it.safeExecute(
cmd = "export AWS_RETRY_MODE=standard; export AWS_MAX_ATTEMPTS=10; cd $localSharedHome/data && ( find attachments -mindepth 1 -maxdepth 1 -type d -print0 | xargs -n1 -0 -P30 -I {} aws s3 cp --recursive --only-show-errors {}/ s3://$s3StorageBucketName/{}/ )",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarly actionable in this PR - just FYI / future work note for other reviewers.

This is hard to unit test. I created suggestion for us to consider: https://ecosystem.atlassian.net/browse/JPERF-1122

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a bit dense that line. We could likely achieve something similar if we could write jvm code to run on the remote as in your suggestion. That would make it much more testable too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be improved. Preference effort/quality starting with lowest

/**
* Describes a set of resources used by Jira and stored in shared storage (eg. shared home or S3).
*/
enum class JiraSharedStorageResource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this enum change in the future? If so then enum is not a great option for it, as changes to it would break API.

Usually polymprphism is a good way to avoid enum usage. If you want I could help with figuring out how a polymprphic model could look like here.

We could proceed with enum if it's very unlikely for the value set to change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely possible for this to change. Is there an example of the polymorphic approach you're suggesting and I could have a go at implementing it?

Copy link
Contributor

@pczuj pczuj May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was thinking about was a model where a polymorphic replacement for JiraSharedStorageResource could be an extension to JiraHomeSource. Both operations based on it that you implemented so far (in DataCenterNodeFormula and in SharedHomeFormula) are done on Jira home, so I believe that it should logically fit there.

Each Jira resource type (avatars, attachments) could be implemented as wrapper which would delegate to JiraHomeSource (from Dataset which is passed to DataCenterFormula). It would first let the delegate configure the home on remote and then it would create S3 bucket and upload selected artifacts from Jira home there. Then it could append (or create) filestore-config.xml.

When I think about it now this may seem to be a bit of implementation changes. I'd personally still go for it, though maybe there is a simpler way to avoid enums that the team can think of. Alternatively maybe enum are good enough for now - still need to consult with the team (we have a standup in 1h).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to show the idea with this change set, however I wasn't able to finish it: #168

Copy link
Contributor

@pczuj pczuj May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MatthewCochrane your alternative solutions to getting rid of enum usage that you posted in this comment should be fine. Please proceed with any solution that will work for you and don't be blocked by my suggestions. The main concern was about enum in API and usage of polymorphism is obviously not mandatory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, have pushed a change.

}

private fun installV2Cli(ssh: SshConnection) {
// Instructions from https://docs.aws.amazon.com/cli/latest/userguide/getting-started-version.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Wouldn't it be better as a method kdoc? It would be more clear that the comment is about the whole method and not only line 61

@pczuj
Copy link
Contributor

pczuj commented May 25, 2023

I think those changes look good. Before I approve I'd like to consult with the team if the the enum usage is acceptable, as it's currently my only concern. Hopefully we will be able to sync on that tomorrow.

@pczuj
Copy link
Contributor

pczuj commented May 26, 2023

This won't be compatible with our method of saving datasets, as that method assumes that jirahome from local file system is enough for it.

// Copy the attachment data from NFS onto S3
// Use xargs to split the work across 30 concurrent jobs. This is much faster than a single job.
it.safeExecute(
cmd = "export AWS_RETRY_MODE=standard; export AWS_MAX_ATTEMPTS=10; cd $localSharedHome/data && ( find attachments -mindepth 1 -maxdepth 1 -type d -print0 | xargs -n1 -0 -P30 -I {} aws s3 cp --recursive --only-show-errors {}/ s3://$s3StorageBucketName/{}/ )",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to share this upload method with customers? In JPT we like to have references to documentation in the code, as we want the automation to simulate the admin.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment we will be recommending that customers use AWS DataSync instead of this approach. If we do end up suggesting this approach to customers, then I'll be sure to update the code with a link to documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use the AWS DataSync in our code? Our approach was always to imitate admins, which has many benefits.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to your thoughts here but I think using DataSync in JPT may be overkill. DataSync is recommended to customers because it's very robust, and reliable for large instances (eg. 1TB+ of attachments).
To setup DataSync requires quite a lot of work and in general a customer with only a few GB (or 10's of GBs) of data like in the JPT datasets would use an approach like aws cp. We actually expect some customers with smaller instances to use this approach (ie aws cp) instead of DataSync. Perhaps this is more an issue with our documentation and we should look at recommending this approach there as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want our datasets in JPT to be of the scale of our customers.

If we want to recommend usage of aws cp to our customers then I think it should be fine for JPT as well. Let's just be consistent, because:

  • If we will have problems with the recommendation here, then our customers may also have those.
  • In case we will improve our recommendation we will also want to apply it in JPT to benefit.
  • With using different approach then recommendation we will face problems that our customers don't have.

Is there something problematic in DataSync that makes it hard to use in JPT specifically?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear what you're saying around wanting to imitate Jira admins, it's a good form of added testing and understanding what our customers experience. At the moment, our documentation recommends DataSync, however it doesn't provide a definitive solution. It leaves the decision of how to copy data to the admin and doesn't provide any specific commands. Here's the relevant documentation:
image.

It's definitely possible to implement DataSync, though it will require more setup and configuration. In addition it may be prohibitively slow for the JPT use case (ie take closer to 40 minutes to copy default dataset). To set it up, we'll need to create a DataSync agent (on a new EC2 instance), configure source and destination locations, and set up task execution.

@MatthewCochrane
Copy link
Author

MatthewCochrane commented May 30, 2023

Thanks for this comment, thats a good point that it won't be compatible with the dataset-saving code. That actually looks quite complex to update. For now, would it make sense to not allow dataset saving if attachments or avatars are stored in S3?
Our changes mean that avatars/attachments no longer necessarily reside in the shared home, which has some interesting implications for datasets.

@pczuj
Copy link
Contributor

pczuj commented May 30, 2023

would it make sense to not allow dataset saving if attachments or avatars are stored in S3?

If we can't simply make it work I think it would be good to at least somehow detect it and throw an exception. My biggest concern is that someone wouldn't be aware of saving an incomplete dataset.

/**
* Describes a set of resources used by Jira and stored in shared storage (eg. shared home or S3).
*/
enum class JiraSharedStorageResource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the recent change this is no longer used. Should we remove it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, thanks.

Copy link
Contributor

@pczuj pczuj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Remove JiraSharedStorageResource
  • Consider using AWS DataSync as we recommend to our admins
  • Don't allow to save dataset with missing attachments / avatars
  • Consider joining new configuration into one object
  • Update changelog
  • Wait for release of aws-resources

@pczuj
Copy link
Contributor

pczuj commented Jun 2, 2023

Also please add CHANGELOG.md entry

@MatthewCochrane
Copy link
Author

@pczuj Is there a way to use the release-1.12.0-alpha release or should I wait for a non-alpha release of aws-resources so I can removed the hacked lib?

@pczuj
Copy link
Contributor

pczuj commented Jun 2, 2023

@MatthewCochrane we should release the 1.12.0

@pczuj
Copy link
Contributor

pczuj commented Jun 2, 2023

The release attempt already happened, but it failed: https://github.com/atlassian-labs/aws-resources/actions/runs/5088283216/jobs/9146117217

Needs investigation

@MatthewCochrane
Copy link
Author

@pczuj As we discussed in your PR, I tried the copy operation again with max_concurrent_requests set to 300. I did some tests:

Method Throughput CPU Time to copy attachments (660k files, 6GB)
aws cp with max_concurrent_requests = 300 ~20Mb/s 1 core maxed, ~2% in other cores ~ 40 mins
aws cp with max_concurrent_requests = 3000 ~20Mb/s 1 core maxed, ~2% in other cores ~ 40 mins
aws cp with xargs ~200Mb/s High usage among all cores ~3.5 minutes

I can only speculate as to the cause but it looks like even though aws cp offloads work to multiple threads, there's a main thread that is becoming a bottleneck. So when we start many copies of aws cp it can spread whatever work was the bottleneck out over multiple threads and get a higher throughput. Here are some resources from AWS that led me to this approach.

@pczuj
Copy link
Contributor

pczuj commented Jun 6, 2023

aws-resources-1.12.0 is available: https://packages.atlassian.com/maven-central/com/atlassian/performance/tools/aws-resources/1.12.0/

@pczuj
Copy link
Contributor

pczuj commented Jun 6, 2023

WRT max_concurrent_requests tests:

Sad to see the results. Usage of the flag would be much simpler. Anyway, if the process fork approach is faster, then we may want to use it. If this is some optimisation that is not part of the recommendation I'd suggest to documment it somehow in the code, so that we can reconsider it in the future without any doubt where did the xarg usage came from.

@MatthewCochrane
Copy link
Author

Hey @pczuj, I found some time to come back to this PR and update it.

I've rebased it on master and made a couple of small changes/fixes since the last time you looked at it.

  • I changed which role the bucket policy is granted to. It was updated to the role of the EC2 instance profile the tests run as. This ensures the instance under test can access (and delete objects in) the bucket with avatars/attachments.
  • I increased the lifespan of the stack with the avatar/attachment S3 bucket to four days. This lets the bucket lifecycle policy delete the objects before the housekeeping plan tries to clean it up. AWS can take multiple days to perform the actual deletion, thus the extra buffer here.

In addition to these changes, to run, the JPT-dev role will also require the following permissions:

  • s3:GetBucketPolicy
  • s3:PutBucketPolicy
  • s3:DeleteBucketPolicy
  • s3:PutLifecycleConfiguration

Is there a process for adding these?

@pczuj
Copy link
Contributor

pczuj commented Oct 23, 2023

Is there a process for adding these?

Yes, we store the permissions in https://github.com/atlassian-labs/aws-resources/blob/master/src/main/resources/iam-policy.json

@MatthewCochrane
Copy link
Author

MatthewCochrane commented Oct 26, 2023

Is there a process for adding these?

Yes, we store the permissions in https://github.com/atlassian-labs/aws-resources/blob/master/src/main/resources/iam-policy.json

Thanks, I created a PR for this here.

Copy link
Contributor

@pczuj pczuj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those changes look good to me. New API should be fine to maintain and all the internal logic is refactorable if needed.

Would be good to see @mgrzaslewicz and/or @dagguh review, as they will maintain JPT further while the rest of tech team will very likely move to other streams.

Copy link
Contributor

@mgrzaslewicz mgrzaslewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few improvements to be done. Happy to approve as soon as those are addressed

@@ -12,7 +12,14 @@ import java.util.concurrent.ConcurrentHashMap
*
* @since 2.15.0
*/
class AwsCli {
class AwsCli (val cliVersion: String = "2.9.12") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop old AWS cli support, unnecessary complexity. Preserve one constructor without parameters AwsCli()

// Copy the attachment data from NFS onto S3
// Use xargs to split the work across 30 concurrent jobs. This is much faster than a single job.
it.safeExecute(
cmd = "export AWS_RETRY_MODE=standard; export AWS_MAX_ATTEMPTS=10; cd $localSharedHome/data && ( find attachments -mindepth 1 -maxdepth 1 -type d -print0 | xargs -n1 -0 -P30 -I {} aws s3 cp --recursive --only-show-errors {}/ s3://$s3StorageBucketName/{}/ )",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be improved. Preference effort/quality starting with lowest

@@ -25,6 +29,37 @@ internal class DataCenterNodeFormula(
it.execute("echo ehcache.object.port = 40011 >> $jiraHome/cluster.properties")
it.execute("echo jira.node.id = node$nodeIndex >> $jiraHome/cluster.properties")
it.execute("echo jira.shared.home = `realpath $localSharedHome` >> $jiraHome/cluster.properties")

if (s3StorageBucketName != null && jiraSharedStorageConfig.isAnyResourceStoredInS3()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extract it to a new method or even better inject new tested dependency to the DataCenterFormula (replacing the jiraSharedStorageConfig with jiraSharedStorageSetup that would handle the job)

### Added
- Add `DataCenterFormula.Builder.jiraSharedStorageConfig` function to configure DC tests to store attachments/avatars in an S3 bucket.
- Add `JiraSharedStorageConfig`.
- Update AWSCli to support V2 and use V2.9.12 by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When 2 CLI versions dropped, this could become an implementation detail that could be skipped in changelog

@dagguh
Copy link
Contributor

dagguh commented Apr 19, 2024

Yes, we store the permissions in https://github.com/atlassian-labs/aws-resources/blob/master/src/main/resources/iam-policy.json

Nope, the aws-resources stores permissions needed by its APIs. If aws-infrastructure needs more permissions for its own APIs, it should have its own policy list here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants