Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TechDocs: Support AWS S3 as TechDocs external storage #3794

Merged
merged 19 commits into from
Jan 5, 2021

Conversation

ayshiff
Copy link
Contributor

@ayshiff ayshiff commented Dec 20, 2020

Hey, I just made a Pull Request!

Closes #3714

This PR adds the feature to use AWS S3 as TechDocs external storage.

  • Added a new awsS3.ts publisher Link
  • Added tests and mocks Link
  • Added steps about how to use AWS S3 Link
  • Updated config reference documentation Link

✔️ Checklist

  • A changeset describing the change and affected packages. (more info)
  • Added or updated documentation
  • Tests for new functionality and regression tests for bug fixes
  • Screenshots attached (for UI changes)

@ayshiff ayshiff requested review from a team as code owners December 20, 2020 21:17
@changeset-bot
Copy link

changeset-bot bot commented Dec 20, 2020

🦋 Changeset detected

Latest commit: 1517009

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
@backstage/techdocs-common Patch
@backstage/plugin-techdocs-backend Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Member

@OrkoHunter OrkoHunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great so far @ayshiff ! Kudos for your work on this and supporting documentation as well.

Leaving some comments. Will take another look very soon. 🎉

.changeset/fast-colts-cross.md Outdated Show resolved Hide resolved
app-config.yaml Outdated Show resolved Hide resolved
docs/features/techdocs/configuration.md Outdated Show resolved Hide resolved
docs/features/techdocs/using-cloud-storage.md Outdated Show resolved Hide resolved
packages/techdocs-common/src/stages/publish/awsS3.ts Outdated Show resolved Hide resolved
packages/techdocs-common/src/stages/publish/awsS3.ts Outdated Show resolved Hide resolved
plugins/techdocs-backend/src/service/router.ts Outdated Show resolved Hide resolved
@OrkoHunter OrkoHunter added this to Incoming in TechDocs project board via automation Dec 20, 2020
@OrkoHunter OrkoHunter moved this from Incoming to To Review/Review In Progress ⏳ in TechDocs project board Dec 20, 2020
Copy link
Collaborator

@andrewthauer andrewthauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited this is being added.

awsS3:
# An API key is required to write to a storage bucket.
credentials:
$file: '/path/to/aws_application_credentials.json',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, can the credentials be set with just environment variables? Having a physical secrets file might be difficult to setup in some environments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you.
At first I wanted to put the credentials as environment variables:

credentials:
    aws_access_key_id: ACCESS_KEY_ID
    aws_secret_access_key: SECRET_ACCESS_KEY

But to keep some consistency with the configuration for GCS I thought it might be interesting to keep the same credentials file logic.

But since it is not very relevant for aws we can perhaps replace file credentials by environment variables 👍

@OrkoHunter I would like to know your opinion on that.

Copy link
Member

@OrkoHunter OrkoHunter Dec 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, can the credentials be set with just environment variables?

Yes! We can use the contents of the "**credentials.json" as the value for the environment variable (e.g. TECHDOCS_AWSS3_CREDENTIALS). And set

awsS3:
  credentials:
    $env: TECHDOCS_AWSS3_CREDENTIALS

I have used this with GCS on the demo site. But I see that it will be nice to write this ^ in the app-config.yaml comments and the Config reference docs.

Copy link
Collaborator

@andrewthauer andrewthauer Dec 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the environment variable need to be a JSON string in this case? I'm personally not a big fan of managing them that way. It can make it more challenging to deal with some secret management processes. I'd rather be able to split out each field separately. If possible that would be preferable imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewthauer I agree with you that JSON strings are not pretty in environment variables.

Here is my concern. With GCS, their credentials JSON looks like this

{
  "type": "",
  "project_id": "",
  "private_key_id": "",
  "private_key": "-----BEGIN PRIVATE KEY-----<Insert 1600 characters>-----END PRIVATE KEY-----\n",
  "client_email": "",
  "client_id": "",
  "auth_uri": "",
  "token_uri": "",
  "auth_provider_x509_cert_url": "",
  "client_x509_cert_url": ""
}

Even if we split these out into individual fields, the private_key will still be a huge JSON string. So, I do not see much of an improvement here over the existing ways i.e.

  1. Copy over the credentials file as a huge $env variable.
  2. (Kubernetes) Inject the $file using Kubernetes secrets management. https://kubernetes.io/docs/concepts/configuration/secret/#overview-of-secrets
  3. ...other typical ways to make a secret file available on a server.

That being said, if AWS allows smaller alphanumeric token-type secrets that can be generated and set as environment variables, I too would prefer that over the huge JSON string. It's just that, GCS is holding us back with how it authenticates service accounts.

Copy link
Contributor

@mfrinnstrom mfrinnstrom Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I run Backstage as a container in ECS will I be able to skip all of this config and use the role that is assigned to the task?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfrinnstrom Role and Region are not used as of this moment. Only bucket name and credentials are needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OrkoHunter that was my concern. If I add credentials to my task in AWS ECS that would be picked up automatically by the AWS SDK. It seems like we bypass this here and require credentials to be provided from the app.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I add credentials to my task in AWS ECS

Is this same as setting environment variables? We are open for suggestions. When the AWS sdk is initialized here, should it check some existing environment variables?
May I also ask what are the downsides of explicitly setting those environment variables in app-config.yaml?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OrkoHunter not really, although the AWS SDK has some standard environment variables that are picked up automatically if available. Some more information can be found here.

We usually create roles and assign them to our different workloads, in this case an ECS task. This will then be picked up by the AWS SDK in the ECS task without having to create a user and generate credentials for that, and then having to securely manage and supply those credentials to the workload.

This works great when running in AWS but when running outside of AWS you will need something like what is implemented right now with the access and secret keys as environment variables. It should probably be possible to use the default environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY) and have the SDK pick that up automatically, but that becomes a bit hidden and maybe not to easy to follow in the config.

Copy link
Member

@OrkoHunter OrkoHunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

I have tested the setup with S3 and it works great! Thanks again @ayshiff for working on this.

@ayshiff
Copy link
Contributor Author

ayshiff commented Dec 21, 2020

I will fix the tests and fix the techdocs JSON schema by adding the region property 👍

@OrkoHunter
Copy link
Member

Awesome!

@ayshiff
Copy link
Contributor Author

ayshiff commented Dec 21, 2020

I don't know if you have reviewed the example diagram for aws. Link

I used excalidraw to draw it but saw that the other graphs were made with draw.io.
Do you think it's better that I redo the graph with draw.io to keep the documentation consistent ?

aws-s3

@OrkoHunter
Copy link
Member

@ayshiff Good that you noticed it! If you make it with draw.io and save as an svg, it will be easier to maintain/update! Try the draw.io VS Code extension and open a file with .drawio.svg extension to experience the magic. ✨

@andrewthauer
Copy link
Collaborator

andrewthauer commented Dec 21, 2020

@ayshiff Good that you noticed it! If you make it with draw.io and save as an svg, it will be easier to maintain/update! Try the draw.io VS Code extension and open a file with .drawio.svg extension to experience the magic. ✨

Or plantuml if you'd rather not think or fiddle with layout 😄

Copy link
Collaborator

@andrewthauer andrewthauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - Will publishers support using nested directories? For instance, you want to share an s3 bucket for other purposes and the documentation sites are located in say /docs/sites as the root.

@@ -42,6 +42,7 @@
"@google-cloud/storage": "^5.6.0",
"@types/dockerode": "^3.2.1",
"@types/express": "^4.17.6",
"aws-sdk": "^2.813.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like v3 was just released. Is it worth taking a look? Seems like it has better TS support - https://aws.amazon.com/about-aws/whats-new/2020/12/aws-sdk-javascript-version-3-generally-available/.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! @ayshiff Would you this this a try? You can create an issue for the upgrade if you want to do it later or want someone else to do it - and that is totally fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course ! I'll look into it 👍

plugins/techdocs/package.json Outdated Show resolved Hide resolved
@OrkoHunter
Copy link
Member

Will publishers support using nested directories? For instance, you want to share an s3 bucket for other purposes and the documentation sites are located in say /docs/sites as the root.

Great question @andrewthauer and big +1 for adding that support. I'll create an issue for this. I imagine it can be defined as a rootPath inside techdocs.publisher.awsS3. Will do for both GCS and S3.

Comment on lines 149 to 155
const techdocsMetadataJson = file?.Body?.toString();

if (!techdocsMetadataJson) {
throw new Error(
`Unable to parse the techdocs metadata file ${entityRootDir}/techdocs_metadata.json.`,
);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have another way to get the Readable | ReadableStream | Blob in a better way ?

@ayshiff
Copy link
Contributor Author

ayshiff commented Dec 22, 2020

@OrkoHunter Do you know why the lint and tsc commands fail on the CI on a line that does not seem to exist ?

@OrkoHunter
Copy link
Member

Hey @ayshiff! Could you try pulling the latest remote master branch into your feature branch? (Or do git rebase) That will cause some conflicts, maybe.

This #3764 was merged recently.

@OrkoHunter
Copy link
Member

@ayshiff The failing test is gone (we had to rebase and fix merge conflict, that's all). Now there is a new tsc issue with the aws-sdk package it seems.

I also had to run yarn again after copying over yarn.lock from the master branch to sync.

@ayshiff
Copy link
Contributor Author

ayshiff commented Dec 22, 2020

Thanks @OrkoHunter 👍 I'll look into it.

@ayshiff
Copy link
Contributor Author

ayshiff commented Dec 22, 2020

I wonder why I didn't have conflicts while doing my rebase 🤔
I was probably not up to date with the remote. My bad.

@ayshiff
Copy link
Contributor Author

ayshiff commented Dec 22, 2020

It looks like an issue aws-sdk-js-v3 has: Related issue

@andrewthauer
Copy link
Collaborator

It looks like an issue aws-sdk-js-v3 has: Related issue

That's unfortunate. Maybe not worth the version bump until that is sorted out.

@ayshiff
Copy link
Contributor Author

ayshiff commented Dec 22, 2020

I'm going to move the v3 bump commits to another branch and I'm going to create a Draft PR while waiting for the issue to be fixed in #1812.

Does it sound good to you ?

let secretAccessKey = null;
let bucketName = '';
try {
accessKeyId = config.getString(
Copy link
Contributor

@lowjoel lowjoel Dec 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible improvement for a future iteration (happy to contribute if we reach there first): this should use either EC2 instance profiles/Kubernetes IRSA to remove the need to provide an access key. These credentials are automatically rotated and support is built into the AWS SDK.

Typically in AWS SDKs if you credentials are omitted it uses the default precedence rules, which (from memory) is env vars, ~/.aws/credentials, instance profile/IRSA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have the credentials as optional and if not present use the default credentials chain that @lowjoel mentions? Just new aws.S3() basically.

We would also be happy to contribute this since we have a policy to never create users and use access keys. I can understand that this might be necessary if you are not running in AWS but still want to use S3 though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be happy to add this to the existing PR !

@OrkoHunter Do we keep this PR in current state and iterate over it once merged to add optional credentials ? Or do we add that on this branch ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayshiff IMO let us iterate in the future! We shouldn't delay this PR more. Hopefully we can merge this next week. :)

@OrkoHunter
Copy link
Member

@backstage/maintainers Please review when possible!

@@ -76,7 +76,7 @@ techdocs:
generators:
techdocs: 'docker' # Alternatives - 'local'
publisher:
type: 'local' # Alternatives - 'googleGcs'. Read documentation for using alternatives.
type: 'local' # Alternatives - 'googleGcs' or 'awsS3'. Read documentation for using alternatives.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for now - But wondering if we can drop the provider prefix for these types? s3 and gcs reads much better than awsS3 and googleGcs respectively - thoughts @backstage/techdocs-core?

Thinking that maybe we would favour some form of configuration that looks like the below, where all the services below a provider live underneath a provider?

aws:
  s3:
    access_key_id: $S3_AWS_ACCESS_KEY_ID
    secret_access_key: $S3_AWS_SECRET_ACCESS_KEY
  redshift:
    access_key_id: $REDSHIFT_AWS_ACCESS_KEY_ID
    secret_access_key: $REDSHIFT_AWS_SECRET_ACCESS_KEY
  default:
    access_key_id: $AWS_ACCESS_KEY_ID
    secret_access_key: $AWS_SECRET_ACCESS_KEY

Copy link
Member

@OrkoHunter OrkoHunter Jan 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting idea of unification of configs based on provider as you write above. I think we might do this just like how we did the integrations with GitHub/GitLab/etc. ➕
As of now for techdocs.publisher.type, IMHO googleGcs and awsS3 are slightly better than gcs and s3. :)

# Required when techdocs.publisher.type is set to 'awsS3'. Skip otherwise.

awsS3:
# An API key is required to write to a storage bucket.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the sdk also read from the instance profile or ~/.aws/credentials too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! There are some additional ways to handle S3 configuration apart from creating an access user agent (discussed above in various threads). Thinking we can address it in a separate PR (to also improve the GCS auth on similar patterns).

packages/techdocs-common/src/stages/publish/awsS3.ts Outdated Show resolved Hide resolved
packages/techdocs-common/src/stages/publish/awsS3.ts Outdated Show resolved Hide resolved
packages/techdocs-common/src/stages/publish/awsS3.ts Outdated Show resolved Hide resolved
@OrkoHunter
Copy link
Member

@ayshiff Do you think this is good to go? I'll do a final Quality Assurance on this and merge. Could you also highlight some future fixes/improvements? We can create issues for them.

@ayshiff
Copy link
Contributor Author

ayshiff commented Jan 5, 2021

I think this is good to go, I’ll do a last check in the afternoon.

Here are the future fixes/improvements that I noted in the comments:

  • handle S3 configuration apart from creating an access user agent ( add the possibility to read from the instance profile or ~/.aws/credentials ) Link
  • Drop provider prefixes ? (awsS3 -> s3 and googleGcs -> gcs) Link
  • Unification of configs based on provider Link
  • aws-sdkversion bump (already implemented on a branch, waiting for some TypeScript issues to be fixed on aws-sdk-js-v3.)
  • enable publishers support using nested directories (gcs + s3)

@OrkoHunter
Copy link
Member

@ayshiff Do let me know when it's ready according to you.
Thank you for summarising the improvements. I can go ahead and create separate issues for them once this is merged. But feel free and let me know if you want to create them yourself. :)

@ayshiff
Copy link
Contributor Author

ayshiff commented Jan 5, 2021

It looks good for me !
Do you want me to clean up the commit history ? (squash some commits ...)

I’ll be happy to create them. :)
For issues involving aws s3 and gcs do I create two issues or a single one that includes both ?

@OrkoHunter
Copy link
Member

Do you want me to clean up the commit history ? (squash some commits ...)

No, the commit history looks good to me!

I’ll be happy to create them. :)
For issues involving aws s3 and gcs do I create two issues or a single one that includes both ?

Awesome and thank you! This is what I think -

  1. (Only S3) Improved S3 credentials handling. Explicitly defining configs should not be required.
  2. (Only GCS) Same ^. One should be able to do new GCS(), and it should not complain about missing configs. The logic should be - during initialization, if the publisher is able to read the metadata of the bucket -> Connected; unable to read metadata -> Throw error.
  3. (Only S3) aws-sdk version bump (already implemented on a branch, waiting for some TypeScript issues to be fixed on aws-sdk-js-v3.)
  4. (Joint) Enable publishers support using nested directories in a bucket

@OrkoHunter OrkoHunter merged commit 0b9646e into backstage:master Jan 5, 2021
TechDocs project board automation moved this from To Review/Review In Progress ⏳ to Done ✅ Jan 5, 2021
@OrkoHunter
Copy link
Member

You did awesome on this PR @ayshiff! 👏 I appreciate your attention to detail with the changes you did, and the reviews you addressed (sparked many good conversations). :)

@ayshiff
Copy link
Contributor Author

ayshiff commented Jan 5, 2021

I think the problem with failed tests on master is due to bad path delimiters.
Do we revert the PR and add that or do I create another PR that fixes the tests ?

@OrkoHunter
Copy link
Member

OrkoHunter commented Jan 5, 2021

@ayshiff New one would be good! (Windows paths 😬)

@OrkoHunter
Copy link
Member

OrkoHunter commented Jan 5, 2021

You could use something like this in the tests. (Looking at existing examples)

const path = os.platform() === 'win32' ? 'C:\\rootDir' : '/rootDir';

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

TechDocs: Support AWS S3 as TechDocs external storage (to store and read generated documentation sites)
8 participants