New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCS offload support(4): add documentations for GCS #2152
Conversation
bfeeffd
to
1fa337f
Compare
retest this please |
@ivankelly can you review this? |
|
||
Tiered storage currently supports S3 for long term storage. On the broker, the administrator must configure a S3 bucket and the AWS region where the bucket exists. Offloaded data will be placed into this bucket. | ||
Pulsar users multi-part objects to update the segment data. It is possible that a broker could crash while uploading the data. We recommend you add a life cycle rule your bucket to expire incomplete multi-part upload after a day or two to avoid getting charged for incomplete uploads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulsar uses multi-part objects to upload the segment data.
|
||
At a minimum, the user must configure the driver, the region and the bucket. | ||
Currently we support driver of types: { "S3", "aws-s3", "google-cloud-storage" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't mention the two variants of S3 in the docs, just "aws-s3" and "google-cloud-storage".
|
||
### Configuring the broker | ||
At a minimum, the administrator must configure the driver, the bucket and the authenticating. There is also some other knobs to configure, like the bucket regions, the max block size in backed storage, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> the bucket, and authentication credentials.
-> There are also some other options to configure, ...
For AWS, region is a required configuration. I would guess it's the same for GCS, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, They both have default values for region, and both not required.
US East (N. Virginia) is the default Region for aws-s3.
us(Multi-regional locations) is the default location for gcs.
|
||
Offloading is configured in ```broker.conf```. | ||
### Configure the driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> Configuring the driver
|
||
The configured S3 bucket must exist before attempting to offload. If it does not exist, the offload operation will fail. | ||
## Configuring for S3 and GCS in the broker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> Configuring Tiered Storage in the Broker
|
||
At a minimum, the user must configure the driver, the region and the bucket. | ||
Currently we support driver of types: { "S3", "aws-s3", "google-cloud-storage" }, | ||
{% include admonition.html type="warning" content="The chars are case ignored for driver's name. "s3" and "aws-s3" are similar, with "aws-s3" you just don't need to define the url of the endpoint because it will know to use `s3.amazonaws.com`." %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> Driver names are case-insensitive.
Why is there's there different behaviour with s3 and aws-s3? surely if the endpoint is defined it should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, with s3, you must provide endpoint url; while with aws-s3, the endpoint url is not a must.
s3ManagedLedgerOffloadRegion=eu-west-3 | ||
``` | ||
|
||
### Configure the Bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configuring the Bucket
|
||
### Configure the Bucket | ||
|
||
On the broker, the administrator must configure the bucket or credentials for the cloud storage service. The configured bucket must exist before attempting to offload. If it does not exist, the offload operation will fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> bucket and credentials
For AWS, you should state that region is also required, and should match the region in which the bucket has been created.
|
||
On the broker, the administrator must configure the bucket or credentials for the cloud storage service. The configured bucket must exist before attempting to offload. If it does not exist, the offload operation will fail. | ||
|
||
- Regarding driver type "S3" or "aws-s3", the administrator should configure `s3ManagedLedgerOffloadBucket`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, having S3 and aws-s3 is going to be very confusing. Stick with aws-s3.
|
||
Pulsar also provides some knobs to configure the size of requests sent to S3. | ||
Regarding AWS S3, the default region is `US East (N. Virginia)`. Page [AWS Regions and Endpoints](https://docs.aws.amazon.com/general/latest/gr/rande.html) contains more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The page is jumping back and forth between GCS and AWS a lot. For a user, this is very confusing. A user is either going to care about GCS or AWS, and not give a fig about the other. So, for the things you need to get up and running, they should be groups together.
In other words, the sections should be
- Configuring the Driver // s3 or gcs
- S3
- configuration of region, bucket and credentials
- note at end about setting the endpoint explicitly
- GCS
- configuration of region, bucket and credentials
- Extra options
- for the block size stuff, etc, that people are rarely going to touch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Agreed that most the user cares 1 of the 2 types. Current sections is following your original view, It seems be clear from the view of each setting: in each setting, S3 first introduced, then GCS. I would like to keep the current view.
Most of the settings is similar(except the endpoint). And most words of each sections is about explanation the meaning of settings, less words for the setting. If split S3 and GCS, there will be some dup of the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user doesn't care if there is duplication. They care how much they need to read to get up and running. And how much of what they read is useful to them. With the current layout, 50% of what they read is useless to them.
@ivankelly Thanks for the comments, updated this PR. |
rerun integration tests |
@ivankelly , updated again. |
@@ -5,6 +5,8 @@ tags: [admin, tiered-storage] | |||
|
|||
Pulsar's **Tiered Storage** feature allows older backlog data to be offloaded to long term storage, thereby freeing up space in BookKeeper and reducing storage costs. This cookbook walks you through using tiered storage in your Pulsar cluster. | |||
|
|||
Tiered storage currently leverage [Apache Jclouds](https://jclouds.apache.org) to supports [S3](https://aws.amazon.com/s3/) and [Google Cloud Storage](https://cloud.google.com/storage/)(GCS for short) for long term storage. And by Jclouds, it is easy to add more [supported](https://jclouds.apache.org/reference/providers/#blobstore-providers) cloud storage provider in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> currently uses Apache Jclouds to support Amazon S3.
And by Jclouds.. -> With jclouds, it is easy to add support for more cloud storage providers in the future.
jclouds always seem to write their name in all lowercase.
Pulsar users multipart objects to update the segment data. It is possible that a broker could crash while uploading the data. We recommend you add a lifecycle rule your S3 bucket to expire incomplete multipart upload after a day or two to avoid getting charged for incomplete uploads. | ||
|
||
### Configuring the broker | ||
## Configuring the driver for "aws-s3" or "google-cloud-storage" in the broker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> Configuring the offload driver
|
||
Offloading is configured in ```broker.conf```. | ||
|
||
At a minimum, the user must configure the driver, the region and the bucket. | ||
At a minimum, the administrator must configure the driver, the bucket and the authenticating. There is also some other knobs to configure, like the bucket regions, the max block size in backed storage, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> the bucket and authentication credentials.
-> bucket region
At a minimum, the administrator must configure the driver, the bucket and the authenticating. There is also some other knobs to configure, like the bucket regions, the max block size in backed storage, etc. | ||
|
||
Currently we support driver of types: { "aws-s3", "google-cloud-storage" }, | ||
{% include admonition.html type="warning" content="Driver names are case-insensitive for driver's name. "s3" and "aws-s3" are similar, with "aws-s3" you just don't need to define the url of the endpoint because it is aligned with region, and default is `s3.amazonaws.com`; while with s3, you must provide the endpoint url by `s3ManagedLedgerOffloadServiceEndpoint`." %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"s3" and "aws-s3" ... -> There is a third driver type, "s3", which is identical to "aws-s3", though it requires that you specify an endpoint url using s3ManagedLedgerOffloadServiceEndpoint
. This is useful if using a S3 compatible data store, other than AWS.
``` | ||
|
||
It is also possible to specify the s3 endpoint directly, using ```s3ManagedLedgerOffloadServiceEndpoint```. This is useful if you are using a non-AWS storage service which provides an S3 compatible API. | ||
### Configuring for "aws-s3" driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> "aws-s3" Driver configuration
gcsManagedLedgerOffloadRegion=europe-west3 | ||
``` | ||
|
||
#### Configuring the Authenticating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, just "Authentication" is enough.
|
||
#### Configuring the Authenticating | ||
|
||
The administrator need configure `gcsManagedLedgerOffloadServiceAccountKeyFile` in `broker.conf` to get GCS service available. It is a Json file, which contains GCS credentials of service account key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The administrator needs to configure gcsManagedLedgerOffloadServiceAccountKeyFile
in broker.conf
for the broker to be able to access the GCS service. gcsManagedLedgerOffloadServiceAccountKeyFile
is a Json file, containing the GCS credentials of a service account.
#### Configuring the Authenticating | ||
|
||
The administrator need configure `gcsManagedLedgerOffloadServiceAccountKeyFile` in `broker.conf` to get GCS service available. It is a Json file, which contains GCS credentials of service account key. | ||
[This page](https://support.google.com/googleapi/answer/6158849) contains more information of how to create this key file for authentication. You could also get more information regarding google cloud [IAM](https://cloud.google.com/storage/docs/access-control/iam). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Service Accounts section of this page contains more information...
More information about Google Cloud IAMs is available here.
2. If it's not already selected, select the project that you're creating credentials for. | ||
3. To set up a new service account, click New credentials and then select Service account key. | ||
4. Choose the service account to use for the key. | ||
5. Choose whether to download the service account's public/private key as a JSON file that can be loaded by a Google API client library. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove "Choose whether to".
4. Choose the service account to use for the key. | ||
5. Choose whether to download the service account's public/private key as a JSON file that can be loaded by a Google API client library. | ||
|
||
Here is an example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line.
@ivankelly , updated again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#lgtm
retest this please |
1 similar comment
retest this please |
fyi please hold on merging this until I cut over 2.1 website. |
### Motivation There is a syntax error in tiered storage doc introduced by apache#2152. ``` Liquid Exception: Invalid syntax for include tag: type="warning" content="Driver names are case-insensitive for driver's name. There is a third driver type, "s3", which is identical to "aws-s3", though it requires that you specify an endpoint url using `s3ManagedLedgerOffloadServiceEndpoint`. This is useful if using a S3 compatible data store, other than AWS." Valid syntax: {% include file.ext param='value' param2='value' %} in docs/latest/cookbooks/tiered-storage.md ``` ### Changes Fix the syntax error.
### Motivation There is a syntax error in tiered storage doc introduced by #2152. ``` Liquid Exception: Invalid syntax for include tag: type="warning" content="Driver names are case-insensitive for driver's name. There is a third driver type, "s3", which is identical to "aws-s3", though it requires that you specify an endpoint url using `s3ManagedLedgerOffloadServiceEndpoint`. This is useful if using a S3 compatible data store, other than AWS." Valid syntax: {% include file.ext param='value' param2='value' %} in docs/latest/cookbooks/tiered-storage.md ``` ### Changes Fix the syntax error.
### Motivation Cherry-pick apache#2152
This is the 4th part to support Google Cloud Storage offload.
It aims to add documentations for GCS. And it is based on PR #2151
Master Issue: #2067