-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCS offload support(4): add documentations for GCS #2152
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,8 @@ tags: [admin, tiered-storage] | |
|
||
Pulsar's **Tiered Storage** feature allows older backlog data to be offloaded to long term storage, thereby freeing up space in BookKeeper and reducing storage costs. This cookbook walks you through using tiered storage in your Pulsar cluster. | ||
|
||
Tiered storage currently leverage [Apache Jclouds](https://jclouds.apache.org) to supports [S3](https://aws.amazon.com/s3/) and [Google Cloud Storage](https://cloud.google.com/storage/)(GCS for short) for long term storage. And by Jclouds, it is easy to add more [supported](https://jclouds.apache.org/reference/providers/#blobstore-providers) cloud storage provider in the future. | ||
|
||
## When should I use Tiered Storage? | ||
|
||
Tiered storage should be used when you have a topic for which you want to keep a very long backlog for a long time. For example, if you have a topic containing user actions which you use to train your recommendation systems, you may want to keep that data for a long time, so that if you change your recommendation algorithm you can rerun it against your full user history. | ||
|
@@ -17,42 +19,63 @@ A topic in Pulsar is backed by a log, known as a managed ledger. This log is com | |
|
||
The Tiered Storage offloading mechanism takes advantage of this segment oriented architecture. When offloading is requested, the segments of the log are copied, one-by-one, to tiered storage. All segments of the log, apart from the segment currently being written to can be offloaded. | ||
|
||
## Amazon S3 | ||
On the broker, the administrator must configure the bucket or credentials for the cloud storage service. The configured bucket must exist before attempting to offload. If it does not exist, the offload operation will fail. | ||
|
||
Tiered storage currently supports S3 for long term storage. On the broker, the administrator must configure a S3 bucket and the AWS region where the bucket exists. Offloaded data will be placed into this bucket. | ||
Pulsar users multi-part objects to update the segment data. It is possible that a broker could crash while uploading the data. We recommend you add a life cycle rule your bucket to expire incomplete multi-part upload after a day or two to avoid getting charged for incomplete uploads. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pulsar uses multi-part objects to upload the segment data. |
||
|
||
The configured S3 bucket must exist before attempting to offload. If it does not exist, the offload operation will fail. | ||
## Configuring for S3 and GCS in the broker | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -> Configuring Tiered Storage in the Broker |
||
|
||
Pulsar users multipart objects to update the segment data. It is possible that a broker could crash while uploading the data. We recommend you add a lifecycle rule your S3 bucket to expire incomplete multipart upload after a day or two to avoid getting charged for incomplete uploads. | ||
Offloading is configured in ```broker.conf```. | ||
|
||
### Configuring the broker | ||
At a minimum, the administrator must configure the driver, the bucket and the authenticating. There is also some other knobs to configure, like the bucket regions, the max block size in backed storage, etc. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -> the bucket, and authentication credentials. -> There are also some other options to configure, ... For AWS, region is a required configuration. I would guess it's the same for GCS, no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, They both have default values for region, and both not required. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -> the bucket and authentication credentials. |
||
|
||
Offloading is configured in ```broker.conf```. | ||
### Configure the driver | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -> Configuring the driver |
||
|
||
At a minimum, the user must configure the driver, the region and the bucket. | ||
Currently we support driver of types: { "S3", "aws-s3", "google-cloud-storage" }, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't mention the two variants of S3 in the docs, just "aws-s3" and "google-cloud-storage". |
||
{% include admonition.html type="warning" content="The chars are case ignored for driver's name. "s3" and "aws-s3" are similar, with "aws-s3" you just don't need to define the url of the endpoint because it will know to use `s3.amazonaws.com`." %} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -> Driver names are case-insensitive. Why is there's there different behaviour with s3 and aws-s3? surely if the endpoint is defined it should be used. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, with s3, you must provide endpoint url; while with aws-s3, the endpoint url is not a must. |
||
|
||
```conf | ||
managedLedgerOffloadDriver=S3 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. change to aws-s3 |
||
s3ManagedLedgerOffloadRegion=eu-west-3 | ||
``` | ||
|
||
### Configure the Bucket | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Configuring the Bucket |
||
|
||
On the broker, the administrator must configure the bucket or credentials for the cloud storage service. The configured bucket must exist before attempting to offload. If it does not exist, the offload operation will fail. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -> bucket and credentials For AWS, you should state that region is also required, and should match the region in which the bucket has been created. |
||
|
||
- Regarding driver type "S3" or "aws-s3", the administrator should configure `s3ManagedLedgerOffloadBucket`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, having S3 and aws-s3 is going to be very confusing. Stick with aws-s3. |
||
|
||
```conf | ||
s3ManagedLedgerOffloadBucket=pulsar-topic-offload | ||
``` | ||
|
||
It is also possible to specify the s3 endpoint directly, using ```s3ManagedLedgerOffloadServiceEndpoint```. This is useful if you are using a non-AWS storage service which provides an S3 compatible API. | ||
- While regarding driver type "google-cloud-storage", the administrator should configure `gcsManagedLedgerOffloadBucket`. | ||
```conf | ||
gcsManagedLedgerOffloadBucket=pulsar-topic-offload | ||
``` | ||
|
||
{% include admonition.html type="warning" content="If the endpoint is specified directly, then the region must _not_ be set." %} | ||
### Configure the Bucket Region | ||
|
||
{% include admonition.html type="warning" content="The broker.conf of all brokers must have the same configuration for driver, region and bucket for offload to avoid data becoming unavailable as topics move from one broker to another." %} | ||
Bucket Region is the region where bucket located. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a note about whether bucket region is a required configuration. What happens if it is not configured. |
||
|
||
Pulsar also provides some knobs to configure the size of requests sent to S3. | ||
Regarding AWS S3, the default region is `US East (N. Virginia)`. Page [AWS Regions and Endpoints](https://docs.aws.amazon.com/general/latest/gr/rande.html) contains more information. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The page is jumping back and forth between GCS and AWS a lot. For a user, this is very confusing. A user is either going to care about GCS or AWS, and not give a fig about the other. So, for the things you need to get up and running, they should be groups together. In other words, the sections should be
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, Agreed that most the user cares 1 of the 2 types. Current sections is following your original view, It seems be clear from the view of each setting: in each setting, S3 first introduced, then GCS. I would like to keep the current view. Most of the settings is similar(except the endpoint). And most words of each sections is about explanation the meaning of settings, less words for the setting. If split S3 and GCS, there will be some dup of the explanation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The user doesn't care if there is duplication. They care how much they need to read to get up and running. And how much of what they read is useful to them. With the current layout, 50% of what they read is useless to them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With AWS S3, ... |
||
|
||
- ```s3ManagedLedgerOffloadMaxBlockSizeInBytes``` configures the maximum size of a "part" sent during a multipart upload. This cannot be smaller than 5MB. Default is 64MB. | ||
- ```s3ManagedLedgerOffloadReadBufferSizeInBytes``` configures the block size for each individual read when reading back data from S3. Default is 1MB. | ||
Regarding GCS, buckets are default created in the `us multi-regional location`, page [Bucket Locations](https://cloud.google.com/storage/docs/bucket-locations) contains more information. | ||
|
||
In both cases, these should not be touched unless you know what you are doing. | ||
- AWS S3 Region example: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove this line. |
||
|
||
{% include admonition.html type="warning" content="The broker must be rebooted for any changes in the configuration to take effect." %} | ||
```conf | ||
s3ManagedLedgerOffloadRegion=eu-west-3 | ||
``` | ||
|
||
### Authenticating with S3 | ||
- GCS Region example: | ||
|
||
```conf | ||
gcsManagedLedgerOffloadRegion=europe-west3 | ||
``` | ||
|
||
### Configure the Authenticating | ||
|
||
#### Authenticating with AWS S3 | ||
|
||
To be able to access S3, you need to authenticate with S3. Pulsar does not provide any direct means of configuring authentication for S3, but relies on the mechanisms supported by the [DefaultAWSCredentialsProviderChain](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html). | ||
|
||
|
@@ -86,6 +109,33 @@ If you are running in EC2 you can also use instance profile credentials, provide | |
|
||
{% include admonition.html type="warning" content="The broker must be rebooted for credentials specified in pulsar_env to take effect." %} | ||
|
||
#### Authenticating with GCS | ||
|
||
The administrator need configure `gcsManagedLedgerOffloadServiceAccountKeyFile` in `broker.conf` to get GCS service available. It is a Json file, which contains GCS credentials of service account key. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The administrator needs to configure |
||
[This page](https://support.google.com/googleapi/answer/6158849) contains more information of how to create this key file for authentication. You could also get more information regarding google cloud [IAM](https://cloud.google.com/storage/docs/access-control/iam). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Service Accounts section of this page contains more information... |
||
|
||
Usually these are the steps to create the authentication file: | ||
1. Open the API Console Credentials page. | ||
2. If it's not already selected, select the project that you're creating credentials for. | ||
3. To set up a new service account, click New credentials and then select Service account key. | ||
4. Choose the service account to use for the key. | ||
5. Choose whether to download the service account's public/private key as a JSON file that can be loaded by a Google API client library. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove "Choose whether to". |
||
|
||
Here is an example: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove this line. |
||
```conf | ||
gcsManagedLedgerOffloadServiceAccountKeyFile="/Users/jia/Downloads/project-804d5e6a6f33.json" | ||
``` | ||
|
||
### Configure the size of block read/write | ||
|
||
Pulsar also provides some knobs to configure the size of requests sent to S3/GCS. | ||
|
||
- ```s3ManagedLedgerOffloadMaxBlockSizeInBytes``` and ```gcsManagedLedgerOffloadMaxBlockSizeInBytes``` configures the maximum size of a "part" sent during a multipart upload. This cannot be smaller than 5MB. Default is 64MB. | ||
- ```s3ManagedLedgerOffloadReadBufferSizeInBytes``` and ```gcsManagedLedgerOffloadReadBufferSizeInBytes``` configures the block size for each individual read when reading back data from S3/GCS. Default is 1MB. | ||
|
||
In both cases, these should not be touched unless you know what you are doing. | ||
|
||
|
||
## Configuring offload to run automatically | ||
|
||
Namespace policies can be configured to offload data automatically once a threshold is reached. The threshold is based on the size of data that the topic has stored on the pulsar cluster. Once the topic reaches the threshold, an offload operation will be triggered. Setting a negative value to the threshold will disable automatic offloading. Setting the threshold to 0 will cause the broker to offload data as soon as it possiby can. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> currently uses Apache Jclouds to support Amazon S3.
And by Jclouds.. -> With jclouds, it is easy to add support for more cloud storage providers in the future.
jclouds always seem to write their name in all lowercase.