Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS offload support(4): add documentations for GCS #2152

Merged
merged 4 commits into from
Aug 8, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 67 additions & 17 deletions site/docs/latest/cookbooks/tiered-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ tags: [admin, tiered-storage]

Pulsar's **Tiered Storage** feature allows older backlog data to be offloaded to long term storage, thereby freeing up space in BookKeeper and reducing storage costs. This cookbook walks you through using tiered storage in your Pulsar cluster.

Tiered storage currently leverage [Apache Jclouds](https://jclouds.apache.org) to supports [S3](https://aws.amazon.com/s3/) and [Google Cloud Storage](https://cloud.google.com/storage/)(GCS for short) for long term storage. And by Jclouds, it is easy to add more [supported](https://jclouds.apache.org/reference/providers/#blobstore-providers) cloud storage provider in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> currently uses Apache Jclouds to support Amazon S3.
And by Jclouds.. -> With jclouds, it is easy to add support for more cloud storage providers in the future.

jclouds always seem to write their name in all lowercase.


## When should I use Tiered Storage?

Tiered storage should be used when you have a topic for which you want to keep a very long backlog for a long time. For example, if you have a topic containing user actions which you use to train your recommendation systems, you may want to keep that data for a long time, so that if you change your recommendation algorithm you can rerun it against your full user history.
Expand All @@ -17,42 +19,63 @@ A topic in Pulsar is backed by a log, known as a managed ledger. This log is com

The Tiered Storage offloading mechanism takes advantage of this segment oriented architecture. When offloading is requested, the segments of the log are copied, one-by-one, to tiered storage. All segments of the log, apart from the segment currently being written to can be offloaded.

## Amazon S3
On the broker, the administrator must configure the bucket or credentials for the cloud storage service. The configured bucket must exist before attempting to offload. If it does not exist, the offload operation will fail.

Tiered storage currently supports S3 for long term storage. On the broker, the administrator must configure a S3 bucket and the AWS region where the bucket exists. Offloaded data will be placed into this bucket.
Pulsar users multi-part objects to update the segment data. It is possible that a broker could crash while uploading the data. We recommend you add a life cycle rule your bucket to expire incomplete multi-part upload after a day or two to avoid getting charged for incomplete uploads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulsar uses multi-part objects to upload the segment data.


The configured S3 bucket must exist before attempting to offload. If it does not exist, the offload operation will fail.
## Configuring for S3 and GCS in the broker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Configuring Tiered Storage in the Broker


Pulsar users multipart objects to update the segment data. It is possible that a broker could crash while uploading the data. We recommend you add a lifecycle rule your S3 bucket to expire incomplete multipart upload after a day or two to avoid getting charged for incomplete uploads.
Offloading is configured in ```broker.conf```.

### Configuring the broker
At a minimum, the administrator must configure the driver, the bucket and the authenticating. There is also some other knobs to configure, like the bucket regions, the max block size in backed storage, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> the bucket, and authentication credentials.

-> There are also some other options to configure, ...

For AWS, region is a required configuration. I would guess it's the same for GCS, no?

Copy link
Contributor Author

@zhaijack zhaijack Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, They both have default values for region, and both not required.
US East (N. Virginia) is the default Region for aws-s3.
us(Multi-regional locations) is the default location for gcs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> the bucket and authentication credentials.
-> bucket region


Offloading is configured in ```broker.conf```.
### Configure the driver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Configuring the driver


At a minimum, the user must configure the driver, the region and the bucket.
Currently we support driver of types: { "S3", "aws-s3", "google-cloud-storage" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mention the two variants of S3 in the docs, just "aws-s3" and "google-cloud-storage".

{% include admonition.html type="warning" content="The chars are case ignored for driver's name. "s3" and "aws-s3" are similar, with "aws-s3" you just don't need to define the url of the endpoint because it will know to use `s3.amazonaws.com`." %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Driver names are case-insensitive.

Why is there's there different behaviour with s3 and aws-s3? surely if the endpoint is defined it should be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, with s3, you must provide endpoint url; while with aws-s3, the endpoint url is not a must.


```conf
managedLedgerOffloadDriver=S3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to aws-s3

s3ManagedLedgerOffloadRegion=eu-west-3
```

### Configure the Bucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuring the Bucket


On the broker, the administrator must configure the bucket or credentials for the cloud storage service. The configured bucket must exist before attempting to offload. If it does not exist, the offload operation will fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> bucket and credentials

For AWS, you should state that region is also required, and should match the region in which the bucket has been created.


- Regarding driver type "S3" or "aws-s3", the administrator should configure `s3ManagedLedgerOffloadBucket`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, having S3 and aws-s3 is going to be very confusing. Stick with aws-s3.


```conf
s3ManagedLedgerOffloadBucket=pulsar-topic-offload
```

It is also possible to specify the s3 endpoint directly, using ```s3ManagedLedgerOffloadServiceEndpoint```. This is useful if you are using a non-AWS storage service which provides an S3 compatible API.
- While regarding driver type "google-cloud-storage", the administrator should configure `gcsManagedLedgerOffloadBucket`.
```conf
gcsManagedLedgerOffloadBucket=pulsar-topic-offload
```

{% include admonition.html type="warning" content="If the endpoint is specified directly, then the region must _not_ be set." %}
### Configure the Bucket Region

{% include admonition.html type="warning" content="The broker.conf of all brokers must have the same configuration for driver, region and bucket for offload to avoid data becoming unavailable as topics move from one broker to another." %}
Bucket Region is the region where bucket located.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note about whether bucket region is a required configuration. What happens if it is not configured.


Pulsar also provides some knobs to configure the size of requests sent to S3.
Regarding AWS S3, the default region is `US East (N. Virginia)`. Page [AWS Regions and Endpoints](https://docs.aws.amazon.com/general/latest/gr/rande.html) contains more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The page is jumping back and forth between GCS and AWS a lot. For a user, this is very confusing. A user is either going to care about GCS or AWS, and not give a fig about the other. So, for the things you need to get up and running, they should be groups together.

In other words, the sections should be

  • Configuring the Driver // s3 or gcs
  • S3
    • configuration of region, bucket and credentials
    • note at end about setting the endpoint explicitly
  • GCS
    • configuration of region, bucket and credentials
  • Extra options
    • for the block size stuff, etc, that people are rarely going to touch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Agreed that most the user cares 1 of the 2 types. Current sections is following your original view, It seems be clear from the view of each setting: in each setting, S3 first introduced, then GCS. I would like to keep the current view.

Most of the settings is similar(except the endpoint). And most words of each sections is about explanation the meaning of settings, less words for the setting. If split S3 and GCS, there will be some dup of the explanation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user doesn't care if there is duplication. They care how much they need to read to get up and running. And how much of what they read is useful to them. With the current layout, 50% of what they read is useless to them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With AWS S3, ...


- ```s3ManagedLedgerOffloadMaxBlockSizeInBytes``` configures the maximum size of a "part" sent during a multipart upload. This cannot be smaller than 5MB. Default is 64MB.
- ```s3ManagedLedgerOffloadReadBufferSizeInBytes``` configures the block size for each individual read when reading back data from S3. Default is 1MB.
Regarding GCS, buckets are default created in the `us multi-regional location`, page [Bucket Locations](https://cloud.google.com/storage/docs/bucket-locations) contains more information.

In both cases, these should not be touched unless you know what you are doing.
- AWS S3 Region example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line.


{% include admonition.html type="warning" content="The broker must be rebooted for any changes in the configuration to take effect." %}
```conf
s3ManagedLedgerOffloadRegion=eu-west-3
```

### Authenticating with S3
- GCS Region example:

```conf
gcsManagedLedgerOffloadRegion=europe-west3
```

### Configure the Authenticating

#### Authenticating with AWS S3

To be able to access S3, you need to authenticate with S3. Pulsar does not provide any direct means of configuring authentication for S3, but relies on the mechanisms supported by the [DefaultAWSCredentialsProviderChain](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html).

Expand Down Expand Up @@ -86,6 +109,33 @@ If you are running in EC2 you can also use instance profile credentials, provide

{% include admonition.html type="warning" content="The broker must be rebooted for credentials specified in pulsar_env to take effect." %}

#### Authenticating with GCS

The administrator need configure `gcsManagedLedgerOffloadServiceAccountKeyFile` in `broker.conf` to get GCS service available. It is a Json file, which contains GCS credentials of service account key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The administrator needs to configure gcsManagedLedgerOffloadServiceAccountKeyFile in broker.conf for the broker to be able to access the GCS service. gcsManagedLedgerOffloadServiceAccountKeyFile is a Json file, containing the GCS credentials of a service account.

[This page](https://support.google.com/googleapi/answer/6158849) contains more information of how to create this key file for authentication. You could also get more information regarding google cloud [IAM](https://cloud.google.com/storage/docs/access-control/iam).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Service Accounts section of this page contains more information...
More information about Google Cloud IAMs is available here.


Usually these are the steps to create the authentication file:
1. Open the API Console Credentials page.
2. If it's not already selected, select the project that you're creating credentials for.
3. To set up a new service account, click New credentials and then select Service account key.
4. Choose the service account to use for the key.
5. Choose whether to download the service account's public/private key as a JSON file that can be loaded by a Google API client library.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "Choose whether to".


Here is an example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line.

```conf
gcsManagedLedgerOffloadServiceAccountKeyFile="/Users/jia/Downloads/project-804d5e6a6f33.json"
```

### Configure the size of block read/write

Pulsar also provides some knobs to configure the size of requests sent to S3/GCS.

- ```s3ManagedLedgerOffloadMaxBlockSizeInBytes``` and ```gcsManagedLedgerOffloadMaxBlockSizeInBytes``` configures the maximum size of a "part" sent during a multipart upload. This cannot be smaller than 5MB. Default is 64MB.
- ```s3ManagedLedgerOffloadReadBufferSizeInBytes``` and ```gcsManagedLedgerOffloadReadBufferSizeInBytes``` configures the block size for each individual read when reading back data from S3/GCS. Default is 1MB.

In both cases, these should not be touched unless you know what you are doing.


## Configuring offload to run automatically

Namespace policies can be configured to offload data automatically once a threshold is reached. The threshold is based on the size of data that the topic has stored on the pulsar cluster. Once the topic reaches the threshold, an offload operation will be triggered. Setting a negative value to the threshold will disable automatic offloading. Setting the threshold to 0 will cause the broker to offload data as soon as it possiby can.
Expand Down