add gcs related documents

apache · Jul 13, 2018 · 1fa337f · 1fa337f
1 parent e9d67bc
commit 1fa337f
Showing 1 changed file with 67 additions and 17 deletions.
diff --git a/site/docs/latest/cookbooks/tiered-storage.md b/site/docs/latest/cookbooks/tiered-storage.md
@@ -5,6 +5,8 @@ tags: [admin, tiered-storage]
 
 Pulsar's **Tiered Storage** feature allows older backlog data to be offloaded to long term storage, thereby freeing up space in BookKeeper and reducing storage costs. This cookbook walks you through using tiered storage in your Pulsar cluster.
 
+Tiered storage currently leverage [Apache Jclouds](https://jclouds.apache.org) to supports [S3](https://aws.amazon.com/s3/) and [Google Cloud Storage](https://cloud.google.com/storage/)(GCS for short) for long term storage. And by Jclouds, it is easy to add more [supported](https://jclouds.apache.org/reference/providers/#blobstore-providers) cloud storage provider in the future.
+
 ## When should I use Tiered Storage?
 
 Tiered storage should be used when you have a topic for which you want to keep a very long backlog for a long time. For example, if you have a topic containing user actions which you use to train your recommendation systems, you may want to keep that data for a long time, so that if you change your recommendation algorithm you can rerun it against your full user history.
@@ -17,42 +19,63 @@ A topic in Pulsar is backed by a log, known as a managed ledger. This log is com
 
 The Tiered Storage offloading mechanism takes advantage of this segment oriented architecture. When offloading is requested, the segments of the log are copied, one-by-one, to tiered storage. All segments of the log, apart from the segment currently being written to can be offloaded.
 
-## Amazon S3
+On the broker, the administrator must configure the bucket or credentials for the cloud storage service. The configured bucket must exist before attempting to offload. If it does not exist, the offload operation will fail.
 
-Tiered storage currently supports S3 for long term storage. On the broker, the administrator must configure a S3 bucket and the AWS region where the bucket exists. Offloaded data will be placed into this bucket.
+Pulsar users multi-part objects to update the segment data. It is possible that a broker could crash while uploading the data. We recommend you add a life cycle rule your bucket to expire incomplete multi-part upload after a day or two to avoid getting charged for incomplete uploads.
 
-The configured S3 bucket must exist before attempting to offload. If it does not exist, the offload operation will fail.
+## Configuring for S3 and GCS in the broker
 
-Pulsar users multipart objects to update the segment data. It is possible that a broker could crash while uploading the data. We recommend you add a lifecycle rule your S3 bucket to expire incomplete multipart upload after a day or two to avoid getting charged for incomplete uploads.
+Offloading is configured in ```broker.conf```. 
 
-### Configuring the broker
+At a minimum, the administrator must configure the driver, the bucket and the authenticating.  There is also some other knobs to configure, like the bucket regions, the max block size in backed storage, etc.
 
-Offloading is configured in ```broker.conf```. 
+### Configure the driver
 
-At a minimum, the user must configure the driver, the region and the bucket.
+Currently we support driver of types: { "S3", "aws-s3", "google-cloud-storage" }, 
+{% include admonition.html type="warning" content="The chars are case ignored for driver's name. "s3" and "aws-s3" are similar, with "aws-s3" you just don't need to define the url of the endpoint because it will know to use `s3.amazonaws.com`." %}
 
 ```conf
 managedLedgerOffloadDriver=S3
-s3ManagedLedgerOffloadRegion=eu-west-3
+```
+
+### Configure the Bucket
+
+On the broker, the administrator must configure the bucket or credentials for the cloud storage service. The configured bucket must exist before attempting to offload. If it does not exist, the offload operation will fail.
+
+- Regarding driver type "S3" or "aws-s3", the administrator should configure `s3ManagedLedgerOffloadBucket`.
+
+```conf
 s3ManagedLedgerOffloadBucket=pulsar-topic-offload
 ```
 
-It is also possible to specify the s3 endpoint directly, using ```s3ManagedLedgerOffloadServiceEndpoint```. This is useful if you are using a non-AWS storage service which provides an S3 compatible API. 
+- While regarding driver type "google-cloud-storage", the administrator should configure `gcsManagedLedgerOffloadBucket`.
+```conf
+gcsManagedLedgerOffloadBucket=pulsar-topic-offload
+```
 
-{% include admonition.html type="warning" content="If the endpoint is specified directly, then the region must _not_ be set." %}
+### Configure the Bucket Region
 
-{% include admonition.html type="warning" content="The broker.conf of all brokers must have the same configuration for driver, region and bucket for offload to avoid data becoming unavailable as topics move from one broker to another." %}
+Bucket Region is the region where bucket located. 
 
-Pulsar also provides some knobs to configure the size of requests sent to S3.
+Regarding AWS S3, the default region is `US East (N. Virginia)`. Page [AWS Regions and Endpoints](https://docs.aws.amazon.com/general/latest/gr/rande.html) contains more information.
 
-- ```s3ManagedLedgerOffloadMaxBlockSizeInBytes``` configures the maximum size of a "part" sent during a multipart upload. This cannot be smaller than 5MB. Default is 64MB.
-- ```s3ManagedLedgerOffloadReadBufferSizeInBytes``` configures the block size for each individual read when reading back data from S3. Default is 1MB.
+Regarding GCS, buckets are default created in the `us multi-regional location`,  page [Bucket Locations](https://cloud.google.com/storage/docs/bucket-locations) contains more information.
 
-In both cases, these should not be touched unless you know what you are doing.
+- AWS S3 Region example:
 
-{% include admonition.html type="warning" content="The broker must be rebooted for any changes in the configuration to take effect." %}
+```conf
+s3ManagedLedgerOffloadRegion=eu-west-3
+```
 
-### Authenticating with S3
+- GCS Region example:
+
+```conf
+gcsManagedLedgerOffloadRegion=europe-west3
+```
+
+### Configure the Authenticating
+
+#### Authenticating with AWS S3
 
 To be able to access S3, you need to authenticate with S3. Pulsar does not provide any direct means of configuring authentication for S3, but relies on the mechanisms supported by the [DefaultAWSCredentialsProviderChain](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html).
 
@@ -86,6 +109,33 @@ If you are running in EC2 you can also use instance profile credentials, provide
 
 {% include admonition.html type="warning" content="The broker must be rebooted for credentials specified in pulsar_env to take effect." %}
 
+#### Authenticating with GCS
+
+The administrator need configure `gcsManagedLedgerOffloadServiceAccountKeyFile` in `broker.conf` to get GCS service available.  It is a Json file, which contains GCS credentials of service account key. 
+[This page](https://support.google.com/googleapi/answer/6158849) contains more information of how to create this key file for authentication. You could also get more information regarding google cloud [IAM](https://cloud.google.com/storage/docs/access-control/iam).
+
+Usually these are the steps to create the authentication file:
+1. Open the API Console Credentials page.
+2. If it's not already selected, select the project that you're creating credentials for.
+3. To set up a new service account, click New credentials and then select Service account key.
+4. Choose the service account to use for the key.
+5. Choose whether to download the service account's public/private key as a JSON file that can be loaded by a Google API client library.
+
+Here is an example:
+```conf
+gcsManagedLedgerOffloadServiceAccountKeyFile="/Users/jia/Downloads/project-804d5e6a6f33.json"
+```
+
+### Configure the size of block read/write
+
+Pulsar also provides some knobs to configure the size of requests sent to S3/GCS.
+
+- ```s3ManagedLedgerOffloadMaxBlockSizeInBytes``` and ```gcsManagedLedgerOffloadMaxBlockSizeInBytes``` configures the maximum size of a "part" sent during a multipart upload. This cannot be smaller than 5MB. Default is 64MB.
+- ```s3ManagedLedgerOffloadReadBufferSizeInBytes``` and ```gcsManagedLedgerOffloadReadBufferSizeInBytes``` configures the block size for each individual read when reading back data from S3/GCS. Default is 1MB.
+
+In both cases, these should not be touched unless you know what you are doing.
+
+
 ## Configuring offload to run automatically
 
 Namespace policies can be configured to offload data automatically once a threshold is reached. The threshold is based on the size of data that the topic has stored on the pulsar cluster. Once the topic reaches the threshold, an offload operation will be triggered. Setting a negative value to the threshold will disable automatic offloading. Setting the threshold to 0 will cause the broker to offload data as soon as it possiby can.