Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS offload support(4): add documentations for GCS #2152

Merged
merged 4 commits into from Aug 8, 2018

Conversation

zhaijack
Copy link
Contributor

This is the 4th part to support Google Cloud Storage offload.
It aims to add documentations for GCS. And it is based on PR #2151

Master Issue: #2067

@zhaijack
Copy link
Contributor Author

retest this please

@sijie sijie added doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. type/task area/tieredstorage labels Jul 23, 2018
@sijie sijie added this to the 2.2.0-incubating milestone Jul 23, 2018
@sijie
Copy link
Member

sijie commented Jul 23, 2018

@ivankelly can you review this?

@sijie sijie requested a review from ivankelly July 23, 2018 21:28

Tiered storage currently supports S3 for long term storage. On the broker, the administrator must configure a S3 bucket and the AWS region where the bucket exists. Offloaded data will be placed into this bucket.
Pulsar users multi-part objects to update the segment data. It is possible that a broker could crash while uploading the data. We recommend you add a life cycle rule your bucket to expire incomplete multi-part upload after a day or two to avoid getting charged for incomplete uploads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulsar uses multi-part objects to upload the segment data.


At a minimum, the user must configure the driver, the region and the bucket.
Currently we support driver of types: { "S3", "aws-s3", "google-cloud-storage" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mention the two variants of S3 in the docs, just "aws-s3" and "google-cloud-storage".


### Configuring the broker
At a minimum, the administrator must configure the driver, the bucket and the authenticating. There is also some other knobs to configure, like the bucket regions, the max block size in backed storage, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> the bucket, and authentication credentials.

-> There are also some other options to configure, ...

For AWS, region is a required configuration. I would guess it's the same for GCS, no?

Copy link
Contributor Author

@zhaijack zhaijack Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, They both have default values for region, and both not required.
US East (N. Virginia) is the default Region for aws-s3.
us(Multi-regional locations) is the default location for gcs.


Offloading is configured in ```broker.conf```.
### Configure the driver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Configuring the driver


The configured S3 bucket must exist before attempting to offload. If it does not exist, the offload operation will fail.
## Configuring for S3 and GCS in the broker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Configuring Tiered Storage in the Broker


At a minimum, the user must configure the driver, the region and the bucket.
Currently we support driver of types: { "S3", "aws-s3", "google-cloud-storage" },
{% include admonition.html type="warning" content="The chars are case ignored for driver's name. "s3" and "aws-s3" are similar, with "aws-s3" you just don't need to define the url of the endpoint because it will know to use `s3.amazonaws.com`." %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Driver names are case-insensitive.

Why is there's there different behaviour with s3 and aws-s3? surely if the endpoint is defined it should be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, with s3, you must provide endpoint url; while with aws-s3, the endpoint url is not a must.

s3ManagedLedgerOffloadRegion=eu-west-3
```

### Configure the Bucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuring the Bucket


### Configure the Bucket

On the broker, the administrator must configure the bucket or credentials for the cloud storage service. The configured bucket must exist before attempting to offload. If it does not exist, the offload operation will fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> bucket and credentials

For AWS, you should state that region is also required, and should match the region in which the bucket has been created.


On the broker, the administrator must configure the bucket or credentials for the cloud storage service. The configured bucket must exist before attempting to offload. If it does not exist, the offload operation will fail.

- Regarding driver type "S3" or "aws-s3", the administrator should configure `s3ManagedLedgerOffloadBucket`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, having S3 and aws-s3 is going to be very confusing. Stick with aws-s3.


Pulsar also provides some knobs to configure the size of requests sent to S3.
Regarding AWS S3, the default region is `US East (N. Virginia)`. Page [AWS Regions and Endpoints](https://docs.aws.amazon.com/general/latest/gr/rande.html) contains more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The page is jumping back and forth between GCS and AWS a lot. For a user, this is very confusing. A user is either going to care about GCS or AWS, and not give a fig about the other. So, for the things you need to get up and running, they should be groups together.

In other words, the sections should be

  • Configuring the Driver // s3 or gcs
  • S3
    • configuration of region, bucket and credentials
    • note at end about setting the endpoint explicitly
  • GCS
    • configuration of region, bucket and credentials
  • Extra options
    • for the block size stuff, etc, that people are rarely going to touch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Agreed that most the user cares 1 of the 2 types. Current sections is following your original view, It seems be clear from the view of each setting: in each setting, S3 first introduced, then GCS. I would like to keep the current view.

Most of the settings is similar(except the endpoint). And most words of each sections is about explanation the meaning of settings, less words for the setting. If split S3 and GCS, there will be some dup of the explanation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user doesn't care if there is duplication. They care how much they need to read to get up and running. And how much of what they read is useful to them. With the current layout, 50% of what they read is useless to them.

@zhaijack
Copy link
Contributor Author

@ivankelly Thanks for the comments, updated this PR.

@zhaijack
Copy link
Contributor Author

rerun integration tests

@zhaijack
Copy link
Contributor Author

@ivankelly , updated again.

@@ -5,6 +5,8 @@ tags: [admin, tiered-storage]

Pulsar's **Tiered Storage** feature allows older backlog data to be offloaded to long term storage, thereby freeing up space in BookKeeper and reducing storage costs. This cookbook walks you through using tiered storage in your Pulsar cluster.

Tiered storage currently leverage [Apache Jclouds](https://jclouds.apache.org) to supports [S3](https://aws.amazon.com/s3/) and [Google Cloud Storage](https://cloud.google.com/storage/)(GCS for short) for long term storage. And by Jclouds, it is easy to add more [supported](https://jclouds.apache.org/reference/providers/#blobstore-providers) cloud storage provider in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> currently uses Apache Jclouds to support Amazon S3.
And by Jclouds.. -> With jclouds, it is easy to add support for more cloud storage providers in the future.

jclouds always seem to write their name in all lowercase.

Pulsar users multipart objects to update the segment data. It is possible that a broker could crash while uploading the data. We recommend you add a lifecycle rule your S3 bucket to expire incomplete multipart upload after a day or two to avoid getting charged for incomplete uploads.

### Configuring the broker
## Configuring the driver for "aws-s3" or "google-cloud-storage" in the broker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Configuring the offload driver


Offloading is configured in ```broker.conf```.

At a minimum, the user must configure the driver, the region and the bucket.
At a minimum, the administrator must configure the driver, the bucket and the authenticating. There is also some other knobs to configure, like the bucket regions, the max block size in backed storage, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> the bucket and authentication credentials.
-> bucket region

At a minimum, the administrator must configure the driver, the bucket and the authenticating. There is also some other knobs to configure, like the bucket regions, the max block size in backed storage, etc.

Currently we support driver of types: { "aws-s3", "google-cloud-storage" },
{% include admonition.html type="warning" content="Driver names are case-insensitive for driver's name. "s3" and "aws-s3" are similar, with "aws-s3" you just don't need to define the url of the endpoint because it is aligned with region, and default is `s3.amazonaws.com`; while with s3, you must provide the endpoint url by `s3ManagedLedgerOffloadServiceEndpoint`." %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"s3" and "aws-s3" ... -> There is a third driver type, "s3", which is identical to "aws-s3", though it requires that you specify an endpoint url using s3ManagedLedgerOffloadServiceEndpoint. This is useful if using a S3 compatible data store, other than AWS.

```

It is also possible to specify the s3 endpoint directly, using ```s3ManagedLedgerOffloadServiceEndpoint```. This is useful if you are using a non-AWS storage service which provides an S3 compatible API.
### Configuring for "aws-s3" driver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> "aws-s3" Driver configuration

gcsManagedLedgerOffloadRegion=europe-west3
```

#### Configuring the Authenticating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, just "Authentication" is enough.


#### Configuring the Authenticating

The administrator need configure `gcsManagedLedgerOffloadServiceAccountKeyFile` in `broker.conf` to get GCS service available. It is a Json file, which contains GCS credentials of service account key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The administrator needs to configure gcsManagedLedgerOffloadServiceAccountKeyFile in broker.conf for the broker to be able to access the GCS service. gcsManagedLedgerOffloadServiceAccountKeyFile is a Json file, containing the GCS credentials of a service account.

#### Configuring the Authenticating

The administrator need configure `gcsManagedLedgerOffloadServiceAccountKeyFile` in `broker.conf` to get GCS service available. It is a Json file, which contains GCS credentials of service account key.
[This page](https://support.google.com/googleapi/answer/6158849) contains more information of how to create this key file for authentication. You could also get more information regarding google cloud [IAM](https://cloud.google.com/storage/docs/access-control/iam).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Service Accounts section of this page contains more information...
More information about Google Cloud IAMs is available here.

2. If it's not already selected, select the project that you're creating credentials for.
3. To set up a new service account, click New credentials and then select Service account key.
4. Choose the service account to use for the key.
5. Choose whether to download the service account's public/private key as a JSON file that can be loaded by a Google API client library.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "Choose whether to".

4. Choose the service account to use for the key.
5. Choose whether to download the service account's public/private key as a JSON file that can be loaded by a Google API client library.

Here is an example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line.

@zhaijack
Copy link
Contributor Author

@ivankelly , updated again.

Copy link
Contributor

@ivankelly ivankelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#lgtm

@srkukarni
Copy link
Contributor

retest this please

1 similar comment
@zhaijack
Copy link
Contributor Author

zhaijack commented Aug 1, 2018

retest this please

@sijie
Copy link
Member

sijie commented Aug 2, 2018

fyi please hold on merging this until I cut over 2.1 website.

@sijie sijie merged commit 5e1ee37 into apache:master Aug 8, 2018
sijie added a commit to sijie/pulsar that referenced this pull request Aug 8, 2018
 ### Motivation

There is a syntax error in tiered storage doc introduced by apache#2152.

```
Liquid Exception: Invalid syntax for include tag: type="warning" content="Driver names are case-insensitive for driver's name. There is a third driver type, "s3", which is identical to "aws-s3", though it requires that you specify an endpoint url using `s3ManagedLedgerOffloadServiceEndpoint`. This is useful if using a S3 compatible data store, other than AWS." Valid syntax: {% include file.ext param='value' param2='value' %} in docs/latest/cookbooks/tiered-storage.md
```

 ### Changes

Fix the syntax error.
sijie added a commit that referenced this pull request Aug 8, 2018
### Motivation

There is a syntax error in tiered storage doc introduced by #2152.

```
Liquid Exception: Invalid syntax for include tag: type="warning" content="Driver names are case-insensitive for driver's name. There is a third driver type, "s3", which is identical to "aws-s3", though it requires that you specify an endpoint url using `s3ManagedLedgerOffloadServiceEndpoint`. This is useful if using a S3 compatible data store, other than AWS." Valid syntax: {% include file.ext param='value' param2='value' %} in docs/latest/cookbooks/tiered-storage.md
```

 ### Changes

Fix the syntax error.
sijie added a commit to sijie/pulsar that referenced this pull request Aug 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tieredstorage doc Your PR contains doc changes, no matter whether the changes are in markdown or code files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants