Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(glue): support partition index on tables #17998

Merged
merged 11 commits into from
Dec 29, 2021
Merged

feat(glue): support partition index on tables #17998

merged 11 commits into from
Dec 29, 2021

Conversation

kaizencc
Copy link
Contributor

@kaizencc kaizencc commented Dec 14, 2021

This PR adds support for creating partition indexes on tables via custom resources.
It offers two different ways to create indexes:

// via table definition
const table = new glue.Table(this, 'Table', {
  database,
  bucket,
  tableName: 'table',
  columns,
  partitionKeys,
  partitionIndexes: [{
    indexName: 'my-index',
    keyNames: ['month'],
  }],
  dataFormat: glue.DataFormat.CSV,
});
// or as a function
table.AddPartitionIndex([{
  indexName: 'my-other-index',
  keyNames: ['month', 'year'],
});

I also refactored the format of some tests, which is what accounts for the large diff in test.table.ts.

Motivation:
Creating partition indexes on a table is something you can do via the console, but is not an exposed property in cloudformation. In this case, I think it makes sense to support this feature via custom resources as it will significantly reduce the customer pain of either provisioning a custom resource with correct permissions or manually going into the console after resource creation. Supporting this feature allows for synth-time checks and dependency chaining for multiple indexes (reason detailed in the FAQ) which removes a rather sharp edge for users provisioning custom resource indexes themselves.

FAQ:

Why do we need to chain dependencies between different Partition Index Custom Resources?

  • Because Glue only allows 1 index to be created or deleted simultaneously per table. Without dependencies the resources will try to create partition indexes simultaneously and the second sdk call with be dropped.

Why is it called partitionIndexes? Is that really how you pluralize index?

  • Yesish. If you hate it it can be partitionIndices.

Why is keyNames of type string[] and not Column[]? PartitionKey is of type Column[] and partition indexes must be a subset of partition keys...

  • This could be a debate. But my argument is that the pattern I see for defining a Table is to define partition keys inline and not declare them each as variables. It would be pretty clunky from a UX perspective:
    const key1 = { name: 'mykey', type: glue.Schema.STRING };
    const key2 = { name: 'mykey2', type: glue.Schema.STRING };
    const key3 = { name: 'mykey3', type: glue.Schema.STRING };
    new glue.Table(this, 'table', {
      database,
      bucket,
      tableName: 'table',
      columns,
      partitionKeys: [key1, key2, key3],
      partitionIndexes: [key1, key2],
      dataFormat: glue.DataFormat.CSV,
    });

Why are there 2 different checks for having > 3 partition indexes?

  • It's possible someone decides to define 3 indexes in the definition and then try to add another with table.addPartitionIndex(). This would be a nasty deploy time error, its better if it is synth time. It's also possible someone decides to define 4 indexes in the definition. It's better to fast-fail here before we create 3 custom resources.

What if I deploy a table, manually add 3 partition indexes, and then try to call table.addPartitionIndex() and update the stack? Will that still be a synth time failure?

  • Sorry, no.

Why do we need to generate names?

  • We don't. I just thought it would be helpful.

Why is grantToUnderlyingResources public?

  • I thought it would be helpful. Some permissions need to be added to the table, the database, and the catalog.

Closes #17589.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@gitpod-io
Copy link

gitpod-io bot commented Dec 14, 2021

@github-actions github-actions bot added the @aws-cdk/aws-glue Related to AWS Glue label Dec 14, 2021
@mergify mergify bot added the contribution/core This is a PR that came from AWS. label Dec 14, 2021
@kaizencc kaizencc requested a review from a team December 14, 2021 02:10
Copy link
Contributor

@njlynch njlynch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. A handful of questions and changes, plus a few typos.

packages/@aws-cdk/aws-glue/README.md Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-glue/lib/table.ts Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-glue/lib/table.ts Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-glue/lib/table.ts Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-glue/lib/table.ts Outdated Show resolved Hide resolved
Comment on lines +462 to +464
this.tableArn,
this.database.catalogArn,
this.database.databaseArn,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a source for these permissions? What does glue:UpdateTable mean for a table vs catalog vs database?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly from my own investigations (nothing documented that I could find).

Here's what I get with no permissions:

Received response status [FAILED] from custom resource. Message returned: User:
arn:aws:sts::489318732371:assumed-role/GluePartitionStack-AWS679f53fac002430cb0
da5b7982bd-1GZIV4NJQYJJ1/GluePartitionStack-AWS679f53fac002430cb0da5b7982bd-89Z
BzvCbg7Oi is not authorized to perform: glue:UpdateTable on resource: arn:aws:g
lue:us-east-1:489318732371:catalog (RequestId: ab6c2467-9c78-4d23-be59-953e4ab2
3144)

Here's what I get after I add glue:UpdateTable with permissions to the catalog:

Received response status [FAILED] from custom resource. Message returned: User:
arn:aws:sts::489318732371:assumed-role/GluePartitionStack-AWS679f53fac002430cb0
da5b7982bd-7U8A8EMIKP4E/GluePartitionStack-AWS679f53fac002430cb0da5b7982bd-nlw0
ulljEHZq is not authorized to perform: glue:UpdateTable on resource: arn:aws:gl
ue:us-east-1:489318732371:database/my_database (RequestId: d4caefa8-30d2-4b02-9
9e4-6b1305ab5aea)

So I then add the same permission to the database and I get:

Received response status [FAILED] from custom resource. Message returned: User:
arn:aws:sts::489318732371:assumed-role/GluePartitionStack-AWS679f53fac002430cb0
da5b7982bd-1N8TEEWC845G7/GluePartitionStack-AWS679f53fac002430cb0da5b7982bd-z2X
SZ4Zjykso is not authorized to perform: glue:UpdateTable on resource: arn:aws:g
lue:us-east-1:489318732371:table/my_database/json_table (RequestId: aa73a7ff-0f
f8-4c20-8338-b0f587298d6e)

The only valid permission is for the catalog, database, and table.

@mergify
Copy link
Contributor

mergify bot commented Dec 29, 2021

Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject89A8053A-LhjRyN9kxr8o
  • Commit ID: b98ea3f
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mergify mergify bot merged commit c071367 into aws:master Dec 29, 2021
@mergify
Copy link
Contributor

mergify bot commented Dec 29, 2021

Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

TikiTDO pushed a commit to TikiTDO/aws-cdk that referenced this pull request Feb 21, 2022
This PR adds support for creating partition indexes on tables via custom resources.
It offers two different ways to create indexes:

```ts
// via table definition
const table = new glue.Table(this, 'Table', {
  database,
  bucket,
  tableName: 'table',
  columns,
  partitionKeys,
  partitionIndexes: [{
    indexName: 'my-index',
    keyNames: ['month'],
  }],
  dataFormat: glue.DataFormat.CSV,
});
```

```ts
// or as a function
table.AddPartitionIndex([{
  indexName: 'my-other-index',
  keyNames: ['month', 'year'],
});
```

I also refactored the format of some tests, which is what accounts for the large diff in `test.table.ts`. 

Motivation: 
Creating partition indexes on a table is something you can do via the console, but is not an exposed property in cloudformation. In this case, I think it makes sense to support this feature via custom resources as it will significantly reduce the customer pain of either provisioning a custom resource with correct permissions or manually going into the console after resource creation. Supporting this feature allows for synth-time checks and dependency chaining for multiple indexes (reason detailed in the FAQ) which removes a rather sharp edge for users provisioning custom resource indexes themselves.

FAQ:

Why do we need to chain dependencies between different Partition Index Custom Resources? 
  - Because Glue only allows 1 index to be created or deleted simultaneously per table. Without dependencies the resources will try to create partition indexes simultaneously and the second sdk call with be dropped.

Why is it called `partitionIndexes`? Is that really how you pluralize index?
  - [Yesish](https://www.nasdaq.com/articles/indexes-or-indices-whats-the-deal-2016-05-12). If you hate it it can be `partitionIndices`.

Why is `keyNames` of type `string[]` and not `Column[]`? `PartitionKey` is of type `Column[]` and partition indexes must be a subset of partition keys...
  - This could be a debate. But my argument is that the pattern I see for defining a Table is to define partition keys inline and not declare them each as variables. It would be pretty clunky from a UX perspective:
    ```ts
    const key1 = { name: 'mykey', type: glue.Schema.STRING };
    const key2 = { name: 'mykey2', type: glue.Schema.STRING };
    const key3 = { name: 'mykey3', type: glue.Schema.STRING };
    new glue.Table(this, 'table', {
      database,
      bucket,
      tableName: 'table',
      columns,
      partitionKeys: [key1, key2, key3],
      partitionIndexes: [key1, key2],
      dataFormat: glue.DataFormat.CSV,
    });
    ```

Why are there 2 different checks for having > 3 partition indexes?
  - It's possible someone decides to define 3 indexes in the definition and then try to add another with `table.addPartitionIndex()`. This would be a nasty deploy time error, its better if it is synth time. It's also possible someone decides to define 4 indexes in the definition. It's better to fast-fail here before we create 3 custom resources.

What if I deploy a table, manually add 3 partition indexes, and then try to call `table.addPartitionIndex()` and update the stack? Will that still be a synth time failure?
  - Sorry, no. 

Why do we need to generate names?
  - We don't. I just thought it would be helpful.

Why is `grantToUnderlyingResources` public?
  - I thought it would be helpful. Some permissions need to be added to the table, the database, and the catalog.

Closes aws#17589.

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-glue Related to AWS Glue contribution/core This is a PR that came from AWS.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(glue): add support for partition indexes
3 participants