Skip to content

Commit 3117cd3

Browse files
author
Sam Goodwin
authored
feat(glue): add L2 resources for Database and Table (#1988)
1 parent a1df717 commit 3117cd3

File tree

14 files changed

+76173
-11
lines changed

14 files changed

+76173
-11
lines changed
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,187 @@
11
## The CDK Construct Library for AWS Glue
22
This module is part of the [AWS Cloud Development Kit](https://github.com/awslabs/aws-cdk) project.
3+
4+
### Database
5+
6+
A `Database` is a logical grouping of `Tables` in the Glue Catalog.
7+
8+
```ts
9+
new glue.Database(stack, 'MyDatabase', {
10+
databaseName: 'my_database'
11+
});
12+
```
13+
14+
By default, a S3 bucket is created and the Database is stored under `s3://<bucket-name>/`, but you can manually specify another location:
15+
16+
```ts
17+
new glue.Database(stack, 'MyDatabase', {
18+
databaseName: 'my_database',
19+
locationUri: 's3://explicit-bucket/some-path/'
20+
});
21+
```
22+
23+
### Table
24+
25+
A Glue table describes a table of data in S3: its structure (column names and types), location of data (S3 objects with a common prefix in a S3 bucket), and format for the files (Json, Avro, Parquet, etc.):
26+
27+
```ts
28+
new glue.Table(stack, 'MyTable', {
29+
database: myDatabase,
30+
tableName: 'my_table',
31+
columns: [{
32+
name: 'col1',
33+
type: glue.Schema.string,
34+
}, {
35+
name: 'col2',
36+
type: glue.Schema.array(Schema.string),
37+
comment: 'col2 is an array of strings' // comment is optional
38+
}]
39+
dataFormat: glue.DataFormat.Json
40+
});
41+
```
42+
43+
By default, a S3 bucket will be created to store the table's data but you can manually pass the `bucket` and `s3Prefix`:
44+
45+
```ts
46+
new glue.Table(stack, 'MyTable', {
47+
bucket: myBucket,
48+
s3Prefix: 'my-table/'
49+
...
50+
});
51+
```
52+
53+
#### Partitions
54+
55+
To improve query performance, a table can specify `partitionKeys` on which data is stored and queried separately. For example, you might partition a table by `year` and `month` to optimize queries based on a time window:
56+
57+
```ts
58+
new glue.Table(stack, 'MyTable', {
59+
database: myDatabase,
60+
tableName: 'my_table',
61+
columns: [{
62+
name: 'col1',
63+
type: glue.Schema.string
64+
}],
65+
partitionKeys: [{
66+
name: 'year',
67+
type: glue.Schema.smallint
68+
}, {
69+
name: 'month',
70+
type: glue.Schema.smallint
71+
}],
72+
dataFormat: glue.DataFormat.Json
73+
});
74+
```
75+
76+
### [Encryption](https://docs.aws.amazon.com/athena/latest/ug/encryption.html)
77+
78+
You can enable encryption on a Table's data:
79+
* `Unencrypted` - files are not encrypted. The default encryption setting.
80+
* [S3Managed](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html) - Server side encryption (`SSE-S3`) with an Amazon S3-managed key.
81+
```ts
82+
new glue.Table(stack, 'MyTable', {
83+
encryption: glue.TableEncryption.S3Managed
84+
...
85+
});
86+
```
87+
* [Kms](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html) - Server-side encryption (`SSE-KMS`) with an AWS KMS Key managed by the account owner.
88+
89+
```ts
90+
// KMS key is created automatically
91+
new glue.Table(stack, 'MyTable', {
92+
encryption: glue.TableEncryption.Kms
93+
...
94+
});
95+
96+
// with an explicit KMS key
97+
new glue.Table(stack, 'MyTable', {
98+
encryption: glue.TableEncryption.Kms,
99+
encryptionKey: new kms.EncryptionKey(stack, 'MyKey')
100+
...
101+
});
102+
```
103+
* [KmsManaged](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html) - Server-side encryption (`SSE-KMS`), like `Kms`, except with an AWS KMS Key managed by the AWS Key Management Service.
104+
```ts
105+
new glue.Table(stack, 'MyTable', {
106+
encryption: glue.TableEncryption.KmsManaged
107+
...
108+
});
109+
```
110+
* [ClientSideKms](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingClientSideEncryption.html#client-side-encryption-kms-managed-master-key-intro) - Client-side encryption (`CSE-KMS`) with an AWS KMS Key managed by the account owner.
111+
```ts
112+
// KMS key is created automatically
113+
new glue.Table(stack, 'MyTable', {
114+
encryption: glue.TableEncryption.ClientSideKms
115+
...
116+
});
117+
118+
// with an explicit KMS key
119+
new glue.Table(stack, 'MyTable', {
120+
encryption: glue.TableEncryption.ClientSideKms,
121+
encryptionKey: new kms.EncryptionKey(stack, 'MyKey')
122+
...
123+
});
124+
```
125+
126+
*Note: you cannot provide a `Bucket` when creating the `Table` if you wish to use server-side encryption (`Kms`, `KmsManaged` or `S3Managed`)*.
127+
128+
### Types
129+
130+
A table's schema is a collection of columns, each of which have a `name` and a `type`. Types are recursive structures, consisting of primitive and complex types:
131+
132+
```ts
133+
new glue.Table(stack, 'MyTable', {
134+
columns: [{
135+
name: 'primitive_column',
136+
type: glue.Schema.string
137+
}, {
138+
name: 'array_column',
139+
type: glue.Schema.array(glue.Schema.integer),
140+
comment: 'array<integer>'
141+
}, {
142+
name: 'map_column',
143+
type: glue.Schema.map(
144+
glue.Schema.string,
145+
glue.Schema.timestamp),
146+
comment: 'map<string,string>'
147+
}, {
148+
name: 'struct_column',
149+
type: glue.Schema.struct([{
150+
name: 'nested_column',
151+
type: glue.Schema.date,
152+
comment: 'nested comment'
153+
}]),
154+
comment: "struct<nested_column:date COMMENT 'nested comment'>"
155+
}],
156+
...
157+
```
158+
159+
#### Primitive
160+
161+
Numeric:
162+
* `bigint`
163+
* `float`
164+
* `integer`
165+
* `smallint`
166+
* `tinyint`
167+
168+
Date and Time:
169+
* `date`
170+
* `timestamp`
171+
172+
String Types:
173+
174+
* `string`
175+
* `decimal`
176+
* `char`
177+
* `varchar`
178+
179+
Misc:
180+
* `boolean`
181+
* `binary`
182+
183+
#### Complex
184+
185+
* `array` - array of some other type
186+
* `map` - map of some primitive key type to any value type.
187+
* `struct` - nested structure containing individually named and typed columns.
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
/**
2+
* Absolute class name of the Hadoop `InputFormat` to use when reading table files.
3+
*/
4+
export class InputFormat {
5+
/**
6+
* An InputFormat for plain text files. Files are broken into lines. Either linefeed or
7+
* carriage-return are used to signal end of line. Keys are the position in the file, and
8+
* values are the line of text.
9+
*
10+
* @see https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/TextInputFormat.html
11+
*/
12+
public static readonly TextInputFormat = new InputFormat('org.apache.hadoop.mapred.TextInputFormat');
13+
14+
constructor(public readonly className: string) {}
15+
}
16+
17+
/**
18+
* Absolute class name of the Hadoop `OutputFormat` to use when writing table files.
19+
*/
20+
export class OutputFormat {
21+
/**
22+
* Writes text data with a null key (value only).
23+
*
24+
* @see https://hive.apache.org/javadocs/r2.2.0/api/org/apache/hadoop/hive/ql/io/HiveIgnoreKeyTextOutputFormat.html
25+
*/
26+
public static readonly HiveIgnoreKeyTextOutputFormat = new OutputFormat('org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat');
27+
28+
constructor(public readonly className: string) {}
29+
}
30+
31+
/**
32+
* Serialization library to use when serializing/deserializing (SerDe) table records.
33+
*
34+
* @see https://cwiki.apache.org/confluence/display/Hive/SerDe
35+
*/
36+
export class SerializationLibrary {
37+
/**
38+
* @see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-JSON
39+
*/
40+
public static readonly HiveJson = new SerializationLibrary('org.apache.hive.hcatalog.data.JsonSerDe');
41+
42+
/**
43+
* @see https://github.com/rcongiu/Hive-JSON-Serde
44+
*/
45+
public static readonly OpenXJson = new SerializationLibrary('org.openx.data.jsonserde.JsonSerDe');
46+
47+
constructor(public readonly className: string) {}
48+
}
49+
50+
/**
51+
* Defines the input/output formats and ser/de for a single DataFormat.
52+
*/
53+
export interface DataFormat {
54+
/**
55+
* `InputFormat` for this data format.
56+
*/
57+
inputFormat: InputFormat;
58+
59+
/**
60+
* `OutputFormat` for this data format.
61+
*/
62+
outputFormat: OutputFormat;
63+
64+
/**
65+
* Serialization library for this data format.
66+
*/
67+
serializationLibrary: SerializationLibrary;
68+
}
69+
70+
export namespace DataFormat {
71+
/**
72+
* Stored as plain text files in JSON format.
73+
*
74+
* Uses OpenX Json SerDe for serialization and deseralization.
75+
*
76+
* @see https://docs.aws.amazon.com/athena/latest/ug/json.html
77+
*/
78+
export const Json: DataFormat = {
79+
inputFormat: InputFormat.TextInputFormat,
80+
outputFormat: OutputFormat.HiveIgnoreKeyTextOutputFormat,
81+
serializationLibrary: SerializationLibrary.OpenXJson
82+
};
83+
}

0 commit comments

Comments
 (0)