-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Redshift part size configurable. #3053
Conversation
@@ -102,6 +102,14 @@ | |||
"description": "The corresponding secret to the above access key id.", | |||
"title": "S3 Access Key", | |||
"airbyte_secret": true | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/test connector=connectors/destination-redshift
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved % comments
@@ -2,7 +2,7 @@ | |||
"destinationDefinitionId": "f7a7d195-377f-cf5b-70a5-be6b819019dc", | |||
"name": "Redshift", | |||
"dockerRepository": "airbyte/destination-redshift", | |||
"dockerImageTag": "0.3.0", | |||
"dockerImageTag": "0.3.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up, you'll need to merge master and do 0.3.2
@@ -27,7 +27,7 @@ | |||
- destinationDefinitionId: f7a7d195-377f-cf5b-70a5-be6b819019dc | |||
name: Redshift | |||
dockerRepository: airbyte/destination-redshift | |||
dockerImageTag: 0.3.0 | |||
dockerImageTag: 0.3.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@@ -63,7 +63,7 @@ | |||
// The smallest part size is 5MB. An S3 upload can be maximally formed of 10,000 parts. This gives | |||
// us an upper limit of 10,000 * 10 / 1000 = 100 GB per table with a 10MB part size limit. | |||
// WARNING: Too large a part size can cause potential OOM errors. | |||
private static final int PART_SIZE_MB = 10; | |||
public static final int DEFAULT_PART_SIZE_MB = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this default larger? 10MB seems really small.
"part_size": { | ||
"type": "integer", | ||
"minimum": 10, | ||
"maximum": 100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 100mb maximum? Is that the Redshift maximum?
Do we need to make this configurable? Could we just make the part size really big? What do we lose by doing that? |
@jrhizor @cgardens replying both of ya together since you had the same question. The tradeoff is the typical latency vs throughput with big data. A bigger part size gives us overall larger throughput (since we can send more data overall) and a larger destination table size, however it means slower latency, since the data is buffered in memory until it hits the part size before being sent off. i.e. the same sync takes longer to begin since it now needs to read more data per part. This also increases memory requirements. 100 MB part size with 10 upload threads (currently configured) in theory gives us a minimum requirement of 1GB RAM (I haven't tested this, this is from my theoretical understanding of S3.) I figured if we made this really large, we'll probably still eventually run into a case where a user requires even larger part size, and making it configurable is best since users can decide for themselves. For this reason, think we should keep this configurable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to add a sections in the redshift docs explaining this field in more detail. sounds like there are some serious tradeoffs here and if you start messing with it, you're likely to run into memory limits unless you also adjust container memory size as well. we should make it as safe as possible to play with this new value.
"minimum": 10, | ||
"maximum": 100, | ||
"examples": ["10"], | ||
"description": "Files are streamed to S3 in parts. This determines the size of each part, in MBs. As S3 has a limit of 10,000 parts per file, part size affects the table size. This is 10MB by default, resulting in a default limit of 100GB tables. Increase this for larger tables. A larger part size will result in larger memory requirements. Only relevant for COPY.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add in the description that this otptional. the average user shouldn't need to worry about setting this. dunno if we should make it optional or set a default? but i guess it should also be clear what the default is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make it optional. The default is 10MB and is handled in the Copier code. Will leave comments and all explaining this.
/test connector=destination-redshift
|
/test connector=connectors/destination-snowflake
|
/publish connector=connectors/destination-snowflake
|
/publish connector=connectors/destination-redshift
|
What
The Redshift Copy strategy currently has it's part size set to 10 MB. Since S3 allows a file to be broken into max of 10k parts, this results in a 100GB table limit. A user is trying to sync a table of 115GB and running into this issue.
This makes the part size configurable so users can increase this size if needed.
How
Pre-merge Checklist
Recommended reading order
spec.json