Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Redshift part size configurable. #3053

Merged
merged 9 commits into from
Apr 27, 2021

Conversation

davinchia
Copy link
Contributor

@davinchia davinchia commented Apr 26, 2021

What

The Redshift Copy strategy currently has it's part size set to 10 MB. Since S3 allows a file to be broken into max of 10k parts, this results in a 100GB table limit. A user is trying to sync a table of 115GB and running into this issue.

This makes the part size configurable so users can increase this size if needed.

How

  • Add part_size as part of the spec.json.
  • Use part_size if present. Otherwise, use the default of 10 MB.

Pre-merge Checklist

  • Run integration tests
  • Publish Docker images

Recommended reading order

  1. spec.json
  2. RedshiftCopier

@@ -102,6 +102,14 @@
"description": "The corresponding secret to the above access key id.",
"title": "S3 Access Key",
"airbyte_secret": true
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2021-04-26 at 8 16 11 PM

@davinchia
Copy link
Contributor Author

davinchia commented Apr 26, 2021

/test connector=connectors/destination-redshift

🕑 connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/785542618
✅ connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/785542618

@davinchia davinchia changed the title Davinchia/redshift configure part size Make Redshift part size configurable. Apr 26, 2021
@davinchia
Copy link
Contributor Author

I tested this locally with a part size of 30MB to confirm this works:

Screen Shot 2021-04-26 at 8 39 16 PM

Screen Shot 2021-04-26 at 8 40 09 PM

@davinchia davinchia marked this pull request as ready for review April 26, 2021 12:41
@davinchia davinchia removed the request for review from michel-tricot April 26, 2021 12:41
Copy link
Contributor

@jrhizor jrhizor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved % comments

@@ -2,7 +2,7 @@
"destinationDefinitionId": "f7a7d195-377f-cf5b-70a5-be6b819019dc",
"name": "Redshift",
"dockerRepository": "airbyte/destination-redshift",
"dockerImageTag": "0.3.0",
"dockerImageTag": "0.3.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up, you'll need to merge master and do 0.3.2

@@ -27,7 +27,7 @@
- destinationDefinitionId: f7a7d195-377f-cf5b-70a5-be6b819019dc
name: Redshift
dockerRepository: airbyte/destination-redshift
dockerImageTag: 0.3.0
dockerImageTag: 0.3.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -63,7 +63,7 @@
// The smallest part size is 5MB. An S3 upload can be maximally formed of 10,000 parts. This gives
// us an upper limit of 10,000 * 10 / 1000 = 100 GB per table with a 10MB part size limit.
// WARNING: Too large a part size can cause potential OOM errors.
private static final int PART_SIZE_MB = 10;
public static final int DEFAULT_PART_SIZE_MB = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this default larger? 10MB seems really small.

"part_size": {
"type": "integer",
"minimum": 10,
"maximum": 100,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 100mb maximum? Is that the Redshift maximum?

@cgardens
Copy link
Contributor

Do we need to make this configurable? Could we just make the part size really big? What do we lose by doing that?

@davinchia
Copy link
Contributor Author

davinchia commented Apr 26, 2021

@jrhizor @cgardens replying both of ya together since you had the same question. The tradeoff is the typical latency vs throughput with big data. A bigger part size gives us overall larger throughput (since we can send more data overall) and a larger destination table size, however it means slower latency, since the data is buffered in memory until it hits the part size before being sent off. i.e. the same sync takes longer to begin since it now needs to read more data per part. This also increases memory requirements. 100 MB part size with 10 upload threads (currently configured) in theory gives us a minimum requirement of 1GB RAM (I haven't tested this, this is from my theoretical understanding of S3.)

I figured if we made this really large, we'll probably still eventually run into a case where a user requires even larger part size, and making it configurable is best since users can decide for themselves. For this reason, think we should keep this configurable.

Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add a sections in the redshift docs explaining this field in more detail. sounds like there are some serious tradeoffs here and if you start messing with it, you're likely to run into memory limits unless you also adjust container memory size as well. we should make it as safe as possible to play with this new value.

"minimum": 10,
"maximum": 100,
"examples": ["10"],
"description": "Files are streamed to S3 in parts. This determines the size of each part, in MBs. As S3 has a limit of 10,000 parts per file, part size affects the table size. This is 10MB by default, resulting in a default limit of 100GB tables. Increase this for larger tables. A larger part size will result in larger memory requirements. Only relevant for COPY.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add in the description that this otptional. the average user shouldn't need to worry about setting this. dunno if we should make it optional or set a default? but i guess it should also be clear what the default is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make it optional. The default is 10MB and is handled in the Copier code. Will leave comments and all explaining this.

@davinchia
Copy link
Contributor Author

davinchia commented Apr 27, 2021

/test connector=destination-redshift

🕑 destination-redshift https://github.com/airbytehq/airbyte/actions/runs/787866068
✅ destination-redshift https://github.com/airbytehq/airbyte/actions/runs/787866068

@davinchia
Copy link
Contributor Author

davinchia commented Apr 27, 2021

/test connector=connectors/destination-snowflake

🕑 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/787866608
✅ connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/787866608

@davinchia
Copy link
Contributor Author

Manual testing confirms this works:

Screen Shot 2021-04-27 at 12 43 40 PM

@davinchia
Copy link
Contributor Author

davinchia commented Apr 27, 2021

/publish connector=connectors/destination-snowflake

🕑 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/787896754
✅ connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/787896754

@davinchia
Copy link
Contributor Author

davinchia commented Apr 27, 2021

/publish connector=connectors/destination-redshift

🕑 connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/787926542
✅ connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/787926542

@davinchia davinchia merged commit e63ab84 into master Apr 27, 2021
@davinchia davinchia deleted the davinchia/redshift-configure-part-size branch April 27, 2021 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants