Add configurations to allow tempdir and Redshift cluster to be in different AWS regions #87

JoshRosen · 2015-09-12T22:50:50Z

By default, S3 <-> Redshift copies will not work if the S3 bucket and Redshift cluster are in different AWS regions. If you try to use a bucket in a different region, then you get a confusing error message; see https://forums.databricks.com/questions/1963/why-spark-redshift-can-not-write-s3-bucket.html for one example.

Note that it is technically possible to use a bucket in a different region if you pass an extra region parameter to the COPY command; see https://sqlhaven.wordpress.com/2014/09/07/common-errors-of-redshift-copy-command-and-how-to-solve-them-part-1/ for one example of this.


~~As a result, I think that we should document this limitation and possibly add some configuration validation to print a better error message when the S3 bucket is in the wrong region.~~~

We should add a configuration option so that users can explicitly specify the `tempdir` region to enable cross-region copies.

The text was updated successfully, but these errors were encountered:

JoshRosen · 2015-09-14T19:56:07Z

Come to think of it, there are cases where we want to support cross-region transfers. Therefore, we might choose to split this into two separate issues: giving a more informative warning message and giving instructions on how to configure the cross-region UNLOAD command.

As far as I know, there's not an easy way to determine the cluster's region over JDBC, so I don't know that we'd be able to automatically figure out the correct UNLOAD command: http://stackoverflow.com/q/32545040/590203

cfeduke · 2015-09-18T08:17:30Z

Having aws_region as a configuration parameter might be enough. That's the first thing I looked for when I ran into this problem. (I created an S3 bucket in Standard to get around this problem.)

JoshRosen · 2015-09-18T18:06:24Z

That sounds reasonable to me; I was considering doing something similar. I think that tempdir_aws_region or s3_aws_region might be a slightly clearer configuration name, though.

JoshRosen · 2015-09-21T20:12:11Z

I'm a bit overloaded with other work at the moment and this task isn't part of our current sprint, so this is up-for-grabs if anyone wants to work on this. I do have time to review / revise small patches to spark-redshift if they won't involve too many rounds of back-and-forth.

JoshRosen · 2015-10-16T17:15:01Z

Now that #35 has been merged, this can be worked around using the new extracopyoptions configuration; note, however, that we have not published a release containing that patch yet, so you'd have to build your own version of spark-redshift in order for that to work.

JoshRosen · 2015-10-21T22:41:12Z

extracopyoptions doesn't entirely address this issue, since you also need the ability to specify the same option during reads as well. Therefore, we should still add a configuration / patch for this.

JoshRosen · 2016-01-04T20:09:44Z

Users: please comment on this thread to vote on this issue if it's important to you. I'd like to implement this but am holding off until I hear about more demand for this feature, since I have limited time to devote to spark-redshift and want to prioritize features.

tristanreid · 2016-01-25T23:41:36Z

While the COPY command seems to support a region, it appears that the UNLOAD command doesn't:

From http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html:
Important
The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.

I'm glad to add this functionality, but it appears it would only be one-way (write-only)

JoshRosen · 2016-01-25T23:49:40Z

If you want this functionality yourself, I would be happy to accept a patch for it; the one-way limitation is fine.

tristanreid · 2016-01-26T18:18:43Z

I was just finishing the unit test for this when I realized that there's already a trivial workaround with the existing extracopyoptions option. Maybe it's cleaner to just document the possibility of adding region. Should I do that, or PR the distinct option? Either way seems fine to me.

karanveerm · 2016-08-29T06:08:43Z

+1 We have a use case where we pull data from redshift for several different clients and would prefer to only use a single S3 bucket instead of having an S3 bucket in every region. This will be very helpful.

JoshRosen · 2016-08-29T18:26:11Z

@karanveerm, note the limitation described in #87 (comment):

While the COPY command seems to support a region, it appears that the UNLOAD command doesn't:

From http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html:

Important
The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.

Given this limitation, I don't think we'll be able to support your use-case of using a single bucket to pull data from several clients. However, I believe that you could use a single bucket as the staging area for writes by using the extracopyoptions technique described upthread.

Issue databricks#87: Creating a serializer per row has a really low performance.

JoshRosen added the documentation label Sep 12, 2015

JoshRosen modified the milestone: 0.5.1 Sep 15, 2015

JoshRosen changed the title ~~Document requirement that S3 tempdir and Redshift cluster must be in same AWS region~~ Add configurations to allow tempdir and Redshift cluster to be in different AWS regions Sep 30, 2015

JoshRosen added enhancement and removed documentation labels Sep 30, 2015

JoshRosen modified the milestones: 0.5.1, 0.5.2 Oct 5, 2015

JoshRosen mentioned this issue Oct 6, 2015

java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.services.s3.internal.ServiceUtils #103

Closed

JoshRosen mentioned this issue Oct 15, 2015

Add extracopyoptions #35

Closed

JoshRosen modified the milestones: 0.5.2, 0.6.0 Oct 21, 2015

JoshRosen mentioned this issue Feb 10, 2016

S3N PermanentRedirect issue #176

Closed

JoshRosen removed this from the 2.0.0-preview1 milestone Jul 18, 2016

JoshRosen mentioned this issue Aug 6, 2016

aws_iam_role not being used #252

Closed

JoshRosen added this to the 2.1.0 milestone Oct 19, 2016

JoshRosen self-assigned this Oct 19, 2016

JoshRosen added the documentation label Oct 19, 2016

JoshRosen mentioned this issue Oct 19, 2016

Add documentation and warnings related to using different regions for Redshift and S3 #285

Closed

JoshRosen closed this as completed in d508d3e Oct 20, 2016

munk pushed a commit to ActionIQ-OSS/spark-redshift that referenced this issue May 4, 2021

Merge pull request databricks#88 from gumartinm/spark3_low_performance

1c24b8c

Issue databricks#87: Creating a serializer per row has a really low performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurations to allow tempdir and Redshift cluster to be in different AWS regions #87

Add configurations to allow tempdir and Redshift cluster to be in different AWS regions #87

JoshRosen commented Sep 12, 2015

JoshRosen commented Sep 14, 2015

cfeduke commented Sep 18, 2015

JoshRosen commented Sep 18, 2015

JoshRosen commented Sep 21, 2015

JoshRosen commented Oct 16, 2015

JoshRosen commented Oct 21, 2015

JoshRosen commented Jan 4, 2016

tristanreid commented Jan 25, 2016

JoshRosen commented Jan 25, 2016

tristanreid commented Jan 26, 2016

karanveerm commented Aug 29, 2016

JoshRosen commented Aug 29, 2016

Add configurations to allow tempdir and Redshift cluster to be in different AWS regions #87

Add configurations to allow tempdir and Redshift cluster to be in different AWS regions #87

Comments

JoshRosen commented Sep 12, 2015

JoshRosen commented Sep 14, 2015

cfeduke commented Sep 18, 2015

JoshRosen commented Sep 18, 2015

JoshRosen commented Sep 21, 2015

JoshRosen commented Oct 16, 2015

JoshRosen commented Oct 21, 2015

JoshRosen commented Jan 4, 2016

tristanreid commented Jan 25, 2016

JoshRosen commented Jan 25, 2016

tristanreid commented Jan 26, 2016

karanveerm commented Aug 29, 2016

JoshRosen commented Aug 29, 2016