Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configurations to allow tempdir and Redshift cluster to be in different AWS regions #87

Closed
JoshRosen opened this issue Sep 12, 2015 · 12 comments

Comments

@JoshRosen
Copy link
Contributor

By default, S3 <-> Redshift copies will not work if the S3 bucket and Redshift cluster are in different AWS regions. If you try to use a bucket in a different region, then you get a confusing error message; see https://forums.databricks.com/questions/1963/why-spark-redshift-can-not-write-s3-bucket.html for one example.

Note that it is technically possible to use a bucket in a different region if you pass an extra region parameter to the COPY command; see https://sqlhaven.wordpress.com/2014/09/07/common-errors-of-redshift-copy-command-and-how-to-solve-them-part-1/ for one example of this.


~~As a result, I think that we should document this limitation and possibly add some configuration validation to print a better error message when the S3 bucket is in the wrong region.~~~

We should add a configuration option so that users can explicitly specify the `tempdir` region to enable cross-region copies.
@JoshRosen
Copy link
Contributor Author

Come to think of it, there are cases where we want to support cross-region transfers. Therefore, we might choose to split this into two separate issues: giving a more informative warning message and giving instructions on how to configure the cross-region UNLOAD command.

As far as I know, there's not an easy way to determine the cluster's region over JDBC, so I don't know that we'd be able to automatically figure out the correct UNLOAD command: http://stackoverflow.com/q/32545040/590203

@JoshRosen JoshRosen modified the milestone: 0.5.1 Sep 15, 2015
@cfeduke
Copy link

cfeduke commented Sep 18, 2015

Having aws_region as a configuration parameter might be enough. That's the first thing I looked for when I ran into this problem. (I created an S3 bucket in Standard to get around this problem.)

@JoshRosen
Copy link
Contributor Author

That sounds reasonable to me; I was considering doing something similar. I think that tempdir_aws_region or s3_aws_region might be a slightly clearer configuration name, though.

@JoshRosen
Copy link
Contributor Author

I'm a bit overloaded with other work at the moment and this task isn't part of our current sprint, so this is up-for-grabs if anyone wants to work on this. I do have time to review / revise small patches to spark-redshift if they won't involve too many rounds of back-and-forth.

@JoshRosen JoshRosen changed the title Document requirement that S3 tempdir and Redshift cluster must be in same AWS region Add configurations to allow tempdir and Redshift cluster to be in different AWS regions Sep 30, 2015
@JoshRosen JoshRosen modified the milestones: 0.5.1, 0.5.2 Oct 5, 2015
@JoshRosen
Copy link
Contributor Author

Now that #35 has been merged, this can be worked around using the new extracopyoptions configuration; note, however, that we have not published a release containing that patch yet, so you'd have to build your own version of spark-redshift in order for that to work.

@JoshRosen JoshRosen modified the milestones: 0.5.2, 0.6.0 Oct 21, 2015
@JoshRosen
Copy link
Contributor Author

extracopyoptions doesn't entirely address this issue, since you also need the ability to specify the same option during reads as well. Therefore, we should still add a configuration / patch for this.

@JoshRosen
Copy link
Contributor Author

Users: please comment on this thread to vote on this issue if it's important to you. I'd like to implement this but am holding off until I hear about more demand for this feature, since I have limited time to devote to spark-redshift and want to prioritize features.

@tristanreid
Copy link

While the COPY command seems to support a region, it appears that the UNLOAD command doesn't:

From http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html:
Important
The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.

I'm glad to add this functionality, but it appears it would only be one-way (write-only)

@JoshRosen
Copy link
Contributor Author

If you want this functionality yourself, I would be happy to accept a patch for it; the one-way limitation is fine.

@tristanreid
Copy link

I was just finishing the unit test for this when I realized that there's already a trivial workaround with the existing extracopyoptions option. Maybe it's cleaner to just document the possibility of adding region. Should I do that, or PR the distinct option? Either way seems fine to me.

@karanveerm
Copy link

+1 We have a use case where we pull data from redshift for several different clients and would prefer to only use a single S3 bucket instead of having an S3 bucket in every region. This will be very helpful.

@JoshRosen
Copy link
Contributor Author

@karanveerm, note the limitation described in #87 (comment):

While the COPY command seems to support a region, it appears that the UNLOAD command doesn't:

From http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html:

Important
The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.

Given this limitation, I don't think we'll be able to support your use-case of using a single bucket to pull data from several clients. However, I believe that you could use a single bucket as the staging area for writes by using the extracopyoptions technique described upthread.

@JoshRosen JoshRosen added this to the 2.1.0 milestone Oct 19, 2016
@JoshRosen JoshRosen self-assigned this Oct 19, 2016
munk pushed a commit to ActionIQ-OSS/spark-redshift that referenced this issue May 4, 2021
Issue databricks#87: Creating a serializer per row has a really low performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants