Skip to content

v3.0.0-preview1

Pre-release
Pre-release
Compare
Choose a tag to compare
@JoshRosen JoshRosen released this 01 Nov 00:45
· 0 commits to branch-2.x since this release

⚠️ Important: If you are using Spark 1.x, then use v1.1.0 instead. 3.x releases of this library are only compatible with Spark 2.x.

This is the first preview release of the 3.x version of this library. This release includes several major performance and security enhancements.

Feedback on the changes in this release is welcome! Please open an issue on GitHub to share your comments.

Upgrading from a 2.x release

Version 3.0 now requires forward_spark_s3_credentials to be explicitly set before Spark S3 credentials will be forwarded to Redshift. Users who use the aws_iam_role or temporary_aws_* authentication mechanisms will be unaffected by this change. Users who relied on the old default behavior will now need to explicitly set forward_spark_s3_credentials to true to continue using their previous Redshift to S3 authentication mechanism. For a discussion of the three authentication mechanisms and their security trade-offs, see the Authenticating to S3 and Redshift section of the README.

Changes

New Features:

  • Add experimental support for using CSV as an intermediate data format when writing back to Redshift (#73 / #288). This can significantly speed up large writes and also allows saving of tables whose column names are unsupported by Avro's strict schema validation rules (#84).

Performance enhancements:

  • The read path is now based on Spark 2.0's new FileFormat-based data source, allowing it to benefit from performance improvements in FileScanRDD, such as automatic coalescing of partitions based on input size (#289).

Usability enhancements:

  • Attempt to automatically detect when the Redshift cluster and S3 bucket are in different regions in order to provide more informative error messages (#285).
  • Document this library's support for encrypted load / unload (#189).
  • Document this library's security-related configurations, including an extensive discussion of the different communication channels and data sources and how each may be authenticated and encrypted (#291). Forwarding of Spark credentials to Redshift now requires explicit opt-in.

Bug fixes:

  • Fix a NumberFormatException which occurred when reading the special floating-point values NaN and Infinity from Redshift (#261 / #269).
  • Pass AWSCredentialProviders instead of AWSCredentials instances in order to avoid expiration of temporary AWS credentials between different steps of the read or write operation (#200 / #284).
  • IAM instance profiles authentication no longer requires temporary STS keys (or regular AWS keys) to be explicitly acquired / supplied by user code (#173 / #274, #276 / #277).