Skip to content

Releases: databricks/spark-redshift

v3.0.0-preview1

01 Nov 00:45
Compare
Choose a tag to compare
v3.0.0-preview1 Pre-release
Pre-release

⚠️ Important: If you are using Spark 1.x, then use v1.1.0 instead. 3.x releases of this library are only compatible with Spark 2.x.

This is the first preview release of the 3.x version of this library. This release includes several major performance and security enhancements.

Feedback on the changes in this release is welcome! Please open an issue on GitHub to share your comments.

Upgrading from a 2.x release

Version 3.0 now requires forward_spark_s3_credentials to be explicitly set before Spark S3 credentials will be forwarded to Redshift. Users who use the aws_iam_role or temporary_aws_* authentication mechanisms will be unaffected by this change. Users who relied on the old default behavior will now need to explicitly set forward_spark_s3_credentials to true to continue using their previous Redshift to S3 authentication mechanism. For a discussion of the three authentication mechanisms and their security trade-offs, see the Authenticating to S3 and Redshift section of the README.

Changes

New Features:

  • Add experimental support for using CSV as an intermediate data format when writing back to Redshift (#73 / #288). This can significantly speed up large writes and also allows saving of tables whose column names are unsupported by Avro's strict schema validation rules (#84).

Performance enhancements:

  • The read path is now based on Spark 2.0's new FileFormat-based data source, allowing it to benefit from performance improvements in FileScanRDD, such as automatic coalescing of partitions based on input size (#289).

Usability enhancements:

  • Attempt to automatically detect when the Redshift cluster and S3 bucket are in different regions in order to provide more informative error messages (#285).
  • Document this library's support for encrypted load / unload (#189).
  • Document this library's security-related configurations, including an extensive discussion of the different communication channels and data sources and how each may be authenticated and encrypted (#291). Forwarding of Spark credentials to Redshift now requires explicit opt-in.

Bug fixes:

  • Fix a NumberFormatException which occurred when reading the special floating-point values NaN and Infinity from Redshift (#261 / #269).
  • Pass AWSCredentialProviders instead of AWSCredentials instances in order to avoid expiration of temporary AWS credentials between different steps of the read or write operation (#200 / #284).
  • IAM instance profiles authentication no longer requires temporary STS keys (or regular AWS keys) to be explicitly acquired / supplied by user code (#173 / #274, #276 / #277).

v1.1.0

21 Aug 22:26
Compare
Choose a tag to compare

⚠️ Important: If you are using Spark 2.x, then use v2.0.1 instead. 1.x releases of this library are only compatible with Spark 1.x.

The 1.1.0 release (which supports Spark 1.x) contains the following changes:

Bug fixes:

  • Provide a clearer error message when attempting to write BinaryType columns to Redshift (#251).
  • Automatically detect the JDBC 4.2 version of the Amazon Redshift JDBC driver (#258 / #259).
  • Restore compatibility with old versions of the AWS Java SDK (#254 / #135). This library now works with versions 1.7.4+ of the AWS Java SDK (and possibly earlier versions, but this has not been tested).

New Features:

  • Support for setting custom JDBC column types (#220)

v2.0.1

20 Aug 20:27
Compare
Choose a tag to compare

⚠️ Important: If you are using Spark 1.x, then use v1.1.0 instead. 2.x releases of this library are only compatible with Spark 2.x.

The 2.0.1 release (which is compatible with Spark 2.x) includes the following bug fixes:

  • Provide a clearer error message when attempting to write BinaryType columns to Redshift (#251).
  • Automatically detect the JDBC 4.2 version of the Amazon Redshift JDBC driver (#258 / #259).
  • Restore compatibility with old versions of the AWS Java SDK (#254 / #135). This library now works with versions 1.7.4+ of the AWS Java SDK (and possibly earlier versions, but this has not been tested).

v2.0.0

02 Aug 02:44
Compare
Choose a tag to compare

This is the first non-preview release of this library which supports Spark 2.0.0+.

This incorporates all of the changes from the v2.0.0-preview1 release, as well as the following bug fixes:

v2.0.0-preview1

18 Jul 21:17
Compare
Choose a tag to compare
v2.0.0-preview1 Pre-release
Pre-release

This is the first preview release of this library which supports Spark 2.x previews and release candidates.

A small number of deprecated features have been removed in this release; for a list, see #239.

New Features:

  • Spark 2.0 preview support (#221)
  • Support for setting custom JDBC column types (#220)

v1.0.0

11 Jul 18:24
Compare
Choose a tag to compare

This is the last planned major release of this library for Spark 1.x.

spark-redshift 1.x releases will remain compatible with Spark 1.4.x through 1.6.x, while spark-redshift 2.0.0+ will support only Spark 2.0.0+.

We will continue to fix minor bugs in 1.x maintenance releases but do not plan to add major new features in the 1.x line.

Bug Fixes:

  • Properly escape backslashes in queries passed to UNLOAD (#215 / #228).
  • Fix loss of sub-second precision when reading timestamps from Redshift (#214 / #227).
  • Fix a bug which led to incorrect results for queries that contained filters with date or timestamp literals (#152 / #156).
  • Fix a bug which broke the use of IAM instance profile credentials (#158 / #159).
  • Use MANIFEST to guard against eventually-consistent S3 bucket listing calls (#151).

Enhancements:

  • The Redshift username and password can now be specified as configuration options rather than being embedded in the URL (#132 / #162). This should fix connectivity issues for users whose Redshift passwords contained non-URL-safe characters.
  • Support for using IAM roles to authorize Redshift <-> S3 connections (#199 / #219).
  • Support for specifying column comments and encodings (#164, #172, #178).
  • The COPY statement issued against Redshift is now logged in order to make debugging easier (#196).
  • Documentation enhancements: #150, #163.

v0.6.0

06 Jan 21:30
Compare
Choose a tag to compare

Bug Fixes:

  • Properly handle special characters in JDBC connection strings (#132 / #134). This bug affected users whose Redshift passwords contained special characters that were not valid in URLs (e.g. a password containing a percentage-sign (%) character).
  • Restored compatibility with spark-avro 1.0.0 (#111 / #114).
  • Fix bugs related to using the PostgreSQL JDBC driver instead of Amazon's official Redshift JDBC driver (#126, #143, #147). If your classpath contains both the PostgreSQL and Amazon drivers, explicitly specifying a JDBC driver class via the jdbcdriver parameter will now force that driver class to be used.
  • Give a better warning message for non-existing S3 buckets when attempting to read their bucket lifecycle configurations (#138 / #142).
  • Minor documentation fixes: #119, #120, #123, #137.

Enhancements:

  • Redshift queries are now cancelled when thread issuing the query is interrupted (#116 / #117). If you cancel a Databricks notebook shell while it is executing a spark-redshift query, the Spark REPL will no longer crash due to interrupts being swallowed.
  • When writing data back to Redshift, dates are now written in the default Redshift date format (yyyy-MM-dd) rather than a timestamp format (#122 / #130).
  • spark-redshift now implements Spark 1.6's new unhandledFilters API, which allows Spark to eliminate a duplicate layer of filtering for filters that are pushed down to Redshift (#128).

v0.5.2

23 Oct 17:37
Compare
Choose a tag to compare

spark-redshift 0.5.2 is a maintenance release that contains a handful of important bugfixes. We recommend that all users upgrade to this release.

Bug Fixes:

  • Fixed a thread-safety issue which could lead to errors or data corruption when processing date, timestamp, or decimal columns (#107 / #108).
  • Fixed bugs related to handling of S3 credentials when they are specified as part of the tempdir URL (#109).
  • Fixed a typo in the AWS credentials section of the README: the old text referred to sc.hadoopConfig instead of sc.hadoopConfiguration (#109).

Enhancements:

  • Added a new extracopyoptions configuration, which allows advanced users to pass additional options to Redshift in COPY commands (#35).
  • Added an example of writing data back to Redshift using the SQL language API (#110).
  • Added documentation on how to configure the SparkContext's global hadoopConfiguration from Python (#109).
  • Added a tutorial (#101 and #106).

v0.5.1

05 Oct 23:17
Compare
Choose a tag to compare

spark-redshift 0.5.1 is a maintenance release which contains several bugfixes and usability improvements:

  • Improved JDBC quoting and escaping:
    • Column names are now properly quoted when saving tables to Redshift, allowing reserved words or names containing special characters to be used as column names (#80 / #85).
    • Table names that are qualified with schemas (e.g. myschema.mytable) or which contain special characters (such as spaces) are now supported (#97 / #102).
  • Improved dependency handling:
    • spark-redshift no longer has a binary dependency on the hadoop-aws artifact, which caused problems for EMR users (#92 / #94).
    • When using the Redshift JDBC driver, both the JDBC 4.0 and 4.1 versions of the driver can now be used without having to change the default jdbcdriver setting; the proper configuration will be automatically chosen depending on which version of the JDBC driver can be loaded (#83 / #90).
  • Misc. bugfixes:
    • Fix a bug which prevented tables with empty partitions from being saved to Redshift (#96 / #102).
    • Fix spurious exceptions when checking the S3 bucket lifecycle configuration when tempdir points to the root of the bucket (#91 / #95).
    • Fixed a bug in Utils.joinURLs which caused problems for Windows users (#93).