maxlength metadata not working with pyspark #137

dokipen · 2015-12-18T20:46:12Z

I made sure the table doesn't exist, then run the following:

df = sqlCtx.createDataFrame(sc.parallelize([Row(value="a"*2048)]))
df.schema.fields[0].metadata['maxlength'] = 4096
df.write.format("com.databricks.spark.redshift") \
        .options(url=pgconn,
                 dbtable="tmptable",
                 tempdir=TMPDIR) \
        .save(mode='append')

The maxlength metadata is ignored and the column is created with character varying(256) type. Any ideas?

The text was updated successfully, but these errors were encountered:

JoshRosen · 2015-12-18T21:46:29Z

Hi @dokipen,

This is a known issue which stems from limitations in PySpark's column metadata APIs.

In order to change a column's metadata you need to create a new DataFrame using the new metadata; modifying it in-place like this won't work. For this reason, the example in the README uses the DataFrame.withColumn() and Column.as() APIs in order to create a new DataFrame with updated column metadata, but those APIs are not currently available in the Python.

It seems like you might be able to set column metadata by using SQLContext.createDataFrame with an explicit schema, but I think that's only a partial solution because it doesn't really address the case where you're trying to save a DataFrame that's been transformed. I suppose you could convert the DataFrame back to an RDD[Row] by calling .rdd() and then call createDataFrame() with that RDD, but this could be slow due to incurring multiple data format conversions.

This limitation is documented at the bottom of the "Configuring the maximum size of string columns" section in the README, although I suppose I could put a ⚠️ or ℹ️ emoji to draw more attention to it.

See also: #54 (comment)

I'm open to the idea of adding new spark-redshift APIs for encoding these column length constraints in case you have any suggestions there.

/cc @marmbrus @rxin, FYI.

dokipen · 2015-12-21T13:30:23Z

Thanks, sorry I didn't read more carefully. FYI, both work-arounds worked.

See #137

JoshRosen · 2015-12-22T23:06:24Z

I have gone ahead and updated the README to make this caveat a little clearer: ed75de1

Therefore, I'm going to close this issue for now. When Spark expands its language support for column metadata operations, I'll be sure to update the README to include examples in other languages.

hrp · 2016-07-26T06:52:54Z

@dokipen What were the workarounds that worked?

dokipen · 2016-08-09T17:27:08Z

I don't remember at this point.

JoshRosen added a commit that referenced this issue Dec 22, 2015

Make maxlength caveats clearer in README

ed75de1

See #137

JoshRosen closed this as completed Dec 22, 2015

JoshRosen self-assigned this Dec 22, 2015

JoshRosen added the documentation label Dec 22, 2015

JoshRosen mentioned this issue Jan 26, 2016

No way to specify column compression #164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maxlength metadata not working with pyspark #137

maxlength metadata not working with pyspark #137

dokipen commented Dec 18, 2015

JoshRosen commented Dec 18, 2015

dokipen commented Dec 21, 2015

JoshRosen commented Dec 22, 2015

hrp commented Jul 26, 2016

dokipen commented Aug 9, 2016

maxlength metadata not working with pyspark #137

maxlength metadata not working with pyspark #137

Comments

dokipen commented Dec 18, 2015

JoshRosen commented Dec 18, 2015

dokipen commented Dec 21, 2015

JoshRosen commented Dec 22, 2015

hrp commented Jul 26, 2016

dokipen commented Aug 9, 2016