Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maxlength metadata not working with pyspark #137

Closed
dokipen opened this issue Dec 18, 2015 · 5 comments
Closed

maxlength metadata not working with pyspark #137

dokipen opened this issue Dec 18, 2015 · 5 comments
Assignees

Comments

@dokipen
Copy link

dokipen commented Dec 18, 2015

I made sure the table doesn't exist, then run the following:

df = sqlCtx.createDataFrame(sc.parallelize([Row(value="a"*2048)]))
df.schema.fields[0].metadata['maxlength'] = 4096
df.write.format("com.databricks.spark.redshift") \
        .options(url=pgconn,
                 dbtable="tmptable",
                 tempdir=TMPDIR) \
        .save(mode='append')

The maxlength metadata is ignored and the column is created with character varying(256) type. Any ideas?

@JoshRosen
Copy link
Contributor

Hi @dokipen,

This is a known issue which stems from limitations in PySpark's column metadata APIs.

In order to change a column's metadata you need to create a new DataFrame using the new metadata; modifying it in-place like this won't work. For this reason, the example in the README uses the DataFrame.withColumn() and Column.as() APIs in order to create a new DataFrame with updated column metadata, but those APIs are not currently available in the Python.

It seems like you might be able to set column metadata by using SQLContext.createDataFrame with an explicit schema, but I think that's only a partial solution because it doesn't really address the case where you're trying to save a DataFrame that's been transformed. I suppose you could convert the DataFrame back to an RDD[Row] by calling .rdd() and then call createDataFrame() with that RDD, but this could be slow due to incurring multiple data format conversions.

This limitation is documented at the bottom of the "Configuring the maximum size of string columns" section in the README, although I suppose I could put a ⚠️ or ℹ️ emoji to draw more attention to it.

See also: #54 (comment)

I'm open to the idea of adding new spark-redshift APIs for encoding these column length constraints in case you have any suggestions there.

/cc @marmbrus @rxin, FYI.

@dokipen
Copy link
Author

dokipen commented Dec 21, 2015

Thanks, sorry I didn't read more carefully. FYI, both work-arounds worked.

JoshRosen added a commit that referenced this issue Dec 22, 2015
@JoshRosen
Copy link
Contributor

I have gone ahead and updated the README to make this caveat a little clearer: ed75de1

Therefore, I'm going to close this issue for now. When Spark expands its language support for column metadata operations, I'll be sure to update the README to include examples in other languages.

@hrp
Copy link

hrp commented Jul 26, 2016

@dokipen What were the workarounds that worked?

@dokipen
Copy link
Author

dokipen commented Aug 9, 2016

I don't remember at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants