Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark XML is not saving the dataframe in partitions #327

Closed
ivijay opened this issue Aug 20, 2018 · 11 comments
Closed

Spark XML is not saving the dataframe in partitions #327

ivijay opened this issue Aug 20, 2018 · 11 comments

Comments

@ivijay
Copy link

ivijay commented Aug 20, 2018

I want to save the dataframe in multiple XML files based on its partitioning column, i.e, one file per partition.

df.write() .format("com.databricks.spark.xml") .option("rootTag", "items") .option("rowTag", "item") .mode(org.apache.spark.sql.SaveMode.Overwrite) .partitionBy("author") .save(filePath);

Above code executes without any error but partitions are not created as per the provided partition column.

@BioQwer
Copy link
Contributor

BioQwer commented Nov 25, 2018

@ivijay i try it.
realy not do partitions
as fast solution you can do dataframe for each partition and save xml`s by yourself

@HyukjinKwon
Copy link
Member

Thanks, @BioQwer. Let me leave this closed.

@nmergia
Copy link

nmergia commented Apr 19, 2019

@HyukjinKwon @BioQwer can you tell me why partitionBy is not a good thing to do? I was using partitionBy with parquet files and it was working fine but when requirements changed and I had to write out the output as xml it is not working. @BioQwer Could you let me know what you mean by "as fast solution you can do dataFrame for each partition and save xml`s by yourself".
B/c if you do forEachParition on the DataFrame object it is not the same as partitionBy on the DataFrameWriter object.

@srowen
Copy link
Collaborator

srowen commented Apr 19, 2019

I am not sure partitionBy is implemented for this XML data sink. You can try partitioning the source DataFrame first? It won't give partitioned subdirs but may give separate files per partition key (haven't tried it)

@BioQwer
Copy link
Contributor

BioQwer commented Apr 20, 2019

@nmergia
on 25 Nov 2018 i check this feature and see that partitionBy not working.

as i see now, spark-xml has big refactoring
@srowen
is it fixed, for this time?

@srowen
Copy link
Collaborator

srowen commented Apr 20, 2019

I don't believe anything changed here. As I say, I don't see that partitionBy ever had effect for this sink. It doesn't make as much sense for how XML is read.

@nmergia
Copy link

nmergia commented Apr 20, 2019

@srowen The reason I wanted to use partitionBy in the first place is for the subdirs created by the operation. There are multiple processes that start of after this and each only cares about one partition of the data. For now I had to implement a loop for writing each partition out to a different subdir based on the partition columns but if the operation of partitionBy was available the process would be much faster and more efficient.

@srowen
Copy link
Collaborator

srowen commented Apr 20, 2019

I think you can just cache the DataFrame, and filter / write whatever subsets you want. I don't think it would be that inefficient, and more flexible.

@HyukjinKwon
Copy link
Member

BTW, it will be fixed if we move to Datasource V2. There's nothing we can do in spark-xml side.

@RLashofRegas
Copy link

@srowen @HyukjinKwon old issue but wanted to bump it because I'm having the same issue. The workarounds mentioned are not great because I am trying to optimize for speed. partitioning the dataset prior to writing, caching, looping through, etc. all require a shuffle. I am specifically trying to use patitionBy on the Writer to avoid doing a shuffle. Similar to @nmergia I have downstream systems that will each load only one partition and there's a requirement for it to be xml.

@srowen
Copy link
Collaborator

srowen commented Nov 13, 2021

I don't know how to make partitionBy work, myself - something to do with how this DataSource is implemented. If anyone can take a shot, go ahead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants