Spark XML is not saving the dataframe in partitions #327

ivijay · 2018-08-20T20:48:48Z

I want to save the dataframe in multiple XML files based on its partitioning column, i.e, one file per partition.

df.write() .format("com.databricks.spark.xml") .option("rootTag", "items") .option("rowTag", "item") .mode(org.apache.spark.sql.SaveMode.Overwrite) .partitionBy("author") .save(filePath);

Above code executes without any error but partitions are not created as per the provided partition column.

The text was updated successfully, but these errors were encountered:

BioQwer · 2018-11-25T10:23:22Z

@ivijay i try it.
realy not do partitions
as fast solution you can do dataframe for each partition and save xml`s by yourself

HyukjinKwon · 2018-11-30T05:06:07Z

Thanks, @BioQwer. Let me leave this closed.

nmergia · 2019-04-19T01:34:35Z

@HyukjinKwon @BioQwer can you tell me why partitionBy is not a good thing to do? I was using partitionBy with parquet files and it was working fine but when requirements changed and I had to write out the output as xml it is not working. @BioQwer Could you let me know what you mean by "as fast solution you can do dataFrame for each partition and save xml`s by yourself".
B/c if you do forEachParition on the DataFrame object it is not the same as partitionBy on the DataFrameWriter object.

srowen · 2019-04-19T13:30:18Z

I am not sure partitionBy is implemented for this XML data sink. You can try partitioning the source DataFrame first? It won't give partitioned subdirs but may give separate files per partition key (haven't tried it)

BioQwer · 2019-04-20T07:30:23Z

@nmergia
on 25 Nov 2018 i check this feature and see that partitionBy not working.

as i see now, spark-xml has big refactoring
@srowen
is it fixed, for this time?

srowen · 2019-04-20T12:39:21Z

I don't believe anything changed here. As I say, I don't see that partitionBy ever had effect for this sink. It doesn't make as much sense for how XML is read.

nmergia · 2019-04-20T20:56:20Z

@srowen The reason I wanted to use partitionBy in the first place is for the subdirs created by the operation. There are multiple processes that start of after this and each only cares about one partition of the data. For now I had to implement a loop for writing each partition out to a different subdir based on the partition columns but if the operation of partitionBy was available the process would be much faster and more efficient.

srowen · 2019-04-20T23:31:57Z

I think you can just cache the DataFrame, and filter / write whatever subsets you want. I don't think it would be that inefficient, and more flexible.

HyukjinKwon · 2019-04-21T05:22:07Z

BTW, it will be fixed if we move to Datasource V2. There's nothing we can do in spark-xml side.

RLashofRegas · 2021-11-13T14:56:12Z

@srowen @HyukjinKwon old issue but wanted to bump it because I'm having the same issue. The workarounds mentioned are not great because I am trying to optimize for speed. partitioning the dataset prior to writing, caching, looping through, etc. all require a shuffle. I am specifically trying to use patitionBy on the Writer to avoid doing a shuffle. Similar to @nmergia I have downstream systems that will each load only one partition and there's a requirement for it to be xml.

srowen · 2021-11-13T15:07:48Z

I don't know how to make partitionBy work, myself - something to do with how this DataSource is implemented. If anyone can take a shot, go ahead.

HyukjinKwon closed this as completed Nov 30, 2018

zzeekk mentioned this issue Nov 5, 2021

implement reading partitioned xml-data, remove unneeded utils smart-data-lake/smart-data-lake#410

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark XML is not saving the dataframe in partitions #327

Spark XML is not saving the dataframe in partitions #327

ivijay commented Aug 20, 2018 •

edited

Loading

BioQwer commented Nov 25, 2018

HyukjinKwon commented Nov 30, 2018

nmergia commented Apr 19, 2019

srowen commented Apr 19, 2019

BioQwer commented Apr 20, 2019

srowen commented Apr 20, 2019

nmergia commented Apr 20, 2019

srowen commented Apr 20, 2019

HyukjinKwon commented Apr 21, 2019

RLashofRegas commented Nov 13, 2021

srowen commented Nov 13, 2021

Spark XML is not saving the dataframe in partitions #327

Spark XML is not saving the dataframe in partitions #327

Comments

ivijay commented Aug 20, 2018 • edited Loading

BioQwer commented Nov 25, 2018

HyukjinKwon commented Nov 30, 2018

nmergia commented Apr 19, 2019

srowen commented Apr 19, 2019

BioQwer commented Apr 20, 2019

srowen commented Apr 20, 2019

nmergia commented Apr 20, 2019

srowen commented Apr 20, 2019

HyukjinKwon commented Apr 21, 2019

RLashofRegas commented Nov 13, 2021

srowen commented Nov 13, 2021

ivijay commented Aug 20, 2018 •

edited

Loading