-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark XML is not saving the dataframe in partitions #327
Comments
@ivijay i try it. |
Thanks, @BioQwer. Let me leave this closed. |
@HyukjinKwon @BioQwer can you tell me why partitionBy is not a good thing to do? I was using partitionBy with parquet files and it was working fine but when requirements changed and I had to write out the output as xml it is not working. @BioQwer Could you let me know what you mean by "as fast solution you can do dataFrame for each partition and save xml`s by yourself". |
I am not sure partitionBy is implemented for this XML data sink. You can try partitioning the source DataFrame first? It won't give partitioned subdirs but may give separate files per partition key (haven't tried it) |
I don't believe anything changed here. As I say, I don't see that partitionBy ever had effect for this sink. It doesn't make as much sense for how XML is read. |
@srowen The reason I wanted to use partitionBy in the first place is for the subdirs created by the operation. There are multiple processes that start of after this and each only cares about one partition of the data. For now I had to implement a loop for writing each partition out to a different subdir based on the partition columns but if the operation of partitionBy was available the process would be much faster and more efficient. |
I think you can just cache the DataFrame, and filter / write whatever subsets you want. I don't think it would be that inefficient, and more flexible. |
BTW, it will be fixed if we move to Datasource V2. There's nothing we can do in spark-xml side. |
@srowen @HyukjinKwon old issue but wanted to bump it because I'm having the same issue. The workarounds mentioned are not great because I am trying to optimize for speed. partitioning the dataset prior to writing, caching, looping through, etc. all require a shuffle. I am specifically trying to use |
I don't know how to make partitionBy work, myself - something to do with how this DataSource is implemented. If anyone can take a shot, go ahead. |
I want to save the dataframe in multiple XML files based on its partitioning column, i.e, one file per partition.
df.write() .format("com.databricks.spark.xml") .option("rootTag", "items") .option("rowTag", "item") .mode(org.apache.spark.sql.SaveMode.Overwrite) .partitionBy("author") .save(filePath);
Above code executes without any error but partitions are not created as per the provided partition column.
The text was updated successfully, but these errors were encountered: