I am using hadoopoffice dependancy in my project to export hive table in Excel file #64

nitinpawarIO · 2020-05-11T08:52:22Z

i have few questions regarding the performance and features of this library:

How much big size data it can convert in excel, i have a requirement of processiong data in GB's.
I tried processing 50 MB file... it works fine for multiple partions but it fails with timeout error in case of 1 partition.
Is there any option to format excel data (bold ,italic , font size) settings options using hadoop office.

com.github.zuinnote spark-hadoopoffice-ds_2.11 1.3.8

Dataset dataset2 = spark.read().parquet("C:/tmp/students_data");

	System.out.println("Partitions used : "+ dataset2.toJavaRDD().getNumPartitions());
	
	dataset2.write()
     .format("org.zuinnote.spark.office.excel")
     .option("spark.write.useHeader",true)
     .option("hadoopoffice.write.locale.bcp47", "us") 
     .option("hadoopoffice.write.lowFootprint", true) //low footprint mode that allows reading of spreadsheet files with a low CPU and memory footprint. For Excel this is based on the Apache POI event API
     .save("F:\\JAVA\\Big Data\\StudentExcel_3");

The text was updated successfully, but these errors were encountered:

jornfranke · 2020-05-11T21:11:12Z

Thanks for your report.
ad 1) there is no limit technically in the HadoopOffice library, but Excel has an internal limit of max. 1 Mio rows (see also here): https://support.office.com/en-us/article/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3
add 2) The original file is parquet? How many rows does it have (see also 1)? A timeout normally is not an issue related to HadoopOffice, but the Spark setup. Do you use a cluster? What is the setup? Do you give enough memory to the executors? Despite low footprint mode - it does not come for free.
add 3) There is the possibility to use a template Excel (but it does not work in lowFootprint mode): https://github.com/ZuInnoTe/hadoopoffice/wiki/Hadoop-File-Format#templates The idea is that you preformat the Excel and use it as template and use HadoopOffice to insert the data.

jornfranke · 2020-05-11T21:33:32Z

Another question: How do you change the number of partitions? Do you use repartition or coalesce?

nitinpawarIO · 2020-05-12T12:09:46Z

Thanks for quick response @jornfranke :)

Yes the file format is parquet and all 3 files have 2 million rows.
I am not using cluster , running program in eclipse on windows with 8GB RAM and intel i3-2core processor.
HivetableToExcel.txt

pom.txt

When i use lowfootprint mode then file gets processed withing few seconds(30 seconds) but without low footprint mode there is no output even after waiting for 10 minuted and then throws exception.
I am using coalesce to change number of partitions.

Kindly suggest best way to use hadoop office for my requirement as below.
requirement :
-Need to process files in GB's size

Required to apply text formatting for atleast column headers(Bold / color/ size)
Need all hive table data in one Excel (considering record limit of 1 milion in one sheet), one excel document with multiple sheets having all data will do the job.
without low foot print mode my program is not giving any output, its breaking in between.

I am attaching my source files for reference. Kindly suggest best approach and i will really appreciate if you can provide reference link with examples which suit my requirement .

jornfranke · 2020-05-12T20:55:56Z

I think the problem could be that the Executors do not have enough memory, but this is probably not due to hadoopoffice library, but the coalesce to one is the issue. Probably if you write the output to a parquet file you will have exactly the same issue by using colaesce to one. Can you please try to increase the executor memory? Can you also check the executor logfiles of Spark? The issue to the without low footprint mode could be also related to the memory allocation.

jornfranke · 2020-05-12T21:00:11Z

Additionally: Is there other software running on the machine? Maybe Antivirus software etc.?
Eclipse on Windows is already a lot of memory consumption itself then other software running.
Make sure that you also assign to the executors and the driver the right amount of memory and check the logfiles of Spark (driver and executor) to see errors/issues.

nitinpawarIO · 2020-05-14T07:25:02Z

Ok. Thanks for details.. I will check and confirm.

Could you please help with a parameter confirguration for creating multile sheets in excel file once it exeeds limit of max rows in sheet. I am getting max rows limit error:
2020-05-14 10:09:53,622 INFO [Executor task launch worker for task 1] common.OfficeWriter (OfficeWriter.java:(72)) - Storing new Excel file ,xlsx in low footprint mode
2020-05-14 10:10:25,053 ERROR [Executor task launch worker for task 1] util.Utils (Logging.scala:logError(91)) - Aborting task
java.lang.IllegalArgumentException: Invalid row number (1048576) outside allowable range (0..1048575)

jornfranke · 2020-05-14T07:55:21Z

Great. I see. So in Spark simple mode of the HadoopOffice library is only one sheet possible (-> we can add this feature for multiple sheets, but we need a couple of days). In spark not simple mode where you need to describe the cell yourself (see here you can add any number of sheets as you describe per cell in which sheet it lands. The latter requires more complex logic on your side.
I will check to also add the first option, but it will take a couple of days.

nitinpawarIO · 2020-05-14T08:45:35Z

Hello @jornfranke ,
Thanks for quick response..
ok.. really appreciate your effors.. Will wait for update for writing huge dataset to multiple sheets in same excel.

Havent find below example useful tocreate multiple sheets : its simply writing multiple DF to single excel sheet.
https://github.com/ZuInnoTe/hadoopoffice/blob/master/examples/scala-spark2-excel-out-ds/src/main/scala/org/zuinnote/spark/office/example/excel/SparkScalaExcelOutDataSource.scala

nitinpawarIO · 2020-05-14T10:04:06Z

Hi @jornfranke ,
Can we also add feature of using excel template in low footprint mode , that will bring value add to the functionality.

org.zuinnote.hadoop.office.format.common.writer.OfficeWriterException: Templates are not supported in low footprint write mode

jornfranke · 2020-05-16T09:49:11Z

Hi @jornfranke ,
Can we also add feature of using excel template in low footprint mode , that will bring value add to the functionality.

org.zuinnote.hadoop.office.format.common.writer.OfficeWriterException: Templates are not supported in low footprint write mode

This is more difficult to achieve due to the Excel format. It is very memory in-efficient. In low footprint mode we cannot keep all rows in-memory and thus formatting from the template cannot be applied correctly. Out of curiosity: What formatting do you try to apply? Maybe one could make it work, but it requires extensive testing etc.

You can always use the normal processing mode and provide more memory.

jornfranke · 2020-05-16T09:51:14Z

Hello @jornfranke ,
Thanks for quick response..
ok.. really appreciate your effors.. Will wait for update for writing huge dataset to multiple sheets in same excel.

Havent find below example useful tocreate multiple sheets : its simply writing multiple DF to single excel sheet.
https://github.com/ZuInnoTe/hadoopoffice/blob/master/examples/scala-spark2-excel-out-ds/src/main/scala/org/zuinnote/spark/office/example/excel/SparkScalaExcelOutDataSource.scala

In the example: yes. But as you can see, you can specify for a cell the sheet. So you can implement a logic to automatically distribute to a new sheet if the number of rows exceed the limits. This will though only a workaround until I have implemented the configurable option.

nitinpawarIO · 2020-05-18T07:38:47Z

Hello @jornfranke ,
Thanks for quick response..
ok.. really appreciate your effors.. Will wait for update for writing huge dataset to multiple sheets in same excel.
Havent find below example useful tocreate multiple sheets : its simply writing multiple DF to single excel sheet.
https://github.com/ZuInnoTe/hadoopoffice/blob/master/examples/scala-spark2-excel-out-ds/src/main/scala/org/zuinnote/spark/office/example/excel/SparkScalaExcelOutDataSource.scala

In the example: yes. But as you can see, you can specify for a cell the sheet. So you can implement a logic to automatically distribute to a new sheet if the number of rows exceed the limits. This will though only a workaround until I have implemented the configurable option.

Hi @jornfranke ,
How can i encorporate in my code below. I am reading data from Hive table and want to write 1000 records per sheet. if one sheet exeeds 1k records then it should write to new sheet in same document.

Dataset excelDataset = spark.sql("select * from (SELECT *,ROW_NUMBER() over (Order by student_id) as rowid "
+ "FROM students_data1 )t where rowid > 0 and rowid <=1000");

	//System.out.println("Partitions used : "+ dataset2.toJavaRDD().getNumPartitions());
	
	System.out.println(new Date().toString() + " : Excel File writting STARTED....");
	
	excelDataset.write()
     .format("org.zuinnote.spark.office.excel")
     .option("write.spark.defaultsheetname", "MyData")
     //.option("hadoopoffice.write.mimeType", "application/vnd.ms-excel")  // write in.xls format row limit is 65K
     .option("hadoopoffice.write.header.write",true)
     .option("hadoopoffice.write.locale.bcp47", "us") 
     .option("hadoopoffice.write.template.file", "F:/JAVA/Big Data/HiveImportTemplate.xlsx")
     //.option("hadoopoffice.write.lowFootprint", true) //low footprint mode that allows reading of spreadsheet files with a low CPU and memory footprint. For Excel this is based on the Apache POI event API
     .mode("append")
     .save("F:\\JAVA\\Big Data\\output\\StudentExcel_16");

jornfranke · 2020-05-18T17:20:15Z

The issue is that Hadoop/HDFS (even if you use it with your local filesystem) do not allow editing documents. What you could do:

Write each sheet into a separate file
put the files in the same location
Point HadoopOffice to the folder and open the files with simple mode = false
Write the files with simple mode = false and coalesce to one Excel file

nitinpawarIO · 2020-05-20T05:31:34Z

Yes.. the approach looks good however it will finally write data to one sheet upto the excel records limit and then will throw error and we are back to initial issue :) .
Lets wait for the addition of multiple sheets to excel feature.

jornfranke · 2020-05-20T08:56:23Z

It will come soon.

As a workaround you can automatically create one dataframe for x rows and then write it in different sheets. Then combine the sheets as stated above

jornfranke · 2020-05-23T19:34:36Z

Can you please check 1.3.9 - it has support for automated spillover, see #65

nitinpawarIO · 2020-05-25T11:39:39Z

thanks for the update @jornfranke ,
Really apreciate your efforts towards this.

Spillover rows are getting written to new sheet.
Queries :
Does the defined template gets applied to the other spillover sheets ?

Observation:
below log is now getting printed for every spillover row, increasing logfiles size.

2020-05-25 16:43:05,774 INFO [Executor task launch worker for task 2] msexcel.MSExcelWriter (MSExcelWriter.java:write(280)) - Maximum number of rows reached. Spilling over to additional sheet
2020-05-25 16:43:05,774 INFO [Executor task launch worker for task 2] msexcel.MSExcelWriter (MSExcelWriter.java:write(280)) - Maximum number of rows reached. Spilling over to additional sheet
2020-05-25 16:43:05,778 INFO [Executor task launch worker for task 2] msexcel.MSExcelWriter (MSExcelWriter.java:write(280)) - Maximum number of rows reached. Spilling over to additional sheet
2020-05-25 16:43:05,778 INFO [Executor task launch worker for task 2] msexcel.MSExcelWriter (MSExcelWriter.java:write(280)) - Maximum number of rows reached. Spilling over to additional sheet
...................
..........
............

jornfranke · 2020-05-25T15:36:37Z

Ah thx for the log Observation - I will fix this this weekend. Template - it depends on the content. If the sheets in the Template exist with the same names then yes. The template is not applied to each sheet individually. The reason is that there are also templates containing multiple sheets (eg one data sheet and a visualization sheet). Thus they cannot be easily mapped to one sheet. However, if the same sheets (without data) exist in the template there is no issue. How many sheets do you expect in your use case .

…

Am 25.05.2020 um 13:39 schrieb pawarnitinscoe ***@***.***>: thanks for the update @jornfranke , Really apreciate your efforts towards this. Spillover rows are getting written to new sheet. Queries : Does the defined template gets applied to the other spillover sheets ? Observation: below log is now getting printed for every spillover row, increasing logfiles size. 2020-05-25 16:43:05,774 INFO [Executor task launch worker for task 2] msexcel.MSExcelWriter (MSExcelWriter.java:write(280)) - Maximum number of rows reached. Spilling over to additional sheet 2020-05-25 16:43:05,774 INFO [Executor task launch worker for task 2] msexcel.MSExcelWriter (MSExcelWriter.java:write(280)) - Maximum number of rows reached. Spilling over to additional sheet 2020-05-25 16:43:05,778 INFO [Executor task launch worker for task 2] msexcel.MSExcelWriter (MSExcelWriter.java:write(280)) - Maximum number of rows reached. Spilling over to additional sheet 2020-05-25 16:43:05,778 INFO [Executor task launch worker for task 2] msexcel.MSExcelWriter (MSExcelWriter.java:write(280)) - Maximum number of rows reached. Spilling over to additional sheet ................... .......... ............ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jornfranke · 2020-05-25T23:16:29Z

please check 1.3.10 it removes the excessive logging

jornfranke · 2021-05-30T14:19:14Z

Check out also 1.5.0 (note config options changed naming!). If there is still an issue then please open a new issue

This was referenced May 16, 2020

Spill over to next sheet if number of rows exceeding Excel limitations ZuInnoTe/spark-hadoopoffice-ds#34

Closed

Hive+Flink+Spark:Spill over to next sheet if number of rows exceeding Excel limitations #65

Closed

jornfranke closed this as completed May 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I am using hadoopoffice dependancy in my project to export hive table in Excel file #64

I am using hadoopoffice dependancy in my project to export hive table in Excel file #64

nitinpawarIO commented May 11, 2020

jornfranke commented May 11, 2020

jornfranke commented May 11, 2020

nitinpawarIO commented May 12, 2020

jornfranke commented May 12, 2020

jornfranke commented May 12, 2020

nitinpawarIO commented May 14, 2020

jornfranke commented May 14, 2020

nitinpawarIO commented May 14, 2020

nitinpawarIO commented May 14, 2020 •

edited

jornfranke commented May 16, 2020

jornfranke commented May 16, 2020

nitinpawarIO commented May 18, 2020

jornfranke commented May 18, 2020

nitinpawarIO commented May 20, 2020

jornfranke commented May 20, 2020

jornfranke commented May 23, 2020

nitinpawarIO commented May 25, 2020

jornfranke commented May 25, 2020 via email

jornfranke commented May 25, 2020

jornfranke commented May 30, 2021

I am using hadoopoffice dependancy in my project to export hive table in Excel file #64

I am using hadoopoffice dependancy in my project to export hive table in Excel file #64

Comments

nitinpawarIO commented May 11, 2020

jornfranke commented May 11, 2020

jornfranke commented May 11, 2020

nitinpawarIO commented May 12, 2020

jornfranke commented May 12, 2020

jornfranke commented May 12, 2020

nitinpawarIO commented May 14, 2020

jornfranke commented May 14, 2020

nitinpawarIO commented May 14, 2020

nitinpawarIO commented May 14, 2020 • edited

jornfranke commented May 16, 2020

jornfranke commented May 16, 2020

nitinpawarIO commented May 18, 2020

jornfranke commented May 18, 2020

nitinpawarIO commented May 20, 2020

jornfranke commented May 20, 2020

jornfranke commented May 23, 2020

nitinpawarIO commented May 25, 2020

jornfranke commented May 25, 2020 via email

jornfranke commented May 25, 2020

jornfranke commented May 30, 2021

nitinpawarIO commented May 14, 2020 •

edited