Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I am using hadoopoffice dependancy in my project to export hive table in Excel file #64

Closed
nitinpawarIO opened this issue May 11, 2020 · 20 comments

Comments

@nitinpawarIO
Copy link

i have few questions regarding the performance and features of this library:

  1. How much big size data it can convert in excel, i have a requirement of processiong data in GB's.
  2. I tried processing 50 MB file... it works fine for multiple partions but it fails with timeout error in case of 1 partition.
  3. Is there any option to format excel data (bold ,italic , font size) settings options using hadoop office.
com.github.zuinnote spark-hadoopoffice-ds_2.11 1.3.8

Dataset dataset2 = spark.read().parquet("C:/tmp/students_data");

	System.out.println("Partitions used : "+ dataset2.toJavaRDD().getNumPartitions());
	
	dataset2.write()
     .format("org.zuinnote.spark.office.excel")
     .option("spark.write.useHeader",true)
     .option("hadoopoffice.write.locale.bcp47", "us") 
     .option("hadoopoffice.write.lowFootprint", true) //low footprint mode that allows reading of spreadsheet files with a low CPU and memory footprint. For Excel this is based on the Apache POI event API
     .save("F:\\JAVA\\Big Data\\StudentExcel_3");
@jornfranke
Copy link
Member

Thanks for your report.
ad 1) there is no limit technically in the HadoopOffice library, but Excel has an internal limit of max. 1 Mio rows (see also here): https://support.office.com/en-us/article/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3
add 2) The original file is parquet? How many rows does it have (see also 1)? A timeout normally is not an issue related to HadoopOffice, but the Spark setup. Do you use a cluster? What is the setup? Do you give enough memory to the executors? Despite low footprint mode - it does not come for free.
add 3) There is the possibility to use a template Excel (but it does not work in lowFootprint mode): https://github.com/ZuInnoTe/hadoopoffice/wiki/Hadoop-File-Format#templates The idea is that you preformat the Excel and use it as template and use HadoopOffice to insert the data.

@jornfranke
Copy link
Member

Another question: How do you change the number of partitions? Do you use repartition or coalesce?

@nitinpawarIO
Copy link
Author

Thanks for quick response @jornfranke :)

  1. Yes the file format is parquet and all 3 files have 2 million rows.
    I am not using cluster , running program in eclipse on windows with 8GB RAM and intel i3-2core processor.
    HivetableToExcel.txt

pom.txt

  • When i use lowfootprint mode then file gets processed withing few seconds(30 seconds) but without low footprint mode there is no output even after waiting for 10 minuted and then throws exception.
  • I am using coalesce to change number of partitions.

Kindly suggest best way to use hadoop office for my requirement as below.
requirement :
-Need to process files in GB's size

  • Required to apply text formatting for atleast column headers(Bold / color/ size)
  • Need all hive table data in one Excel (considering record limit of 1 milion in one sheet), one excel document with multiple sheets having all data will do the job.
  • without low foot print mode my program is not giving any output, its breaking in between.

I am attaching my source files for reference. Kindly suggest best approach and i will really appreciate if you can provide reference link with examples which suit my requirement .

@jornfranke
Copy link
Member

I think the problem could be that the Executors do not have enough memory, but this is probably not due to hadoopoffice library, but the coalesce to one is the issue. Probably if you write the output to a parquet file you will have exactly the same issue by using colaesce to one. Can you please try to increase the executor memory? Can you also check the executor logfiles of Spark? The issue to the without low footprint mode could be also related to the memory allocation.

@jornfranke
Copy link
Member

Additionally: Is there other software running on the machine? Maybe Antivirus software etc.?
Eclipse on Windows is already a lot of memory consumption itself then other software running.
Make sure that you also assign to the executors and the driver the right amount of memory and check the logfiles of Spark (driver and executor) to see errors/issues.

@nitinpawarIO
Copy link
Author

Ok. Thanks for details.. I will check and confirm.

Could you please help with a parameter confirguration for creating multile sheets in excel file once it exeeds limit of max rows in sheet. I am getting max rows limit error:
2020-05-14 10:09:53,622 INFO [Executor task launch worker for task 1] common.OfficeWriter (OfficeWriter.java:(72)) - Storing new Excel file ,xlsx in low footprint mode
2020-05-14 10:10:25,053 ERROR [Executor task launch worker for task 1] util.Utils (Logging.scala:logError(91)) - Aborting task
java.lang.IllegalArgumentException: Invalid row number (1048576) outside allowable range (0..1048575)

@jornfranke
Copy link
Member

Great. I see. So in Spark simple mode of the HadoopOffice library is only one sheet possible (-> we can add this feature for multiple sheets, but we need a couple of days). In spark not simple mode where you need to describe the cell yourself (see here you can add any number of sheets as you describe per cell in which sheet it lands. The latter requires more complex logic on your side.
I will check to also add the first option, but it will take a couple of days.

@nitinpawarIO
Copy link
Author

Hello @jornfranke ,
Thanks for quick response..
ok.. really appreciate your effors.. Will wait for update for writing huge dataset to multiple sheets in same excel.

Havent find below example useful tocreate multiple sheets : its simply writing multiple DF to single excel sheet.
https://github.com/ZuInnoTe/hadoopoffice/blob/master/examples/scala-spark2-excel-out-ds/src/main/scala/org/zuinnote/spark/office/example/excel/SparkScalaExcelOutDataSource.scala

@nitinpawarIO
Copy link
Author

nitinpawarIO commented May 14, 2020

Hi @jornfranke ,
Can we also add feature of using excel template in low footprint mode , that will bring value add to the functionality.

  • org.zuinnote.hadoop.office.format.common.writer.OfficeWriterException: Templates are not supported in low footprint write mode

@jornfranke
Copy link
Member

Hi @jornfranke ,
Can we also add feature of using excel template in low footprint mode , that will bring value add to the functionality.

  • org.zuinnote.hadoop.office.format.common.writer.OfficeWriterException: Templates are not supported in low footprint write mode

This is more difficult to achieve due to the Excel format. It is very memory in-efficient. In low footprint mode we cannot keep all rows in-memory and thus formatting from the template cannot be applied correctly. Out of curiosity: What formatting do you try to apply? Maybe one could make it work, but it requires extensive testing etc.

You can always use the normal processing mode and provide more memory.

@jornfranke
Copy link
Member

Hello @jornfranke ,
Thanks for quick response..
ok.. really appreciate your effors.. Will wait for update for writing huge dataset to multiple sheets in same excel.

Havent find below example useful tocreate multiple sheets : its simply writing multiple DF to single excel sheet.
https://github.com/ZuInnoTe/hadoopoffice/blob/master/examples/scala-spark2-excel-out-ds/src/main/scala/org/zuinnote/spark/office/example/excel/SparkScalaExcelOutDataSource.scala

In the example: yes. But as you can see, you can specify for a cell the sheet. So you can implement a logic to automatically distribute to a new sheet if the number of rows exceed the limits. This will though only a workaround until I have implemented the configurable option.

@nitinpawarIO
Copy link
Author

Hello @jornfranke ,
Thanks for quick response..
ok.. really appreciate your effors.. Will wait for update for writing huge dataset to multiple sheets in same excel.
Havent find below example useful tocreate multiple sheets : its simply writing multiple DF to single excel sheet.
https://github.com/ZuInnoTe/hadoopoffice/blob/master/examples/scala-spark2-excel-out-ds/src/main/scala/org/zuinnote/spark/office/example/excel/SparkScalaExcelOutDataSource.scala

In the example: yes. But as you can see, you can specify for a cell the sheet. So you can implement a logic to automatically distribute to a new sheet if the number of rows exceed the limits. This will though only a workaround until I have implemented the configurable option.

Hi @jornfranke ,
How can i encorporate in my code below. I am reading data from Hive table and want to write 1000 records per sheet. if one sheet exeeds 1k records then it should write to new sheet in same document.

Dataset excelDataset = spark.sql("select * from (SELECT *,ROW_NUMBER() over (Order by student_id) as rowid "
+ "FROM students_data1 )t where rowid > 0 and rowid <=1000");

	//System.out.println("Partitions used : "+ dataset2.toJavaRDD().getNumPartitions());
	
	System.out.println(new Date().toString() + " : Excel File writting STARTED....");
	
	excelDataset.write()
     .format("org.zuinnote.spark.office.excel")
     .option("write.spark.defaultsheetname", "MyData")
     //.option("hadoopoffice.write.mimeType", "application/vnd.ms-excel")  // write in.xls format row limit is 65K
     .option("hadoopoffice.write.header.write",true)
     .option("hadoopoffice.write.locale.bcp47", "us") 
     .option("hadoopoffice.write.template.file", "F:/JAVA/Big Data/HiveImportTemplate.xlsx")
     //.option("hadoopoffice.write.lowFootprint", true) //low footprint mode that allows reading of spreadsheet files with a low CPU and memory footprint. For Excel this is based on the Apache POI event API
     .mode("append")
     .save("F:\\JAVA\\Big Data\\output\\StudentExcel_16");

@jornfranke
Copy link
Member

The issue is that Hadoop/HDFS (even if you use it with your local filesystem) do not allow editing documents. What you could do:

  • Write each sheet into a separate file
  • put the files in the same location
  • Point HadoopOffice to the folder and open the files with simple mode = false
  • Write the files with simple mode = false and coalesce to one Excel file

@nitinpawarIO
Copy link
Author

Yes.. the approach looks good however it will finally write data to one sheet upto the excel records limit and then will throw error and we are back to initial issue :) .
Lets wait for the addition of multiple sheets to excel feature.

@jornfranke
Copy link
Member

It will come soon.

As a workaround you can automatically create one dataframe for x rows and then write it in different sheets. Then combine the sheets as stated above

@jornfranke
Copy link
Member

Can you please check 1.3.9 - it has support for automated spillover, see #65

@nitinpawarIO
Copy link
Author

thanks for the update @jornfranke ,
Really apreciate your efforts towards this.

Spillover rows are getting written to new sheet.
Queries :
Does the defined template gets applied to the other spillover sheets ?

Observation:
below log is now getting printed for every spillover row, increasing logfiles size.

2020-05-25 16:43:05,774 INFO [Executor task launch worker for task 2] msexcel.MSExcelWriter (MSExcelWriter.java:write(280)) - Maximum number of rows reached. Spilling over to additional sheet
2020-05-25 16:43:05,774 INFO [Executor task launch worker for task 2] msexcel.MSExcelWriter (MSExcelWriter.java:write(280)) - Maximum number of rows reached. Spilling over to additional sheet
2020-05-25 16:43:05,778 INFO [Executor task launch worker for task 2] msexcel.MSExcelWriter (MSExcelWriter.java:write(280)) - Maximum number of rows reached. Spilling over to additional sheet
2020-05-25 16:43:05,778 INFO [Executor task launch worker for task 2] msexcel.MSExcelWriter (MSExcelWriter.java:write(280)) - Maximum number of rows reached. Spilling over to additional sheet
...................
..........
............

@jornfranke
Copy link
Member

jornfranke commented May 25, 2020 via email

@jornfranke
Copy link
Member

please check 1.3.10 it removes the excessive logging

@jornfranke
Copy link
Member

Check out also 1.5.0 (note config options changed naming!). If there is still an issue then please open a new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants