Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export-Excel extremely slow on large spreadsheets #506

Closed
worc4021 opened this issue Dec 11, 2018 · 41 comments
Closed

Export-Excel extremely slow on large spreadsheets #506

worc4021 opened this issue Dec 11, 2018 · 41 comments

Comments

@worc4021
Copy link

Hi,

I've been trying to merge two excel spreadsheets together, the source xlsx files are just over 3MB, but when I call Export-Excel -Verbose it seems that is flies through the headers and then stalls. These spreadsheets have about 1700 columns and 300 rows, i.e. their union has slightly more. But these numbers are not unheard-of. Is this expected to scale poorly or am I doing something wrong? It stalls saving only one already:

$sheet = Import-Excel -Path \\path\to\my\xlsx
$sheet | Export-Excel \\path\to\my\output -Verbose -Debug

I can see that it is trying to do something in the task manager, but it's not progressing after

VERBOSE: Cell '1:1737' add header [...] 

Cheers,
Manuel

@dfinke
Copy link
Owner

dfinke commented Dec 11, 2018

@worc4021 if you just do the import-Excel, does it return all the data? Would it be possible to scrub the xlsx file and post it?

@worc4021
Copy link
Author

Hi!
Thank you for your quick response. Unfortunately, the reason I left out most of the useful bits from the verbose output is that the spreadsheets I am trying to combine hold simulation output metrics, i.e. work related. So I won't be able to share the spreadsheets, however, the spreadsheets hold very simple data, each row corresponds to one simulation, i.e. each data field is either empty, a string or a numerical value - no functions or anything like that.
I hope this still helps.
Manuel

@dfinke
Copy link
Owner

dfinke commented Dec 11, 2018

Ok, if you only do an import-Excel, does it work? What's the response time like?

So 1700x300 empty, string and numeric data?

@worc4021
Copy link
Author

Import-Excel is fine, takes about 1-2 seconds. 1700x300 was the one I tried, naturally since each row corresponds to one sim and each column corresponds to one metric the size can vary, but that's the ballpark.

@dfinke
Copy link
Owner

dfinke commented Dec 11, 2018

I created a sample set of objects. Took 22 minutes to export to excel. There may be some low hanging perf improvements that can be done, not sure how much effort or how much improvement it would bring.

@worc4021
Copy link
Author

Thank you for looking into this!

  • So it is nothing I'm doing particularly wrong on my side.
  • Not sure whether this would be worthwhile.
    I just stumbled upon ImportExcel and thought it would be a great prost processing tool for when I compile my regressions to compare different sets of sims. The alternative is literally click 'Add rows' in Spotfire instead. So if you don't see something that obviously makes it scale poorly AND know an alternative implementation, I would not advise you to change the existing version.

Sad times.. I prefer scripting over clicking :D

Thank you,
Manuel

@pkarunkar
Copy link

pkarunkar commented Dec 11, 2018 via email

@dfinke
Copy link
Owner

dfinke commented Dec 11, 2018

@worc4021 Yeah, friends don't let friends click. Exactly why I built this :)

@pkarunkar Good point, you're right. -Verbose slows things down a lot. Unfortunately, I wasn't using -Verbose so, this probably the fastest it can be for now.

Maybe @jhoneill has some insight.

@pkarunkar
Copy link

pkarunkar commented Dec 11, 2018 via email

@worc4021
Copy link
Author

Maybe just a defence for the Verbose switch: When you run a command in the prompt and for over 5 minutes nothing happens you might try to see whether -Verbose produces any output... ;)

@pkarunkar
Copy link

pkarunkar commented Dec 11, 2018 via email

@jhoneill
Copy link
Contributor

First on Write-Verbose: it slows things down quite a lot and I removed it from the tightest looping part of export-excel because (in this case) it would output over 500K messages. That's just too many to be helpful.
I just tried 1..500000 | foreach {$x = $_ } -takes 7.7 seconds ;
1..500000 | foreach {$x = $_; write-verbose -Message $_ } takes 45 seconds when it is not outputting anything
@pkarunkar - are you getting the these time differences with the current version, or are you getting lots of verbose output - I just did
ps | export-excel -now
and
ps | export-excel -now -verbose
And verbose was faster (but within experimental error)

Second. Don't ever, ever do ; Import-Excel | export-Excel or $var = import ; $var | export.
Unless the data is very, very small and you want to lose any formatting you have on the spreadsheet.
I wrote copy-excelWorksheet because you can copy a whole sheet within and between files VERY quickly.
I also wrote Join-Worksheet to copy whole sheets to a common destination - this might not be exactly what @worc4021 wants but it may be possible to use some of that code.
In this case - I'm not sure if there are perf implications to an PSCustomObject with 1700 properties, but that's more properties than ever seen before.
If you know that your numbers are in numeric types, not strings, you'll get a decent speed improvement from -NoNumberConversion. but ultimately 1700x300 is 510,000 cells, and doug's test did them at about 400 cells/sec. Each value is checked to see if is a number, a hyperlink, a date, a timespan or a formula (and without -NoNumberConversion, a number in the form of a string), before assuming it is a string. It is then poked into a cell, and if it is a date, timespan or hyperlink formatting is applied (formatting cells 1 by 1 is an expensive operation).
Where there are huge numbers of rows, it may be faster to built a datatable in memory and use Send-SQLDataToExcel, and you can get Excel data with a SQL query … but it should be faster to use the copy functionality which is there.

@pkarunkar
Copy link

pkarunkar commented Dec 11, 2018 via email

@pkarunkar
Copy link

pkarunkar commented Dec 12, 2018

@jhoneill : Ok I faked the test at home by copy pasting data into cells making it 70k rows and 32 cols.
As you said, verbose now looks faster.. Except when I used $var | export-excel with or without verbose.(This is strange as I do this all the time at office with large excel and it has been working great without verbose.(note that i haven't tested latest version with verbose) )
Attaching the excel sheet for reuse and my measure-command outputs ..

Using -noheader since duplicate column header popped in while copy pasting.

`Measure-Command {Get-Service | Export-Excel $env:userprofile\Desktop\Newfolder\VerboseTest.xlsx}
Measure-Command {Get-Service | Export-Excel $env:userprofile\Desktop\Newfolder\VerboseTest2.xlsx -Verbose}

Measure-Command {import-Excel $env:userprofile\Desktop\Newfolder\VerboseTestlargedata.xlsx -NoHeader }
Measure-Command {import-Excel $env:userprofile\Desktop\Newfolder\VerboseTestlargedata.xlsx -NoHeader -Verbose}

Measure-Command {$Var = import-Excel $env:userprofile\Desktop\Newfolder\VerboseTestlargedata.xlsx -NoHeader }
Measure-Command {$var1 = import-Excel $env:userprofile\Desktop\Newfolder\VerboseTestlargedata.xlsx -NoHeader -Verbose}

Measure-Command {$var1 | Export-Excel $env:userprofile\Desktop\Newfolder\VerboseTest.xlsx}
Measure-Command {$var1 | Export-Excel $env:userprofile\Desktop\Newfolder\VerboseTest2.xlsx -Verbose}`

The last 2 commands took 8 minutes with or without verbose. That is still very fast comparing my past experiences. At office, I have different data types, string, date, numeric... I do not know if that would make any difference.

VerboseTestlargedata.xlsx

ExcelTestOutput.txt

@jhoneill
Copy link
Contributor

I don't know if the EPPlus library gets slower to insert into the worksheet XML as the sheet gets bigger.
I might try this out during the day, and see time goes up linearly or is more like a square law.

I made and 21,000 cell sheet (PS | export-Excel).
import-excel <file> | export-excel -nonumberconversion * <file> took 6.4 seconds on my machine, which is a fairly quick 3270 cells per second.
Copy-ExcelWorkSheet -SourceWorkbook <file1> -DestinationWorkbook <file2>
Takes 0.76 seconds - about 8 times faster. I think this factor increase with bigger sheets.

Hence my advice not to do import | export …. it also loses formulas, which copy preserves.

@jhoneill
Copy link
Contributor

jhoneill commented Dec 12, 2018

OK I tried a couple of things.
I made a spreadsheet with 50 Columns x 200 data rows = 10,000 data cells and imported it
$tenKCells = Import-Excel C:\temp\10KCells.xlsx

Then exported it

$tenKCells | export-excel c:\temp\10.xlsx -ReturnRange
$tenKCells + $tenKCells | export-excel c:\temp\10.xlsx -ReturnRange
$tenKCells + $tenKCells + $tenKCells + $tenKCells | export-excel c:\temp\10.xlsx -ReturnRange  

Double the rows and the time doubles, so EPPlus scales linearly. 1.3 Seconds for 10K, 2.6 for 20K, 5.4 for 40K , 10.31 for 80K, 21.3 for 160K

Next switched to 200 columns x 50 rows. 10K cells should take ~ 1.3 secs. It took over 10 seconds. So I tried

$header = $tenkcells[1].PSObject.Properties.where( {$_.MemberType -like "*" }).Name  
$i = 0 ; foreach ($row in $tenKCells) { foreach ($Name in $Header) {$x = $row.$name ; $i++ }} 

7.4 seconds !

So, looking at each property of a PS Custom object with hundreds of properties gets slow. (Although it didn't seem to get any worse when I changed to 25 rows by 400 columns [Edit] There seems to be a step change some in the region of 75 to 80 properties. )

All the more reason to use copy and not import | export when the data is hundreds of columns wide. (It's not that many versions since Excel lost the 256 column limit - this is the first time I've know someone using > 256 columns in real-world use!)

@dfinke
Copy link
Owner

dfinke commented Dec 12, 2018

Off the top of my head. Was wondering if there was an -AsIs switch on the export, it bypassed all the property inspection and just pushed all the values into the sheet? Don't know without testing.

@jhoneill
Copy link
Contributor

IIRC - no. Everything goes via Add-CellValue.

One can try
$t = $r | convertto-csv -NoTypeInformation
($r is a collection of rows)
$x = $excel.Workbook.Worksheets[1].cells["A1"].LoadFromText($t)

But this is very slow to save. There are other LoadFrom methods which may also be worth looking at.
What slows it down is not so much the checking of types as getting 300 properties per object and poking them into cells one by one. It wouldn't take much to put some test code together to try skipping add-cellvalue, but my hunch is it won't change the time much.

@ctrl78
Copy link

ctrl78 commented Dec 22, 2018

Hi,
i had same issues when i upgrade to last version, moving to version 5.0.1 fixed the slowness issue

@MarcelSchuermann
Copy link

MarcelSchuermann commented Feb 14, 2019

I still have performance issues with Export-Excel in version 5.4.4.
Any news how to improve performance?

Edit:
The approach of @jhoneill with Send-SQLDataToExcel was a lot faster than saving the data in a variable first.
I am getting following error when using Send-SQLDataToExcel now:

At C:\Program Files\WindowsPowerShell\Modules\ImportExcel\5.4.4\Set-Column.ps1:141 char:9
         $columnName = [OfficeOpenXml.ExcelCellAddress]::new(1,$column).Address - ...
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : MethodNotFound

Any ideas?

Excel is created though but conversions are also not as I wish.
int or decimal(38,5) are converted to dates for example.

Edit 2:
In my local environment I got PS version 5.1.x and in my problem environment I got PS version 4.x.
Might that be a problem?

Edit 3:
I figured it out - it was caused by the old PSVersion 4.0. With 5.1 I have no problem.

@ericleigh007
Copy link

ericleigh007 commented Mar 31, 2019

So just to kick this dead horse for a while longer.

If I do $rows = import-excel 'file'  # with a 15k row file, that's about 8 seconds.
if I do $rows | export-excel 'file1' -autosize -autofilter -freetoprow   # with the rows, that's about 2.08 minutes
if I do @rows | export-excel 'file2' # that's about 1.30 minutes

To me, that means the save time is excessive. We must be doing a lot of chunking through the data in memory (since I don't see autosize and autofilter having any affect on write speed)

Am I crazy in hoping to see an improvement of 5-10 times here?

Thoughts?

@dfinke
Copy link
Owner

dfinke commented Mar 31, 2019

Expected. There is no bulk export. Each value in the data is inspected for a number of things as it is set in the worksheet cell.

@jhoneill
Copy link
Contributor

jhoneill commented Apr 1, 2019

I think we need to be clear that export-excel is not making any attempt to be optimized for speed. It is trying to be general purpose and export small to medium volumes of data. It adds each cell individually which you would expect to be a slower way than (for example) adding a database table in one block. It checks every value to see if it is a formula, a hyperlink, a date, or text that needs to and processes it accordingly.

On my machine with 500 items in a folder this takes 1.2 seconds

dir | select -Property name,Length,LastWriteTime | export-excel \temp\foo.xlsx -BoldTopRow -FreezeTopRow -AutoSize -AutoFilter

The rate is 1000-2000 cells per second. It will vary a little and obviously with machine speed, but that's the order of magnitute. edit My test on the train this morning seems to have run at about half normal speed. At home I'm gettting around 3000 -4000 cells/sec

By comparison this runs in 0.2 seconds

$t = New-Object -TypeName System.Data.DataTable
[void]$t.Columns.Add("Name",[string])
[void]$t.Columns.Add("Length",[int])
[void]$t.Columns.Add("Lastwrite",[datetime])
dir | foreach {$r = $t.NewRow() ; $r.name=$_.name; $r.length = $_.Length ; $r.lastWrite  = $_.LastWriteTime ; $t.rows.Add($r) }
Send-SQLDataToExcel -DataTable $t -Path \temp\foo.xlsx -BoldTopRow -FreezeTopRow -AutoSize -AutoFilter

If speed of execution is more important than something produced quickly (etc) AND you have the knowledge of the data to produce it then there are optimizations you can do for your data …

However I have found an significant optimization which I'm going to talk to @dfinke about.

@ericleigh007
Copy link

The Datatable workaround works well in my case. I'll refactor my stuff to use them instead, using Convertto-Datatable, which will speed it up by perhaps 10x.

Thanks for suggesting it.

Anything you can do with a "significant optimization" would also be welcome, of COURSE.

@jhoneill
Copy link
Contributor

jhoneill commented Apr 2, 2019

@ericleigh007 You asked for 5-10x improvement. I give you ... ~ 10 x improvement - there was something I looked at somewhere else which I thought was too hard to change. It makes the script harder to read/maintain but the perf change is clear - getting up to 70K cells per second on my machine. There's some variablity - the difference between runs is greater than the change with -NoNumberConversion and -Autosize.

image

@dfinke this passes all the build tests, and I should have it on my repo shortly couple of final things to clear up. It is worse from a reading/mainting P.o.V but not massively so.

@dfinke
Copy link
Owner

dfinke commented Apr 2, 2019

Cool! Reminds me of the below, which never sat well with me. I'd prefer a different function so it can "more easily" be recognized, both in discovery and maintenance.

Haven't looked at what it would take to refactor though.

Get-Process | Get-Member

vs

Get-Member -InputObject Get-Process

@jhoneill
Copy link
Contributor

jhoneill commented Apr 2, 2019

That opens up a whole can of worms.... in a lot of places if $x is an array passing it via -InputObject causes the command (e.g. get-member) to look at the array itself. In others (e..g Format-Table) the members get processed.
I think command2 $(command1) and
command1 | command2 are about equal - it depends how you're thinking in step-by-step, or end-product-first
command1 | command2 | command3 | command 4 is easier than
command4 (command3 (command2(command1))))
Though that's often how my excel formulas look - I'm a recovering LISP programmer so that's to be expected

What I've done is

  1. Renamed -TargetData to -InputObject (but with an alias so existing use of it doesn't break) and then do foreach ($targetData in $inputObject). This gives a small speed improvement over piping.
  2. Moved the Add-CellValue function into the main body - cutting out the function call gives a significant speed improvement.
  3. Moved the handling of "simple" types so the moved code doesn't need to be duplicated.
  4. Taken the "Autoformat only the first X rows" idea from Send-SqlDataToExcel.ps1  #555 and also added special case handling for a table (if $inputObject is a table, Insert it in the begin block and remove it before the process block) - there was something about that in Send-SqlDataToExcel.ps1  #555 as well.

@ericleigh007
Copy link

ericleigh007 commented Apr 2, 2019 via email

@ili101
Copy link
Contributor

ili101 commented Apr 2, 2019

@jhoneill Thank you for implementing some of the stuff from my fork, The changes sound good, I hope it goes well so I can finely use the official version and stop needing to maintain my fork.
Please update when it's ready so I can give it a look and do some testing.

@dfinke
Copy link
Owner

dfinke commented Apr 2, 2019

@jhoneill Knew you were a LISP coder. What's next? cdr car?

Probably going to need to float this as a preview so hopefully we can get this exercised and shake out as many issues a possible.

@jhoneill
Copy link
Contributor

jhoneill commented Apr 5, 2019

@dfinke ah yes, "You do anything with car, cdr, cond and setq". I haven't touched LISP or Pascal this century or Occam, or Forth, or Assembler. But "Everything's a list" …. er … hash-table, and "we don't need no stinking variables, nest everything!" is still with me.
I've given it a reasonable shake through the existing tests but there's enough of a change to make a preview a good idea.,

@jhoneill
Copy link
Contributor

jhoneill commented Apr 6, 2019

@ili101 , you're welcome. Your ideas were all good, I was just nervous about getting them all in and not breaking other things. I've also committed some changes to make Send-SQLDatatoExcel a lot more compact. It now brings the parameters from Export-Excel in as dynamic parameters rather than declaring everything. That was a reminder why I hate both dynamic parameters and parameter sets.
Because the table insert now happens in Export-Excel some of the things around tables, ranges, auto-sizing which need to happen with the insert can go back to Export-Excel too - that was something else you made the case for.

@dfinke
Copy link
Owner

dfinke commented Apr 7, 2019

The zip of the module an the updates can be downloaded here http://bit.ly/2OTxoaO

@uSlackr
Copy link
Contributor

uSlackr commented Apr 25, 2019

Wow this is so much faster. Just yesterday I timed the export of 950 custom PS objects to excel and it took18 seconds. Since the update - nearly instantaneous.
Thanks for this excellent tool

@dfinke
Copy link
Owner

dfinke commented Apr 25, 2019

Curious, how many properties on each?

@uSlackr
Copy link
Contributor

uSlackr commented Apr 25, 2019 via email

@davidhigh
Copy link

First of all -- thanks for this great package!

Here is just a quick presentation of my experiences in case anybody comes here for advice on how to speed up his script ... as this is basically the only post around the web which discusses this topic.

I had a long running script where Export-Excel took a significant part of the total run-time. The script focuses on regression testing and has to process and inspect all exported data. The whole, relatively complex data wrangling part took only about 5% of the total runtime, whereas Export-Excel took the other 95%.

After several trials (like converting to csv, exporting and texttocolumns, etc.) I examined the solution via "Send-SQLDataToExcel" by @jhoneill, which solved my issue. With this, the export to excel takes a time in the same order as the data processing part, and this brought my runtime from 20 minutes down to about 1 minute.

I just wanted to highlight this point, as the statement by jhoneill reads as "not so great" which motivated me to my own wicked trials ... in fact, the Send-SQLDataToExcel approach is great for performance.

@uSlackr
Copy link
Contributor

uSlackr commented Mar 9, 2022

Might help to say how large the data set is and whether any special export features are being used. I routinely process 10,000 lines with little impact from the export. Good tip on the Send-SQLDatatoExcel. And what version of importexcel you have

@jhoneill
Copy link
Contributor

jhoneill commented Mar 9, 2022

@davidhigh Thanks for the comments.
Export-Excel can manage thousands of cells per second, but when you have millions of cells - either many rows or many columns or the product of both execution time balloons, it does seem to be linear - it doesn't take longer to add row 100,000 than row 100.
I found the library we call, to do all work underneath import and export has the ability to push a table straight in - so I added the support for it. I don't know how the library does it but it inserts very large amounts very quickly - this route wins if your data is already available in a table object, if you need to load data from something else into a table that can use all the time you're saving.

@davidhigh
Copy link

I ran some further tests and here are the timings:

For 5000 rows and 55 columns:

  1. Export-Excel took 5.5 seconds
  2. Send-SQLDataToExcel took 0.5 seconds

For 30'000 rows and 150 columns (--about a factor of 16 larger in the number of rows times columns):

  1. Export-Excel took 57 minutes (with -NoNumberConversion * the sample took 54 minutes)
  2. Send-SQLDataToExcel took 3 seconds (!)

I took care to measure only the runtimes of the pure commands, i.e. without the time needed to generate the data. For this I called the Export-Excel cmdlet from the pipe as $myData | Export-Excel -excelPackage $pkg ..., but after inspection of the code this seems equivalent to -InputObject (as long as the input is no [DataTable]).

Two conclusions:

  • Send-SQLDataToExcel is way faster for large data sets and behaves sublinearly in the number of rows times columns (only a factor of ~6 in timing for a factor of 16 in workload)

  • At least in my case, Export-Excel does not seem to behave completely linearly -- otherwise it should take an expected time of about 90 seconds for the larger case. However, from the code, it's not obvious to me where this non-linear overhead comes from.

In my case, as I generate the comparison data during program execution, I could simply switch to [DataTable] instead of [PSObject] as results objects, with no significant impact on the runtime. But even if the data is only available as standard Powershell objects, converting it to a [DataTable] first and then do the export to Excel seems advantageous, which is basically trading memory versus runtime costs.

My suggestion for improvement is thus to incorporate the pattern "Convert-ToDataTable + Export" into Export-Excel, possibly via a switch -AsDataTable. The suggested usage would be:

$obj = [PSCustomObject]@{ ... }
$obj | Export-Excel -AsDataTable -Path "test.xslx"

The ConvertTo-DataTable part can be be based on this function here.

@jhoneill
Copy link
Contributor

Keep in mind that Export-Excel works like this.

  • Get a list of properties of from the first object.
  • For each property in the list write a heading
  • For each object ( foreach property { do any transformation, write the value to the a cell })

The time to execute the PowerShell part should scale linearly. Poking cell by cell is going to get slow when you get to big numbers, I'm a little surprised that you go 50,000 cells per second with your first sample.
IIRC the XLSx format doesn't store repeating values in the each cell, it stores the value once, and each cell with a value is a pointer - so each value added requires a lookup to see if that value exists, depending how that is implemented in the library we call, I can see how it might put an n-squared-time factor into the execution time, but your 16.3x number of cells would have taken 5.5 * 16.3^2 = 22 minutes so it is even worse than that with large numbers. Quite what magic is being done to turn a 30,000 x 150 table into XML format so fast I don't know, but selecting the unique values form 4,500,000 and translating 4,500,000 values to pointers is going be quicker than doing them one by one.

Perhaps writing something to create tables before sending them is the way forward for people with requirements for very large amounts of data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants