Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update EditEngine code to use 32 bit paragraph storage #164

Merged
merged 64 commits into from Feb 12, 2023

Conversation

DamjanJovanovic
Copy link
Contributor

Our EditEngine, in main/editeng, has a number of containers for paragraphs, paragraph portions, lines, etc. These are based on svl's "PTRARR" type classes, which is just an array of some caller-defined class that grows as needed, like ::std::vector but hardcoded to 16 bit limits. Worse, a lot of code, even outside of main/editeng, is written under these assumptions, and passes 16 bit types and constants to EditEngine classes and retrieves 16 bit results.

One particularly notorious bug emerging from this design, is that when you try to paste an HTML table of more than 65534 cells into Calc, the first 65534 cells are pasted, but all further cells are silently lost, something reported in at least 3 bugs in the last 18 years (57176, 110486 and 117225). This apparently happens because our HTML import is similar to XML DOM parsing, done in 2 phases, first parsing the results into memory, then processing those in-memory results to populate data into the spreadsheet (ScHTMLImport::WriteToDocument() in main/sc/source/filter/html/htmlimp.cxx). When parsing, each HTML cell () becomes a new EditEngine paragraph, and after 65534 paragraphs, all further cells are ignored, so when it's time to populate the spreadsheet, the data isn't there to add.

This enormous patchset changes the container for paragraphs to a new BaseList class, which wraps ::std::vector and uses 32 bit integers to access it. All EditEngine methods are changed to take 32 bit paragraph index parameters, all EditEngine classes store any paragraph indices as 32 bit fields, all 16 bit paragraph index constants like 0xffff are changed to 32 bit 0xffffffff (and to a more readable constant like EE_PARA_NOT_FOUND), and all (known) calling code everywhere in OpenOffice is updated to take these changes into account.

The bug is definitely fixed, all sample documents from Bugzilla paste fully and correctly. What is harder to prove is that nothing else broke. Much had to change, 136 files in 9 modules. OpenGrok helped tremendously in finding obscure places where EditEngine methods were getting called from. I've really tried to avoid introducing any new bugs, but it is hard to be sure with a change of this size. I am making a PR instead of directly pushing to allow you to test a lot ;).

Damjan Jovanovic added 30 commits January 3, 2023 04:53
which is limited to 16-bit indices and 2^16 entries.
For now, use our own std::vector-based compatible class.

Patch by: me
Get the ParaPortionList to use 32 bit indices too.

Patch by: me
Change the public API of the EditEngine class to 32 bit paragraph indices.

Patch by: me
MoveParagraphsInfo, PasteOrDropInfos, EENotify, DeletedNodeInfo,
EditUndoDelContent, EditUndoSplitPara, and EditUndoMoveParagraphs
classes with 32 bit paragraph indices.

Patch by: me
to 32 bit integers. Factor out the BaseList class into a separate
file and use it as the ContentInfoList.

Also increment the binary stream version to 603, as it now stores
a 32 bit paragraph count.

Patch by: me
to 32 bit that were missed before.

Patch by: me
Fix one wrong converted varible that should have been left as 16 bit.

Patch by: me
Audit all usage of "node" (sometimes used as a name for variables
storing paragraph indices) and start converting and fixing the problems.

Patch by: me
to use 32 bit paragraph indices, as per the compiler warnings.

Also convert EBulletInfo to 32 bit paragraph indices.

Patch by: me
indices, and update enough of the affected code to able to build
editeng.

Patch by: me
OLUndoExpand to 32 bit indices.

Patch by: me
tools module's 64 bit macros.

Patch by: me
to return a sal_uInt32 or sal_uLong, and fix any related problems
in all their callers.

Patch by: me
Damjan Jovanovic added 14 commits January 3, 2023 05:44
take a 32 bit paragraph index.

Patch by: me
…RangeEnumeration

to use 32 bit paragraph indices.

Patch by: me
in editeng/source/accessibility.

Patch by: me
and convert these to 32 bit constants. Add new EE_PARA_MAX and EE_INDEX_MAX
constants for this purpose.

Patch by: me
take 32 bit paragraph index parameters too.

Patch by: me
@Pilot-Pirx
Copy link
Member

Hi Damjan,
this is a great amount of work, thank you!
I will try to do a build on Windows and test some files...

@cbmarcum
Copy link
Contributor

cbmarcum commented Jan 6, 2023

H Damjan,
Thanks for your work on this.
I made a slight change to the html test file from i117635 by removing the first row so that cell A1 was in the first row.
See link [1].
the file is a simple html table with 20000 rows and 5 columns.
When I open it with my build of this PR on CentOS 7 it still only opens into a Writer/HTML document to the row 12801 before starting to just produce text beyond that. Same as with trunk.

[1] https://home.apache.org/~cmarcum/test-files/20000-row-html-table.html

@DamjanJovanovic
Copy link
Contributor Author

When I open it with my build of this PR on CentOS 7 it still only opens into a Writer/HTML document to the row 12801 before starting to just produce text beyond that. Same as with trunk.

[1] https://home.apache.org/~cmarcum/test-files/20000-row-html-table.html

Nice catch. Yes, apparently there is still a problem somewhere in Writer. I only tested copying from a web browser and pasting to Calc.

@DamjanJovanovic
Copy link
Contributor Author

I believe the Writer table limit of 64K cells is unrelated to this issue.

For example in main/sw/source/filter/html/htmltab.cxx:

class HTMLTable
{
...
    sal_uInt16 nRows;                   // Anzahl Zeilen
    sal_uInt16 nCols;                   // Anzahl Spalten

That's not EditEngine related, that's a separate issue specific to Writer.

Please test only copying from a web browser and pasting into Calc, and opening through the "File type" of "HTML document (OpenOffice Calc) (*html, *.htm)".

@cbmarcum
Copy link
Contributor

cbmarcum commented Jan 7, 2023

My latest testing:
File > Open
set file type filter to HTML Document (OpenOffice Calc) (.html;.htm)

trunk got to row 13106
with PR 164 got all 20000 rows

Copy/Paste from Firefox to Calc
trunk and this PR
all 20000 rows for both

Copy/Paste from Firefox to Writer
trunk and this PR
all 20000 rows as tab separated values per row.

@DamjanJovanovic
Copy link
Contributor Author

Copy/Paste from Firefox to Calc trunk and this PR all 20000 rows for both

There is no way all 20000 pasted successfully over HTML on trunk. Either the transfer format wasn't HTML, or you tested something other than trunk.

@cbmarcum
Copy link
Contributor

cbmarcum commented Jan 7, 2023

Copy/Paste from Firefox to Calc trunk and this PR all 20000 rows for both

There is no way all 20000 pasted successfully over HTML on trunk. Either the transfer format wasn't HTML, or you tested something other than trunk.

Hi Damjan,
You are correct. I found out what was wrong with my test.
I had copied from a Firefox outside of the VM and pasted to AOO inside the VM. This opened a Text Import dialog and did import all 20000 rows in both trunk and the PR.

I can confirm that copying from Firefox and pasting into AOO within the same VM results in:
trunk getting to row 13106 and inserting a lot (or all) of the rest into column D of the next row.
The PR build takes all 20000 rows.
Both without using the Text import dialog.
Sorry for the confusion.

@cbmarcum cbmarcum self-requested a review January 28, 2023 12:43
Copy link
Contributor

@cbmarcum cbmarcum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my test was successful

@DamjanJovanovic
Copy link
Contributor Author

Is everyone happy? Should we merge this to trunk? And 420?

@Pilot-Pirx
Copy link
Member

Pilot-Pirx commented Jan 28, 2023

Sorry, I forgot to build this one...

My Windows build is now running, looks good so far.
I will upload it to DropBox later, if anyone is interested in testing it.

Here it is:
https://www.dropbox.com/s/tmf97odgftuej67/Apache_OpenOffice_4.5.0_Win_x86_install_en-US-PR164.exe?dl=0

@Pilot-Pirx
Copy link
Member

My test on Windows with this table:
https://home.apache.org/~cmarcum/test-files/20000-row-html-table.html

was successful!

@Pilot-Pirx
Copy link
Member

Pilot-Pirx commented Feb 11, 2023

I am having problems with an xlsx file containing Cyrillic characters. Only the numbers are imported. It may only be my machine.

Update:
The file was from https://bz.apache.org/ooo/show_bug.cgi?id=126720 and the fixing commit is most likely not in this branch...

@DamjanJovanovic DamjanJovanovic merged commit d5edfd0 into apache:trunk Feb 12, 2023
asfgit pushed a commit that referenced this pull request Feb 12, 2023
Update EditEngine code to use 32 bit paragraph storage

(cherry picked from commit d5edfd0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants