Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

Problem with xlsx with merged cells and multiline header #320

Closed
mcarans opened this issue May 13, 2020 · 8 comments · Fixed by #323
Closed

Problem with xlsx with merged cells and multiline header #320

mcarans opened this issue May 13, 2020 · 8 comments · Fixed by #323
Labels

Comments

@mcarans
Copy link
Contributor

mcarans commented May 13, 2020

Overview

When trying to read the file below, tabulator reads the headers on rows 10-12 incorrectly:
http://mapipcissprd.us-east-1.elasticbeanstalk.com/api/public/population-tracking-tool/data/2017,2020/?page=1&limit=1&condition=A&export=true&country=AF

I used options headers=[10,12], fill_merged_cells=True:
stream = Stream('http://mapipcissprd.us-east-1.elasticbeanstalk.com/api/public/population-tracking-tool/data/2017,2020/?page=1&limit=1&condition=A&export=true&country=AF', headers=[10,12], fill_merged_cells=True, format='xlsx')
I also tried fill_merged_cells=False.

An example of what's missing can be seen in column K of the spreadsheet. The header should be for that one "Current Phase 1 #" but instead is just "Phase 1 #".

Pandas is able to read it using:
df = pandas.read_excel(url, header=[9, 10, 11])


Please preserve this line to notify @roll (lead of this repository)

@roll
Copy link
Member

roll commented May 19, 2020

Hi @mcarans,

There is some magic behind this issue because Openpyxl 3.0.3 gives this list of merged ranges:

A10:E11
W10:AL10
G11:H11
M11:N11
Q11:R11
U11:V11
AA11:AB11
AE11:AF11
AI11:AJ11
AM11:AN11
AS11:AT11
AW11:AX11
BA11:BB11

while "Current" and "Second projection" are definitely merged but at the same time definitely no on this list although "First projection" is here "W10:AL10".

I'm closing it, for now, could you please:

  • try to re-create this table in case it's somehow broken
  • or create a bug report for Openpyxl

?

@roll roll closed this as completed in #323 May 19, 2020
@mcarans
Copy link
Contributor Author

mcarans commented May 19, 2020

@roll I got a different result from openpyxl. I saved the link I gave in the OP as test.xlsx then:

wb = load_workbook('/home/mcarans/Downloads/test.xlsx')
sheet_ranges = wb['IPC']
print(sheet_ranges.merged_cells.ranges)```

[<CellRange A10:E11>, <CellRange G10:V10>, <CellRange W10:AL10>, <CellRange AM10:BB10>, <CellRange G11:H11>, <CellRange K11:L11>, <CellRange M11:N11>, <CellRange O11:P11>, <CellRange Q11:R11>, <CellRange S11:T11>, <CellRange U11:V11>, <CellRange W11:X11>, <CellRange AA11:AB11>, <CellRange AC11:AD11>, <CellRange AE11:AF11>, <CellRange AG11:AH11>, <CellRange AI11:AJ11>, <CellRange AK11:AL11>, <CellRange AM11:AN11>, <CellRange AQ11:AR11>, <CellRange AS11:AT11>, <CellRange AU11:AV11>, <CellRange AW11:AX11>, <CellRange AY11:AZ11>, <CellRange BA11:BB11>, <CellRange A13:D13>]

Please can you check.

@roll
Copy link
Member

roll commented May 19, 2020 via email

@mcarans
Copy link
Contributor Author

mcarans commented May 19, 2020

@roll Yes I can confirm it is 3.0.3.

@roll
Copy link
Member

roll commented May 19, 2020

@mcarans
And with tabulator@1.46.0 the bug is still here?

@mcarans
Copy link
Contributor Author

mcarans commented May 19, 2020

@roll
The bug seems to still exist with 1.46 unfortunately. This still seems to be a problem:
An example of what's missing can be seen in column K of the spreadsheet. The header should be for that one "Current Phase 1 #" but instead is just "Phase 1 #".

@roll
Copy link
Member

roll commented May 19, 2020

@mcarans
I hope I have found the problem. It seems self.__sheet.merged_cells.ranges was changed in-place during merging so some ranges were skipped

Could you please try tabulator@1.46.1

@mcarans
Copy link
Contributor Author

mcarans commented May 19, 2020

Super, that looks good, thx @roll !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants