Skip to content

Conversation

@PeterStaar-IBM
Copy link
Contributor

Avoid broken tables, eg,

input (https://en.wikipedia.org/wiki/IBM):

Screenshot 2024-10-24 at 10 49 35

current output:

| Year
         |      Revenue
(US$ bn)
 |      Net income
(US$ bn)
 | Employees
         |
|---------|------|------|---------|
| 2014    | 92.7 | 12   | 379,592 |
| 2015    | 81.7 | 13.1 | 377,757 |
| 2016    | 79.9 | 11.8 | 380,300 |
| 2017    | 79.1 |  5.7 | 366,600 |
| 2018    | 79.5 |  8.7 | 350,600 |
| 2019    | 77.1 |  9.4 | 352,600 |
| 2020    | 73.6 |  5.5 | 345,900 |
| 2021[a] | 57.3 |  5.7 | 282,100 |
| 2022    | 60.5 |  1.6 | 288,300 |
| 2023    | 61.8 |  7.5 | 282,200 |

desired output:

| Year |  Revenue (US$ bn)|  Net income (US$ bn) | Employees |
|---------|------|------|---------|
| 2014    | 92.7 | 12   | 379,592 |
| 2015    | 81.7 | 13.1 | 377,757 |
| 2016    | 79.9 | 11.8 | 380,300 |
| 2017    | 79.1 |  5.7 | 366,600 |
| 2018    | 79.5 |  8.7 | 350,600 |
| 2019    | 77.1 |  9.4 | 352,600 |
| 2020    | 73.6 |  5.5 | 345,900 |
| 2021[a] | 57.3 |  5.7 | 282,100 |
| 2022    | 60.5 |  1.6 | 288,300 |
| 2023    | 61.8 |  7.5 | 282,200 |

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
# make sure that md tables are not broken
# due to newline chars in the text
text = col.text
text = text.replace("\n", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have a separator, at least " ", because some other tables might be TEXT\nSUBTEXT

Technically, the better replacement (only in Markdown) could be <br />.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, that was my intention in the first place ... must have slipped

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@PeterStaar-IBM PeterStaar-IBM merged commit 673d45d into main Oct 24, 2024
7 checks passed
@PeterStaar-IBM PeterStaar-IBM deleted the fix/fix-output-in-md-tables branch October 24, 2024 10:49
muhark added a commit to muhark/docling-core that referenced this pull request Mar 19, 2025
* fix the output in markdown tables, remove newline in table-cells

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran pre-commits

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed with space

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants