You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some web sites format the text contained in HTML tags (paragraphs, headings) by adding the newline character \n or carriage return \r, which of course the browser ignores in the rendering phase, deleting it. Their purpose is to make the source code display better in some editor but they have no meaning for the HTML in the browser.
I noticed that Trafilatura returns two forms of the extracted text: raw_text and text. The first one always removes the characters \n, \r and does not preserve the logical division between headers and paragraphs, on the contrary the second one tries to preserve a division between paragraphs using the character sequence \n.
The point is the following: is there any possibility to remove the useless carriage return or new line characters of the textual content inside the header and paragraph tags since they have no semantic meaning? As a matter of fact, the "text" returned by the tool, in these cases, seems to me to understand that it always renders these new line and carriage return characters, mixing them with the more significant ones to delimit paragraphs and headings.
The text was updated successfully, but these errors were encountered:
Lucabenj
changed the title
Removal of useless new line and carriage return characters in paragraphs and Html headings
Removal of useless new line and carriage return characters in Html headings and paragraphs
Apr 22, 2022
Hi @Lucabenj, thanks for the feedback! The raw_text and text formats indeed differ as regards text trimming. Nonetheless the text output should filter the additional signs you mention. It could be an issue with content filtering, could you please give me an example so I can see if it's a bug?
Difficult case, here the newlines are not meaningful, but it could be in other documents. Here there is also a formatting problem with text present twice...
Some web sites format the text contained in HTML tags (paragraphs, headings) by adding the newline character
\n
or carriage return\r
, which of course the browser ignores in the rendering phase, deleting it. Their purpose is to make the source code display better in some editor but they have no meaning for the HTML in the browser.I noticed that Trafilatura returns two forms of the extracted text: raw_text and text. The first one always removes the characters
\n
,\r
and does not preserve the logical division between headers and paragraphs, on the contrary the second one tries to preserve a division between paragraphs using the character sequence \n.The point is the following: is there any possibility to remove the useless carriage return or new line characters of the textual content inside the header and paragraph tags since they have no semantic meaning? As a matter of fact, the "text" returned by the tool, in these cases, seems to me to understand that it always renders these new line and carriage return characters, mixing them with the more significant ones to delimit paragraphs and headings.
The text was updated successfully, but these errors were encountered: