Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of useless new line and carriage return characters in Html headings and paragraphs #199

Closed
Lucabenj opened this issue Apr 22, 2022 · 4 comments
Labels
question Further information is requested

Comments

@Lucabenj
Copy link

Some web sites format the text contained in HTML tags (paragraphs, headings) by adding the newline character \n or carriage return \r, which of course the browser ignores in the rendering phase, deleting it. Their purpose is to make the source code display better in some editor but they have no meaning for the HTML in the browser.

I noticed that Trafilatura returns two forms of the extracted text: raw_text and text. The first one always removes the characters \n, \r and does not preserve the logical division between headers and paragraphs, on the contrary the second one tries to preserve a division between paragraphs using the character sequence \n.

The point is the following: is there any possibility to remove the useless carriage return or new line characters of the textual content inside the header and paragraph tags since they have no semantic meaning? As a matter of fact, the "text" returned by the tool, in these cases, seems to me to understand that it always renders these new line and carriage return characters, mixing them with the more significant ones to delimit paragraphs and headings.

@Lucabenj Lucabenj changed the title Removal of useless new line and carriage return characters in paragraphs and Html headings Removal of useless new line and carriage return characters in Html headings and paragraphs Apr 22, 2022
@adbar
Copy link
Owner

adbar commented Apr 25, 2022

Hi @Lucabenj, thanks for the feedback! The raw_text and text formats indeed differ as regards text trimming. Nonetheless the text output should filter the additional signs you mention. It could be an issue with content filtering, could you please give me an example so I can see if it's a bug?

@adbar adbar added the question Further information is requested label Apr 25, 2022
@Lucabenj
Copy link
Author

Lucabenj commented Apr 25, 2022

trafilatura -u "https://bit.ly/3k9Wjas" --json

@adbar
Copy link
Owner

adbar commented Apr 25, 2022

Difficult case, here the newlines are not meaningful, but it could be in other documents. Here there is also a formatting problem with text present twice...

@adbar
Copy link
Owner

adbar commented May 13, 2022

I closing this since the issue is now mentioned in #4.

@adbar adbar closed this as completed May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants