Add option to provide XPaths for content extraction #596

klvbdmh · 2024-05-16T01:05:53Z

I can't parse comments on Reddit pages (old.reddit.com). The issue is likely due to the comments being encapsulated within form elements, which the parser may not be handling correctly.

Steps to reproduce
Run trafilatura -u "https://old.reddit.com/r/programming/comments/1cnvy7y/how_stripe_prevents_double_payment_using/"

Expected Behavior
trafilatura should successfully parse and extract all visible Reddit comments.

Actual Behavior
Only user names, points, and number of children are extracted:

[–]SittingWave 665 points666 points667 points (73 children)
[–]barbouk 55 points56 points57 points (2 children)
[–]caltheon 15 points16 points17 points (0 children)
[–]TheSameTrain 19 points20 points21 points (0 children)
[–]WannaBeRichieRich 95 points96 points97 points (67 children)
[...]

Adding a --recall flag doesn't change anything.

Is it possible to manually specify which elements should be parsed?

The text was updated successfully, but these errors were encountered:

adbar · 2024-05-16T11:13:03Z

You can add XPath expressions to remove elements but you cannot explicitly add elements, that could be a useful improvement.
As for Reddit the extractor is not made for social networks, you could directly use Reddit datasets.

klvbdmh · 2024-05-20T10:52:31Z

You can add XPath expressions to remove elements but you cannot explicitly add elements, that could be a useful improvement.

Yes, that could be a great improvement. I see other issues with unusual elements that have desirable content (#573). Could be better instead of hard-coding edge cases.

adbar added the question Further information is requested label May 17, 2024

adbar changed the title ~~Can't parse Reddit comments on old.reddit.com through CLI~~ Add option to provide XPaths for content extraction May 21, 2024

adbar added enhancement New feature or request and removed question Further information is requested labels May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to provide XPaths for content extraction #596

Add option to provide XPaths for content extraction #596

klvbdmh commented May 16, 2024

adbar commented May 16, 2024

klvbdmh commented May 20, 2024

Add option to provide XPaths for content extraction #596

Add option to provide XPaths for content extraction #596

Comments

klvbdmh commented May 16, 2024

adbar commented May 16, 2024

klvbdmh commented May 20, 2024