Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to provide XPaths for content extraction #596

Open
klvbdmh opened this issue May 16, 2024 · 2 comments
Open

Add option to provide XPaths for content extraction #596

klvbdmh opened this issue May 16, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@klvbdmh
Copy link

klvbdmh commented May 16, 2024

I can't parse comments on Reddit pages (old.reddit.com). The issue is likely due to the comments being encapsulated within form elements, which the parser may not be handling correctly.

Steps to reproduce
Run trafilatura -u "https://old.reddit.com/r/programming/comments/1cnvy7y/how_stripe_prevents_double_payment_using/"

Expected Behavior
trafilatura should successfully parse and extract all visible Reddit comments.

Actual Behavior
Only user names, points, and number of children are extracted:

[–]SittingWave 665 points666 points667 points (73 children)
[–]barbouk 55 points56 points57 points (2 children)
[–]caltheon 15 points16 points17 points (0 children)
[–]TheSameTrain 19 points20 points21 points (0 children)
[–]WannaBeRichieRich 95 points96 points97 points (67 children)
[...]

Adding a --recall flag doesn't change anything.

Is it possible to manually specify which elements should be parsed?

@adbar
Copy link
Owner

adbar commented May 16, 2024

You can add XPath expressions to remove elements but you cannot explicitly add elements, that could be a useful improvement.
As for Reddit the extractor is not made for social networks, you could directly use Reddit datasets.

@adbar adbar added the question Further information is requested label May 17, 2024
@klvbdmh
Copy link
Author

klvbdmh commented May 20, 2024

You can add XPath expressions to remove elements but you cannot explicitly add elements, that could be a useful improvement.

Yes, that could be a great improvement. I see other issues with unusual elements that have desirable content (#573). Could be better instead of hard-coding edge cases.

@adbar adbar changed the title Can't parse Reddit comments on old.reddit.com through CLI Add option to provide XPaths for content extraction May 21, 2024
@adbar adbar added enhancement New feature or request and removed question Further information is requested labels May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants