Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: extract pagetype from og:type or ld+json #310

Merged
merged 5 commits into from
Feb 24, 2023

Conversation

andremacola
Copy link
Contributor

@andremacola andremacola commented Feb 24, 2023

This implements #307

  • Extract from og:type
  • If no og:type tag, try to extract from ld+json
  • On some sites the variable element_text was coming with bad html formatting causing error and having to use extract_json_parse_error. This patch fix the problem with html.unescape()
  • Tests were done but I'm not sure if I should put validation in all extract_meta_json tests

OBS: I realized that json_metadata already had some @type implementations but I preferred not to change it to prevent break anything and implemented it with new conditions and lists (not sure if this is the ideal approach)

OBS2: I think txttocsv function and test_txttocsv should be refactored. Every minor change in metadata breaks the test.

OBS3: extract_meta_json function could be simplified and refactored

OBS4: Only returns pagetype from ld+json if the page is some kind of an article/category/home/site/bloc etc... got types from https://schema.org/docs/full.html

@codecov-commenter
Copy link

codecov-commenter commented Feb 24, 2023

Codecov Report

Merging #310 (fed9da9) into master (dd2d212) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master     #310      +/-   ##
==========================================
- Coverage   96.17%   96.15%   -0.02%     
==========================================
  Files          21       21              
  Lines        3241     3253      +12     
==========================================
+ Hits         3117     3128      +11     
- Misses        124      125       +1     
Impacted Files Coverage Δ
trafilatura/utils.py 97.88% <ø> (-0.53%) ⬇️
trafilatura/json_metadata.py 96.80% <100.00%> (+0.24%) ⬆️
trafilatura/metadata.py 98.43% <100.00%> (+0.01%) ⬆️
trafilatura/core.py 98.10% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@adbar
Copy link
Owner

adbar commented Feb 24, 2023

Thanks for the PR, I'm going to review it. As for your remarks:

  1. @type in metadata: @felipehertzer do you have an opinion on this?
  2. Yes, txttocsv() could use some refactoring, I'm open for a PR on this btw.
  3. I'm not sure how to refactor extract_meta_json(), could you provide a suggestion?
  4. Sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants