feat: extract pagetype from og:type or ld+json #310

andremacola · 2023-02-24T05:37:55Z

This implements #307

Extract from og:type
If no og:type tag, try to extract from ld+json
On some sites the variable element_text was coming with bad html formatting causing error and having to use extract_json_parse_error. This patch fix the problem with html.unescape()
Tests were done but I'm not sure if I should put validation in all extract_meta_json tests

OBS: I realized that json_metadata already had some @type implementations but I preferred not to change it to prevent break anything and implemented it with new conditions and lists (not sure if this is the ideal approach)

OBS2: I think txttocsv function and test_txttocsv should be refactored. Every minor change in metadata breaks the test.

OBS3: extract_meta_json function could be simplified and refactored

OBS4: Only returns pagetype from ld+json if the page is some kind of an article/category/home/site/bloc etc... got types from https://schema.org/docs/full.html

codecov-commenter · 2023-02-24T05:42:19Z

Codecov Report

Merging #310 (fed9da9) into master (dd2d212) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master     #310      +/-   ##
==========================================
- Coverage   96.17%   96.15%   -0.02%     
==========================================
  Files          21       21              
  Lines        3241     3253      +12     
==========================================
+ Hits         3117     3128      +11     
- Misses        124      125       +1

Impacted Files	Coverage Δ
trafilatura/utils.py	`97.88% <ø> (-0.53%)`	⬇️
trafilatura/json_metadata.py	`96.80% <100.00%> (+0.24%)`	⬆️
trafilatura/metadata.py	`98.43% <100.00%> (+0.01%)`	⬆️
trafilatura/core.py	`98.10% <0.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

adbar · 2023-02-24T14:10:37Z

Thanks for the PR, I'm going to review it. As for your remarks:

@type in metadata: @felipehertzer do you have an opinion on this?
Yes, txttocsv() could use some refactoring, I'm open for a PR on this btw.
I'm not sure how to refactor extract_meta_json(), could you provide a suggestion?
Sounds good.

trafilatura/json_metadata.py

trafilatura/metadata.py

feat: extract pagetype from og:type or ld+json

4c0915d

andremacola added 2 commits February 24, 2023 03:07

improvement extracting ld+json from html

308385e

add jobposting to ogtype schema

a5ed5d1

adbar reviewed Feb 24, 2023

View reviewed changes

trafilatura/json_metadata.py Show resolved Hide resolved

adbar reviewed Feb 24, 2023

View reviewed changes

trafilatura/metadata.py Show resolved Hide resolved

andremacola added 2 commits February 24, 2023 13:07

added reference comments to ld+json

a00dec0

add ld+json test with html entities

fed9da9

adbar merged commit b461e23 into adbar:master Feb 24, 2023

andremacola mentioned this pull request Dec 3, 2023

Feat: extract pagetype from og:type or ld+json extractus/article-extractor#373

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: extract pagetype from og:type or ld+json #310

feat: extract pagetype from og:type or ld+json #310

andremacola commented Feb 24, 2023 •

edited

Loading

codecov-commenter commented Feb 24, 2023 •

edited

Loading

adbar commented Feb 24, 2023

feat: extract pagetype from og:type or ld+json #310

feat: extract pagetype from og:type or ld+json #310

Conversation

andremacola commented Feb 24, 2023 • edited Loading

codecov-commenter commented Feb 24, 2023 • edited Loading

Codecov Report

adbar commented Feb 24, 2023

andremacola commented Feb 24, 2023 •

edited

Loading

codecov-commenter commented Feb 24, 2023 •

edited

Loading