Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add image urls to metadata #282

Merged
merged 5 commits into from Jan 20, 2023
Merged

Conversation

andremacola
Copy link
Contributor

@andremacola andremacola commented Dec 28, 2022

Sometimes an image is not included in text body and we can extract by some SEO TAGS

Issue: #281

Unfortunately I didn't have time to create the tests

@andremacola andremacola changed the title feat: Add image urls to metadata (https://github.com/adbar/trafilatura/issues/281) feat: Add image urls to metadata Dec 28, 2022
@adbar
Copy link
Owner

adbar commented Dec 29, 2022

Hi @andremacola, that's basically the idea, thanks for the draft!
You still have to adapt the tests to the new metadata though, and please also add tests for the new lines of code.

@andremacola
Copy link
Contributor Author

Hi @andremacola, that's basically the idea, thanks for the draft! You still have to adapt the tests to the new metadata though, and please also add tests for the new lines of code.

Nice. For now this is working for my use case. I'll try to update the tests by the end of the next week.

@adbar adbar linked an issue Jan 9, 2023 that may be closed by this pull request
@adbar
Copy link
Owner

adbar commented Jan 18, 2023

@andremacola could you please update the tests?

@boxabirds
Copy link

boxabirds commented Jan 19, 2023

I last used trafilatura a couple of years ago and what I did to augment my extraction with metadata was to use extruct. Below is the data I found (at least in 2021) was missing from trafilatura. Not sure if this has changed… but maybe it's worth considering some kind of merge / collaboration with these two libraries?

EXTRUCT_METADATA_MAP = { "textDescription": [ "[opengraph][*]['og:description']", "[json-ld][*][description]", ], "authors": ["[json-ld][*][author][*][name]", "[json-ld][*][author][*]"], "publisherIconUrl": [ "[json-ld][*][logo][url]", "[json-ld][*][publisher][logo][url]", ], "datePublished": ["[json-ld][*][datePublished]"], "image": [ "[json-ld][*][image][url]", "[opengraph][*]['og:image']", "[json-ld][*][image][*]", ], "canonicalUrl": [ "[json-ld][*][mainEntityOfPage][@id]", "[microdata-ld][*][mainEntityOfPage]", "[opengraph][*]['og:url']", ], }

@adbar
Copy link
Owner

adbar commented Jan 19, 2023

@boxabirds, thanks, we'll have a look

@andremacola
Copy link
Contributor Author

hey @adbar Sorry for late response. I am very busy at work until next week. I'll try my best to update the tests when I'm free

@andremacola
Copy link
Contributor Author

andremacola commented Jan 19, 2023

Created image tests. Just trying to fix test_txttocsv unit_test

Edit: Ok. I think I'm done. Could you please review?

@codecov-commenter
Copy link

Codecov Report

Merging #282 (027e3d9) into master (14d9782) will decrease coverage by 0.20%.
The diff coverage is 78.94%.

@@            Coverage Diff             @@
##           master     #282      +/-   ##
==========================================
- Coverage   95.52%   95.32%   -0.20%     
==========================================
  Files          21       21              
  Lines        3288     3319      +31     
==========================================
+ Hits         3141     3164      +23     
- Misses        147      155       +8     
Impacted Files Coverage Δ
trafilatura/utils.py 98.41% <ø> (ø)
trafilatura/metadata.py 92.37% <78.94%> (-1.90%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@adbar
Copy link
Owner

adbar commented Jan 20, 2023

Hi @andremacola, thanks for completing the PR and making the tests pass!

Before I review your additions, do you want to have a look at the expressions mentioned by @boxabirds above? I'm not primarily interested in images but I guess it would be interesting to add more ways to extract the relevant data?

@andremacola
Copy link
Contributor Author

@adbar I'm already pulling from OG, Twitter and images meta tags.

Using extruct for extract all metadata it's easier but maybe there will be performance hit.

Maybe in a next step I can check inside the json-ld format not only for images but also for other parameters.

But as I said, that's for a next version. I've been pretty busy lately unfortunately.

@boxabirds
Copy link

boxabirds commented Jan 20, 2023 via email

@adbar
Copy link
Owner

adbar commented Jan 20, 2023

Nice, I'll review and integrate the PR by next week at the latest.

@adbar adbar merged commit 22696f8 into adbar:master Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add image urls to metadata
4 participants