-
-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add image urls to metadata #282
Conversation
Hi @andremacola, that's basically the idea, thanks for the draft! |
Nice. For now this is working for my use case. I'll try to update the tests by the end of the next week. |
@andremacola could you please update the tests? |
I last used trafilatura a couple of years ago and what I did to augment my extraction with metadata was to use
|
@boxabirds, thanks, we'll have a look |
hey @adbar Sorry for late response. I am very busy at work until next week. I'll try my best to update the tests when I'm free |
Created image tests. Just trying to fix Edit: Ok. I think I'm done. Could you please review? |
Codecov Report
@@ Coverage Diff @@
## master #282 +/- ##
==========================================
- Coverage 95.52% 95.32% -0.20%
==========================================
Files 21 21
Lines 3288 3319 +31
==========================================
+ Hits 3141 3164 +23
- Misses 147 155 +8
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Hi @andremacola, thanks for completing the PR and making the tests pass! Before I review your additions, do you want to have a look at the expressions mentioned by @boxabirds above? I'm not primarily interested in images but I guess it would be interesting to add more ways to extract the relevant data? |
@adbar I'm already pulling from OG, Twitter and images meta tags. Using extruct for extract all metadata it's easier but maybe there will be performance hit. Maybe in a next step I can check inside the json-ld format not only for images but also for other parameters. But as I said, that's for a next version. I've been pretty busy lately unfortunately. |
It’s really great what you already added — thanks!
…On Fri, 20 Jan 2023 at 15:33, André Mácola ***@***.***> wrote:
@adbar <https://github.com/adbar> I'm already pulling from OG, Twitter
and images meta tags.
Using extruct for extract all metadata it's easier but maybe there will be
performance hit.
Maybe in a next step I can check inside the json-ld format not only for
images but also for other parameters.
But as I said, that's for a next version. I've been pretty busy lately
unfortunately.
—
Reply to this email directly, view it on GitHub
<#282 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABD62LZNDW4O4P7DMVFJF3WTKV6NANCNFSM6AAAAAATLTZC5I>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Nice, I'll review and integrate the PR by next week at the latest. |
Sometimes an image is not included in text body and we can extract by some SEO TAGS
Issue: #281
Unfortunately I didn't have time to create the tests