Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed Extraction when Meta tag has an empty content #545

Merged
merged 5 commits into from
Apr 5, 2024

Conversation

felipehertzer
Copy link
Contributor

Hey @adbar,

I had a few cases that the below meta tags are empty, and when it happen the extraction stops to work.

   <meta name="title" content=" " />
   <meta property="og:title" content=" " />
   <meta name="twitter:title" content=" " />

I checked and the line below is checking for None before trying to get the correct title, but it never happens, because the title is ' ' in this line.

if metadata.title is None:

Evan on json parse it is checking for None but the current title is ' ':

if metadata.title is None:

And then this function convert the the title ' ' to none:

metadata.clean_and_trim()

And then it will fail here:

if only_with_metadata is True and any(

This is an example of the problem:

Example

I run the the tests and comparison_small.py and it appears to be the same.

Thanks.

Copy link

codecov bot commented Apr 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.26%. Comparing base (fb3e174) to head (b4f037a).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #545   +/-   ##
=======================================
  Coverage   97.26%   97.26%           
=======================================
  Files          22       22           
  Lines        3438     3438           
=======================================
  Hits         3344     3344           
  Misses         94       94           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adbar
Copy link
Owner

adbar commented Apr 5, 2024

Instead of trimming maybe element.get("content").isspace() ? This would be more light-weight.

@felipehertzer
Copy link
Contributor Author

Oh, that makes sense, I updated it. Thanks.

@adbar
Copy link
Owner

adbar commented Apr 5, 2024

Thanks!

@adbar adbar merged commit 8125043 into adbar:master Apr 5, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants