New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Good test cases #9
Comments
Hi Vlad, thanks for the input. |
@adbar For gardeners.com, the date is given as |
Hi, thank you for the suggestion, I believe a regex like |
|
Same error as almanac.com occurs on https://ozhanozturk.com/2017/09/30/halkbilim-sozlugu-folklor-sozlugu-kar-kilic/, but regex for JSON works for ozhanozturk.com but not for almanac.com because of ": ". |
When trying to fetch content with urllib, almanac.com gives 404. Fetching from ozhanozturk.com is successful. Still they return None with find_date. However, calling json_search does not return None, it returns date. |
Thanks for the bug report, I'll look into it. |
gardeners.com is fixed in #13. |
Hi everyone, I fixed a further bug with an updated page at gardeners.com in 97dee35. The two other issues are now related to the user-agent in Python which is blocked and requires user-agent rotation which goes beyond the scope of this library. See trafilatura for more info. |
Hi Adrien
here are a few test cases where the extraction gave a wrong answer:
https://www.gardeners.com/how-to/vegetable-gardening/5069.html
https://www.almanac.com/vegetable-gardening-for-beginners
Somewhat related, this one 'hangs':
https://www.homedepot.com/c/ah/how-to-start-a-vegetable-garden/9ba683603be9fa5395fab90d6de2854
The text was updated successfully, but these errors were encountered: