Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Good test cases #9

Closed
vprelovac opened this issue Apr 9, 2020 · 9 comments
Closed

Good test cases #9

vprelovac opened this issue Apr 9, 2020 · 9 comments

Comments

@vprelovac
Copy link

Hi Adrien

here are a few test cases where the extraction gave a wrong answer:

https://www.gardeners.com/how-to/vegetable-gardening/5069.html
https://www.almanac.com/vegetable-gardening-for-beginners

Somewhat related, this one 'hangs':
https://www.homedepot.com/c/ah/how-to-start-a-vegetable-garden/9ba683603be9fa5395fab90d6de2854

@adbar
Copy link
Owner

adbar commented Apr 21, 2020

Hi Vlad, thanks for the input.
The Home Depot case is probably because they ban automated user-agents. For gardeners.com I can't find a date (even as a human annotator) and almanac.com is indeed a deeply-rooted problem.

@evolutionoftheuniverse
Copy link
Contributor

@adbar For gardeners.com, the date is given as <p class="bottom-spacer"><em>Last updated: 5/5/20</em></p>.

@adbar
Copy link
Owner

adbar commented Jul 3, 2020

Hi, thank you for the suggestion, I believe a regex like updated: ... might do the trick, do you want to try and write a PR?

@evolutionoftheuniverse
Copy link
Contributor

((U|u)pdated|UPDATED): ?[0-9]{1,2}(\.|\/)[0-9]{1,2}(\.|\/)([0-9]{4}|[0-9]{2}) might work.

@evolutionoftheuniverse
Copy link
Contributor

Same error as almanac.com occurs on https://ozhanozturk.com/2017/09/30/halkbilim-sozlugu-folklor-sozlugu-kar-kilic/, but regex for JSON works for ozhanozturk.com but not for almanac.com because of ": ".

@evolutionoftheuniverse
Copy link
Contributor

Same error as almanac.com occurs on https://ozhanozturk.com/2017/09/30/halkbilim-sozlugu-folklor-sozlugu-kar-kilic/, but regex for JSON works for ozhanozturk.com but not for almanac.com because of ": ".

When trying to fetch content with urllib, almanac.com gives 404. Fetching from ozhanozturk.com is successful. Still they return None with find_date. However, calling json_search does not return None, it returns date.

@adbar
Copy link
Owner

adbar commented Jul 8, 2020

Thanks for the bug report, I'll look into it.

@evolutionoftheuniverse
Copy link
Contributor

gardeners.com is fixed in #13.

@adbar
Copy link
Owner

adbar commented Apr 28, 2021

Hi everyone, I fixed a further bug with an updated page at gardeners.com in 97dee35.

The two other issues are now related to the user-agent in Python which is blocked and requires user-agent rotation which goes beyond the scope of this library. See trafilatura for more info.

@adbar adbar closed this as completed Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants