Good test cases #9

vprelovac · 2020-04-09T20:09:14Z

Hi Adrien

here are a few test cases where the extraction gave a wrong answer:

https://www.gardeners.com/how-to/vegetable-gardening/5069.html
https://www.almanac.com/vegetable-gardening-for-beginners

Somewhat related, this one 'hangs':
https://www.homedepot.com/c/ah/how-to-start-a-vegetable-garden/9ba683603be9fa5395fab90d6de2854

adbar · 2020-04-21T16:33:53Z

Hi Vlad, thanks for the input.
The Home Depot case is probably because they ban automated user-agents. For gardeners.com I can't find a date (even as a human annotator) and almanac.com is indeed a deeply-rooted problem.

evolutionoftheuniverse · 2020-07-02T23:25:57Z

@adbar For gardeners.com, the date is given as <p class="bottom-spacer"><em>Last updated: 5/5/20</em></p>.

adbar · 2020-07-03T17:59:59Z

Hi, thank you for the suggestion, I believe a regex like updated: ... might do the trick, do you want to try and write a PR?

evolutionoftheuniverse · 2020-07-04T00:42:50Z

((U|u)pdated|UPDATED): ?[0-9]{1,2}(\.|\/)[0-9]{1,2}(\.|\/)([0-9]{4}|[0-9]{2}) might work.

evolutionoftheuniverse · 2020-07-06T16:50:23Z

Same error as almanac.com occurs on https://ozhanozturk.com/2017/09/30/halkbilim-sozlugu-folklor-sozlugu-kar-kilic/, but regex for JSON works for ozhanozturk.com but not for almanac.com because of ": ".

evolutionoftheuniverse · 2020-07-07T17:57:13Z

Same error as almanac.com occurs on https://ozhanozturk.com/2017/09/30/halkbilim-sozlugu-folklor-sozlugu-kar-kilic/, but regex for JSON works for ozhanozturk.com but not for almanac.com because of ": ".

When trying to fetch content with urllib, almanac.com gives 404. Fetching from ozhanozturk.com is successful. Still they return None with find_date. However, calling json_search does not return None, it returns date.

adbar · 2020-07-08T17:13:34Z

Thanks for the bug report, I'll look into it.

evolutionoftheuniverse · 2020-07-12T12:04:34Z

gardeners.com is fixed in #13.

adbar · 2021-04-28T19:16:34Z

Hi everyone, I fixed a further bug with an updated page at gardeners.com in 97dee35.

The two other issues are now related to the user-agent in Python which is blocked and requires user-agent rotation which goes beyond the scope of this library. See trafilatura for more info.

adbar closed this as completed Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Good test cases #9

Good test cases #9

vprelovac commented Apr 9, 2020

adbar commented Apr 21, 2020

evolutionoftheuniverse commented Jul 2, 2020

adbar commented Jul 3, 2020

evolutionoftheuniverse commented Jul 4, 2020

evolutionoftheuniverse commented Jul 6, 2020

evolutionoftheuniverse commented Jul 7, 2020

adbar commented Jul 8, 2020

evolutionoftheuniverse commented Jul 12, 2020

adbar commented Apr 28, 2021

Good test cases #9

Good test cases #9

Comments

vprelovac commented Apr 9, 2020

adbar commented Apr 21, 2020

evolutionoftheuniverse commented Jul 2, 2020

adbar commented Jul 3, 2020

evolutionoftheuniverse commented Jul 4, 2020

evolutionoftheuniverse commented Jul 6, 2020

evolutionoftheuniverse commented Jul 7, 2020

adbar commented Jul 8, 2020

evolutionoftheuniverse commented Jul 12, 2020

adbar commented Apr 28, 2021