Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawling error rises as invalid URLs are extracted #108

Open
MrOrz opened this issue Oct 24, 2018 · 0 comments
Open

Crawling error rises as invalid URLs are extracted #108

MrOrz opened this issue Oct 24, 2018 · 0 comments
Assignees
Labels

Comments

@MrOrz
Copy link
Member

MrOrz commented Oct 24, 2018

In google webmaster tool, we are receiving crawl error in weird URLs:
2018-10-24 1 21 57

These URLs are extracted from pages like this:
https://cofacts.g0v.tw/article/2yje6no2cqv2v

「www.sushiexpress.com.tw,爭鮮也已經出面回應這是假訊息。」 is being extracted as URL, and "http://" is not prepended, thus the crawler goes to https://cofacts.g0v.tw/article/www.sushiexpress.com.tw,⋯⋯ instead.

We should fix the URL scrapping logic, prepend http:// or https:// when needed, and rewrite all articles & replies' hyperlinks field to remove the wrongly extracted URLs.

@MrOrz MrOrz added the bug label Oct 24, 2018
@MrOrz MrOrz self-assigned this Oct 24, 2018
@MrOrz MrOrz changed the title Scrap urls rises as invalid URLs are extracted Crawling error rises as invalid URLs are extracted Oct 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant