Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-i docs doesn't ignore subdomains containing "docs" #79

Closed
cyb3rjerry opened this issue Nov 14, 2022 · 3 comments
Closed

-i docs doesn't ignore subdomains containing "docs" #79

cyb3rjerry opened this issue Nov 14, 2022 · 3 comments
Labels
bug Something isn't working Go good first issue Good for newcomers help wanted Extra attention is needed

Comments

@cyb3rjerry
Copy link
Contributor

Describe the bug
When crawling https://mapbox.com/ we notice that "docs.mapbox.com" gets crawled

To Reproduce
Steps to reproduce the behavior:

  1. Run echo "https://mapbox.com/" | cariddi -s -intensive -i docs
  2. Look at output

Expected behavior
docs.* not to be crawler

Desktop (please complete the following information):

  • OS: Linux WKS-001772 5.19.0-23-generic 24-Ubuntu SMP PREEMPT_DYNAMIC Fri Oct 14 15:39:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Browser [e.g. chrome, safari]
  • Version: v1.1.9
@edoardottt
Copy link
Owner

Thank you so much @cyb3rjerry for your contribution, really appreciated.

Actually you are right, it should not happen.
Investigating a bit it seems like that 99% of the URL that should not pass are actually ignored, but a small part of them still get printed on cli. This is because the method used to print URLs on CLI is:

c.OnResponse(func(r *colly.Response) {
		fmt.Println(r.Request.URL.String())
...

This prints also the ignored URLs in case of a long chain of redirects where the last call is made by a ignored URL.
Imagine something like this:

  • Link1 redirects to Link2
  • Link2 redirects to Link3
  • ...
  • LinkToIgnore redirects to LinkN

This behavior makes cariddi printing LinkToIgnore.
I don't have a solution for this, if you want to propose something I'm all ears.

For now you can rely on output files, I'm quite sure they don't include to-be-ignored URLs.

@edoardottt edoardottt added bug Something isn't working help wanted Extra attention is needed good first issue Good for newcomers Go labels Nov 14, 2022
@cyb3rjerry
Copy link
Contributor Author

Sounds good! I'll thinker on this and get back to you at some point :)

Thanks again for the great tool btw!

@edoardottt
Copy link
Owner

Closed by #82

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Go good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants