Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not crawling all links? #7

Closed
ghost opened this issue Feb 28, 2016 · 11 comments
Closed

Not crawling all links? #7

ghost opened this issue Feb 28, 2016 · 11 comments

Comments

@ghost
Copy link

ghost commented Feb 28, 2016

I pulled the direct source code down from this repo, and decided to test on Twitter. I edited main.py to suit the needs (just the project name and URL) and it is only crawling one single URL. If you go into twitter.com unlogged in you can see that there are pages under the root domain. Why isn't this working?

@lwgray
Copy link
Contributor

lwgray commented Feb 29, 2016

The purpose of this tool is to only gather links

@ghost
Copy link
Author

ghost commented Feb 29, 2016

I understand, but when I crawl thenewboston.com, it gathers ~20-30 links. You can find in Chrome Dev Tools that there are links under the domain on that. The crawler does what it does and gathers all the links and puts them in the crawled.txt file. However, when I go to twitter.com, it doesn't get ANY links, just the http://www.twitter.com part. There are no other links in the crawled.txt file.

@lwgray
Copy link
Contributor

lwgray commented Feb 29, 2016

maybe we can solve this together... I will take a look at it and will respond if I find something. Maybe buckyroberts might have a better idea of how to address it.

@lwgray
Copy link
Contributor

lwgray commented Feb 29, 2016

I submitted a pull request. I was able to get it to work...Lets just hope buckyroberts accepts it.

@ghost
Copy link
Author

ghost commented Feb 29, 2016

Fixes the issue for Twitter, but now when I do a site like https://github.com or https://youtube.com, it has the same problem. I removed the content checker all together and that seemed to fix the problem, but kind of leaves issued for sites down the road.

@lwgray
Copy link
Contributor

lwgray commented Feb 29, 2016

I doubt there are that many variations so maybe we could add them to a list and loop through them.

idb

@buckyroberts
Copy link
Owner

Agreed, a lot of sites use different techniques for determining what is a "bot", what is allowed to make a request to their server, how often requests are allowed to be made, etc...

So I am sure we will often come across sites that pose different types of problems, but I like that idea lwgray. Before crawling, we can loop through until we find a spider that works with that specific site. Once we find a variation that is compatible, we will use that.

What we could also do is develop a generic Spider class (like the original one). Then, anytime a specific site had a problem (like Twitter) we could just inherit from the Spider class and overwrite whatever methods we need to make it compatible with that site. That way we won't clutter up spider.py with a bunch of code that attempts to fix all the issues for every site.

Also, thanks for the bug fixes guys!

@lwgray
Copy link
Contributor

lwgray commented Feb 29, 2016

that does sound better

@ghost
Copy link
Author

ghost commented Feb 29, 2016

I like the idea of a generic class for a spider. Maybe a temporary fix for this could be instead of the spider checking whether a website Content-Type is text/html, maybe check that it isn't a PDF, .exe file etc.. Like a blacklist. But then again, that could go back to the problem where we don't want a long list of items..

@lwgray
Copy link
Contributor

lwgray commented Mar 1, 2016

I am not quite sure how to structure the generic spider... If you explain, I don't mind implementing it.

Cheers,
Larry

@buckyroberts
Copy link
Owner

Basically like it is right now. Then later on when we find a problem crawling some site like Facebook for example, we will just make a new class called FacebookSpider that inherits from Spider and change whatever functionality we need to to make it work with Facebook. That way we can solve specific problems for specific sites, without cluttering up the Spider class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants