This is an extendable and easy-to-use crawler which can be fully customized according to your needs.
- Supports asynchronous requests
- Able to add special crawling strategies within custom spiders
- Specify needed tags to be scraped
- Access to full logs and detailed execution timing
- Minimize RAM usage by making use of BeautifulSoup's decomposition and python's garbage collection
Install the package using pip:
pip install pysimplecrawler
Or download the source code using this command:
git clone https://github.com/benyaminkosari/pysimplecrawler.git
- Import Crawler
from pysimplecrawler import crawler
- Define the main url to crawl:
url = "https://example.com/"
- Add custom config if needed:
config = {
"async": True,
"workers": 5,
"headers": {"cookies", "my_cookie"}
}
- Execute the process:
spider = crawler.Crawl(url, depth=3, config=config)
spider.start()
- Access result:
print(spider.data)
-
url : str (required)
Target's address to start crawling -
depth : int (optional) default=3
Maximum level which crawlers must go deep through target's urls -
priority : str (optional) default='vertical'
Strategy which the crawler uses. It's the name of the spider's class (all lowercase)
Note that starting the class name with 'Async' will be automatically handled from config -
config : dict (optional)
Change default and add extra settingstags : list = ["img", "h1", "meta", "title", ...]
Html tags to be scraped while crawlingasync : bool default=False
Whether requests must be sent concurrentlyworkers : int default=10
Number of requests to be sent simultaneouslylogs : bool default=True
Whether logs must be shownheaders : dict
Headers of the requests.
Note that there are already some default headers which will be overridden.
Crawl class of pysimplecrawler.crawler
to Spider classes of pysimplecrawler.spiders
are
connected as a Factory to its Products. Considering inheritance from AbsSpider as the abstract class
and SpiderBase as a helper class, You can add you own custom strategies to /spiders
directory.