Skip to content

aman-harness/WebCrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

WebCrawl

A web - scraper built using scrapy framework in python.

  • Item file conatins the container that will be loaded with scraped data. They are just like a normal python dict. Modify it according to the data you need to scrap from a web page.
  • Spider are the classes used to scrape information from a domain(Or a group of domains). They define an initial list of URLs to download, how to follow links, and how to parse the contents of those pages to extract items.
    Note: Remeber to use DOWNLOAD_DELAY >= 2 sec else your IP could be blacklisted.

How To Run: Go to projects top level directory and type in terminal :- scrapy crawl <projectname>
Replace project name wih you the name you gave to your project while creating it.

Dependencies: Before running this project or any scrapy project you must have these packages intalled.

To Do: Use MongoDB for storing data in the database and use Stack_spider for doing the nested crawling.

Reference: https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/

About

My experiment with basic web crawling.. tested first on my own blog.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages