Scraping-with-scrapy-in-python

A baby problem in scraping in python

For an overview on scrapy - see the official documentation at - http://doc.scrapy.org/en/latest/intro/tutorial.html

I have tried to keep the "baby_scaping.py" as basic/bare-bones as possible(I have not even used the lxml) - will add more files to this repo on pagination & intelligent crawling, mainly the bits on recursive scraping, also called a crawlspider. With an optimally designed Crawlspider - you can navigate the pages in a website and only extract the elements of interest. An optimial crawlspider will also keep a memory tab on what is already crawled - so that the crawler does not get stuck in an infinite loop. Note also that crwling with approprite rules, can get a very memory intensive activity - no wonder there are ways to run bots on your personal machine, as a proxy to someone else's use case .

For now, all you need to understand this is 3 broad things -:

What a scraper really does and the overview of offcial Tutorial
Building Xpaths and trying it out with XPath checker plugin in modzilla
Writing your spider to extract the items of interest from the webpage.

for suggestions, comments, collaboration, reach out to me at ekta1007@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
baby_scraper.py		baby_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

baby_scraper.py

baby_scraper.py

Repository files navigation

Scraping-with-scrapy-in-python

About

Releases

Packages

Languages

ekta1007/Scraping-with-scrapy-in-python

Folders and files

Latest commit

History

README.md

README.md

baby_scraper.py

baby_scraper.py

Repository files navigation

Scraping-with-scrapy-in-python

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages