Skip to content

cnsoft/isbullshit-crawler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This project contains the code of the spider described in my blogpost Crawl a website with Scrapy.

This spider crawls the website http://isbullsh.it, and extract information about each blogpost:

  • title
  • author
  • tag(s)
  • release date
  • url
  • HTML formatted text
  • location

We implement the spider using Scrapy.

Requirements

  • Scrapy: pip install Scrapy
  • pymongo: pip install pymongo
  • An installed MongoDB server

How do I test it?

Release the spider by entering

scrapy crawl isbullshit

About

A crawling spider gathering all blogposts from http://isbullsh.it, using Scrapy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages