Skip to content

Scalable real-time Reddit crawler built with Catenae + Kafka (pull feed)

License

Notifications You must be signed in to change notification settings

brunneis/reddit-crawler

 
 

Repository files navigation

reddit-crawler

This crawler scrapes all new submissions and comments posted on Reddit in real time. A topology is defined with three Catenae modules. The extracted texts can be retrieved on the Kafka topic new_texts.

Requirements

  • docker
  • docker-compose

Standalone mode

In order to launch the crawler in standalone mode with its own Kafka broker, execute the launch.sh script.

About

Scalable real-time Reddit crawler built with Catenae + Kafka (pull feed)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 92.5%
  • Dockerfile 4.6%
  • Shell 2.9%