Skip to content

digitalmethodsinitiative/trawler

 
 

Repository files navigation

Trawler

A job scheduler and analysis tool for webscraping (and other) tasks.

Node.js Package

Datasources

Curently the following datasources are implemented:

  • tiktok get video metadata per hashtag, download them and analyse the text using easyOCR

  • gab (nazi-twitter) crawl posts for user

  • onionlist download tor-catalogue from onionlist.org

  • google dorking fint interesting files and download them

  • facebook posts and reactions scrape facebook posts, comments and reactions (like, heart, etc)

Features

  • simple configuration of actions/datasources, also from 3rd party modules/repos
  • job monitoring and scheduling
  • schedule jobs
  • sqlite, csv and json browser
  • separation of datasets/artifacts (one archive per crawl)
  • scalable amount of workers (also on other machines)

Architecture

Frontend and API

  • GUI to create and schedule jobs
  • Displays pending, running and done jobs
  • Display csv and sqlite datasets

Worker(s)

  • Can be distributed (workers and c&c on different locations/servers)
  • Jobs are managed through json files (and can be distrubuted with an adapter like pouchDB)
  • Multithreaded

Install

Using Docker-Compose

Install using docker-compose by running:

docker-compose up

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 78.8%
  • Vue 19.3%
  • HTML 1.9%