Crawler implementation #8

brahle · 2023-09-07T20:59:45Z

Implementation of a YAML-configured web crawler.

Describe more what you did on changes.

Configuration is done in a folder that can contain folders and multiple Markdown (.md) files.
YAML contents from those .md files are extracted and can be used together to define what needs to be crawled.
RootCrawler will read the configuration and potentially hit multiple URLs based on the configuration.
Implements a sqlite3 DB to ensure things don't get re-crawled again.
BrowserEmulator makes it easier to mimic a browser.
DownloadRateLimiter makes it easier to not overwhelm a server.
Stores the variable context for each URL into a DB.

brahle added 3 commits September 7, 2023 22:49

Crawler implementation

d3bcd61

Fixing lint

f0f0652

Storing variables

582d6a1

brahle merged commit 7fd6d20 into main Sep 8, 2023
4 checks passed

Provide feedback