Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler implementation #8

Merged
merged 3 commits into from
Sep 8, 2023
Merged

Crawler implementation #8

merged 3 commits into from
Sep 8, 2023

Conversation

brahle
Copy link
Owner

@brahle brahle commented Sep 7, 2023

Summary 📝

Implementation of a YAML-configured web crawler.

Details

Describe more what you did on changes.

  1. Configuration is done in a folder that can contain folders and multiple Markdown (.md) files.
  2. YAML contents from those .md files are extracted and can be used together to define what needs to be crawled.
  3. RootCrawler will read the configuration and potentially hit multiple URLs based on the configuration.
  4. Implements a sqlite3 DB to ensure things don't get re-crawled again.
  5. BrowserEmulator makes it easier to mimic a browser.
  6. DownloadRateLimiter makes it easier to not overwhelm a server.
  7. Stores the variable context for each URL into a DB.

Checks

  • Stakeholder Approval

@brahle brahle merged commit 7fd6d20 into main Sep 8, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant