Harvest 🌾

A full-text search engine built from scratch in Rust. Crawls the web, builds an inverted index with positional data, and supports phrase queries.

harvest-full-demo_5Yvqjugl.mp4

What This Project Demonstrates

Web Crawling: Concurrent BFS crawler with configurable depth, rate limiting via semaphores, and URL deduplication
Text Analysis Pipeline: Modular design with character filters, tokenizers, and token filters (lowercase, stop words, Porter stemming)
SPIMI Indexing: Single-Pass In-Memory Indexing with memory budgets, disk block persistence, and k-way merge
Positional Index: Stores term positions per document for phrase/proximity query support
Phrase Queries: Positional intersection algorithm to match exact phrases across documents
Incremental Indexing: Re-running the indexer only processes new pages, appends to existing term buckets

Tech Stack

Component	Technology
Language	Rust (2024 edition)
Async Runtime	Tokio
Database	MongoDB
Web Framework	Axum
HTML Parsing	scraper, html5ever

Quick Start

Prerequisites

Rust (latest stable)
MongoDB running locally or accessible via MONGO_URI

Setup

# Clone and build
git clone https://github.com/yourusername/harvest.git
cd harvest
cargo build --release

# Configure (optional)
cp .env.example .env
# Edit .env with your MongoDB URI if needed

Usage

1. Crawl websites

# Crawl Wikipedia starting from the search engine article, depth 2
# Recommendation for max-depth is not make it too big like even with number say 10
# it will take a lot of time as the crawler will go to deep in web.
# BUT, you can always just stop the crawler.
cargo run --release -- crawl \
  --url "https://en.wikipedia.org/wiki/Search_engine" \
  --max-depth 2 \
  --max-concurrent-fetches 100

2. Build the index

# Index all crawled pages (100MB memory budget for SPIMI blocks)
cargo run --release -- index \
  --page-fetch-limit 10000 \
  --budget-bytes 100000000

3. Start the search server

cargo run --release -- serve --port 3000

Then open http://localhost:3000 in your browser.

CLI Reference

harvest <command> [options]

Commands:
  crawl   Crawl websites starting from a seed URL
  index   Build inverted index from crawled pages
  serve   Start the web server with search API and UI

crawl:
  -u, --url <URL>                    Seed URL to start crawling
  -d, --max-depth <N>                Maximum crawl depth [default: 0]
  -f, --max-concurrent-fetches <N>   Concurrent HTTP requests [default: 150]
  -s, --frontier-size <N>            URL frontier queue size [default: 500]

index:
  -p, --page-fetch-limit <N>         Pages per batch [default: 10000]
  -b, --budget-bytes <N>             Memory budget before flush [default: 100MB]

serve:
  -p, --port <N>                     Server port [default: 3000]
  -H, --host <ADDR>                  Bind address [default: 127.0.0.1]

Architecture

See ARCHITECTURE.md for system design and component details.

Running Tests

# Unit tests
cargo test

# Integration tests (requires MongoDB)
cargo test --test query_engine_tests
cargo test --test indexer_tests

TODOs

IP rotation service integration so we don't get blacklisted by websites.
Spell correction. (Did you mean x?)
Re-Indexing of the same pages, crawled after some time of indexing. (Currently we only support incremental indexing)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
src		src
static		static
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harvest 🌾

What This Project Demonstrates

Tech Stack

Quick Start

Prerequisites

Setup

Usage

CLI Reference

Architecture

Running Tests

TODOs

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Harvest 🌾

What This Project Demonstrates

Tech Stack

Quick Start

Prerequisites

Setup

Usage

CLI Reference

Architecture

Running Tests

TODOs

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages