Skip to content

amareshhebbar/Recall

Repository files navigation

Recall

{ 
  "project": "Recall", 
  "frameworks": ["Python", "LangGraph", "Ollama"],
  "manual_data_entry": null 
}

Recall is a highly autonomous, multi-layered data extraction and RAG (Retrieval-Augmented Generation) pipeline. You point it at a URL, and it doesn't just scrape the page—it reads it, cross-references it with live web searches, extracts structured JSON using local LLMs, and stores it in both local and cloud vector databases so you can chat with it later.

And one more thing, this doesn;t have any ui cause it doesn;t need one

Explore the MIGRATED docs

Wanna start

The straight-up way to get this pipeline running on your machine.

1. Create and Activate Virtual Environment

Keeping dependencies isolated is best practice. Run the following commands in your terminal:

For macOS/Linux:

python3.13 -m venv venv
source venv/bin/activate

Note: Once activated, you should see (venv) appear at the start of your terminal prompt.


2. Install Requirements

With the environment active, install all necessary libraries defined in the requirements.txt file:

pip install -r requirements.txt

or for the Fedora System

python3 -m pip install -r requirements.txt

If this is your first time using Playwright, you may also need to install the browser binaries:

playwright install chromium

3. Run the Script

Now that your environment is configured, launch the main script:

python index.py

Output Data

After the script completes, you can find the extracted data (and HTML content, if configured) in the following directory:

  • storage/datasets/default/

  1. Fire it up

Run the main scraper and enrichment script:

python main.py

Want to migrate your local ChromaDB data to the Zilliz cloud? Just pass the flag:

python main.py --zilliz --remove-unwanted

[[[ How's it Working ]]]

Recall is built like a factory assembly line using asynchronous Python. Here is what happens under the hood:

The Crawler (Producer): Crawl4AI visits the target website, bypasses caching, and rips out the raw Markdown. It splits massive pages into manageable chunks and tosses them into an async queue.

The Search Layers: Before trusting the scraped data, the system attempts to ground it. It will try to use Gemini to search the web. If that fails, it cascades down to Serper, Tavily, Brave, and finally DuckDuckGo.

The Brain (Enrichment): The raw data and the web search results are fed into Ollama (Llama 3.1) locally via LangGraph. Ollama synthesizes the messy text into a perfectly structured JSON object (Pros, Cons, Descriptions, Pricing).

The Vault: The cleaned data is instantly embedded into a local ChromaDB database. It can also be synced to a Zilliz (Milvus) cloud cluster.

The Chat: Once the data is stored, you can run the LangGraph RAG terminal chat. You ask a question, it queries ChromaDB, retrieves the exact context, and Ollama talks to you like a friend about the tools it just researched.


The Tech Stack

I didn't use off-the-shelf wrappers; I built a custom pipeline. Here are the core tools making this happen:

  • Crawl4AI: For hardcore, headless browser DOM extraction.

LangGraph: To build the cyclic graphs for my research agents and chat loops.

Ollama (llama3.1:8b): The local heavyweight champion doing all the JSON formatting and conversational generation for free.

ChromaDB: My local vector database for fast, offline RAG.

Zilliz (PyMilvus): The cloud vector database for production-level storage.

LiteLLM / Playwright: Used in the agentic extraction script to let the AI physically click buttons, navigate Cloudflare blocks, and hunt down data autonomously.


State & Duplicate Management

Scraping gets messy fast. To prevent extracting the same data twice or corrupting the database, Recall uses a strict StateManager.

Deterministic Hashing: Every extracted entity generates a unique SHA-256 hash based on its name.

Smart Upserting: If the crawler finds a tool it already knows about, it doesn't create a duplicate. It merges the new metrics and appends new sentiment analysis to the existing record.

Local Checkpointing: It writes processed names to processed_tools.txt so if the script crashes, you don't lose your API credits starting over.


[[[ Updates needed ]]]

The pipeline is currently a highly functional MVP, but here is what is on the roadmap for future updates:

Full Agentic UI Automation: Enhancing the human_click logic to let the AI solve CAPTCHAs and navigate deep paginated directories completely on its own.

Front-End Integration: Hooking up this massive backend vector database to a React Native / Next.js frontend for easy searching.

Dynamic Model Switching: Automatically swapping between smaller local models for simple data extraction and heavy cloud models (like Claude 3.5 or GPT-4o) for complex reasoning tasks depending on the API keys available.

Congratulations, you’ve built a $50-a-month subscription to local LLM tokens just to recreate the 'Save as PDF' button# Recall

About

An end-to-end agentic pipeline that crawls the web for AI tools/market gaps, enriches data via LangGraph, and serves it through a 100% offline React Native interface using MediaPipe

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors