Skip to content

A robust full-stack application that handles long-running web scraping tasks asynchronously using BullMQ, Puppeteer, and Google Gemini AI to turn any website into a summarized knowledge base.

Notifications You must be signed in to change notification settings

dev-9820/ScrapeFlow-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Web Scraper with Queue Management

A full-stack application that scrapes websites, processes content using Google Gemini AI, and manages high-load tasks using an asynchronous queue system.

Tech Stack

  • Frontend: Next.js 14 (App Router), TanStack Query, Tailwind CSS
  • Backend: Node.js, Express
  • Database: PostgreSQL + Drizzle ORM
  • Queue: BullMQ + Redis
  • AI/Scraping: Google Gemini API + Puppeteer

Key Features

  • Asynchronous Processing: Long-running scraping jobs are offloaded to a Redis queue to prevent request timeouts.
  • Live Polling: Frontend uses TanStack Query to poll job status in real-time.
  • Robust Scraping: Uses Puppeteer (Headless Chrome) to render JavaScript-heavy sites before scraping.
  • Task History: Persists all jobs and results in PostgreSQL.
  • Dockerized: Entire stack runs with a single command.

How to Run

Prerequisites: Docker Desktop installed.

  1. Create a .env file in the server folder:

    DATABASE_URL=postgres://myuser:mypassword@postgres:5432/scraper_db
    REDIS_HOST=redis
    REDIS_PORT=6379
    GEMINI_API_KEY=your_gemini_api_key_here
  2. Run the application:

    docker-compose up --build
  3. Open your browser:

Project Structure

  • /project: Next.js frontend application.
  • /server: Express API and Worker logic.
  • docker-compose.yml: Orchestration for DB, Redis, Backend, and Frontend.

Bonus Implemented: The entire application is containerized using Docker Compose. Database schemas are automatically applied on container startup using Drizzle Migrations.

About

A robust full-stack application that handles long-running web scraping tasks asynchronously using BullMQ, Puppeteer, and Google Gemini AI to turn any website into a summarized knowledge base.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published