Skip to content

cf2018/scraper_playwright

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Google Maps Business Scraper & Dashboard

A production-ready system for scraping business information from Google Maps with a modern web dashboard interface. Deployed on AWS Lambda with MongoDB backend.

πŸš€ Live Demo

Production URL: You will get it after deploying to AWS Lambda using the instructions below.

✨ Features

Web Dashboard

  • πŸ“Š Real-time scraping progress - Live activity tracking during scraping
  • πŸ” Search & filter businesses by name, category, or location
  • πŸ“ˆ Statistics - Total businesses, average ratings, top categories
  • ✏️ CRUD operations - Edit, delete, and manage scraped data
  • πŸ“₯ Export to CSV - Download business data for analysis
  • 🌐 MongoDB integration - Persistent cloud storage

Scraping Engine

  • πŸ—ΊοΈ Google Maps automation - Playwright-based scraping
  • πŸ“± Comprehensive data extraction:
    • Business name, phone, website, address
    • Instagram and WhatsApp (including from Reserve/Order buttons)
    • Ratings and review counts
    • Geographic coordinates
  • 🎯 Smart duplicate detection - Name, phone, and URL matching
  • πŸ’Ύ Immediate database saves - No data loss on errors
  • πŸ›‘οΈ Robust error handling - Graceful recovery and retry logic

AWS Lambda Deployment

  • ⚑ Serverless architecture - No server management
  • 🐳 Dockerized - Consistent environments
  • πŸ”„ Async invocation - No API Gateway timeouts
  • πŸ“Š CloudWatch logging - Full observability
  • 🌍 API Gateway - RESTful endpoint with /prod stage

πŸ“‹ What's Working

Component Status Description
Web Dashboard βœ… Working Flask app with modern UI
MongoDB Integration βœ… Working External MongoDB @ easypanel.host
Google Maps Scraper βœ… Working 100% success rate (fixed Oct 22)
WhatsApp Extraction βœ… Enhanced Extracts from action buttons
Lambda Deployment βœ… Working 1024MB, 300s timeout
Async Scraping βœ… Working Self-invocation pattern
Real-time Progress βœ… Working Live activity updates
Error Handling βœ… Production-ready Graceful degradation
Database Errors βœ… User-friendly Clean messages

πŸ“ Project Structure

scraper_playwright/
β”œβ”€β”€ app.py                    # Main Flask application
β”œβ”€β”€ scrape_businesses_maps.py # Google Maps scraper
β”œβ”€β”€ database.py               # MongoDB operations
β”œβ”€β”€ lambda_handler.py         # AWS Lambda entry point
β”œβ”€β”€ extract_contact_info.py   # Contact extraction (deprecated in scraping flow)
β”œβ”€β”€ json_database.py          # Fallback JSON storage
β”œβ”€β”€ templates/                # Jinja2 templates
β”‚   β”œβ”€β”€ index.html           # Scraping interface
β”‚   └── dashboard.html       # Business management
β”œβ”€β”€ static/                   # CSS, JS, images
β”œβ”€β”€ infra/                    # Terraform configuration
β”‚   β”œβ”€β”€ main.tf
β”‚   β”œβ”€β”€ terraform.tfvars
β”‚   └── outputs.tf
β”œβ”€β”€ Dockerfile               # Lambda container image
β”œβ”€β”€ deploy.sh                # Deployment automation
└── requirements.txt         # Python dependencies

πŸ”§ Data Structure

Each scraped business contains:

{
  "name": "Business Name",
  "phone": "011 1234-5678",
  "website": "https://example.com",
  "email": null,
  "address": "Street Address, City",
  "rating": "4.8",
  "reviews": 152,
  "instagram": "https://instagram.com/username",
  "whatsapp": "+5491123456789",
  "scraped_at": "2025-10-22T12:34:56.789000",
  "search_query": "plomero, caba"
}

πŸš€ Quick Start

Local Development

  1. Clone and install:
git clone https://github.com/cf2018/scraper_playwright.git
cd scraper_playwright
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
playwright install chromium
  1. Set environment variables:
cp .env.example .env
# Edit .env with your MongoDB credentials
  1. Run locally:
python app.py
# Dashboard: http://localhost:5000/

Command-Line Scraping

# Scrape specific business type and location
python scrape_businesses_maps.py "plomero, caba" --max-results 10

# Output saved to: json_output/plomero_caba_YYYYMMDD_HHMMSS.json

AWS Lambda Deployment

  1. Configure AWS credentials:
aws configure
  1. Set MongoDB credentials in infra/terraform.tfvars:
mongodb_connection_string = "mongodb://user:pass@host:27017/scraper"
mongodb_database_name = "scraper"
  1. Deploy:
./deploy.sh deploy
  1. Access:
  • Dashboard: https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/

πŸ“Š Usage Examples

Scrape via Web Interface

  1. Navigate to / (scraping page)
  2. Enter search query (e.g., "restaurants, new york")
  3. Set max results (default: 20, Lambda max: 20)
  4. Click "Start Scraping"
  5. Watch live progress updates
  6. View results in /dashboard

Scrape via API

# Start scraping
curl -X POST https://your-api.amazonaws.com/prod/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"search_query": "plomero, caba", "max_results": 10}'

# Check status
curl https://your-api.amazonaws.com/prod/api/scraping-status/<task_id>

# Get businesses
curl https://your-api.amazonaws.com/prod/api/businesses

πŸ”’ Environment Variables

Variable Description Required
MONGODB_CONNECTION_STRING MongoDB URI Yes
MONGODB_DATABASE_NAME Database name Yes
LAMBDA_ENVIRONMENT Set to true in Lambda Auto
API_PREFIX API Gateway stage prefix Auto

πŸ§ͺ Recent Improvements (Oct 22, 2025)

Fixed Critical Browser Stability Issue βœ…

  • Problem: Scraper crashed after first business (browser context closed)
  • Solution: Disabled website contact extraction during multi-business scraping
  • Result: 100% success rate (was 10%)

Enhanced WhatsApp Extraction βœ…

  • Extracts WhatsApp from Reserve/Order buttons with wa.me links
  • Handles URL-encoded parameters
  • Validates phone number format (10-15 digits)

Improved Browser Fingerprinting βœ…

  • Updated to Chrome 131 user agent
  • Argentine locale (es-AR) and timezone
  • Buenos Aires geolocation coordinates
  • Proper Sec-Fetch-* headers

Robust Error Recovery βœ…

  • Page validity checking before navigation
  • Fallback to re-navigate on errors
  • Graceful exit on unrecoverable failures

πŸ› Known Limitations

  • Email extraction: Disabled to maintain stability (was ~40% coverage)
  • Lambda max results: Limited to 20 businesses due to 300s timeout
  • Website extraction: Disabled in multi-business scraping (stability over completeness)

πŸ“š Documentation

πŸ”— Technology Stack

  • Backend: Python 3.12, Flask 3.1.2
  • Scraping: Playwright (Chromium)
  • Database: MongoDB (external cloud)
  • Deployment: AWS Lambda, API Gateway, ECR
  • IaC: Terraform
  • Frontend: Vanilla JS, Tailwind-inspired CSS

πŸ“ˆ Performance

  • Scraping speed: ~3-5 seconds per business
  • Lambda cold start: ~2-3 seconds
  • Lambda warm execution: ~1 second for dashboard
  • Database queries: <100ms average
  • Success rate: 100% (after stability fixes)

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

This project is for educational and research purposes.

⚠️ Disclaimer

This scraper is for educational purposes. Always respect Google Maps Terms of Service and implement appropriate rate limiting. The authors are not responsible for misuse of this tool.

About

Playwright business scraper on aws lambda using mongodb for state and storage

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published