🕷️ Crawler Web 2 — Sequential Domain HTML Archiver

A high-fidelity, memory-conscious web crawler built with Crawlee + Playwright that sequentially archives raw HTML from multiple domains. Optimized for stability and production use.

🎯 Project Overview

Crawler Web 2 adalah sistem pengarsipan HTML yang dirancang untuk ekstraksi konten berkualitas tinggi dari website modern (React, Vue, Next.js). Berbeda dengan web scraper tradisional yang mencoba parallelisasi antar domain, project ini mengutamakan stabilitas memori melalui isolasi per-domain dan eksekusi sekuensial.

Keunggulan

✅ Sequential Domain Processing — Satu domain selesai, baru domain berikutnya → memori terbebaskan sepenuhnya
✅ Stealth Mode — Built-in anti-bot fingerprinting menggunakan playwright-extra + puppeteer-extra-plugin-stealth
✅ robots.txt Compliance — Validasi otomatis terhadap robots.txt sebelum crawling
✅ High-Fidelity Rendering — Menunggu framework JS selesai render (networkidle + manual delay) sebelum ekstraksi HTML
✅ Hybrid Discovery — Kombinasi sitemap + recursive link detection untuk cakupan maksimal
✅ Health Monitoring — Auto-stop jika disk space < 500MB atau failure rate terlalu tinggi
✅ Atomic Writes — File writes via .tmp → rename untuk mencegah data corruption

📋 Daftar Fitur

Fitur	Penjelasan
Raw HTML Archival	Menyimpan HTML mentah ke `storage/<domain>/<sanitized_url>.html`
Sitemap Priority	Mengunduh semua URL dari sitemap.xml di awal untuk cakupan maksimal
robots.txt Respect	Menolak crawl URL yang disallowed di robots.txt
Per-Domain Error Logs	Setiap domain punya `error.log` untuk tracking masalah
Disk Space Guard	Berhenti otomatis jika sisa disk < 500 MB
Memory Efficiency	`availableMemoryRatio: 0.8` untuk stabilitas
Stealth Anti-Bot	Menggunakan plugin playwright-extra stealth untuk bypass deteksi bot

⚙️ Instalasi

Prerequisites

Node.js 18+ (dengan npm 9+)
Disk space minimal 1 GB (untuk contoh 3 domain)

Setup

# 1. Clone repository
git clone <your-repo> crawler-web-2
cd crawler-web-2

# 2. Install dependencies (postinstall otomatis download Playwright browsers)
npm install

# 3. (Opsional) Update target domains di src/main.js
# Edit array `domains` untuk menentukan URL target

🚀 Cara Menjalankan

Command

npm start

Ini akan menjalankan node src/main.js.

📊 Apa Terjadi Saat `npm start` Dijalankan?

Berikut adalah alur eksekusi lengkap dengan penjelasan setiap tahap:

1️⃣ Initialization Phase

✓ Load Crawlee configuration (memory ratio = 0.8)
✓ Activate Playwright stealth fingerprinting
✓ Parse target domains dari src/main.js

2️⃣ Per-Domain Sequential Loop

Untuk setiap domain dalam daftar, sistem menjalankan tahapan berikut:

┌─────────────────────────────────────────────────────────────┐
│  DOMAIN 1: https://sendokibu.com                            │
└─────────────────────────────────────────────────────────────┘

Phase A: Pre-Crawl Checks

Langkah	Deskripsi	Output
📁 Create Storage	Buat folder `storage/sendokibu.com/`	✓ Directory ready
💾 Disk Check	Cek ruang disk tersisa	✓ 250 GB free (> 500 MB threshold)
🌐 Fetch robots.txt	Download & parse robots.txt	✓ Disallowed paths: `/admin`, `/private`
🔍 Validate Start URL	Cek apakah start URL di-allow	✓ `/` is allowed

Contoh output:

═══════════════════════════════════════════════════════════════
🌐  Starting crawl: https://sendokibu.com/
📁  Storage dir  : /Users/fidaa/work/mygithub/crawler-web-2/storage/sendokibu.com
═══════════════════════════════════════════════════════════════

💾 Free disk: 250 GB (> 500 MB threshold) ✓
🔍  Attempting to fetch sitemap: https://sendokibu.com/sitemap.xml
📍  Found 248 URLs in sitemap.
🤖  robots.txt parsed: 2 disallowed paths
✅  Start URL is allowed by robots.txt

Phase B: URL Discovery

Sitemap-First Strategy:

Crawlee menggunakan downloadListOfUrls() untuk ekstrak semua URL dari sitemap
Semua 248 URL langsung dimasukkan ke request queue

Recursive Discovery (Fallback):

Jika sitemap tidak tersedia, sistem akan:
- Crawl halaman start URL
- Ekstrak semua link internal (<a href>)
- Enqueue dengan strategi same-domain (tetap dalam 1 domain)

Contoh Queue:

Request Queue (248 items):
  1. https://sendokibu.com/ [Discovery]
  2. https://sendokibu.com/about-us [Discovery]
  3. https://sendokibu.com/contact [Discovery]
  4. https://sendokibu.com/brand [Detail] 
  5. ... (244 more)

Phase C: Page Crawling & Rendering

Untuk setiap URL di queue, sistem melakukan:

URL: https://sendokibu.com/about-us
│
├─ 1. Navigate to page (60s timeout)
│  └─ Playwright opens browser tab
│
├─ 2. Wait for network idle
│  └─ Tunggu semua XHR/fetch selesai
│
├─ 3. Wait for body element (30s timeout)
│  └─ Memastikan DOM telah terrender minimal
│
├─ 4. Extra 2 second pause
│  └─ Beri waktu React/Vue/Next.js hydration selesai
│  └─ Framework JS bisa render konten dinamis
│
├─ 5. Extract page content
│  └─ Ambil raw HTML dari browser context
│
├─ 6. Validate HTML size
│  └─ Jika < 1KB → kemungkinan error page (log error)
│  └─ Jika >= 1KB → proceeed to save
│
└─ 7. Find & enqueue new links
   └─ Parse <a href> tags
   └─ Enqueue ke queue dengan label "detail"
   └─ Skip external URLs, fragments (#), non-HTML

Contoh progress:

[sendokibu.com] Discovery: https://sendokibu.com/
  ✓ Network idle (2.3s)
  ✓ Body rendered (0.8s)
  ✓ HTML extracted (1.2 KB)
  ✓ Found 15 new links
  → Enqueued 14 links (1 was external)

[sendokibu.com] Discovery: https://sendokibu.com/about-us
  ✓ Network idle (1.1s)
  ✓ Body rendered (0.4s)
  ✓ HTML extracted (4.8 KB)
  ✓ Found 8 new links
  → Enqueued 8 links

[sendokibu.com] Discovery: https://sendokibu.com/contact
  ⚠ HTML size only 0.3 KB (likely error page)
  ✗ FAILED — HTML too small (probably 404)
  → Logged to error.log

... (245 more pages)

Phase D: File Storage

Setiap HTML yang berhasil di-extract disimpan dengan pola path-based mapping:

Sitemap URL                    Storage Path
─────────────────────────────  ──────────────────────────────────
https://sendokibu.com/        → storage/sendokibu.com/index.html
https://sendokibu.com/about   → storage/sendokibu.com/about.html
https://sendokibu.com/about/  → storage/sendokibu.com/about/index.html
https://sendokibu.com/blog?id=1 → storage/sendokibu.com/blog_id=1.html

Atomic Write Process:

1. Generate temp file: /path/to/file.html.tmp
2. Write raw HTML → file.html.tmp
3. Rename (atomic): file.html.tmp → file.html
   ↳ Jika proses crash sebelum rename, file.html tidak rusak

Phase E: Health Monitoring & Error Logging

Setiap X requests, sistem check:

Metrics:
  Total requests: 248
  Successful pages: 240
  Failed pages: 8
  Failure rate: 3.2%

Status: ✅ HEALTHY (3.2% < 50% threshold)
  └─ Lanjutkan crawl

Storage: ✅ HEALTHY (250 GB >> 500 MB threshold)
  └─ Lanjutkan crawl

Failed URLs logged ke storage/sendokibu.com/error.log:

[2025-04-09T14:23:45.123Z] FAILED https://sendokibu.com/404 — HTTP 404
[2025-04-09T14:24:12.456Z] FAILED https://sendokibu.com/admin — Forbidden (robots.txt)
[2025-04-09T14:25:03.789Z] FAILED https://sendokibu.com/timeout — Navigation timeout 60s exceeded
[2025-04-09T14:26:15.321Z] FAILED https://sendokibu.com/blocked — Network blocked (CSP/CORS)

Phase F: Domain Completion

═══════════════════════════════════════════════════════════════
✅ CRAWL COMPLETE: sendokibu.com
═══════════════════════════════════════════════════════════════
📊 Summary:
   ✓ Total URLs crawled: 248
   ✓ Success: 240 pages
   ✗ Failures: 8 pages
   💾 Storage used: 45 MB
   ⏱️  Time elapsed: 12m 34s

📁 Files saved to: storage/sendokibu.com/
📝 Error log: storage/sendokibu.com/error.log
═══════════════════════════════════════════════════════════════

🧹 Cleanup:
   → Close browser instance
   → Release memory back to OS
   → Flush dataset cache

3️⃣ Next Domain (Repeat Loop)

Setelah domain 1 selesai, sistem langsung masuk ke domain 2:

┌─────────────────────────────────────────────────────────────┐
│  DOMAIN 2: https://sequence.day                             │
└─────────────────────────────────────────────────────────────┘

[Fresh browser instance]
[Fresh memory allocation]
[Repeat Phase A → F]

4️⃣ Final Completion

═══════════════════════════════════════════════════════════════
🎉 ALL DOMAINS CRAWLED SUCCESSFULLY
═══════════════════════════════════════════════════════════════

📊 GLOBAL SUMMARY:
   Domain 1 (sendokibu.com):  240/248 pages ✓
   Domain 2 (sequence.day):   1205/1310 pages ✓
   Domain 3 (cmlabs.co):      530/560 pages ✓
   ──────────────────────────────────────────
   TOTAL:                     1975 pages archived

💾 Total storage: 845 MB

⏱️  Total time: 45m 23s

📁 Archive structure:
   storage/
   ├── sendokibu.com/
   │   ├── index.html
   │   ├── about/
   │   ├── contact.html
   │   └── error.log
   ├── sequence.day/
   │   └── ...
   └── cmlabs.co/
       └── ...

═══════════════════════════════════════════════════════════════

📸 Visual Flow Diagram

Berikut adalah alur eksekusi yang divisualisasikan:

Execution Timeline

Contoh: Output saat crawl domain sendokibu.com selesai

Contoh: Output saat crawl domain sequence.day selesai

Contoh: Output saat crawl domain cmlabs.co selesai

📁 Output Structure

Setelah npm start selesai, struktur folder akan terlihat seperti:

storage/
├── sendokibu.com/
│   ├── index.html                          # Homepage
│   ├── about.html
│   ├── contact.html
│   ├── blog/
│   │   ├── post-1.html
│   │   ├── post-2.html
│   │   └── ...
│   ├── error.log                           # Failed URLs & reasons
│   └── ...
│
├── sequence.day/
│   ├── index.html
│   ├── auth_type=personal.html
│   ├── pricing/
│   └── error.log
│
└── cmlabs.co/
    ├── en/
    ├── id/
    ├── products/
    ├── error.log
    └── ...

datasets/
└── default/
    └── (crawler statistics & metadata)

request_queues/
└── default/
    └── (request state snapshots)

🛠️ Konfigurasi

Edit src/main.js untuk menyesuaikan:

// Target domains
const domains = [
    'https://cmlabs.co',
    'https://sequence.day',
    'https://sendokibu.com/'
];

// Concurrency (default: 2-3 untuk ethical scraping)
const maxConcurrency = 3;

// Request timeout (default: 60s)
const navigationTimeoutSecs = 60;

// Failure rate threshold (default: 50%)
const failureRateThreshold = 0.5;

// Disk space threshold (default: 500 MB)
const DISK_THRESHOLD_BYTES = 500 * 1024 * 1024;

📚 File Reference

File	Deskripsi
`src/main.js`	Entry point: domain loop, robots.txt check, health monitoring
`src/routes.js`	Request handlers: link discovery, HTML extraction, file saving
`storage/<domain>/`	Output folder per domain dengan HTML files & error.log
`package.json`	Dependencies & npm scripts
`Dockerfile`	Production deployment (pre-installed Playwright)

🐳 Docker (Production)

# Build image
docker build -t crawler-web-2 .

# Run container
docker run --rm \
  -v $(pwd)/storage:/app/storage \
  -e NODE_ENV=production \
  crawler-web-2

Base image: apify/actor-node-playwright-chrome:24-1.58.2 (Playwright pre-installed)

🔒 robots.txt Compliance

Sistem mandatory memvalidasi robots.txt sebelum crawl:

// Check sebelum crawl
const robots = await fetchRobots(startUrl);
const userAgent = 'CrawleeBot';

if (!robots.isAllowed(startUrl, userAgent)) {
    console.warn(`🚫  ${startUrl} is disallowed by robots.txt. Skipping.`);
    fs.appendFileSync(errorLogPath, `[timestamp] SKIPPED (robots.txt) ${startUrl}\n`);
    continue; // Skip domain
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.agents		.agents
.github		.github
assets		assets
promts		promts
src		src
storage-v1		storage-v1
storage-v2		storage-v2
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

🕷️ Crawler Web 2 — Sequential Domain HTML Archiver

🎯 Project Overview

Keunggulan

📋 Daftar Fitur

⚙️ Instalasi

Prerequisites

Setup

🚀 Cara Menjalankan

Command

📊 Apa Terjadi Saat npm start Dijalankan?

1️⃣ Initialization Phase

2️⃣ Per-Domain Sequential Loop

Phase A: Pre-Crawl Checks

Phase B: URL Discovery

Phase C: Page Crawling & Rendering

Phase D: File Storage

Phase E: Health Monitoring & Error Logging

Phase F: Domain Completion

3️⃣ Next Domain (Repeat Loop)

4️⃣ Final Completion

📸 Visual Flow Diagram

Execution Timeline

📁 Output Structure

🛠️ Konfigurasi

📚 File Reference

🐳 Docker (Production)

🔒 robots.txt Compliance

📖 Resources

📝 Lisensi

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

📊 Apa Terjadi Saat `npm start` Dijalankan?

Packages