A high-fidelity, memory-conscious web crawler built with Crawlee + Playwright that sequentially archives raw HTML from multiple domains. Optimized for stability and production use.
Crawler Web 2 adalah sistem pengarsipan HTML yang dirancang untuk ekstraksi konten berkualitas tinggi dari website modern (React, Vue, Next.js). Berbeda dengan web scraper tradisional yang mencoba parallelisasi antar domain, project ini mengutamakan stabilitas memori melalui isolasi per-domain dan eksekusi sekuensial.
- β Sequential Domain Processing β Satu domain selesai, baru domain berikutnya β memori terbebaskan sepenuhnya
- β
Stealth Mode β Built-in anti-bot fingerprinting menggunakan
playwright-extra+puppeteer-extra-plugin-stealth - β
robots.txt Compliance β Validasi otomatis terhadap
robots.txtsebelum crawling - β High-Fidelity Rendering β Menunggu framework JS selesai render (networkidle + manual delay) sebelum ekstraksi HTML
- β Hybrid Discovery β Kombinasi sitemap + recursive link detection untuk cakupan maksimal
- β Health Monitoring β Auto-stop jika disk space < 500MB atau failure rate terlalu tinggi
- β
Atomic Writes β File writes via
.tmpβrenameuntuk mencegah data corruption
| Fitur | Penjelasan |
|---|---|
| Raw HTML Archival | Menyimpan HTML mentah ke storage/<domain>/<sanitized_url>.html |
| Sitemap Priority | Mengunduh semua URL dari sitemap.xml di awal untuk cakupan maksimal |
| robots.txt Respect | Menolak crawl URL yang disallowed di robots.txt |
| Per-Domain Error Logs | Setiap domain punya error.log untuk tracking masalah |
| Disk Space Guard | Berhenti otomatis jika sisa disk < 500 MB |
| Memory Efficiency | availableMemoryRatio: 0.8 untuk stabilitas |
| Stealth Anti-Bot | Menggunakan plugin playwright-extra stealth untuk bypass deteksi bot |
- Node.js 18+ (dengan npm 9+)
- Disk space minimal 1 GB (untuk contoh 3 domain)
# 1. Clone repository
git clone <your-repo> crawler-web-2
cd crawler-web-2
# 2. Install dependencies (postinstall otomatis download Playwright browsers)
npm install
# 3. (Opsional) Update target domains di src/main.js
# Edit array `domains` untuk menentukan URL targetnpm startIni akan menjalankan node src/main.js.
Berikut adalah alur eksekusi lengkap dengan penjelasan setiap tahap:
β Load Crawlee configuration (memory ratio = 0.8)
β Activate Playwright stealth fingerprinting
β Parse target domains dari src/main.js
Untuk setiap domain dalam daftar, sistem menjalankan tahapan berikut:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DOMAIN 1: https://sendokibu.com β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Langkah | Deskripsi | Output |
|---|---|---|
| π Create Storage | Buat folder storage/sendokibu.com/ |
β Directory ready |
| πΎ Disk Check | Cek ruang disk tersisa | β 250 GB free (> 500 MB threshold) |
| π Fetch robots.txt | Download & parse robots.txt | β Disallowed paths: /admin, /private |
| π Validate Start URL | Cek apakah start URL di-allow | β / is allowed |
Contoh output:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Starting crawl: https://sendokibu.com/
π Storage dir : /Users/fidaa/work/mygithub/crawler-web-2/storage/sendokibu.com
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
πΎ Free disk: 250 GB (> 500 MB threshold) β
π Attempting to fetch sitemap: https://sendokibu.com/sitemap.xml
π Found 248 URLs in sitemap.
π€ robots.txt parsed: 2 disallowed paths
β
Start URL is allowed by robots.txt
Sitemap-First Strategy:
- Crawlee menggunakan
downloadListOfUrls()untuk ekstrak semua URL dari sitemap - Semua 248 URL langsung dimasukkan ke request queue
Recursive Discovery (Fallback):
- Jika sitemap tidak tersedia, sistem akan:
- Crawl halaman start URL
- Ekstrak semua link internal (
<a href>) - Enqueue dengan strategi
same-domain(tetap dalam 1 domain)
Contoh Queue:
Request Queue (248 items):
1. https://sendokibu.com/ [Discovery]
2. https://sendokibu.com/about-us [Discovery]
3. https://sendokibu.com/contact [Discovery]
4. https://sendokibu.com/brand [Detail]
5. ... (244 more)
Untuk setiap URL di queue, sistem melakukan:
URL: https://sendokibu.com/about-us
β
ββ 1. Navigate to page (60s timeout)
β ββ Playwright opens browser tab
β
ββ 2. Wait for network idle
β ββ Tunggu semua XHR/fetch selesai
β
ββ 3. Wait for body element (30s timeout)
β ββ Memastikan DOM telah terrender minimal
β
ββ 4. Extra 2 second pause
β ββ Beri waktu React/Vue/Next.js hydration selesai
β ββ Framework JS bisa render konten dinamis
β
ββ 5. Extract page content
β ββ Ambil raw HTML dari browser context
β
ββ 6. Validate HTML size
β ββ Jika < 1KB β kemungkinan error page (log error)
β ββ Jika >= 1KB β proceeed to save
β
ββ 7. Find & enqueue new links
ββ Parse <a href> tags
ββ Enqueue ke queue dengan label "detail"
ββ Skip external URLs, fragments (#), non-HTML
Contoh progress:
[sendokibu.com] Discovery: https://sendokibu.com/
β Network idle (2.3s)
β Body rendered (0.8s)
β HTML extracted (1.2 KB)
β Found 15 new links
β Enqueued 14 links (1 was external)
[sendokibu.com] Discovery: https://sendokibu.com/about-us
β Network idle (1.1s)
β Body rendered (0.4s)
β HTML extracted (4.8 KB)
β Found 8 new links
β Enqueued 8 links
[sendokibu.com] Discovery: https://sendokibu.com/contact
β HTML size only 0.3 KB (likely error page)
β FAILED β HTML too small (probably 404)
β Logged to error.log
... (245 more pages)
Setiap HTML yang berhasil di-extract disimpan dengan pola path-based mapping:
Sitemap URL Storage Path
βββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββ
https://sendokibu.com/ β storage/sendokibu.com/index.html
https://sendokibu.com/about β storage/sendokibu.com/about.html
https://sendokibu.com/about/ β storage/sendokibu.com/about/index.html
https://sendokibu.com/blog?id=1 β storage/sendokibu.com/blog_id=1.html
Atomic Write Process:
1. Generate temp file: /path/to/file.html.tmp
2. Write raw HTML β file.html.tmp
3. Rename (atomic): file.html.tmp β file.html
β³ Jika proses crash sebelum rename, file.html tidak rusak
Setiap X requests, sistem check:
Metrics:
Total requests: 248
Successful pages: 240
Failed pages: 8
Failure rate: 3.2%
Status: β
HEALTHY (3.2% < 50% threshold)
ββ Lanjutkan crawl
Storage: β
HEALTHY (250 GB >> 500 MB threshold)
ββ Lanjutkan crawl
Failed URLs logged ke storage/sendokibu.com/error.log:
[2025-04-09T14:23:45.123Z] FAILED https://sendokibu.com/404 β HTTP 404
[2025-04-09T14:24:12.456Z] FAILED https://sendokibu.com/admin β Forbidden (robots.txt)
[2025-04-09T14:25:03.789Z] FAILED https://sendokibu.com/timeout β Navigation timeout 60s exceeded
[2025-04-09T14:26:15.321Z] FAILED https://sendokibu.com/blocked β Network blocked (CSP/CORS)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
CRAWL COMPLETE: sendokibu.com
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Summary:
β Total URLs crawled: 248
β Success: 240 pages
β Failures: 8 pages
πΎ Storage used: 45 MB
β±οΈ Time elapsed: 12m 34s
π Files saved to: storage/sendokibu.com/
π Error log: storage/sendokibu.com/error.log
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π§Ή Cleanup:
β Close browser instance
β Release memory back to OS
β Flush dataset cache
Setelah domain 1 selesai, sistem langsung masuk ke domain 2:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DOMAIN 2: https://sequence.day β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[Fresh browser instance]
[Fresh memory allocation]
[Repeat Phase A β F]
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π ALL DOMAINS CRAWLED SUCCESSFULLY
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π GLOBAL SUMMARY:
Domain 1 (sendokibu.com): 240/248 pages β
Domain 2 (sequence.day): 1205/1310 pages β
Domain 3 (cmlabs.co): 530/560 pages β
ββββββββββββββββββββββββββββββββββββββββββ
TOTAL: 1975 pages archived
πΎ Total storage: 845 MB
β±οΈ Total time: 45m 23s
π Archive structure:
storage/
βββ sendokibu.com/
β βββ index.html
β βββ about/
β βββ contact.html
β βββ error.log
βββ sequence.day/
β βββ ...
βββ cmlabs.co/
βββ ...
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Berikut adalah alur eksekusi yang divisualisasikan:

Contoh: Output saat crawl domain sendokibu.com selesai

Contoh: Output saat crawl domain sequence.day selesai

Contoh: Output saat crawl domain cmlabs.co selesai
Setelah npm start selesai, struktur folder akan terlihat seperti:
storage/
βββ sendokibu.com/
β βββ index.html # Homepage
β βββ about.html
β βββ contact.html
β βββ blog/
β β βββ post-1.html
β β βββ post-2.html
β β βββ ...
β βββ error.log # Failed URLs & reasons
β βββ ...
β
βββ sequence.day/
β βββ index.html
β βββ auth_type=personal.html
β βββ pricing/
β βββ error.log
β
βββ cmlabs.co/
βββ en/
βββ id/
βββ products/
βββ error.log
βββ ...
datasets/
βββ default/
βββ (crawler statistics & metadata)
request_queues/
βββ default/
βββ (request state snapshots)
Edit src/main.js untuk menyesuaikan:
// Target domains
const domains = [
'https://cmlabs.co',
'https://sequence.day',
'https://sendokibu.com/'
];
// Concurrency (default: 2-3 untuk ethical scraping)
const maxConcurrency = 3;
// Request timeout (default: 60s)
const navigationTimeoutSecs = 60;
// Failure rate threshold (default: 50%)
const failureRateThreshold = 0.5;
// Disk space threshold (default: 500 MB)
const DISK_THRESHOLD_BYTES = 500 * 1024 * 1024;| File | Deskripsi |
|---|---|
src/main.js |
Entry point: domain loop, robots.txt check, health monitoring |
src/routes.js |
Request handlers: link discovery, HTML extraction, file saving |
storage/<domain>/ |
Output folder per domain dengan HTML files & error.log |
package.json |
Dependencies & npm scripts |
Dockerfile |
Production deployment (pre-installed Playwright) |
# Build image
docker build -t crawler-web-2 .
# Run container
docker run --rm \
-v $(pwd)/storage:/app/storage \
-e NODE_ENV=production \
crawler-web-2Base image: apify/actor-node-playwright-chrome:24-1.58.2 (Playwright pre-installed)
Sistem mandatory memvalidasi robots.txt sebelum crawl:
// Check sebelum crawl
const robots = await fetchRobots(startUrl);
const userAgent = 'CrawleeBot';
if (!robots.isAllowed(startUrl, userAgent)) {
console.warn(`π« ${startUrl} is disallowed by robots.txt. Skipping.`);
fs.appendFileSync(errorLogPath, `[timestamp] SKIPPED (robots.txt) ${startUrl}\n`);
continue; // Skip domain
}ISC
It's not you it's me