Skip to content

fidaatag/crawler-web

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•·οΈ Crawler Web 2 β€” Sequential Domain HTML Archiver

A high-fidelity, memory-conscious web crawler built with Crawlee + Playwright that sequentially archives raw HTML from multiple domains. Optimized for stability and production use.

🎯 Project Overview

Crawler Web 2 adalah sistem pengarsipan HTML yang dirancang untuk ekstraksi konten berkualitas tinggi dari website modern (React, Vue, Next.js). Berbeda dengan web scraper tradisional yang mencoba parallelisasi antar domain, project ini mengutamakan stabilitas memori melalui isolasi per-domain dan eksekusi sekuensial.

Keunggulan

  • βœ… Sequential Domain Processing β€” Satu domain selesai, baru domain berikutnya β†’ memori terbebaskan sepenuhnya
  • βœ… Stealth Mode β€” Built-in anti-bot fingerprinting menggunakan playwright-extra + puppeteer-extra-plugin-stealth
  • βœ… robots.txt Compliance β€” Validasi otomatis terhadap robots.txt sebelum crawling
  • βœ… High-Fidelity Rendering β€” Menunggu framework JS selesai render (networkidle + manual delay) sebelum ekstraksi HTML
  • βœ… Hybrid Discovery β€” Kombinasi sitemap + recursive link detection untuk cakupan maksimal
  • βœ… Health Monitoring β€” Auto-stop jika disk space < 500MB atau failure rate terlalu tinggi
  • βœ… Atomic Writes β€” File writes via .tmp β†’ rename untuk mencegah data corruption

πŸ“‹ Daftar Fitur

Fitur Penjelasan
Raw HTML Archival Menyimpan HTML mentah ke storage/<domain>/<sanitized_url>.html
Sitemap Priority Mengunduh semua URL dari sitemap.xml di awal untuk cakupan maksimal
robots.txt Respect Menolak crawl URL yang disallowed di robots.txt
Per-Domain Error Logs Setiap domain punya error.log untuk tracking masalah
Disk Space Guard Berhenti otomatis jika sisa disk < 500 MB
Memory Efficiency availableMemoryRatio: 0.8 untuk stabilitas
Stealth Anti-Bot Menggunakan plugin playwright-extra stealth untuk bypass deteksi bot

βš™οΈ Instalasi

Prerequisites

  • Node.js 18+ (dengan npm 9+)
  • Disk space minimal 1 GB (untuk contoh 3 domain)

Setup

# 1. Clone repository
git clone <your-repo> crawler-web-2
cd crawler-web-2

# 2. Install dependencies (postinstall otomatis download Playwright browsers)
npm install

# 3. (Opsional) Update target domains di src/main.js
# Edit array `domains` untuk menentukan URL target

πŸš€ Cara Menjalankan

Command

npm start

Ini akan menjalankan node src/main.js.


πŸ“Š Apa Terjadi Saat npm start Dijalankan?

Berikut adalah alur eksekusi lengkap dengan penjelasan setiap tahap:

1️⃣ Initialization Phase

βœ“ Load Crawlee configuration (memory ratio = 0.8)
βœ“ Activate Playwright stealth fingerprinting
βœ“ Parse target domains dari src/main.js

2️⃣ Per-Domain Sequential Loop

Untuk setiap domain dalam daftar, sistem menjalankan tahapan berikut:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DOMAIN 1: https://sendokibu.com                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase A: Pre-Crawl Checks

Langkah Deskripsi Output
πŸ“ Create Storage Buat folder storage/sendokibu.com/ βœ“ Directory ready
πŸ’Ύ Disk Check Cek ruang disk tersisa βœ“ 250 GB free (> 500 MB threshold)
🌐 Fetch robots.txt Download & parse robots.txt βœ“ Disallowed paths: /admin, /private
πŸ” Validate Start URL Cek apakah start URL di-allow βœ“ / is allowed

Contoh output:

═══════════════════════════════════════════════════════════════
🌐  Starting crawl: https://sendokibu.com/
πŸ“  Storage dir  : /Users/fidaa/work/mygithub/crawler-web-2/storage/sendokibu.com
═══════════════════════════════════════════════════════════════

πŸ’Ύ Free disk: 250 GB (> 500 MB threshold) βœ“
πŸ”  Attempting to fetch sitemap: https://sendokibu.com/sitemap.xml
πŸ“  Found 248 URLs in sitemap.
πŸ€–  robots.txt parsed: 2 disallowed paths
βœ…  Start URL is allowed by robots.txt

Phase B: URL Discovery

Sitemap-First Strategy:

  • Crawlee menggunakan downloadListOfUrls() untuk ekstrak semua URL dari sitemap
  • Semua 248 URL langsung dimasukkan ke request queue

Recursive Discovery (Fallback):

  • Jika sitemap tidak tersedia, sistem akan:
    • Crawl halaman start URL
    • Ekstrak semua link internal (<a href>)
    • Enqueue dengan strategi same-domain (tetap dalam 1 domain)

Contoh Queue:

Request Queue (248 items):
  1. https://sendokibu.com/ [Discovery]
  2. https://sendokibu.com/about-us [Discovery]
  3. https://sendokibu.com/contact [Discovery]
  4. https://sendokibu.com/brand [Detail] 
  5. ... (244 more)

Phase C: Page Crawling & Rendering

Untuk setiap URL di queue, sistem melakukan:

URL: https://sendokibu.com/about-us
β”‚
β”œβ”€ 1. Navigate to page (60s timeout)
β”‚  └─ Playwright opens browser tab
β”‚
β”œβ”€ 2. Wait for network idle
β”‚  └─ Tunggu semua XHR/fetch selesai
β”‚
β”œβ”€ 3. Wait for body element (30s timeout)
β”‚  └─ Memastikan DOM telah terrender minimal
β”‚
β”œβ”€ 4. Extra 2 second pause
β”‚  └─ Beri waktu React/Vue/Next.js hydration selesai
β”‚  └─ Framework JS bisa render konten dinamis
β”‚
β”œβ”€ 5. Extract page content
β”‚  └─ Ambil raw HTML dari browser context
β”‚
β”œβ”€ 6. Validate HTML size
β”‚  └─ Jika < 1KB β†’ kemungkinan error page (log error)
β”‚  └─ Jika >= 1KB β†’ proceeed to save
β”‚
└─ 7. Find & enqueue new links
   └─ Parse <a href> tags
   └─ Enqueue ke queue dengan label "detail"
   └─ Skip external URLs, fragments (#), non-HTML

Contoh progress:

[sendokibu.com] Discovery: https://sendokibu.com/
  βœ“ Network idle (2.3s)
  βœ“ Body rendered (0.8s)
  βœ“ HTML extracted (1.2 KB)
  βœ“ Found 15 new links
  β†’ Enqueued 14 links (1 was external)

[sendokibu.com] Discovery: https://sendokibu.com/about-us
  βœ“ Network idle (1.1s)
  βœ“ Body rendered (0.4s)
  βœ“ HTML extracted (4.8 KB)
  βœ“ Found 8 new links
  β†’ Enqueued 8 links

[sendokibu.com] Discovery: https://sendokibu.com/contact
  ⚠ HTML size only 0.3 KB (likely error page)
  βœ— FAILED β€” HTML too small (probably 404)
  β†’ Logged to error.log

... (245 more pages)

Phase D: File Storage

Setiap HTML yang berhasil di-extract disimpan dengan pola path-based mapping:

Sitemap URL                    Storage Path
─────────────────────────────  ──────────────────────────────────
https://sendokibu.com/        β†’ storage/sendokibu.com/index.html
https://sendokibu.com/about   β†’ storage/sendokibu.com/about.html
https://sendokibu.com/about/  β†’ storage/sendokibu.com/about/index.html
https://sendokibu.com/blog?id=1 β†’ storage/sendokibu.com/blog_id=1.html

Atomic Write Process:

1. Generate temp file: /path/to/file.html.tmp
2. Write raw HTML β†’ file.html.tmp
3. Rename (atomic): file.html.tmp β†’ file.html
   ↳ Jika proses crash sebelum rename, file.html tidak rusak

Phase E: Health Monitoring & Error Logging

Setiap X requests, sistem check:

Metrics:
  Total requests: 248
  Successful pages: 240
  Failed pages: 8
  Failure rate: 3.2%

Status: βœ… HEALTHY (3.2% < 50% threshold)
  └─ Lanjutkan crawl

Storage: βœ… HEALTHY (250 GB >> 500 MB threshold)
  └─ Lanjutkan crawl

Failed URLs logged ke storage/sendokibu.com/error.log:

[2025-04-09T14:23:45.123Z] FAILED https://sendokibu.com/404 β€” HTTP 404
[2025-04-09T14:24:12.456Z] FAILED https://sendokibu.com/admin β€” Forbidden (robots.txt)
[2025-04-09T14:25:03.789Z] FAILED https://sendokibu.com/timeout β€” Navigation timeout 60s exceeded
[2025-04-09T14:26:15.321Z] FAILED https://sendokibu.com/blocked β€” Network blocked (CSP/CORS)

Phase F: Domain Completion

═══════════════════════════════════════════════════════════════
βœ… CRAWL COMPLETE: sendokibu.com
═══════════════════════════════════════════════════════════════
πŸ“Š Summary:
   βœ“ Total URLs crawled: 248
   βœ“ Success: 240 pages
   βœ— Failures: 8 pages
   πŸ’Ύ Storage used: 45 MB
   ⏱️  Time elapsed: 12m 34s

πŸ“ Files saved to: storage/sendokibu.com/
πŸ“ Error log: storage/sendokibu.com/error.log
═══════════════════════════════════════════════════════════════

🧹 Cleanup:
   β†’ Close browser instance
   β†’ Release memory back to OS
   β†’ Flush dataset cache

3️⃣ Next Domain (Repeat Loop)

Setelah domain 1 selesai, sistem langsung masuk ke domain 2:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DOMAIN 2: https://sequence.day                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

[Fresh browser instance]
[Fresh memory allocation]
[Repeat Phase A β†’ F]

4️⃣ Final Completion

═══════════════════════════════════════════════════════════════
πŸŽ‰ ALL DOMAINS CRAWLED SUCCESSFULLY
═══════════════════════════════════════════════════════════════

πŸ“Š GLOBAL SUMMARY:
   Domain 1 (sendokibu.com):  240/248 pages βœ“
   Domain 2 (sequence.day):   1205/1310 pages βœ“
   Domain 3 (cmlabs.co):      530/560 pages βœ“
   ──────────────────────────────────────────
   TOTAL:                     1975 pages archived

πŸ’Ύ Total storage: 845 MB

⏱️  Total time: 45m 23s

πŸ“ Archive structure:
   storage/
   β”œβ”€β”€ sendokibu.com/
   β”‚   β”œβ”€β”€ index.html
   β”‚   β”œβ”€β”€ about/
   β”‚   β”œβ”€β”€ contact.html
   β”‚   └── error.log
   β”œβ”€β”€ sequence.day/
   β”‚   └── ...
   └── cmlabs.co/
       └── ...

═══════════════════════════════════════════════════════════════

πŸ“Έ Visual Flow Diagram

Berikut adalah alur eksekusi yang divisualisasikan:

Execution Timeline

Execution Flow
Contoh: Output saat crawl domain sendokibu.com selesai

Domain Sequence
Contoh: Output saat crawl domain sequence.day selesai

Multiple Domains
Contoh: Output saat crawl domain cmlabs.co selesai


πŸ“ Output Structure

Setelah npm start selesai, struktur folder akan terlihat seperti:

storage/
β”œβ”€β”€ sendokibu.com/
β”‚   β”œβ”€β”€ index.html                          # Homepage
β”‚   β”œβ”€β”€ about.html
β”‚   β”œβ”€β”€ contact.html
β”‚   β”œβ”€β”€ blog/
β”‚   β”‚   β”œβ”€β”€ post-1.html
β”‚   β”‚   β”œβ”€β”€ post-2.html
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ error.log                           # Failed URLs & reasons
β”‚   └── ...
β”‚
β”œβ”€β”€ sequence.day/
β”‚   β”œβ”€β”€ index.html
β”‚   β”œβ”€β”€ auth_type=personal.html
β”‚   β”œβ”€β”€ pricing/
β”‚   └── error.log
β”‚
└── cmlabs.co/
    β”œβ”€β”€ en/
    β”œβ”€β”€ id/
    β”œβ”€β”€ products/
    β”œβ”€β”€ error.log
    └── ...

datasets/
└── default/
    └── (crawler statistics & metadata)

request_queues/
└── default/
    └── (request state snapshots)

πŸ› οΈ Konfigurasi

Edit src/main.js untuk menyesuaikan:

// Target domains
const domains = [
    'https://cmlabs.co',
    'https://sequence.day',
    'https://sendokibu.com/'
];

// Concurrency (default: 2-3 untuk ethical scraping)
const maxConcurrency = 3;

// Request timeout (default: 60s)
const navigationTimeoutSecs = 60;

// Failure rate threshold (default: 50%)
const failureRateThreshold = 0.5;

// Disk space threshold (default: 500 MB)
const DISK_THRESHOLD_BYTES = 500 * 1024 * 1024;

πŸ“š File Reference

File Deskripsi
src/main.js Entry point: domain loop, robots.txt check, health monitoring
src/routes.js Request handlers: link discovery, HTML extraction, file saving
storage/<domain>/ Output folder per domain dengan HTML files & error.log
package.json Dependencies & npm scripts
Dockerfile Production deployment (pre-installed Playwright)

🐳 Docker (Production)

# Build image
docker build -t crawler-web-2 .

# Run container
docker run --rm \
  -v $(pwd)/storage:/app/storage \
  -e NODE_ENV=production \
  crawler-web-2

Base image: apify/actor-node-playwright-chrome:24-1.58.2 (Playwright pre-installed)


πŸ”’ robots.txt Compliance

Sistem mandatory memvalidasi robots.txt sebelum crawl:

// Check sebelum crawl
const robots = await fetchRobots(startUrl);
const userAgent = 'CrawleeBot';

if (!robots.isAllowed(startUrl, userAgent)) {
    console.warn(`🚫  ${startUrl} is disallowed by robots.txt. Skipping.`);
    fs.appendFileSync(errorLogPath, `[timestamp] SKIPPED (robots.txt) ${startUrl}\n`);
    continue; // Skip domain
}

πŸ“– Resources


πŸ“ Lisensi

ISC


πŸ‘€ Author

It's not you it's me

About

Crawler Web adalah sistem pengarsipan HTML untuk ekstraksi konten dari website modern (React, Vue, Next.js)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages