A comprehensive Python-based data extraction toolkit for scraping project and contract data from the Department of Public Works and Highways (DPWH) Transparency Portal API.
This repository contains two main extraction scripts designed to collect data from the DPWH Transparency API:
- Base Data Extractor (
base-data/) - Pagination-based extraction for bulk project listings - Projects Data Extractor (
projects-data/) - Individual contract detail extraction with proxy support
- Base API:
https://api.transparency.dpwh.gov.ph/projects - Project Details:
https://api.transparency.dpwh.gov.ph/projects/{contractId}
- Concurrent page fetching with ThreadPoolExecutor
- Automatic TLS fingerprint rotation (70+ browser profiles)
- Smart fingerprint blacklisting (instantly removes unsupported TLS versions)
- Resume capability (tracks completed pages)
- Real-time progress tracking with 10-second snapshots
- Comprehensive retry logic with exponential backoff
- Rate limit detection and handling (403, 429, Cloudflare 1015)
- Detailed statistics and success rate reporting
- Proxy support (free + premium proxy rotation)
- Advanced proxy health monitoring and blacklisting
- TLS fingerprint diversity (Chrome, Firefox, Safari, Edge, Opera)
- Per-proxy and per-fingerprint statistics tracking
- Automatic retry with intelligent proxy selection
- Background progress logging
- Parquet dataset integration for contract ID sourcing
- Python 3.8+
- pip
pip install curl-cffi requests pandas
pip install pyarrow # For parquet file supportcurl-cffi>=0.5.0
requests>=2.28.0
pandas>=1.5.0
Extract paginated project listings with automatic pagination:
python base-data/fetch_dpwh_projects_paginated.py
python base-data/fetch_dpwh_projects_paginated.py --start 1 --end 10
python base-data/fetch_dpwh_projects_paginated.py --start 1 --end 50 --limit 5000 --workers 10--start: Starting page number (default: 1)--end: Ending page number (default: auto-calculated from total contracts)--limit: Items per page (default: 5000, API max)--workers: Concurrent workers (default: 10)
base-data/
βββ base-data-json.tar.xz # Bulk archive of all extracted JSON data
βββ lists/ # Tracking files
β βββ successful_pages.txt # Completed pages
β βββ failed_pages.txt # Failed pages
β βββ dump-page-*-error.txt # Error messages
β βββ dump-page-*-raw.txt # Raw error responses
βββ progress_stats.json # Real-time progress
βββ never_success_tls.json # Blacklisted fingerprints
Extract detailed data for individual contracts using contract IDs:
cd projects-data/extraction-script
python fetch_dpwh_projects_curlcffi.pyNote: This script requires a parquet dataset with contract IDs. Adjust PARQUET_PATH in the script or create free_proxies.json with your proxy list.
projects-data/dpwh-projects-api/
βββ projects-json.tar.xz.001 # Bulk archive part 1 of contract JSON files
βββ projects-json.tar.xz.002 # Bulk archive part 2 of contract JSON files
βββ lists/ # Tracking files
β βββ successful_ids.txt
β βββ failed_ids.txt
β βββ exception_ids.txt
β βββ blocked_ids.txt
βββ raw/ # Error responses
β βββ {contractId}_raw.txt
βββ progress_stats.json # Real-time stats
βββ never_success_tls.json # Blacklisted fingerprints
Both scripts use 70+ browser fingerprints for TLS diversity:
- Chrome 100-120
- Firefox 100-119
- Safari 14-17
- Edge 110-119
- Opera 95-102
Auto-Blacklisting: Fingerprints that return "not supported" errors are immediately blacklisted and saved to never_success_tls.json for exclusion in future runs.
- Detection: Monitors HTTP 403, 429, and Cloudflare error 1015
- Handling: Exponential backoff (5-20 seconds per retry)
- Retry Logic: Up to 4 attempts per page/contract
Real-time statistics updated every 10 seconds:
{
"total": 50,
"success": 45,
"fail": 5,
"retries": {"33": 2, "46": 4},
"tls_stats": {
"chrome116": {"success": 10, "fail": 0, "block": 0},
"firefox119": {"success": 8, "exception": 2}
}
}- Automatic Detection: Identifies unsupported TLS versions
- Instant Blacklisting: Removes failed fingerprints immediately
- Persistent Storage: Saves blacklist to
never_success_tls.json - No Retry Waste: Unsupported fingerprints don't count against retry limit
- Checks
successful_pages.txtbefore fetching - Skips already-downloaded pages automatically
- Safe to interrupt and resume at any time
- ThreadPoolExecutor for parallel requests
- Configurable worker count (default: 10)
- Thread-safe statistics tracking with locks
- [FETCH] - Starting page/contract fetch
- [TLS] - Fingerprint selection per attempt
- [RETRY] - Retry attempts with reason
- [BLACKLIST] - Fingerprint removal
- [SAVE] - Successful data save
- [FAIL] - Final failure after all retries
If all fingerprints are blacklisted:
- Delete
never_success_tls.json - Update
curl-cffito latest version:pip install --upgrade curl-cffi - Reduce worker count:
--workers 5
- Reduce concurrent workers:
--workers 5 - Increase delays in script (modify
MIN_DELAY/MAX_DELAY) - Check internet connection stability
- Consider using proxies (projects extractor only)
- For DPWH Transparency dashboard, it seems to limit to a max off 300 requests per 10 minutes. (Around 1 request every 2 seconds)
- You can try going faster than that but you risk being rate limited after exceeding 1000 fast requests. (Cloudflare Error 1015). Error 1015 will reset after 10 minutes.
- Script automatically handles rate limits with backoff
- For persistent rate limits, reduce
MAX_WORKERS - Projects extractor will switch to proxy mode automatically
watch -n 2 cat base-data/progress_stats.json
wc -l base-data/lists/successful_pages.txt
python -c "import json; print(json.dumps(json.load(open('base-data/progress_stats.json'))['tls_stats'], indent=2))"ls base-data/json/*.json | wc -l
# If using tar.xz archives, extract and analyze as needed:
tar -xJf archive/base-data-json.tar.xz -C /tmp/json_extract/
ls /tmp/json_extract/*.json | wc -l
python -c "import json, glob; print(sum(len(json.load(open(f))['data']['data']) for f in glob.glob('/tmp/json_extract/*.json')))"dpwh-transparency-api-data/ βββ base-data/ # Pagination extractor β βββ fetch_dpwh_projects_paginated.py # Main script β βββ json/ # Output data β βββ lists/ # Tracking files β βββ progress_stats.json # Progress snapshot β βββ never_success_tls.json # Blacklisted fingerprints β βββ projects-data/ # Contract details extractor β βββ extraction-script/ β β βββ fetch_dpwh_projects_curlcffi.py β β βββ free_proxies.json β β βββ premium_proxies.json β β βββ generate_proxy_list.py β βββ dpwh-projects-api/ # Output directory β βββ samples/ # Sample data files βββ archive/ # Archived data (git-ignored) βββ .gitignore # Git ignore rules βββ README.md # This file
Contributions welcome! Areas for improvement:
- Additional error handling
- Performance optimizations
- Proxy rotation enhancements
- Data validation and cleaning
- Alternative API endpoint support
This tool is for educational and research purposes. Please:
- Respect the DPWH API rate limits
- Use responsibly and ethically
- Comply with the DPWH Transparency Portal Terms of Service
- Do not overload the API servers
This project is open source and available for educational purposes.
Total Contracts: ~247,187
Estimated Pages: ~50 (at 5000 items/page)
API Max Limit: 5000 items per request