# Introduction

This notebook benchmarks the performance of synchronous vs asynchronous retrieval of phishing entries from various feeds. For the asynchronous retrieval, it uses the `asyncio` library to run multiple feed retrievals concurrently. The synchronous retrieval runs each feed one after the other.


# Installation


In [1]:
!pip install phishing-web-collector>=0.2.0


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


# Import libraries  

In [2]:
import asyncio
import shutil
import time
from pathlib import Path

from phishing_web_collector import FeedSource
from phishing_web_collector.feed_manager import SOURCES_MAP

# Configure experiment


In [3]:
sources = [
        FeedSource.AD_GUARD_HOME,
        FeedSource.BINARY_DEFENCE_IP,
        FeedSource.BLOCKLIST_DE_IP,
        FeedSource.BOTVRIJ,
        FeedSource.C2_INTEL_DOMAIN,
        FeedSource.C2_TRACKER_IP,
        FeedSource.CERT_PL,
        FeedSource.DANGEROUS_DOMAINS,
        FeedSource.GREEN_SNOW_IP,
        FeedSource.MALWARE_WORLD,
        FeedSource.MIRAI_SECURITY_IP,
        FeedSource.OPEN_PHISH,
        FeedSource.PHISHING_ARMY,
        FeedSource.PHISHING_DATABASE,
        FeedSource.PHISH_STATS,
        FeedSource.PHISH_TANK,
        FeedSource.PROOF_POINT_IP,
        FeedSource.THREAT_VIEW_DOMAIN,
        FeedSource.TWEET_FEED,
        FeedSource.URL_ABUSE,
        FeedSource.URL_HAUS,
        FeedSource.VALDIN,
]
N_RUNS = 5
SYNC_DIR = Path("sync_check")
ASYNC_DIR = Path("async_check")

def clear_dir(path: Path):
    if path.exists():
        shutil.rmtree(path)
    path.mkdir(parents=True)


# Async and Sync functions

In [4]:

def retrieve_all_sync() -> float:
    """Retrieve all phishing entries from all feeds synchronously (sequentially)."""
    providers = [SOURCES_MAP[source]("sync_check") for source in
                 sources]
    start = time.perf_counter()
    entries = []
    for provider in providers:
        entries.extend(provider.retrieve_sync())
    duration = time.perf_counter() - start
    print(f"Sync took {duration:.2f} seconds")
    return duration


async def retrieve_all() -> float:
    """Retrieve all phishing entries from all feeds asynchronously."""
    providers = [SOURCES_MAP[source]("async_check") for source in
                 sources]
    start = time.perf_counter()
    results = await asyncio.gather(*(provider.retrieve() for provider in providers))
    entries = [entry for result in results for entry in result]
    duration = time.perf_counter() - start
    print(f"Async took {duration:.2f} seconds")
    return duration

# Benchmark logic

In [5]:

async def run_benchmark():
    sync_times = []
    async_times = []

    for i in range(N_RUNS):
        print(f"\n--- Run {i + 1} ---")
        clear_dir(SYNC_DIR)
        clear_dir(ASYNC_DIR)

        # Async
        async_time = await retrieve_all()
        async_times.append(async_time)
        print(f"Async took {async_time:.2f} s")

        # Sync
        sync_time = retrieve_all_sync()
        sync_times.append(sync_time)
        print(f"Sync took {sync_time:.2f} s")

    avg_async = sum(async_times) / N_RUNS
    avg_sync = sum(sync_times) / N_RUNS

    print(f"\nAverage async time: {avg_async:.2f} s")
    print(f"Average sync time: {avg_sync:.2f} s")

# Run the benchmark

In [6]:
await run_benchmark()


--- Run 1 ---


Error fetching https://raw.githubusercontent.com/Ealenn/AdGuard-Home-List/gh-pages/AdGuard-Home-List.Block.txt: 
Skipping save - No data fetched for AdGuardHomeFeed
No data found for feed: AdGuardHomeFeed
Error fetching https://raw.githubusercontent.com/ProKn1fe/phishtank-database/master/online-valid.json: 
Skipping save - No data fetched for PhishTank
No data found for feed: PhishTank


Async took 10.91 seconds
Async took 10.91 s
Sync took 36.07 seconds
Sync took 36.07 s

--- Run 2 ---


Error fetching https://hole.cert.pl/domains/domains.csv: 
Skipping save - No data fetched for CertPl
No data found for feed: CertPl
Error fetching https://raw.githubusercontent.com/ProKn1fe/phishtank-database/master/online-valid.json: 
Skipping save - No data fetched for PhishTank
No data found for feed: PhishTank
Error fetching https://raw.githubusercontent.com/Ealenn/AdGuard-Home-List/gh-pages/AdGuard-Home-List.Block.txt: 
Skipping save - No data fetched for AdGuardHomeFeed
No data found for feed: AdGuardHomeFeed


Async took 10.83 seconds
Async took 10.83 s
Sync took 49.15 seconds
Sync took 49.15 s

--- Run 3 ---


Error fetching https://raw.githubusercontent.com/Ealenn/AdGuard-Home-List/gh-pages/AdGuard-Home-List.Block.txt: 
Skipping save - No data fetched for AdGuardHomeFeed
No data found for feed: AdGuardHomeFeed


Async took 12.09 seconds
Async took 12.09 s


Error fetching https://dangerous.domains/list.txt: HTTPSConnectionPool(host='dangerous.domains', port=443): Read timed out. (read timeout=10)
Skipping save - No data fetched for DangerousDomains
No data found for feed: DangerousDomains


Sync took 31.21 seconds
Sync took 31.21 s

--- Run 4 ---


Error fetching https://raw.githubusercontent.com/Ealenn/AdGuard-Home-List/gh-pages/AdGuard-Home-List.Block.txt: 
Skipping save - No data fetched for AdGuardHomeFeed
No data found for feed: AdGuardHomeFeed
Error fetching https://dangerous.domains/list.txt: 
Skipping save - No data fetched for DangerousDomains
No data found for feed: DangerousDomains


Async took 10.59 seconds
Async took 10.59 s


Error fetching https://dangerous.domains/list.txt: HTTPSConnectionPool(host='dangerous.domains', port=443): Read timed out. (read timeout=10)
Skipping save - No data fetched for DangerousDomains
No data found for feed: DangerousDomains


Sync took 30.83 seconds
Sync took 30.83 s

--- Run 5 ---


Error fetching https://raw.githubusercontent.com/Ealenn/AdGuard-Home-List/gh-pages/AdGuard-Home-List.Block.txt: 
Skipping save - No data fetched for AdGuardHomeFeed
No data found for feed: AdGuardHomeFeed
Error fetching https://dangerous.domains/list.txt: 
Skipping save - No data fetched for DangerousDomains
No data found for feed: DangerousDomains


Async took 10.40 seconds
Async took 10.40 s


Error fetching https://dangerous.domains/list.txt: HTTPSConnectionPool(host='dangerous.domains', port=443): Read timed out. (read timeout=10)
Skipping save - No data fetched for DangerousDomains
No data found for feed: DangerousDomains


Sync took 34.07 seconds
Sync took 34.07 s

Average async time: 10.96 s
Average sync time: 36.27 s
