# Introduction

This notebook benchmarks the performance of synchronous vs asynchronous retrieval of phishing entries from various feeds. For the asynchronous retrieval, it uses the `asyncio` library to run multiple feed retrievals concurrently. The synchronous retrieval runs each feed one after the other.


# Installation


In [1]:
!pip install phishing-web-collector

Collecting phishing-web-collector
  Downloading phishing_web_collector-0.1.2.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: phishing-web-collector
  Building wheel for phishing-web-collector (setup.py) ... [?25ldone
[?25h  Created wheel for phishing-web-collector: filename=phishing_web_collector-0.1.2-py3-none-any.whl size=24191 sha256=f04f20f2a6d73f0363b4a168a2e79c2abcbf9e3d5fdaa86a9388cdf3758d0b11
  Stored in directory: /home/efraszczak/.cache/pip/wheels/c4/ac/0c/0b7e7229de2d955a358e2080f3a97c2d9fdd97cc54c1744dcf
Successfully built phishing-web-collector
Installing collected packages: phishing-web-collector
Successfully installed phishing-web-collector-0.1.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# Import libraries  

In [2]:
import asyncio
import shutil
import time
from pathlib import Path

from phishing_web_collector import FeedSource
from phishing_web_collector.feed_manager import SOURCES_MAP

# Configure experiment


In [3]:
sources = list(FeedSource)
N_RUNS = 5
SYNC_DIR = Path("sync_check")
ASYNC_DIR = Path("async_check")

def clear_dir(path: Path):
    if path.exists():
        shutil.rmtree(path)
    path.mkdir(parents=True)


# Async and Sync functions

In [4]:

def retrieve_all_sync() -> float:
    """Retrieve all phishing entries from all feeds synchronously (sequentially)."""
    providers = [SOURCES_MAP[source]("sync_check") for source in
                 sources]
    start = time.perf_counter()
    entries = []
    for provider in providers:
        entries.extend(provider.retrieve())
    duration = time.perf_counter() - start
    print(f"Sync took {duration:.2f} seconds")
    return duration


async def retrieve_all() -> float:
    """Retrieve all phishing entries from all feeds asynchronously."""
    providers = [SOURCES_MAP[source]("async_check") for source in
                 sources]
    start = time.perf_counter()
    results = await asyncio.gather(
        *(asyncio.to_thread(provider.retrieve) for provider in
          providers)
    )
    entries = [entry for result in results for entry in result]
    duration = time.perf_counter() - start
    print(f"Async took {duration:.2f} seconds")
    return duration

# Benchmark logic

In [5]:

async def run_benchmark():
    sync_times = []
    async_times = []

    for i in range(N_RUNS):
        print(f"\n--- Run {i + 1} ---")
        clear_dir(SYNC_DIR)
        clear_dir(ASYNC_DIR)

        # Async
        async_time = await retrieve_all()
        async_times.append(async_time)
        print(f"Async took {async_time:.2f} s")

        # Sync
        sync_time = retrieve_all_sync()
        sync_times.append(sync_time)
        print(f"Sync took {sync_time:.2f} s")

    avg_async = sum(async_times) / N_RUNS
    avg_sync = sum(sync_times) / N_RUNS

    print(f"\nAverage async time: {avg_async:.2f} s")
    print(f"Average sync time: {avg_sync:.2f} s")

# Run the benchmark

In [6]:
await run_benchmark()


--- Run 1 ---


Failed to fetch https://phishstats.info/phish_score.csv - Status: 404
Skipping save - No data fetched for PhishStats
No data found for feed: BinaryDefenceIP
No data found for feed: BlocklistDeIP
No data found for feed: Botvrij
No data found for feed: C2IntelDomain
No data found for feed: C2TrackerIp
No data found for feed: CertPl
No data found for feed: GreenSnowIp
No data found for feed: MiraiSecurityIp
No data found for feed: OpenPhish
No data found for feed: PhishingArmy
No data found for feed: PhishingDatabase
No data found for feed: PhishStats
No data found for feed: PhishStats
No data found for feed: PhishTank
No data found for feed: ProofPointIp
No data found for feed: PhishingArmy
No data found for feed: TweetFeed
No data found for feed: UrlAbuse
No data found for feed: UrlHaus
No data found for feed: Valdin


Async took 10.37 seconds
Async took 10.37 s
Sync took 0.01 seconds
Sync took 0.01 s

--- Run 2 ---


Failed to fetch https://phishstats.info/phish_score.csv - Status: 404
Skipping save - No data fetched for PhishStats
Failed to fetch https://phishstats.info/phish_score.csv - Status: 404
Skipping save - No data fetched for PhishStats
No data found for feed: BinaryDefenceIP
No data found for feed: BlocklistDeIP
No data found for feed: Botvrij
No data found for feed: C2IntelDomain
No data found for feed: C2TrackerIp
No data found for feed: CertPl
No data found for feed: GreenSnowIp
No data found for feed: MiraiSecurityIp
No data found for feed: OpenPhish
No data found for feed: PhishingArmy
No data found for feed: PhishingDatabase
No data found for feed: PhishStats
No data found for feed: PhishStats
No data found for feed: PhishTank
No data found for feed: ProofPointIp
No data found for feed: PhishingArmy
No data found for feed: TweetFeed
No data found for feed: UrlAbuse
No data found for feed: UrlHaus
No data found for feed: Valdin


Async took 9.22 seconds
Async took 9.22 s
Sync took 0.01 seconds
Sync took 0.01 s

--- Run 3 ---


Failed to fetch https://phishstats.info/phish_score.csv - Status: 404
Skipping save - No data fetched for PhishStats
Error fetching https://hole.cert.pl/domains/domains.csv: 
Skipping save - No data fetched for CertPl
Failed to fetch https://phishstats.info/phish_score.csv - Status: 404
Skipping save - No data fetched for PhishStats
No data found for feed: BinaryDefenceIP
No data found for feed: BlocklistDeIP
No data found for feed: Botvrij
No data found for feed: C2IntelDomain
No data found for feed: C2TrackerIp
No data found for feed: CertPl
No data found for feed: GreenSnowIp
No data found for feed: MiraiSecurityIp
No data found for feed: OpenPhish
No data found for feed: PhishingArmy
No data found for feed: PhishingDatabase
No data found for feed: PhishStats
No data found for feed: PhishStats
No data found for feed: PhishTank
No data found for feed: ProofPointIp
No data found for feed: PhishingArmy
No data found for feed: TweetFeed
No data found for feed: UrlAbuse
No data found for

Async took 7.14 seconds
Async took 7.14 s
Sync took 0.01 seconds
Sync took 0.01 s

--- Run 4 ---


Failed to fetch https://phishstats.info/phish_score.csv - Status: 404
Skipping save - No data fetched for PhishStats
Failed to fetch https://phishstats.info/phish_score.csv - Status: 404
Skipping save - No data fetched for PhishStats
Error fetching https://hole.cert.pl/domains/domains.csv: 
Skipping save - No data fetched for CertPl
Error fetching https://raw.githubusercontent.com/ProKn1fe/phishtank-database/master/online-valid.json: 
Skipping save - No data fetched for PhishTank
No data found for feed: BinaryDefenceIP
No data found for feed: BlocklistDeIP
No data found for feed: Botvrij
No data found for feed: C2IntelDomain
No data found for feed: C2TrackerIp
No data found for feed: CertPl
No data found for feed: GreenSnowIp
No data found for feed: MiraiSecurityIp
No data found for feed: OpenPhish
No data found for feed: PhishingArmy
No data found for feed: PhishingDatabase
No data found for feed: PhishStats
No data found for feed: PhishStats
No data found for feed: PhishTank
No data 

Async took 5.74 seconds
Async took 5.74 s
Sync took 0.01 seconds
Sync took 0.01 s

--- Run 5 ---


Failed to fetch https://phishstats.info/phish_score.csv - Status: 404
Skipping save - No data fetched for PhishStats
Failed to fetch https://phishstats.info/phish_score.csv - Status: 404
Skipping save - No data fetched for PhishStats
Error fetching https://hole.cert.pl/domains/domains.csv: 
Skipping save - No data fetched for CertPl
No data found for feed: BinaryDefenceIP
No data found for feed: BlocklistDeIP
No data found for feed: Botvrij
No data found for feed: C2IntelDomain
No data found for feed: C2TrackerIp
No data found for feed: CertPl
No data found for feed: GreenSnowIp
No data found for feed: MiraiSecurityIp
No data found for feed: OpenPhish
No data found for feed: PhishingArmy
No data found for feed: PhishingDatabase
No data found for feed: PhishStats
No data found for feed: PhishStats
No data found for feed: PhishTank
No data found for feed: ProofPointIp
No data found for feed: PhishingArmy
No data found for feed: TweetFeed
No data found for feed: UrlAbuse
No data found for

Async took 7.12 seconds
Async took 7.12 s
Sync took 0.01 seconds
Sync took 0.01 s

Average async time: 7.92 s
Average sync time: 0.01 s


Failed to fetch https://phishstats.info/phish_score.csv - Status: 404
Skipping save - No data fetched for PhishStats
