# Introduction

This notebook benchmarks the performance of synchronous vs asynchronous retrieval of phishing entries from various feeds. For the asynchronous retrieval, it uses the `asyncio` library to run multiple feed retrievals concurrently. The synchronous retrieval runs each feed one after the other.


# Installation


In [14]:
!pip install phishing-web-collector>=0.1.4


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# Import libraries  

In [15]:
import asyncio
import shutil
import time
from pathlib import Path

from phishing_web_collector import FeedSource
from phishing_web_collector.feed_manager import SOURCES_MAP

# Configure experiment


In [16]:
sources = [
        FeedSource.AD_GUARD_HOME,
        FeedSource.BINARY_DEFENCE_IP,
        FeedSource.BLOCKLIST_DE_IP,
        FeedSource.BOTVRIJ,
        FeedSource.C2_INTEL_DOMAIN,
        FeedSource.C2_TRACKER_IP,
        FeedSource.CERT_PL,
        FeedSource.DANGEROUS_DOMAINS,
        FeedSource.GREEN_SNOW_IP,
        FeedSource.MALWARE_WORLD,
        FeedSource.MIRAI_SECURITY_IP,
        FeedSource.OPEN_PHISH,
        FeedSource.PHISHING_ARMY,
        FeedSource.PHISHING_DATABASE,
        FeedSource.PHISH_STATS_API,
        FeedSource.PHISH_TANK,
        FeedSource.PROOF_POINT_IP,
        FeedSource.THREAT_VIEW_DOMAIN,
        FeedSource.TWEET_FEED,
        FeedSource.URL_ABUSE,
        FeedSource.URL_HAUS,
        FeedSource.VALDIN,
]
N_RUNS = 5
SYNC_DIR = Path("sync_check")
ASYNC_DIR = Path("async_check")

def clear_dir(path: Path):
    if path.exists():
        shutil.rmtree(path)
    path.mkdir(parents=True)


# Async and Sync functions

In [17]:

def retrieve_all_sync() -> float:
    """Retrieve all phishing entries from all feeds synchronously (sequentially)."""
    providers = [SOURCES_MAP[source]("sync_check") for source in
                 sources]
    start = time.perf_counter()
    entries = []
    for provider in providers:
        entries.extend(provider.retrieve_sync())
    duration = time.perf_counter() - start
    print(f"Sync took {duration:.2f} seconds")
    return duration


async def retrieve_all() -> float:
    """Retrieve all phishing entries from all feeds asynchronously."""
    providers = [SOURCES_MAP[source]("async_check") for source in
                 sources]
    start = time.perf_counter()
    results = await asyncio.gather(*(provider.retrieve() for provider in providers))
    entries = [entry for result in results for entry in result]
    duration = time.perf_counter() - start
    print(f"Async took {duration:.2f} seconds")
    return duration

# Benchmark logic

In [18]:

async def run_benchmark():
    sync_times = []
    async_times = []

    for i in range(N_RUNS):
        print(f"\n--- Run {i + 1} ---")
        clear_dir(SYNC_DIR)
        clear_dir(ASYNC_DIR)

        # Async
        async_time = await retrieve_all()
        async_times.append(async_time)
        print(f"Async took {async_time:.2f} s")

        # Sync
        sync_time = retrieve_all_sync()
        sync_times.append(sync_time)
        print(f"Sync took {sync_time:.2f} s")

    avg_async = sum(async_times) / N_RUNS
    avg_sync = sum(sync_times) / N_RUNS

    print(f"\nAverage async time: {avg_async:.2f} s")
    print(f"Average sync time: {avg_sync:.2f} s")

# Run the benchmark

In [19]:
await run_benchmark()


--- Run 1 ---


No data found for feed: BinaryDefenceIP
No data found for feed: BlocklistDeIP
No data found for feed: Botvrij
No data found for feed: C2IntelDomain
No data found for feed: C2TrackerIp
No data found for feed: CertPl
No data found for feed: GreenSnowIp
No data found for feed: MiraiSecurityIp
No data found for feed: OpenPhish
No data found for feed: PhishingArmy
No data found for feed: PhishingDatabase
No data found for feed: PhishStats
No data found for feed: PhishTank
No data found for feed: ProofPointIp
No data found for feed: PhishingArmy
No data found for feed: TweetFeed
No data found for feed: UrlAbuse
No data found for feed: UrlHaus
No data found for feed: Valdin


Async took 5.02 seconds
Async took 5.02 s
Sync took 0.02 seconds
Sync took 0.02 s

--- Run 2 ---


No data found for feed: BinaryDefenceIP
No data found for feed: BlocklistDeIP
No data found for feed: Botvrij
No data found for feed: C2IntelDomain
No data found for feed: C2TrackerIp
No data found for feed: CertPl
No data found for feed: GreenSnowIp
No data found for feed: MiraiSecurityIp
No data found for feed: OpenPhish
No data found for feed: PhishingArmy
No data found for feed: PhishingDatabase
No data found for feed: PhishStats
No data found for feed: PhishTank
No data found for feed: ProofPointIp
No data found for feed: PhishingArmy
No data found for feed: TweetFeed
No data found for feed: UrlAbuse
No data found for feed: UrlHaus
No data found for feed: Valdin


Async took 5.62 seconds
Async took 5.62 s
Sync took 0.01 seconds
Sync took 0.01 s

--- Run 3 ---


No data found for feed: BinaryDefenceIP
No data found for feed: BlocklistDeIP
No data found for feed: Botvrij
No data found for feed: C2IntelDomain
No data found for feed: C2TrackerIp
No data found for feed: CertPl
No data found for feed: GreenSnowIp
No data found for feed: MiraiSecurityIp
No data found for feed: OpenPhish
No data found for feed: PhishingArmy
No data found for feed: PhishingDatabase
No data found for feed: PhishStats
No data found for feed: PhishTank
No data found for feed: ProofPointIp
No data found for feed: PhishingArmy
No data found for feed: TweetFeed
No data found for feed: UrlAbuse
No data found for feed: UrlHaus
No data found for feed: Valdin


Async took 5.28 seconds
Async took 5.28 s
Sync took 0.01 seconds
Sync took 0.01 s

--- Run 4 ---


Error fetching https://openphish.com/feed.txt: 
Skipping save - No data fetched for OpenPhish
No data found for feed: BinaryDefenceIP
No data found for feed: BlocklistDeIP
No data found for feed: Botvrij
No data found for feed: C2IntelDomain
No data found for feed: C2TrackerIp
No data found for feed: CertPl
No data found for feed: GreenSnowIp
No data found for feed: MiraiSecurityIp
No data found for feed: OpenPhish
No data found for feed: PhishingArmy
No data found for feed: PhishingDatabase
No data found for feed: PhishStats
No data found for feed: PhishTank
No data found for feed: ProofPointIp
No data found for feed: PhishingArmy
No data found for feed: TweetFeed
No data found for feed: UrlAbuse
No data found for feed: UrlHaus
No data found for feed: Valdin


Async took 5.81 seconds
Async took 5.81 s
Sync took 0.01 seconds
Sync took 0.01 s

--- Run 5 ---


No data found for feed: BinaryDefenceIP
No data found for feed: BlocklistDeIP
No data found for feed: Botvrij
No data found for feed: C2IntelDomain
No data found for feed: C2TrackerIp
No data found for feed: CertPl
No data found for feed: GreenSnowIp
No data found for feed: MiraiSecurityIp
No data found for feed: OpenPhish
No data found for feed: PhishingArmy
No data found for feed: PhishingDatabase
No data found for feed: PhishStats
No data found for feed: PhishTank
No data found for feed: ProofPointIp
No data found for feed: PhishingArmy
No data found for feed: TweetFeed
No data found for feed: UrlAbuse
No data found for feed: UrlHaus
No data found for feed: Valdin


Async took 5.07 seconds
Async took 5.07 s
Sync took 0.01 seconds
Sync took 0.01 s

Average async time: 5.36 s
Average sync time: 0.01 s
