# Comprehensive Benchmarking of Anti-Phishing Datasets

## Overview
This notebook provides a **7-dimensional comparative analysis** of five major anti-phishing datasets: **Phish360**, **PWD2016**, **PhishIntention**, **PILWD-134K**, and **VanNL126k**.

### Goal
To provide researchers with actionable insights for dataset selection beyond simple sample counts. We evaluate:
1. **Content Uniqueness** - Quality vs Quantity trade-offs
2. **Brand Coverage** - Target diversity
3. **Linguistic Diversity** - Global applicability
4. **Data Completeness** - Multimodal integrity
5. **URL Characteristics** - Infrastructure patterns
6. **Text Extraction Quality** - Parser effectiveness
7. **Summary & Recommendations** - Guidance for researchers

---
## 1. Content Uniqueness vs URL Uniqueness

### Why This Matters
Many datasets claim large sizes based on unique URLs, but contain highly repetitive content. This metric reveals the **true diversity** of web page content.

### Results
| Dataset        |   Total Samples |   Unique URLs |   URL Uniqueness % |   Unique Content (Text) |   Content Uniqueness % |
|:---------------|----------------:|--------------:|-------------------:|------------------------:|-----------------------:|
| Phish360       |            4332 |          4257 |              98.27 |                    4002 |                  92.38 |
| PWD2016        |           15000 |          5737 |              38.25 |                    1538 |                  10.25 |
| PhishIntention |           29496 |         25724 |              87.21 |                    6227 |                  21.11 |
| PILWD-134K     |           66964 |         58046 |              86.68 |                   24328 |                  36.33 |
| VanNL126k      |          100000 |        100000 |             100.00 |                   26243 |                  26.24 |

![Content Uniqueness](../../results/plots/plot1_uniqueness.png)

### Key Findings
- **Phish360:** Maintains **92% content uniqueness**, indicating nearly every URL leads to distinct web page content.
- **PWD2016:** Critical failure with only **10% content uniqueness** despite 15k samples. Training on this inflates metrics due to memorization.
- **VanNL126k:** Deceptive - **100% unique URLs** mask **74% duplicate content**, showing URL uniqueness ≠ content diversity.

---
## 2. Brand Coverage

### Why This Matters
Phishing attacks target diverse brands. Datasets covering more brands enable models to generalize better to emerging threats.

### Results
| Dataset        |   Unique Brands |
|:---------------|----------------:|
| Phish360       |              47 |
| PWD2016        |               1 |
| PhishIntention |             282 |
| PILWD-134K     |             149 |
| VanNL126k      |              34 |

![Brand Coverage](../../results/plots/plot2_brands.png)

### Key Findings
- **PhishIntention** leads with **282 unique brands**, offering the widest target diversity.
- **PWD2016** has only **1 brand label** ("unknown"), making it unsuitable for brand-aware phishing detection research.
- **Phish360** covers **47 brands**, balancing diversity with quality curation.

---
## 3. Linguistic Diversity

### Why This Matters
Monolingual datasets fail in global deployments. Linguistic diversity indicates readiness for multilingual phishing detection.

### Results
| Dataset        |   Unique Languages |
|:---------------|-------------------:|
| Phish360       |                 28 |
| PWD2016        |                 36 |
| PhishIntention |                 34 |
| PILWD-134K     |                 43 |
| VanNL126k      |                 44 |

![Linguistic Diversity](../../results/plots/plot3_languages.png)

### Key Findings
- **VanNL126k** and **PILWD-134K** offer the highest linguistic diversity (43-44 languages).
- **Phish360** covers **28 languages**, sufficient for global use while maintaining content quality.
- Language count alone doesn't guarantee quality - must be cross-referenced with content uniqueness.

---
## 4. Data Completeness (Multimodal Integrity)

### Why This Matters
"Multimodal" datasets claim to include images, HTML, and text. This metric verifies if those files actually exist and are parseable.

### Results
| Dataset        |   Image % |   HTML % |   Text % |
|:---------------|----------:|---------:|---------:|
| Phish360       |    100.00 |   100.00 |    99.79 |
| PWD2016        |     96.08 |    91.23 |    82.01 |
| PhishIntention |    100.00 |    98.47 |    98.40 |
| PILWD-134K     |    100.00 |   100.00 |    95.12 |
| VanNL126k      |     99.93 |   100.00 |    98.99 |

![Data Completeness](../../results/plots/plot4_completeness.png)

### Key Findings
- **Phish360** achieves near-perfect completeness across all modalities (100%, 100%, 99.79%).
- **PWD2016** shows concerning data loss: only **82% text extraction success**, likely due to corrupted/missing HTML files.
- **PhishIntention** and **VanNL126k** have excellent completeness, validating their multimodal claims.

---
## 5. URL Characteristics - SSL Usage

### Why This Matters
Modern phishing increasingly uses HTTPS to appear legitimate. Datasets with low SSL ratios may not represent current threats.

### Results
| Dataset        |   SSL % |
|:---------------|---------:|
| Phish360       |    42.66 |
| PWD2016        |     1.53 |
| PhishIntention |    33.06 |
| PILWD-134K     |    75.97 |
| VanNL126k      |    44.26 |

![SSL Usage](../../results/plots/plot5_ssl.png)

### Key Findings
- **PILWD-134K** best reflects modern phishing with **76% HTTPS usage**.
- **PWD2016** is severely outdated with only **1.5% HTTPS**, reflecting pre-2016 phishing tactics.
- **Phish360** (43%) and **VanNL126k** (44%) show realistic contemporary SSL adoption.

---
## 6. Text Extraction Quality

### Why This Matters
Compares the effectiveness of different HTML parsers (BeautifulSoup vs Trafilatura) across datasets.

### Results
| Dataset        |   BeautifulSoup % |   Trafilatura % |
|:---------------|------------------:|----------------:|
| Phish360       |             99.79 |           99.70 |
| PWD2016        |             82.01 |           81.34 |
| PhishIntention |             98.40 |           98.40 |
| PILWD-134K     |             95.12 |           95.05 |
| VanNL126k      |             98.99 |           98.94 |

![Text Extraction](../../results/plots/plot6_extraction.png)

### Key Findings
- Both parsers perform similarly across datasets, with **BeautifulSoup** having slight edge.
- **Phish360** achieves near-perfect extraction (99.79%), validating HTML quality.
- **PWD2016** struggles with both parsers (82%), suggesting corrupted or malformed HTML files.

---
## 7. Summary & Recommendations for Researchers

### Dataset Selection Guide

| Use Case | Recommended Dataset | Rationale |
|:---------|:-------------------|:----------|
| **High-Quality Evaluation** | Phish360 | Best content uniqueness (92%), perfect multimodal integrity |
| **Brand-Aware Detection** | PhishIntention | 282 unique brands, excellent brand diversity |
| **Multilingual Models** | VanNL126k, PILWD-134K | 43-44 languages, global coverage |
| **Modern SSL Patterns** | PILWD-134K | 76% HTTPS usage, reflects 2020+ phishing |
| **Avoid for Training** | PWD2016 | 10% content uniqueness, 82% data loss, outdated SSL patterns |

### Critical Insights

1. **Size ≠ Quality:** VanNL126k has 100k samples but only 26% unique content. Phish360 with 4.3k samples has 92% unique content.

2. **PWD2016 is Unsuitable:** Despite being widely cited, it suffers from:
   - 90% content duplication
   - 18% data completeness loss
   - Outdated infrastructure patterns (1.5% SSL)
   - No brand diversity (labeled as "unknown")

3. **Trade-offs Exist:** No single "best" dataset. Choose based on your research goals:
   - **Quality over Quantity?** → Phish360
   - **Maximum Diversity?** → PhishIntention (brands) + VanNL126k (languages)
   - **Modern Threats?** → PILWD-134K

### Methodological Note
All metrics calculated using memory-optimized sequential column processing to handle large datasets efficiently. Analysis scripts available in repository.