## Collecting the Data

We will leverage two key datasets for this project:

1. [Phishing Domain URL Dataset](https://www.kaggle.com/datasets/michellevp/dataset-phishing-domain-detection-cybersecurity) 
    - This dataset offers a rich array of URL-derived features, specifically tailored for detecting phishing domains.
2. [Malicious URLs Dataset](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset)
    - This dataset provides a broad selection of both malicious and safe website URLs.

**Focus**: As the `Malicious URLs Dataset` also contains URLs to other forms of malicious websites, we will filter this dataset to isolate only phishing-related and benign URLs.

**Data Sampling**: Due to the size of these datasets, we will randomly select 1000 representative entries from each. This approach ensures a manageable dataset while maintaining diversity for effective analysis.

In [2]:
## Open the datasets
import pandas as pd
url_information = pd.read_csv('../datasets/url_information.csv')
malicious_urls = pd.read_csv('../datasets/malicious_urls.csv')

In [4]:
## Filtering Malicious URLs Dataset
filtered_malicious_urls = malicious_urls[malicious_urls['type'].isin(['phishing', 'benign'])]

## Selecting 1000 random samples
sampled_malicious_urls = filtered_malicious_urls.sample(n=1000, random_state=69)
sampled_url_information = url_information.sample(n=1000, random_state=69)

In [None]:
## Write to a new CSV
sampled_malicious_urls.to_csv('../datasets/sampled_malicious_urls.csv', index=False)
sampled_url_information.to_csv('../datasets/sampled_url_information.csv', index=False)