## Identifying the features of a malicious URL

Based on a [study by the University of Huddersfield in the United Kingdom](https://eprints.hud.ac.uk/id/eprint/24330/6/MohammadPhishing14July2015.pdf), certain URL characteristics can signal a potential phishing website:

| Indicator              | Description                                                                                           |
|------------------------|-------------------------------------------------------------------------------------------------------|
| `IP Address Usage`       | Legitimate websites primarily use domain names. The presence of a raw IP address suggests an attempt to mask the site's true identity. |
| `Extended URL Length`    | Phishers craft long URLs to obscure the suspicious parts of the address, making the site harder to scrutinize. |
| `URL Shortening`         | Shortened links hide the full, potentially malicious, domain name.                                      |
| `'@' Symbol Inclusion`   | The "@" symbol tricks browsers into ignoring the initial part of the URL and redirects to a different website hidden behind the symbol. |
| `Multiple '//' Symbols`  | The presence of extra "//" within a URL indicates redirection to a different website, which is a common phishing tactic. |
| `Excessive Dots in Domain`  | A URL with more than two dots between the initial 'www' and the country-code Top-Level Domain (e.g., .sg, .uk, .ca) is suspicious, as it may reveal a multitude of nested subdomains. |
| `Lack of HTTPS`          | The absence of the secure HTTPS protocol can be a red flag, as legitimate sites often use HTTPS to ensure secure connections. |
| `Domain Dashes` | Phishers frequently insert prefixes or suffixes separated by dashes into domain names, making them appear legitimate to unsuspecting users. |

<br>

Additionally, our dataset includes non-URL based features that the study has identified as good indicators of potential phishing websites as well:
| Indicator          | Description                                                                                                  |
|--------------------|--------------------------------------------------------------------------------------------------------------|
| `Google-Index`       | Phishing sites are often newly registered and may not be indexed by the Google Search Engine.  |
| `Age of Domain`      | Unlike legitimate websites that often have long-established domains, phishing sites tend to use newly registered domains and disappear quickly. |
| `Domain Expiration`  | Similarly, legitimate domains are often renewed for extended periods, while phishing domains may have short expiration windows.    |


Hence, the following features of the dataset are selected:
-   URL Characteristics:
    -   `domain_in_ip`: Domain in IP.
    -   `length_url`: Length of the URL.
    -   `url_shortened`: URL shortened.
    -   `qty_at_url`: Quantity of "@" symbols in the URL.
    -   `qty_dot_domain`: Quantity of dots in the domain.
    -   `tls_ssl_certificate`: TLS SSL certificate.
    -   `qty_hyphen_domain`: Quantity of hyphens in the domain.
- Non-URL Characteristics:
    -   `url_google_index`: URL Google index.
    -   `domain_google_index`: Domain Google index.
    -   `time_domain_activation`: Time domain activation.
    -   `time_domain_expiration`: Time domain expiration.

We decided to exclude the `Multiple '//' Symbols` feature from our analysis. This is because our dataset tracks the number of single '/' symbols, which can be quite common in legitimate URLs for directories and file paths.

---

## Collecting the Data

We will leverage two key datasets for this project:

1. [Phishing Domain URL Dataset](https://www.kaggle.com/datasets/michellevp/dataset-phishing-domain-detection-cybersecurity) 
    - This dataset offers a rich array of URL-derived features, specifically tailored for detecting phishing domains.
2. [Malicious URLs Dataset](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset)
    - This dataset provides a broad selection of both malicious and safe website URLs.

**Focus**: As the `Malicious URLs Dataset` also contains URLs to other forms of malicious websites, we will filter this dataset to isolate only phishing-related and benign URLs.

**Data Sampling**: Due to the size of these datasets, we will select 1000 entries from each. This approach ensures a manageable dataset while maintaining diversity for effective analysis.

In [2]:
# Open the datasets
import pandas as pd
url_information = pd.read_csv('../datasets/url_information.csv')
malicious_urls = pd.read_csv('../datasets/malicious_urls.csv')

In [3]:
# Filtering Malicious URLs Dataset
filtered_malicious_urls = malicious_urls[malicious_urls['type'].isin(['phishing', 'benign'])]
sampled_malicious_urls = filtered_malicious_urls.sample(n=1000, replace=False, random_state=69)

# Filtering URL Information Dataset
selected_columns = [
    'domain_in_ip', 'length_url', 'url_shortened', 'qty_at_url', 'qty_dot_domain', 
    'tls_ssl_certificate', 'qty_hyphen_domain', 'url_google_index', 'domain_google_index', 
    'time_domain_activation', 'time_domain_expiration', 'phishing'
]
filtered_url_information = url_information[selected_columns]

By visual observation and preliminary analysis of our dataset, it seems to have a low representation of non-null values in the columns `domain_in_ip`, `url_shortened`, `qty_at_url`, `url_google_index`, and `domain_google_index`. To improve data quality and avoid disproportionate representation, we wil perform the following steps: 

1. **Targeted Sampling**:  Randomly select 500 samples where **at least** one these columns have non-null values, with no duplicate entries.

2. **General Sampling**: The remaining 500 entries will be randomly chosen from the entire dataset, with no duplicates entries.

In [4]:
# Observe that the first 10 entries have all 0s in the columns `domain_in_ip`, `url_shortened`, `qty_at_url`, `qty_underline_domain`, `url_google_index`, and `domain_google_index`
filtered_url_information.head(10)

Unnamed: 0,domain_in_ip,length_url,url_shortened,qty_at_url,qty_dot_domain,tls_ssl_certificate,qty_hyphen_domain,url_google_index,domain_google_index,time_domain_activation,time_domain_expiration,phishing
0,0,13,0,0,1,0,0,0,0,1640,551,1
1,0,329,0,0,2,0,0,0,0,-1,-1,1
2,0,24,0,0,2,1,0,0,0,5355,123,0
3,0,23,0,0,1,0,1,0,0,-1,-1,1
4,0,23,0,0,2,1,1,0,0,7865,1631,0
5,0,22,0,0,3,0,0,0,0,-1,-1,0
6,0,143,0,0,2,1,1,0,0,-1,-1,1
7,0,33,0,0,2,1,0,0,0,-1,-1,1
8,0,17,0,0,2,0,0,0,0,1062,33,0
9,0,35,0,0,1,1,0,0,0,3,361,1


In [5]:
# We can also calculate the proportion of 0s as a fraction of the total entries, and observe that around 97-99% of the entire column are actually null values
columns_to_check = ['domain_in_ip', 'url_shortened', 'qty_at_url', 'url_google_index', 'domain_google_index']
proportion_zeros = filtered_url_information[columns_to_check].apply(lambda x: (x == 0).sum() / len(x))
print(proportion_zeros)

domain_in_ip           0.997309
url_shortened          0.993631
qty_at_url             0.977532
url_google_index       0.997217
domain_google_index    0.996577
dtype: float64


In [7]:
# Collecting the 1000 samples
sampled_url_information = pd.DataFrame()  # Dataframe to store our final samples
targeted_url_information = pd.DataFrame()  # Dataframe for entries with non-zero target columns

# Iterate over the underrepresented columns, and select all non-null entries
for column in columns_to_check:
    available_entries = filtered_url_information[(filtered_url_information[column] != 0)]  # Filter for rows with non-zero values in the current column
    targeted_url_information = pd.concat([targeted_url_information, available_entries])  # Add valid entries to our targeted dataframe

# Remove duplicates, ensuring we keep entries with at least one non-null target column 
targeted_url_information = targeted_url_information.drop_duplicates()
targeted_samples = targeted_url_information.sample(n=500, replace=False, random_state=69)  # Randomly select 500 from the targeted entries

# Add the 500 targeted samples to the main sample dataframe
sampled_url_information = pd.concat([sampled_url_information, targeted_samples])

# I know that this method is a bit monkey-brained, but my original method did not work (see below)
r = 1
while len(sampled_url_information.drop_duplicates()) < 1000:
    general_sample = filtered_url_information.sample(n=1, replace=False, random_state=r)  # Take a single random sample
    sampled_url_information = pd.concat([sampled_url_information, general_sample])  # Add it to the main dataframe
    r = r + 1

sampled_url_information = sampled_url_information.drop_duplicates()  # Ensure all entries are unique 


# My original method was to filter based on indices: locate all entries which are found in filtered_url_information but not found in targeted_samples (non-duplicates)
# However, there were always duplicates, and I could never get a final dataframe of 1000 entries.

# general_url_information = pd.DataFrame()
# general_url_information = filtered_url_information.loc[~filtered_url_information.index.isin(targeted_samples.index)]
# general_samples = general_url_information.sample(n=500, replace=False, random_state=69)
# sampled_url_information = pd.concat([targeted_samples, general_samples]).drop_duplicates()


In [8]:
# We can re-calculate the proportion of 0s and observe a slight improvement!
proportion_zeros = sampled_url_information[columns_to_check].apply(lambda x: (x == 0).sum() / len(x))
print(proportion_zeros)

domain_in_ip           0.962
url_shortened          0.960
qty_at_url             0.657
url_google_index       0.938
domain_google_index    0.928
dtype: float64


In [15]:
## Write to a new CSV
filtered_malicious_urls.to_csv('../datasets/filtered_malicious_urls.csv', index=False)
sampled_malicious_urls.to_csv('../datasets/sampled_malicious_urls.csv', index=False)
sampled_url_information.to_csv('../datasets/sampled_url_information.csv', index=False)