## Identifying the features of a malicious URL

Based on a [study by the University of Huddersfield in the United Kingdom](https://eprints.hud.ac.uk/id/eprint/24330/6/MohammadPhishing14July2015.pdf), certain URL characteristics can signal a potential phishing website:

| Indicator              | Description                                                                                           |
|------------------------|-------------------------------------------------------------------------------------------------------|
| `IP Address Usage`       | Legitimate websites primarily use domain names. The presence of a raw IP address suggests an attempt to mask the site's true identity. |
| `Extended URL Length`    | Phishers craft long URLs to obscure the suspicious parts of the address, making the site harder to scrutinize. |
| `URL Shortening`         | Shortened links hide the full, potentially malicious, domain name.                                      |
| `'@' Symbol Inclusion`   | The "@" symbol tricks browsers into ignoring the initial part of the URL and redirects to a different website hidden behind the symbol. |
| `Multiple '//' Symbols`  | The presence of extra "//" within a URL indicates redirection to a different website, which is a common phishing tactic. |
| `Excessive Dots in Domain`  | A URL with more than two dots between the initial 'www' and the country-code Top-Level Domain (e.g., .sg, .uk, .ca) is suspicious, as it may reveal a multitude of nested subdomains. |
| `Lack of HTTPS`          | The absence of the secure HTTPS protocol can be a red flag, as legitimate sites often use HTTPS to ensure secure connections. |
| `Domain Dashes and Hyphens` | Phishers frequently insert prefixes or suffixes separated by dashes or hyphens into domain names, making them appear legitimate to unsuspecting users. |

<br>

Additionally, our dataset includes non-URL based features that the study has identified as good indicators of potential phishing websites as well:
| Indicator          | Description                                                                                                  |
|--------------------|--------------------------------------------------------------------------------------------------------------|
| `Google-Index`       | Checks if the website appears in Google's search index. Phishing sites often have a short lifespan and may not be indexed.  |
| `Age of Domain`      | Determines how long the domain has been registered (retrieved from WHOIS data). Phishing sites tend to be newly registered. |
| `Domain Expiration`  | Examines the domain's expiration date. Legitimate domains are often renewed for extended periods, while phishing domains may have short expiration windows.    |



Hence, the following features of the dataset are selected:
-   URL Characteristics:
    -   `domain_in_ip`: Domain in IP.
    -   `length_url`: Length of the URL.
    -   `url_shortened`: URL shortened.
    -   `qty_at_url`: Quantity of "@" symbols in the URL.
    -   `qty_dot_domain`: Quantity of dots in the domain.
    -   `tls_ssl_certificate`: TLS SSL certificate.
    -   `qty_hyphen_domain`: Quantity of hyphens in the domain.
    -   `qty_underline_domain`: Quantity of underscores in the domain.
- Non-URL Characteristics:
    -   `url_google_index`: URL Google index.
    -   `domain_google_index`: Domain Google index.
    -   `time_domain_activation`: Time domain activation.
    -   `time_domain_expiration`: Time domain expiration.

We decided to exclude the `Multiple '//' Symbols` feature from our analysis. This is because our dataset tracks the number of single '/' symbols, which can be quite common in legitimate URLs for directories and file paths. 

In [2]:
## Open the datasets
import pandas as pd
sampled_url_information = pd.read_csv('../datasets/sampled_url_information.csv')
sampled_malicious_urls = pd.read_csv('../datasets/sampled_malicious_urls.csv')

In [5]:
# Cleaning the sampled dataset to extract only the useful columns
selected_columns = [
    'domain_in_ip', 'length_url', 'url_shortened', 'qty_at_url', 'qty_dot_domain', 'tls_ssl_certificate',
    'qty_hyphen_domain', 'qty_underline_domain', 'url_google_index', 'domain_google_index',
    'time_domain_activation', 'time_domain_expiration', 'phishing'
]

cleaned_url_information = sampled_url_information[selected_columns]

In [6]:
## Write to a new CSV
cleaned_url_information.to_csv('../datasets/cleaned_url_information.csv', index=False)