# Phishing Website Classification: A Lightweight Bayesian Approach


---


90% of security breaches are a result of phishing attacks, in which attackers harvest the personal information of unsuspecting users under false pretenses ([source](https://spanning.com/blog/cyberattacks-2021-phishing-ransomware-data-breach-statistics/#:~:text=According%20to%20CISCO's%202021,14%20malicious%20emails%20every%20year.)). There was a 74% increase in phishing attacks per second in the last year, making the prevention of phishing attacks more pressing than ever ([source](https://www.phishingbox.com/resources/phishing-facts)). Phishing attacks rely on phishing websites — sites constructed to mimic real websites and steal victims' personal information — to breach their targets. These sites often look very similar to the real versions of sites, even using identical images, fonts, and other static content, as well as impersonating their URLs. 

So how can we prevent phishing attacks? One option is that we stop phishing websites in their tracks, blocking them from loading in browsers before victims can give up their personal information. To implement a solution like this, we need to create a classification model, that given a website's features can classify it as either phishing or legitimate.

In the following Colab, I will implement a lightweight naive Bayes classifier from scratch for phishing websites, using only the most impactful features. The dataset that I used for this project is from UC Irvine's Machine Learning Repository and can be found [here](https://archive.ics.uci.edu/ml/datasets/phishing+websites#). **Adding complexity to this challenge, most of the features were not binary.** Additionally, I implemented this model before Problem Set 6 came out, so I took a different approach, opting to use the pure argmax definition and log probabilities. Using only this simple probabilistic model and only the top 5 most impactful features (out of 30 total), I was able to **classify phishing websites with an accuracy consistently in the range of 91%-93%**. No fancy gradient descent required!




In [4]:
import numpy as np
from sklearn.model_selection import train_test_split
from math import log

data_raw = np.genfromtxt(
    "./phishing-data.csv",
    delimiter=",",
    dtype=int,
    names=True,
)

# Split 30% of data off as test data, reserving 70% for training
data_train, data_test = train_test_split(data_raw, test_size=0.3, random_state=7) # Feel free to change random state
# Remove results column from test data
data_test = data_test[:-1]
print(data_raw.dtype.names)
print(data_raw[:5])

('id', 'having_IP_Address', 'URL_Length', 'Shortining_Service', 'having_At_Symbol', 'double_slash_redirecting', 'Prefix_Suffix', 'having_Sub_Domain', 'SSLfinal_State', 'Domain_registeration_length', 'Favicon', 'port', 'HTTPS_token', 'Request_URL', 'URL_of_Anchor', 'Links_in_tags', 'SFH', 'Submitting_to_email', 'Abnormal_URL', 'Redirect', 'on_mouseover', 'RightClick', 'popUpWidnow', 'Iframe', 'age_of_domain', 'DNSRecord', 'web_traffic', 'Page_Rank', 'Google_Index', 'Links_pointing_to_page', 'Statistical_report', 'Result')
[(1, -1, 1,  1, 1, -1, -1, -1, -1, -1, 1, 1, -1,  1, -1,  1, -1, -1, -1, 0,  1, 1,  1, 1, -1, -1, -1, -1, 1,  1, -1, -1)
 (2,  1, 1,  1, 1,  1, -1,  0,  1, -1, 1, 1, -1,  1,  0, -1, -1,  1,  1, 0,  1, 1,  1, 1, -1, -1,  0, -1, 1,  1,  1, -1)
 (3,  1, 0,  1, 1,  1, -1, -1, -1, -1, 1, 1, -1,  1,  0, -1, -1, -1, -1, 0,  1, 1,  1, 1,  1, -1,  1, -1, 1,  0, -1, -1)
 (4,  1, 0,  1, 1,  1, -1, -1, -1,  1, 1, 1, -1, -1,  0,  0, -1,  1,  1, 0,  1, 1,  1, 1, -1, -1,  1, -1, 1, -

First, we have to load our data, and separate our training and test sets. The 30 columns indicating features of each website and the first 5 websites in the dataset are printed above. Importantly, note that most of these features are not binary, with their supports discretely indicating the degree of suspiciousness from -1 to 1. In fact, it required some sleuthing and analysis to determine that 1 indicated a phishing website and -1 indicated legitimate.

In [5]:
def calc_feature_impact(threshold, n_features):
    ratios = {}
    for i in range(1, len(data_raw[0]) - 1):
        feat_tot = phishing_feat = phishing_not_feat = 0
        for site in data_raw:
            if site[i] >= threshold:
                feat_tot += 1
                if site["Result"] == 1:
                    phishing_feat += 1
            else:
                if site["Result"] == 1:
                    phishing_not_feat += 1

        # Use laplace smoothing to ensure no division by 0
        # Each ratio[i] is P(Y = 1 | X_i >= threshold) / P(Y = 1 | X_i < threshold)
        p_y1_given_xi_geq_thresh = ((phishing_feat + 1) / (feat_tot + 2))
        p_y1_given_xi_lt_thresh = ((phishing_not_feat + 1) / (len(data_raw) - feat_tot + 2))
        ratios[i] = p_y1_given_xi_geq_thresh / p_y1_given_xi_lt_thresh

    # Sort ratios dictionary by value
    sorted_ratios = sorted(ratios.items(), key=lambda x: x[1])
    result = sorted_ratios[len(sorted_ratios) - n_features :]

    print("\nTop {} Most Impactful Features (threshold = {})".format(n_features, threshold))
    for i in result:
        print(
            "P(phishing | {feat} >= {threshold}) / P(phishing | {feat} < {threshold}) = {val}".format(
                feat=data_raw.dtype.names[i[0]],
                threshold=threshold,
                val=i[1],
            )
        )
    
    return [data_raw.dtype.names[i[0]] for i in result]

My primary intention with this classifier was to make it as lightweight as possible while still being effective. I wanted to make it a simple Bayesian classifier, without any more complex machine learning. I also wanted to make the amount of computation required for training the model, as well as the amount of data required for the saved model, to be as minimal as possible. Therefore, my first step was to write up a function that could tell me what the most impactful variables would be so I could only include their probabilities. A good measure for feature impact in this context is to compare the conditional probabilities that a website is a phishing website given a feature's different values. For every feature $X_i$ in the dataset, this function calculates $\frac{P(\text{phishing} | X_i >= x)}{P(\text{phishing} | X_i < x)}$ for some threshold $x$. It then returns the top *n_features* most impactful features. I ultimately decided on using a threshold of 0, meaning suspicious, because I wanted the features to be sorted by their impact given that the feature is rated as suspicious or greater. This, perhaps intuitively, aligned with the greatest model accuracy. 


In [6]:
n_features = 5  # Feel free to try different values here!
most_impact = calc_feature_impact(threshold=0, n_features=n_features) # Impact given a feature is at least suspicious


Top 5 Most Impactful Features (threshold = 0)
P(phishing | Request_URL >= 0) / P(phishing | Request_URL < 0) = 1.6325491434634134
P(phishing | web_traffic >= 0) / P(phishing | web_traffic < 0) = 1.665135680769705
P(phishing | Prefix_Suffix >= 0) / P(phishing | Prefix_Suffix < 0) = 2.0425019147721932
P(phishing | SSLfinal_State >= 0) / P(phishing | SSLfinal_State < 0) = 5.290063905325444
P(phishing | URL_of_Anchor >= 0) / P(phishing | URL_of_Anchor < 0) = 69.88667072216911


In [7]:
def cond_prob(feat, feat_val, y, data):
    y_tot = 0
    feat_given_y = 0
    for site in data:
        if site["Result"] == y:
            y_tot += 1
            if site[feat] == feat_val:
                feat_given_y += 1

  # P(X_i = x | Y = y) with laplace smoothing
  # note: assuming at least one X_i = x and one X_i != x, regardless of support
    return (feat_given_y + 1) / (y_tot + 2)

This function uses laplace smoothing to calculate $P(X_i=x | Y=y)$, for any feature $X_i$, any value of that feature $x$, and any value of $Y$. $Y$ is the random variable representing whether or not a website is legitimate.

In [8]:
def calc_support(X, data):
    support = dict.fromkeys(X, set())
    for site in data:
        for X_i in X:
            support[X_i].add(site[X_i])
    return support

Because the features have varying supports, most of which not being binary, I created a function that would calculate the support of each feature. This allowed me to later reference the supports in order to calculate the conditional probability for each possible value of each feature.

In [9]:
def calc_probs_given_y(X, data):
    probs_given_y = {}
    support = calc_support(X, data)
    for X_i in X:
        probs_given_y[X_i] = {str(val): (cond_prob(X_i, val, -1, data), cond_prob(X_i, val, 1, data)) for val in support[X_i]}
    return probs_given_y

This function creates a dictionary that keeps track of each feature's probability of each value in its support, both conditioned on the website being a phishing website and on the site being legitimate. Each entry in this data structure can be accessed as 

```
probs_given_y[X_i][x][y]
```
Which would yield the conditional probability $P(X_i=x | Y=y)$ for any feature $X_i$, any value of that feature $x$, and any value of $y$.


In [10]:
def calc_p_y(data):
    tot_phish = 0
    for i in data:
        if i["Result"] == 1:
            tot_phish += 1
    return tot_phish / len(data)

For our ultimate prediction, we need the values of $P(Y=1)$ and $P(Y=-1)$. Because $Y$ is binary, we only need to calculate $P(Y=1)$ because $P(Y=-1) = 1- P(Y=1)$.

In [11]:
def train(X, data):
    probs_given_y = calc_probs_given_y(X, data)
    p_y = calc_p_y(data)

    return (probs_given_y, p_y)

This function uses the supplied data to calculate all of the conditional probabilities for the given features, as well as the probability of websites being phishing sites. This comprises all of the necessary data for our classification model.

In [12]:
def prediction(site, X, probs_given_y, p_y):
    X_y_neg1 = log(1 - p_y)
    X_y_1 = log(p_y)

    for X_i in X:
        X_y_neg1 += log(probs_given_y[X_i][str(site[X_i])][0])
        X_y_1 += log(probs_given_y[X_i][str(site[X_i])][1])

    return X_y_1 > X_y_neg1

Here's where we actually implement naive Bayes! For each site, we want to calculate whether the likelihood of it being a phishing site or a legitimate site is greater, and use that judgment to classify our given site. To do so, we need to find
 $\text{argmax}_{y\in\{-1, 1\}} P(X_1, X_2, ..., X_n | Y=y)P(Y=y)$, where $X$ is the vector of all of a site's features $X_i$. To calculate that value, we make the naive Bayes assumption that all of the features are conditionally independent on $Y$, which is most likely not actually the case, but makes computation much easier. Given that assumption, we can simply calculate $\text{argmax}_{y\in\{-1, 1\}} \Pi_{i=1}^n P(X_i| Y=y)P(Y=y)$. Then, because logarithms are monotonically increasing, we can make this calculation easier on our computer's numerical storage by instead calculating the equivalent $\text{argmax}_{y\in\{-1, 1\}} log(P(Y=y)) + \Sigma_{i=1}^n log(P(X_i| Y=y))$. After calculating these sums, we return 1, concluding that the site is a phishing website if $P(X | Y=1)P(Y=1) > P(X | Y=-1)P(Y=-1)$, and 0 otherwise.

In [13]:
def test(X, train_data, test_data):
    correctly_phish = 0
    correctly_not_phish = 0
    predicted_phish = 0
    actual_phish = 0

    probs_given_y, p_y = train(X, train_data)

    for site in test_data:
        phishPredicted = prediction(site, X, probs_given_y, p_y)
        if site["Result"] == 1:
            actual_phish += 1
        if phishPredicted:
            predicted_phish += 1
        if phishPredicted and site["Result"] == 1:
            correctly_phish += 1
        if not phishPredicted and site["Result"] == -1:
            correctly_not_phish += 1

    print("Accuracy: {}".format((correctly_phish + correctly_not_phish) / len(test_data)))
    print("Precision: {}".format(correctly_phish / predicted_phish))
    print("Recall: {}".format(correctly_phish / actual_phish))

This function first trains the naive Bayes classifier using the training data, and then predicts the status of each website using that model, while counting up which classifications it got correctly.

In [14]:
print("Test results using most impactful top {} features".format(n_features))
test(most_impact, data_train, data_test)

Test results using most impactful top 5 features
Accuracy: 0.9291314837153196
Precision: 0.9233997901364114
Recall: 0.9518658734451054


These results are fantastic! Accuracy is the percentage of sites that were correctly classified, precision is a measure of the accuracy of positive predictions, and recall is a measure of the completeness of positive predictions. According to these numbers (if you left all the variables as their defaults) this model would classify 93% of websites correctly, and would correctly flag 95% of phishing websites. Best of all, the only data we needed to perform these predictions was $\Sigma_{i=1}^5 |support(X_i)| \cdot |support(Y)| = 5 \cdot 3 \cdot 2 = 30$ entries in our feature probability data, plus one more for p(y), totalling only 31 values stored to achieve this accuracy (note: for all features in the top 5, $|support(X_i)| = 3$ and because $Y$ is binary, $|support(Y)|$ = 2).

In [15]:
import json

def save_model(probs_given_y, p_y):
  model = probs_given_y
  model["p_y"] = p_y

  with open("model.txt", "w") as fp:
    json.dump(model, fp)

def load_model():
  with open("model.txt", "r") as fp:
    model = json.load(fp)

  return model

probs_given_y, p_y = train(most_impact, data_train)
save_model(probs_given_y, p_y)
model = load_model()
print(model)

{'Request_URL': {'0': [0.00029129041654529564, 0.00023207240659085636], '1': [0.4614040198077483, 0.7017869575307496], '-1': [0.5385959801922516, 0.2982130424692504]}, 'web_traffic': {'0': [0.35217011360326245, 0.1413320956138315], '1': [0.304398485289834, 0.6992341610582502], '-1': [0.34372269152344886, 0.15966581573450916]}, 'Prefix_Suffix': {'0': [0.00029129041654529564, 0.00023207240659085636], '1': [0.00029129041654529564, 0.23044789974472035], '-1': [0.9997087095834547, 0.7695521002552796]}, 'SSLfinal_State': {'0': [0.23070200990387416, 0.0030169412856811324], '1': [0.14564520827264782, 0.9099559062427477], '-1': [0.6239440722400234, 0.08725922487816198]}, 'URL_of_Anchor': {'0': [0.3108068744538305, 0.6289162218612206], '1': [0.030002912904165454, 0.3641216059410536], '-1': [0.6594815030585494, 0.007194244604316547]}, 'p_y': 0.5566037735849056}


Here's the entire tiny model in all of its glory! I've exported it here as json, because a great application of this model would be to make a simple JavaScript webapp for classifying websites. It's pretty cool that we can classify phishing websites with 93% accuracy using only these 31 values!

In [16]:
# Make sure you have fields here for each top feature selected! Otherwise, you'll run into errors.
real_example = {
    "Request_URL":-1, 
    "web_traffic":1, 
    "Prefix_Suffix":1, 
    "SSLfinal_State":0, 
    "URL_of_Anchor":1
}

probs_given_y, p_y = train(most_impact, data_train)
print(prediction(real_example, most_impact, probs_given_y, p_y))

True


This *real_example* data is from a real phishing website linked in a phishing text I got earlier this year as part of a scam of Sam's Club customers, and it classifies it correctly! Feel free to mess around with values and see how it impacts the predicton.

Thanks for reading! I think it's awesome that we can leverage foundational probabilistic concepts to be so effective at performing a classification that could greatly reduce the tangible damage of phishing attacks.