# Assignment: Decision trees and random forests

## Business understanding

Phishing websites are a major cybersecurity threat, aiming to steal sensitive information such as passwords or credit card details. Detecting them early is essential to protect users from financial loss, identity theft, and reputational damage.

The goal of this project is to test whether phishing websites can be reliably identified using simple, automatically collected features such as SSL certificate status, URL patterns, web traffic, and link structure. A successful model could support the development of an automated warning system that alerts users before they enter a phishing site.

Two approaches will be used:

1. **Decision Trees** – to create a transparent, human-readable model that can be directly translated into analyst instructions.

2. **Random Forests** – to improve predictive accuracy through an ensemble of decision trees.

The project seeks to balance interpretability and accuracy, laying the groundwork for a practical phishing detection tool.








Ultimately, the project aims to balance interpretability (decision tree rules that are understandable to an internet analyst) and accuracy (improved prediction via random forests). If successful, the models can form the foundation of a practical phishing detection tool, contributing to safer online interactions for end-users.


We are using the Phishing dataset that is available at the UCI Machine Learning Repository: [Phishing Websites Data Set](https://archive.ics.uci.edu/dataset/327/phishing+websites). The target variable Result indicates whether a website is a phishing site or not.

Note: As the interpretation of the -1’s and 1’s in the Result column seems to be missing from the document, it may be helpful to know that a ‘1‘ corresponds to a phishing site and a ‘-1’ to a legitimate site.

Our goal was to find out whether it is possible to reliably predict whether a website is a phishing site or not based on the easily obtainable information about the website. Based on the outcome, it is possible to construct an automated system that warns users when they are about to visit a phishing website.


## 2. Data understanding

Here we fetch the dataset using the provided import method for python. We also print metadata for the dataset, and information on contained variables.

The result variables meanings:
1) -1 = ok
2) 0 = sus
3) 1 = phis

In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
phishing_websites = fetch_ucirepo(id=327) 
  
# data (as pandas dataframes) 
X = phishing_websites.data.features 
y = phishing_websites.data.targets 
  
  
# variable information 
display(phishing_websites.variables) 


The variables contained in this dataset are:
<br>


| Feature                    | Explanation                                                                                   |
|-----------------------------|-----------------------------------------------------------------------------------------------|
| Using the IP Address        | If the URL uses an IP address instead of a domain name (phishers often hide domain names).    |
| URL-Length                  | Very long URLs can be suspicious (used to hide malicious parts).                              |
| Shortening-Service          | Use of services like bit.ly or tinyurl can hide the true destination.                         |
| having-At-Symbol            | An “@” symbol in a URL may redirect to a different site.                                      |
| double-slash-redirecting    | Extra `//` after the protocol may indicate redirection tricks.                                |
| Prefix-Suffix               | Use of a hyphen “-” in the domain (e.g., paypal-security.com) often mimics real sites.        |
| having-Sub-Domain           | Too many subdomains (e.g., login.bank.example.phish.com) can be a trick.                      |
| SSLfinal-State              | Checks if the SSL certificate is valid (fake or expired SSL is a warning sign).               |
| Domain-registration-length  | Domains registered for a very short time are more likely to be malicious.                     |
| Favicon                     | A favicon loaded from an external domain (not the main site) can signal phishing.             |
| port                        | Use of uncommon or suspicious ports instead of standard ones (80/443).                        |
| HTTPS-token                 | Misuse of “https” inside the domain name (e.g., https-login.com) to fake security.            |
| Request-URL                 | Percentage of external objects (images, scripts, etc.) loaded from outside domains.           |
| URL-of-Anchor               | Percentage of anchor (`<a>`) tags leading to outside or empty links.                          |
| Links-in-tags               | Percentage of links inside `<meta>`, `<script>`, and `<link>` tags pointing outside.          |
| SFH (Server Form Handler)   | Where a form submits data (if empty or external, suspicious).                                 |
| Submitting-to-email         | Forms that submit directly to an email instead of a server.                                   |
| Abnormal_URL                | Whether the domain matches its WHOIS registration info (fake mismatch = phishing).            |
| Redirect                    | Number of times the site redirects (too many = suspicious).                                   |
| On-mouseover                | If hovering changes the link shown in the status bar (a common phishing trick).               |
| RightClick                  | Disabling right-click to prevent users from inspecting elements or code.                      |
| popUpWindow                 | Presence of pop-ups, often used in scams.                                                     |
| Iframe                      | Use of hidden frames (`<iframe>`) to load content secretly from other sites.                  |
| Age-of-domain               | Newly created domains are more likely to be phishing.                                         |
| DNSRecord                   | Missing or abnormal DNS records may indicate a fake site.                                     |
| Web-traffic                 | Very low traffic rank (site is not popular/legitimate).                                       |
| Page-Rank                   | Low Google PageRank means the site isn’t trusted.                                             |
| Google-Index                | If the site is not indexed by Google, it may be suspicious.                                   |
| Links-pointing-to-page      | Few or no inbound links suggest a fake site.                                                  |
| Statistical-report          | Matches known phishing/malware sites in public blacklists/statistical reports.                |







## 3. Data preparation

The data:

In [None]:
display(X.head())
display(y.head())

Tarkistetaan onko vääriä arvoja seassa. Tulostetaan min ja max arvot

In [None]:
display(X.min())
display(X.max())


Toteamme näin että dataset on valid.
Ei tarvitse standardisoida dataa koska teemme päätöspuita.

### Part 1 Decision tree

Your initial goal is to construct a small yet useful decision tree that predicts whether a website is a phishing site or not.

The outcome should contain the following:

An image of the final decision tree.
Evaluation metrics for the decision tree.
Written instructions for an internet analyst to manually make the decision of whether the website is likely to be a phishing site or not. The instructions must match one-to-one with your decision tree, and be written in a way that is understandable to an engineer who is aware of the basics of internet technologies.

Creating a tree classifier

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree  


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=20) # extract test set
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## 2. Modeling

Building a decision tree

In [None]:
tree = DecisionTreeClassifier(max_depth=5, random_state=666, min_impurity_decrease=0.02)
tree.fit(X_train, y_train)

Visualize tree

In [None]:
fig = plt.figure(figsize=(30, 10))
plot_tree(
    tree,
    feature_names=X.keys(),
    class_names=["Legitimate (-1)", "Phishing (1)"],
    fontsize=12  # larger text → larger boxes
)
plt.show()

First check if the web page has trusted ssl-certificate. If it does, then the site is not phishing.
If site doesn't have a trusted ssl-certificate, then check if the <a> tags in websites html code have different domain than the parent site. If they do, site is a phishing site. If these are correct, then the site is legit.

## 5. Evaluation

Test the performance of classifier

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

y_pred2 = tree.predict(X_test)
accuracy_test = accuracy_score(y_test, y_pred2)
display(f"Accuracy of tree classifier on the test set: {accuracy_test:.2f}")
confusion_matrix(y_test, y_pred2)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.utils.multiclass import unique_labels


y_pred = tree.predict(X)
labels = unique_labels(y, y_pred)
cm = confusion_matrix(y, y_pred)
cmd = ConfusionMatrixDisplay(cm, display_labels=labels)
cmd.plot(cmap="Reds")

Puu syvyys 5 oli huonompi ennustamaan --> enemmän virheellisesti ennustettuja arvoja. 10 ennusti paremmin. Tämän vuoksi päädyimme syvyyteen 10.

Written instructions for an internet analyst to manually make the decision of whether the website is likely to be a phishing site or not. The instructions must match one-to-one with your decision tree, and be written in a way that is understandable to an engineer who is aware of the basics of internet technologies.