In [None]:
# Project Description

#I chose the **Phishing Website Dataset** from Kaggle because it is directly relevant to cybersecurity — an increasingly important field that affects everyone. This dataset contains features extracted from real websites and aims to classify whether a given website is legitimate or a phishing attempt.

#This project is interesting because it shows how various small attributes (like the presence of an IP address, abnormal URLs, or HTTPS usage) can reveal malicious intent. Analyzing these features can help improve digital safety and deepen understanding of online threats.


In [None]:
# ## 2️⃣ System Stage – Phishing Website Dataset

# - **File name:** phishing.csv  
# - **File size:** Approximately a few MB  
# - **File type:** CSV (Comma-Separated Values)  
# - **Source:** Kaggle - Phishing Website Detector  
# - **Protocol:** Downloaded via HTTPS  
# - **Versioning:**  
#   - Only one file provided; no formal versions  
#   - Project version control is managed via Git  


In [None]:
# ## 3️⃣ Metadata

# - **Data Types:**  
#   All features are binary (0 or 1), with the target column named `class`:  
#   - `1` = legitimate website  
#   - `-1` = phishing website  

# - **Missing Values:**  
#   No missing values were found in this dataset. All rows are complete and usable for modeling.

# - **Special Values:**  
#   No special placeholder values (like “unknown” or -999) were found. The data is clean and well-formatted for direct use in machine learning.

# ---

# ### Feature Explanation for Phishing Website Dataset

# | Feature Name        | Description                                                                                   |
# |---------------------|-----------------------------------------------------------------------------------------------|
# | **Index**           | A unique identifier or serial number for each sample (not used in modeling).                   |
# | **UsingIP**         | Whether the website URL uses an IP address instead of a domain name (1 = yes, -1 = no).        |
# | **LongURL**         | Whether the URL is unusually long (1 = yes, -1 = no).                                         |
# | **ShortURL**        | Whether the URL is suspiciously short (1 = yes, -1 = no).                                     |
# | **Symbol@**         | Presence of '@' symbol in URL (1 = yes, -1 = no).                                             |
# | **Redirecting//**   | Whether the URL contains '//' after the protocol part (1 = yes, -1 = no).                      |
# | **PrefixSuffix-**   | Use of hyphen '-' in the domain name (1 = yes, -1 = no).                                      |
# | **SubDomains**      | Number of subdomains (1 = more subdomains than usual, -1 = normal).                            |
# | **HTTPS**           | Whether the website uses HTTPS protocol (1 = yes, -1 = no).                                   |
# | **DomainRegLen**    | Length of domain registration (1 = short registration period, -1 = long).                      |
# | **Favicon**         | Whether the favicon is loaded from the same domain (1 = yes, -1 = no).                        |
# | **NonStdPort**      | Use of non-standard port in URL (1 = yes, -1 = no).                                           |
# | **HTTPSDomainURL**  | Whether HTTPS is present in the domain name part of the URL (1 = yes, -1 = no).               |
# | **RequestURL**      | Whether resources are loaded from an external domain (1 = yes, -1 = no).                      |
# | **AnchorURL**       | Whether anchor tags link to external domains (1 = yes, -1 = no).                             |
# | **LinksInScriptTags** | Presence of links inside script tags (1 = yes, -1 = no).                                    |
# | **ServerFormHandler** | Whether the form handler is on an external server (1 = yes, -1 = no).                       |
# | **InfoEmail**       | Whether the website contains email information (1 = yes, -1 = no).                           |
# | **AbnormalURL**     | Whether the URL has abnormalities (1 = yes, -1 = no).                                        |
# | **WebsiteForwarding** | Whether the website forwards to another URL (1 = yes, -1 = no).                            |
# | **StatusBarCust**   | Whether the status bar is customized (1 = yes, -1 = no).                                     |
# | **DisableRightClick** | Whether right-click is disabled (1 = yes, -1 = no).                                        |
# | **UsingPopupWindow** | Presence of popup windows (1 = yes, -1 = no).                                               |
# | **IframeRedirection** | Use of iframes for redirection (1 = yes, -1 = no).                                          |
# | **AgeofDomain**     | Domain age (1 = young domain, -1 = old).                                                     |
# | **DNSRecording**    | Whether the domain is recorded in DNS (1 = yes, -1 = no).                                   |
# | **WebsiteTraffic**  | Website traffic rank (1 = low traffic, -1 = high).                                          |
# | **PageRank**        | Page rank of the website (1 = low, -1 = high).                                              |
# | **GoogleIndex**     | Whether the website is indexed by Google (1 = no, -1 = yes).                                |
# | **LinksPointingToPage** | Number of links pointing to the page (1 = few, -1 = many).                              |
# | **StatsReport**     | Whether there are statistical reports about the site (1 = no, -1 = yes).                    |
# | **class**           | Target variable: `1` = legitimate website, `-1` = phishing website.                         |



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
df = pd.read_csv("phishing.csv")
df.columns

In [None]:
# Compute correlation matrix on the current DataFrame (excluding non-numeric columns if any)
corr_matrix = df.corr(numeric_only=True)

# Plot heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', center=0)
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

In [None]:
for col in X.columns:
    unique_vals = X[col].nunique()
    print(f"Feature '{col}' has {unique_vals} unique values.")

In [None]:
# we can see that favicon and using popupwindow are highly correlated, so we can drop one of them.
# PCA wont work well here because alot of the features arent correlated. 

In [None]:
## Outlier Analysis

### 1. Outliers in Individual Features

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(15, 20))
for i, col in enumerate(X.columns, 1):
    plt.subplot(len(X.columns)//3 + 1, 3, i)
    sns.boxplot(x=X[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()


In [None]:
features_to_remove = [
    'Index',               # Just a row ID
    'AgeofDomain',         # May be missing or unreliable
    'DisableRightClick',   # Many legitimate sites also disable right-click
    'IframeRedirection',   # Used on both phishing and legit sites
    'StatusBarCust',       # Many legit sites customize the status bar
    'StatsReport',         # May contain outdated or irrelevant stats
    'NonStdPort',          # Rarely used and may not be useful
    'WebsiteForwarding',   # Unreliable for classification
    'InfoEmail',   # Just the presence of email, weak signal   
    'Favicon'     #correlated with usingpopupwindow
    'UsingPopupWindow'  # Many legitimate sites use popups for various reasons
]

# Separate target first before dropping
y = df['class']

# Drop only columns that exist and exclude 'class' from dropping here
cols_to_drop = [col for col in features_to_remove if col in df.columns]
X = df.drop(columns=cols_to_drop + ['class'])  # drop unwanted features and the target column
X.columns


In [None]:
# Optional: remove outliers using Isolation Forest (contamination=0.01)
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.01, random_state=42)
outlier_flags = iso.fit_predict(X)
mask_inliers = outlier_flags == 1

X, y = X.loc[mask_inliers], y.loc[mask_inliers]  # Use .loc for label-based filtering



In [None]:
X.dtypes

In [None]:
print("Dataset shape:", X.shape)
X.info()

In [None]:
X.describe()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Print class distribution
print("\nClass distribution:")
print(y.value_counts())

# Plot class distribution
plt.figure(figsize=(6, 4))
sns.countplot(x=y)
plt.title('Class Distribution: Legitimate (1) vs Phishing (-1)')
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.impute import SimpleImputer

# Count missing values before imputation
missing_before = X.isnull().sum()

# Impute missing values (returns NumPy array)
imputer = SimpleImputer(strategy='median')
X_imputed_array = imputer.fit_transform(X)

# Convert back to DataFrame with original column names
X_imputed = pd.DataFrame(X_imputed_array, columns=X.columns)

# Count missing values after imputation
missing_after = X_imputed.isnull().sum()

# Plot missing values
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
sns.barplot(x=missing_before.index, y=missing_before.values, color='skyblue')
plt.title("Missing Values Before Imputation")
plt.xticks(rotation=90)

plt.subplot(1, 2, 2)
sns.barplot(x=missing_after.index, y=missing_after.values, color='lightgreen')
plt.title("Missing Values After Imputation")
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()


In [None]:
binary_cols = [col for col in df.columns if df[col].nunique() == 2]

for col in binary_cols:
    sns.countplot(x=col, data=df)
    plt.title(f"Distribution of {col}")
    plt.show()


In [None]:
# Calculate proportion of 1s for each feature in X
feature_summary = (X == 1).sum() / len(X)

# Plot feature presence rate
plt.figure(figsize=(12, 6))
feature_summary.sort_values(ascending=False).plot(kind='bar')
plt.title('Feature Presence Rate (Proportion of 1s per Feature)')
plt.ylabel('Proportion of 1s')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt

# === Split features and target ===
# Assumes you already have `X` and `y` from the preprocessed DataFrame
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# === Standardize numeric features ===
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# === Train the KNN model ===
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# === Simulated Training Progress ===
train_loss = []
train_accuracy = []

for epoch in range(10):
    y_pred_train = knn.predict(X_train_scaled)
    loss = np.mean(y_pred_train != y_train)
    acc = accuracy_score(y_train, y_pred_train)
    train_loss.append(loss)
    train_accuracy.append(acc)

# === Evaluate on test set ===
y_pred_test = knn.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_pred_test)
print(f"Overall model accuracy: {test_accuracy:.4f}")

# === Feature Importance using Permutation Importance ===
result = permutation_importance(knn, X_test_scaled, y_test, n_repeats=10, random_state=42)
importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': result.importances_mean
}).sort_values(by='importance', ascending=False)

# === Plot Feature Importances ===
plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'], importance_df['importance'], color='skyblue')
plt.xlabel("Mean Importance Score")
plt.title("Permutation Feature Importance (KNN)")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

# === Plot Training Loss & Accuracy ===
plt.figure(figsize=(12, 6))
plt.plot(range(1, len(train_loss) + 1), train_loss, marker='o', label='Training Loss')
plt.plot(range(1, len(train_accuracy) + 1), train_accuracy, marker='o', label='Training Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Value')
plt.title('KNN Training Loss and Accuracy (Simulated)')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:
# === Split features and target ===
# Assumes you already have `X` and `y` from the preprocessed DataFrame
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# === Standardize numeric features ===
# (Random Forests don't require feature scaling, but scaling won't hurt)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# === Train Random Forest model ===
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_scaled, y_train)

# === Evaluate on test set ===
y_pred_test = rf.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_pred_test)
print(f"Overall model accuracy: {test_accuracy:.4f}")

# === Feature Importances from Random Forest ===
importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values(by='importance', ascending=False)

# === Plot Feature Importances ===
plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'], importance_df['importance'], color='forestgreen')
plt.xlabel("Feature Importance")
plt.title("Random Forest Feature Importance")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
# Feature Importance Plot
importances = rf.feature_importances_
indices = importances.argsort()[::-1]
features = X.columns

plt.figure(figsize=(12,6))
sns.barplot(x=importances[indices], y=features[indices])
plt.title('Random Forest Feature Importances')
plt.show()


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming you already have your trained model `rf` and test data X_test_scaled, y_test
y_pred_test = rf.predict(X_test_scaled)

# Calculate basic performance metrics
accuracy = accuracy_score(y_test, y_pred_test)
precision = precision_score(y_test, y_pred_test, pos_label=1)
recall = recall_score(y_test, y_pred_test, pos_label=1)
f1 = f1_score(y_test, y_pred_test, pos_label=1)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Print detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_test))

# Confusion Matrix plot
cm = confusion_matrix(y_test, y_pred_test)
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Phishing', 'Legit'], yticklabels=['Phishing', 'Legit'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()

# ROC Curve and AUC (if your model provides probability predictions)
if hasattr(rf, "predict_proba"):
    y_scores = rf.predict_proba(X_test_scaled)[:,1]  # Probability of positive class (legit)
    fpr, tpr, thresholds = roc_curve(y_test, y_scores, pos_label=1)
    auc_score = roc_auc_score(y_test, y_scores)

    plt.figure(figsize=(6,5))
    plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
    plt.plot([0,1],[0,1],'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.legend()
    plt.show()


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming you have your trained KNN model `knn`, and test data X_test_scaled, y_test

# Predict test labels
y_pred_test = knn.predict(X_test_scaled)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred_test)
precision = precision_score(y_test, y_pred_test, pos_label=1)
recall = recall_score(y_test, y_pred_test, pos_label=1)
f1 = f1_score(y_test, y_pred_test, pos_label=1)

print(f"KNN Model Accuracy: {accuracy:.4f}")
print(f"KNN Model Precision: {precision:.4f}")
print(f"KNN Model Recall: {recall:.4f}")
print(f"KNN Model F1-Score: {f1:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_test))

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred_test)
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Phishing', 'Legit'], yticklabels=['Phishing', 'Legit'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('KNN Confusion Matrix')
plt.show()

# ROC Curve and AUC for KNN (if predict_proba available)
if hasattr(knn, "predict_proba"):
    y_scores = knn.predict_proba(X_test_scaled)[:,1]  # Probability for class '1' (legit)
    fpr, tpr, thresholds = roc_curve(y_test, y_scores, pos_label=1)
    auc_score = roc_auc_score(y_test, y_scores)

    plt.figure(figsize=(6,5))
    plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
    plt.plot([0,1],[0,1],'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('KNN ROC Curve')
    plt.legend()
    plt.show()
else:
    print("KNN model does not support probability prediction; ROC curve cannot be plotted.")


In [None]:
# ## Conclusion

# The **K-Nearest Neighbors (KNN)** classifier performed well on the phishing website dataset, achieving an accuracy of approximately **93.85%**, with precision, recall, and F1-scores all above **94%**. Using a **Random Forest** classifier improved the performance further, reaching an accuracy of about **96.11%** and similarly high precision, recall, and F1-scores.

# This indicates that both models are effective at distinguishing legitimate websites from phishing attempts, with Random Forest showing better overall classification metrics due to its ability to capture more complex relationships between features.

# ---

# ## how can we improve the model?
# To improve the model's performance further, consider the following strategies:
# 1. **Feature Engineering:**  
#    - Analyze feature importance and remove irrelevant or noisy features.  
#    - Create new features that might better capture phishing behavior.

# 2. **Hyperparameter Tuning:**  
#    - For KNN: Tune the number of neighbors (`k`), distance metrics, and weighting.  
#    - For Random Forest: Tune number of trees, max depth, minimum samples per leaf, etc.

# 3. **Handling Class Imbalance:**  
#    - Verify class balance; if imbalanced, consider SMOTE, class weighting, or balanced sampling.

# 4. **Cross-Validation:**  
#    - Use k-fold cross-validation to better evaluate model generalization.

# 5. **Model Comparison:**  
#    - Experiment with other algorithms like Gradient Boosting, SVM, or Neural Networks.

# ---

# ## Potential Problems and Limitations

# - **Data Quality and Feature Reliability:**  
#   Some features may be noisy or weak predictors, potentially impacting model generalization.

# - **Overfitting Risk:**  
#   Random Forests can overfit if hyperparameters are not carefully tuned.

# - **Real-world Applicability:**  
#   Dataset may not reflect all modern phishing techniques, requiring periodic model updates.

# - **Interpretability:**  
#   While KNN is simple, Random Forests are complex; interpretability tools like SHAP or feature importance should be used to understand predictions.
