# Overview & Imports

**Purpose of This Notebook:**  
This notebook defines two key classes for feature extraction from web pages:

- **Top20FeaturesExtraction:**  
    Quickly extracts a concise set of 20 features that capture the most important signals from both the HTML content and the URL of a web page. This is useful for fast, lightweight analysis or as input to machine learning models that require a small, informative feature set.

- **FeaturesExtraction:**  
    Extracts a comprehensive set of features from the web page, including:
    - **HTML static features:** Structure, tags, and content of the page.
    - **Dynamic JavaScript features:** Behaviors like mouse tracking, pop-ups, and redirects.
    - **Password-related features:** Presence and characteristics of password fields.
    - **Suspicious JavaScript features:** Detection of potentially malicious scripts.
    - **URL-based features:** Analysis of the web address for signs of phishing or abnormality.

**Imports:**  
- `FeatureHTML`: A helper for extracting features from the HTML and JavaScript of the page.
- `URLFeature`: A helper for extracting features based on the structure and content of the URL.

**Note:**  
All code is well-commented to explain the purpose and logic of each section, making it easier to follow and modify as needed.


In [5]:
import os

print(os.getcwd())

c:\Users\USER\Desktop\MLproject\ayman\dataset


In [1]:
# Load & execute the other notebooks so their classes/functions become available
%run ./features_html.ipynb
%run ./features_url.ipynb

# Now you can import directly
from features_html import FeatureHTML   # HTML/JS feature extractor
from features_url  import URLFeature     # URL‐based feature extractor


An error occurred during feature extraction: URLFeature() takes no arguments


ModuleNotFoundError: No module named 'features_html'

In [None]:
from features_html import FeatureHTML   # HTML/JS feature extractor
from features_url import URLFeature     # URL‐based feature extractor


Collecting importnb
  Downloading importnb-2023.11.1-py3-none-any.whl.metadata (9.4 kB)
Downloading importnb-2023.11.1-py3-none-any.whl (45 kB)
Installing collected packages: importnb
Successfully installed importnb-2023.11.1
Note: you may need to restart the kernel to use updated packages.


NotImplementedError: Non-relative patterns are unsupported

# Top20FeaturesExtraction

**Purpose:**  
Quickly compute a 20‐element feature vector that summarizes the most important characteristics of a web page for tasks like phishing detection or website classification.

---

## How It Works

- **14 HTML‐based features** (via `FeatureHTML`):  
    These capture the structure and content of the web page, such as:
    - Number of links (`<a href=...>`)
    - Number of lists (`<ul>`, `<ol>`)
    - Total text length
    - Number of images, forms, scripts, meta tags, etc.
    - Presence of certain tags (like `<link>`)
    - Counts of elements like `<div>`, `<span>`, `<p>`, etc.

- **6 URL‐based features** (via `URLFeature`):  
    These analyze the web address for signs of abnormality, including:
    - Presence of hyphens in the domain (often used in phishing)
    - Number of subdomains
    - URL length (long URLs can be suspicious)
    - Use of URL shorteners
    - Number of times the site forwards/redirects
    - Number of external links pointing to the page

---

## Why Use This?

- **Fast & Lightweight:**  
    Extracts only the most informative features, making it suitable for real-time or large-scale analysis.
- **Model-Ready:**  
    Returns a simple Python list of 20 values, ready to be fed into machine learning models or further analysis pipelines.
- **Balanced:**  
    Combines both content-based (HTML) and address-based (URL) signals for a more complete picture of the web page.

---

**In summary:**  
`Top20FeaturesExtraction` is a quick, practical tool to turn a web page into a compact, informative feature vector for security and classification tasks.


In [2]:
class Top20FeaturesExtraction:
    def __init__(self, driver, soup, url):
        """
        Initialize the Top20FeaturesExtraction class.

        Why do we do this?
        - We need to extract features from both the HTML content and the URL of a web page.
        - To do this efficiently, we use two specialized helper classes:
            - FeatureHTML: extracts structural and content-based features from the HTML (using BeautifulSoup and Selenium for dynamic content).
            - URLFeature: extracts features based on the structure and characteristics of the URL.
        - By storing the driver, soup, and url, we ensure that both helpers have access to all necessary information for feature extraction.
        - This setup allows us to quickly compute a compact, informative feature vector for downstream analysis or machine learning.

        Parameters:
        - driver: Selenium WebDriver instance (for dynamic content and JS execution)
        - soup: BeautifulSoup object (for parsing and analyzing HTML)
        - url: The URL string of the web page
        """
        self.url = url
        self.driver = driver
        self.soup = soup
        self.html_features = FeatureHTML(driver, soup)
        self.url_features = URLFeature(url)

    def create_vector(self):
        # --- 2.1: Extract 14 HTML-based features from the page ---
        # Each feature is a numeric or boolean value describing some aspect of the HTML
        html_features = [
            self.html_features.number_of_href(),            # Number of <a href="..."> links
            self.html_features.number_of_list(),            # Number of <ul> and <ol> list elements
            self.html_features.length_of_text(),            # Total length of all visible text on the page
            self.html_features.number_of_a(),               # Number of <a> (anchor) tags
            self.html_features.has_link(),                  # Boolean: presence of any <link> tags (e.g., CSS, favicon)
            self.html_features.number_of_hidden_element(),  # Number of elements hidden from view (e.g., style="display:none")
            self.html_features.number_of_div(),             # Number of <div> elements (layout containers)
            self.html_features.number_of_forms(),           # Number of <form> elements (user input forms)
            self.html_features.number_of_images(),          # Number of <img> tags (images)
            self.html_features.number_of_script(),          # Number of <script> tags (JavaScript)
            self.html_features.number_of_meta(),            # Number of <meta> tags (metadata)
            self.html_features.length_of_title(),           # Length of the <title> tag text
            self.html_features.number_of_paragraph(),       # Number of <p> (paragraph) tags
            self.html_features.number_of_span()             # Number of <span> tags (inline containers)
        ]

        # --- 2.2: Extract 6 URL-based features from the page's URL ---
        # These features describe the structure and characteristics of the URL
        url_features = [
            self.url_features.prefixSuffix(),       # Boolean: is there a '-' (hyphen) in the domain? (often suspicious)
            self.url_features.WebsiteForwarding(),  # Number of times the site forwards/redirects
            self.url_features.SubDomains(),         # Number of subdomains in the URL
            self.url_features.longUrl(),            # Boolean: is the URL unusually long?
            self.url_features.shortUrl(),           # Boolean: is a URL shortener service used?
            self.url_features.LinksPointingToPage() # Number of external links pointing to this page
            
        ]

        # --- 2.3: Combine both HTML and URL features into a single 20-element vector ---
        # This vector can be used as input to machine learning models or for further analysis
        return html_features + url_features


# FeaturesExtraction: HTML & JS Features

**Purpose:**  
Extract a comprehensive set of features from a web page, covering both its static HTML structure and dynamic JavaScript behaviors.
---
## What Features Are Extracted?

### 1. **HTML Static Features**
These describe the visible and structural elements of the page:
- **Title & Headers:**  
    - Presence and length of the `<title>` tag.
    - Existence of header tags (`<h1>`, `<h2>`, `<h3>`).
- **Inputs & Forms:**  
    - Counts of all `<input>` elements, including specific types like email and text.
    - Number of `<form>` tags and whether forms contain password fields.
    - Presence of submit buttons and text areas.
- **Links & Navigation:**  
    - Number of hyperlinks (`<a href=...>`), anchor tags, and external links.
    - Counts of navigation elements (`<nav>`, lists, menus).
- **Media & Layout:**  
    - Number of images (`<img>`), videos, sources, tables, divs, spans, and meta tags.
    - Presence of footers and iframes.
- **Scripts & Metadata:**  
    - Number of `<script>` and `<meta>` tags.
    - Links inside script tags and favicon usage.

### 2. **Dynamic JavaScript Features**
These capture behaviors that require JavaScript execution or dynamic analysis:
- **Mouse & Keyboard Monitoring:**  
    - Detection of scripts tracking mouse movements or keystrokes (potential keyloggers).
- **Pop-ups & Redirects:**  
    - Presence of pop-up windows, meta/JS redirects, and form-based redirections.
- **Hidden Elements:**  
    - Count of elements hidden via CSS or JavaScript.

### 3. **Password-Related Features**
- **Password Inputs:**  
    - Number of password fields and hidden password inputs.
    - Detection of password fields by name or ID.
    - Whether a password field is inside a form.

### 4. **Suspicious JavaScript Features**
- **Clipboard Monitoring:**  
    - Scripts that access or monitor the clipboard.
- **Form Data Collection:**  
    - JavaScript collecting form data outside normal submission.
- **Cookie Manipulation:**  
    - Scripts that read, write, or manipulate cookies.
---
**In summary:**  
The `FeaturesExtraction` class transforms a web page into a detailed feature vector, capturing both what the page looks like (HTML) and how it behaves (JavaScript), as well as specific security-relevant signals. This enables robust analysis for security, classification, or research purposes.


In [3]:
class FeaturesExtraction:
    def __init__(self, driver, soup, url):
        self.url = url                              # the page URL
        self.driver = driver                        # Selenium driver instance
        self.soup = soup                            # BeautifulSoup parsed HTML
        self.html_features = FeatureHTML(driver, soup)
                                                   # HTML/JS helpers
        self.url_features = URLFeature(url)         # URL feature helper

    def create_vector(self):
        # 3.1 HTML static features
        html_features = [
            self.html_features.has_title(),               # title tag exists?
            self.html_features.has_submit(),              # <input type="submit">?
            self.html_features.has_link(),                # <link> tags exist?
            self.html_features.has_email_input(),         # <input type="email">?
            self.html_features.number_of_inputs(),        # count all <input>
            self.html_features.number_of_buttons(),       # count <button>
            self.html_features.number_of_images(),        # count <img>
            self.html_features.number_of_option(),        # count <option>
            self.html_features.number_of_list(),          # count <ul>,<ol>
            self.html_features.number_of_href(),          # count hyperrefs
            self.html_features.number_of_paragraph(),     # count <p>
            self.html_features.number_of_script(),        # count <script>
            self.html_features.length_of_title(),         # title length
            self.html_features.has_h1(),                  # <h1> tag exists?
            self.html_features.has_h2(),                  # <h2> tag exists?
            self.html_features.has_h3(),                  # <h3> tag exists?
            self.html_features.length_of_text(),          # length of all text
            self.html_features.number_of_clickable_button(),  # clickable buttons
            self.html_features.number_of_a(),             # count <a>
            self.html_features.number_of_div(),           # count <div>
            self.html_features.has_footer(),              # <footer> exists?
            self.html_features.number_of_forms(),         # count <form>
            self.html_features.has_text_area(),           # <textarea> exists?
            self.html_features.has_iframe(),              # <iframe> exists?
            self.html_features.has_text_input(),          # <input type="text">?
            self.html_features.number_of_meta(),          # count <meta>
            self.html_features.has_nav(),                 # <nav> exists?
            self.html_features.number_of_sources(),       # count <source>
            self.html_features.number_of_span(),          # count <span>
            self.html_features.number_of_table(),         # count <table>
            self.html_features.RequestURL(),              # percentage of external requests
            self.html_features.AnchorURL(),               # external anchor ratio
            self.html_features.Favicon(),                 # external favicon?
            self.html_features.LinksInScriptTags(),       # links inside scripts
            self.html_features.ServerFormHandler(),       # server‐side form action?
            self.html_features.InfoEmail(),               # mailto: usage
            # 3.2 Dynamic features
            self.html_features.has_mouse_tracking(),      # JS mouse tracking?
            self.html_features.has_keyboard_monitoring(), # JS keylogger?
            self.html_features.has_popups(),              # popup usage?
            self.html_features.number_of_hidden_element(),# hidden elements count
            self.html_features.page_redirect(),           # meta/JS redirects?
            self.html_features.form_redirect_behavior(),  # form redirect JS?
            self.html_features.check_external_form_action()  # external form post?
        ]


## FeaturesExtraction: Password, Suspicious JS & URL Features

### **Purpose:**  
This section adds more specialized and security-focused features to the feature vector, making the analysis more robust for detecting phishing, malicious behavior, or unusual web page characteristics.
---
### What’s Included?

### **1. Password Field Features**
- **Counts of password inputs:**  
    Measures how many `<input type="password">` fields are present.  
    *Why?* Phishing sites often use password fields to steal credentials.
- **Hidden password fields:**  
    Detects password fields that are hidden from the user (e.g., via CSS or JavaScript).  
    *Why?* Hidden fields can be used for malicious purposes, such as capturing passwords without user knowledge.
- **Forms containing passwords:**  
    Checks if password fields are inside a `<form>`.  
    *Why?* Legitimate sites usually wrap password fields in forms; absence may be suspicious.
- **Password name/id hints:**  
    Looks for password fields identified by name or ID attributes (e.g., "pwd", "pass").  
    *Why?* Attackers may use misleading names to hide malicious fields.
---
### **2. Suspicious JavaScript Features**
- **Clipboard monitoring:**  
    Detects scripts that access or monitor the clipboard.  
    *Why?* Malicious sites may try to steal copied passwords or sensitive data.
- **Form data collection:**  
    Finds JavaScript that collects form data outside normal submission.  
    *Why?* This can indicate attempts to steal user input.
- **Cookie manipulation:**  
    Identifies scripts that read, write, or manipulate cookies.  
    *Why?* Manipulating cookies can be used for session hijacking or tracking.
---
### **3. URL-Based Features**
- **IP address in URL:**  
    Checks if the URL uses a raw IP instead of a domain name.  
    *Why?* Phishing sites often use IPs to avoid detection.
- **URL length and shortener usage:**  
    Flags unusually long URLs or use of URL shorteners.  
    *Why?* Both can be used to obfuscate malicious links.
- **Suspicious symbols:**  
    Looks for characters like `@`, `-`, or multiple subdomains.  
    *Why?* These are common in deceptive URLs.
- **Redirection and forwarding:**  
    Counts how many times the page redirects or forwards.  
    *Why?* Excessive redirection can hide the true destination.
- **Domain age and registration length:**  
    Measures how old the domain is and how long it’s registered for.  
    *Why?* New or short-lived domains are often used for attacks.
- **Traffic, PageRank, Google index status:**  
    Checks if the site is popular, ranked, or indexed by Google.  
    *Why?* Legitimate sites usually have some web presence.
- **Backlinks and stats reports:**  
    Counts external links pointing to the page and checks for Alexa/statistics reports.  
    *Why?* More backlinks and stats indicate legitimacy.
---
**In summary:**  
These features help detect subtle signs of phishing, fraud, or abnormal web behavior by analyzing both the page content (passwords, scripts) and the URL itself. This comprehensive approach strengthens security analysis and classification.


In [4]:
# 3.3 Password fields features (returns dict)
password_features = self.html_features.check_password_fields()
html_features.extend([
    password_features['password_type_count'],     # count of password inputs
    password_features['password_name_id_count'],  # name/id hints for pwd
    password_features['hidden_password_count'],   # hidden password inputs
    password_features['form_with_password']       # form wrapping password?
])

# 3.4 Suspicious JavaScript features (returns dict)
js_features = self.html_features.check_suspicious_js()
html_features.extend([
    js_features['clipboard_monitoring'],          # clipboard JS?
    js_features['form_data_collection'],          # form data JS?
    js_features['cookie_manipulation']            # JS cookie ops?
])

# 3.5 URL‐based features
url_features = [
    self.url_features.UsingIp(),                  # IP address in URL?
    self.url_features.longUrl(),                  # long URL length?
    self.url_features.shortUrl(),                 # URL shortener?
    self.url_features.symbol(),                   # suspicious symbols?
    self.url_features.redirecting(),              # “//” redirection
    self.url_features.prefixSuffix(),             # “-” in domain
    self.url_features.SubDomains(),               # subdomain count
    self.url_features.DomainRegLen(),             # age/reg length
    self.url_features.NonStdPort(),               # non‐standard port?
    self.url_features.HTTPSDomainURL(),           # HTTPS in domain?
    self.url_features.AbnormalURL(),              # abnormal URL pattern
    self.url_features.WebsiteForwarding(),        # forwarding count
    self.url_features.StatusBarCust(),            # status bar custom
    self.url_features.DisableRightClick(),        # right‐click disabled?
    self.url_features.UsingPopupWindow(),         # popup windows?
    self.url_features.IframeRedirection(),        # iframe redirection?
    self.url_features.AgeofDomain(),              # domain age
    self.url_features.DNSRecording(),             # DNS record age
    self.url_features.WebsiteTraffic(),           # traffic rank
    self.url_features.PageRank(),                 # Google PageRank
    self.url_features.GoogleIndex(),              # indexed by Google?
    self.url_features.LinksPointingToPage(),      # backlinks count
    self.url_features.StatsReport()               # Alexa/stats report
]

# 3.6 Return final combined feature vector
return html_features + url_features


NameError: name 'self' is not defined