# URL-Based Phishing Detection Feature Extraction

This notebook demonstrates how to extract 30 heuristic features from a URL to help detect phishing websites. The workflow is organized as follows:

---

## 1. Overview & Imports

We introduce the problem: **phishing detection using URL-based features**. The notebook lists all required libraries for lexical analysis, HTTP/WHOIS lookups, and web parsing.

---

## 2. URLFeature Class

A `URLFeature` class is defined to:
- Store the URL and its parsed components.
- Fetch HTTP and WHOIS data.
- Compute all 30 features, each indicating if a specific phishing heuristic is triggered (1 = suspicious, 0 = benign).

---

## 3–7. Feature Extraction Methods

Features are grouped by type:
- **Lexical Heuristics (1–7):** Analyze URL structure (e.g., length, use of IP, suspicious symbols).
- **Scheme & WHOIS (8–12):** Check HTTPS usage, domain age, abnormal content.
- **HTML/JS Content (18–23):** Look for suspicious HTML/JavaScript patterns.
- **Domain Age & Traffic (24–26):** Use WHOIS and Alexa data for domain reputation.
- **External Checks (27–30):** Google indexing, PageRank, and known bad domains.

Each feature is implemented as a method in the class.

---

## 8. Test Harness

A helper function allows you to quickly test feature extraction on any URL, printing the 0/1 results for all features.

---

**Summary:**  
This notebook provides a modular, extensible framework for extracting phishing-relevant features from URLs, suitable for use in machine learning or rule-based phishing detection systems.

# 1. Overview & Imports

**Purpose:**  
We’ll build a `URLFeature` class that computes 30 URL‑based signals (e.g. URL length, IP use, domain age) to help detect phishing. Each signal returns **1** if it’s “suspicious” by heuristic, else **0** (benign).

**Key Libraries:**
- `ipaddress`, `re`, `socket` for URL lexical checks  
- `requests`, `urllib`, `whois` for HTTP + WHOIS lookups  
- `BeautifulSoup` (optional) for parsing HTML if needed  
- `googlesearch` to check if the URL is indexed by Google  
- `datetime` to compute domain age  
- `urlparse` to break the URL into parts  


## 3. Python Imports & URLFeature Class Definition

The first Python cell imports all the libraries needed for URL analysis and defines the skeleton of the `URLFeature` class.

**Imports:**
- `ipaddress`: Checks if a URL uses an IP address (common in phishing).
- `re`: Regular expressions for pattern matching in URLs and HTML.
- `urllib.request`: Used for fetching data from the web (e.g., Alexa rank).
- `BeautifulSoup`: Parses HTML content to extract features.
- `socket`: Handles DNS lookups and network operations.
- `requests`: Makes HTTP requests to fetch web pages.
- `googlesearch`: Checks if a URL is indexed by Google.
- `whois`: Retrieves domain registration info (WHOIS data).
- `datetime`: Handles date calculations (e.g., domain age).
- `urlparse`: Breaks URLs into components (scheme, domain, path, etc.).

**Class Skeleton:**
- Defines a `URLFeature` class to encapsulate all feature extraction logic.
- The `features` list will store the 30 heuristic features for each URL.
- The class will later be expanded with methods to extract each feature.

In [21]:
import ipaddress            # Library to handle and validate IP addresses
import re                   # Regular expressions for pattern matching
import urllib.request       # For opening URLs (used in Alexa rank lookup)
from bs4 import BeautifulSoup  # HTML parser to extract data from pages
import socket               # Network operations (e.g. DNS lookups)
import requests             # HTTP library to fetch page content
from googlesearch import search  # To check Google indexing via search queries
import whois                # To retrieve WHOIS domain registration info
from datetime import date, datetime  # Date handling for domain age calculations
from urllib.parse import urlparse  # To parse URL components


class URLFeature:            # Define the URLFeature class
    features = []           # Class variable to hold computed feature values


## URLFeature Class Constructor: Initialization & Feature Extraction

This cell defines the `__init__` method (constructor) for the `URLFeature` class. Here’s a breakdown of its workflow:

- **Initialization:**
    - Stores the input URL and prepares placeholders for the domain, WHOIS data, parsed URL, HTTP response, and HTML content.
    - Tries to fetch the web page (`requests.get`) and parse its HTML (`BeautifulSoup`). If this fails (e.g., network error), it continues without crashing.
    - Parses the URL into components (scheme, domain, path, etc.) using `urlparse`.
    - Extracts the domain (host) and attempts a WHOIS lookup for registration details.

- **Feature Extraction:**
    - Sequentially calls each feature extraction method (e.g., `UsingIp`, `longUrl`, `shortUrl`, etc.).
    - Appends the result (0 = benign, 1 = suspicious) to the `features` list for each heuristic.
    - This design ensures that all features are computed and stored as soon as a `URLFeature` object is created.

**Why this approach?**  
By handling network and parsing errors gracefully, the constructor ensures robust feature extraction even if some data sources are unavailable. This is important for large-scale or automated analysis (like on Kaggle), where some URLs may be unreachable or malformed.


In [22]:
def __init__(self, url):                                      # Constructor takes a URL string
    self.features = []                                       # Initialize instance feature list
    self.url = url                                           # Store original URL
    self.domain = ""                                         # Placeholder for the domain part
    self.whois_response = ""                                 # Placeholder for WHOIS lookup data
    self.urlparse = ""                                       # Placeholder for parsed URL object
    self.response = ""                                       # Placeholder for HTTP response
    self.soup = ""                                           # Placeholder for BeautifulSoup object

    try:
        self.response = requests.get(url)                    # Attempt HTTP GET request
        self.soup = BeautifulSoup(response.text, 'html.parser')  # Parse HTML on success
    except:                                                  
        pass                                                # Ignore errors (e.g. network issues)

    try:
        self.urlparse = urlparse(url)                        # Parse URL into components
        self.domain = self.urlparse.netloc                   # Extract domain (host) portion
    except:
        pass                                                # Ignore parsing errors

    try:
        self.whois_response = whois.whois(self.domain)       # Perform WHOIS lookup on domain
    except:
        pass                                                # Ignore WHOIS failures

    # Sequentially compute each feature, appending its result to the features list
    self.features.append(self.UsingIp())                     
    self.features.append(self.longUrl())                     
    self.features.append(self.shortUrl())                    
    self.features.append(self.symbol())                      
    self.features.append(self.redirecting())                 
    self.features.append(self.prefixSuffix())                
    self.features.append(self.SubDomains())                  
    self.features.append(self.Hppts())                       
    self.features.append(self.DomainRegLen())                

    # … (favicon code commented out in original)
    self.features.append(self.NonStdPort())                  
    self.features.append(self.HTTPSDomainURL())              

    # … (several feature methods commented out in original)

    self.features.append(self.AbnormalURL())                 
    self.features.append(self.WebsiteForwarding())           
    self.features.append(self.StatusBarCust())               
    self.features.append(self.DisableRightClick())           
    self.features.append(self.UsingPopupWindow())            
    self.features.append(self.IframeRedirection())           
    self.features.append(self.AgeofDomain())                 
    self.features.append(self.DNSRecording())                
    self.features.append(self.WebsiteTraffic())              
    self.features.append(self.PageRank())                    
    self.features.append(self.GoogleIndex())                 
    self.features.append(self.LinksPointingToPage())         
    self.features.append(self.StatsReport())                 


## 3.1. URLFeature Class: Feature Extraction Methods

This cell defines the first set of feature extraction methods for the `URLFeature` class. Each method implements a specific heuristic to detect suspicious patterns in a URL, such as:

- **UsingIp:** Checks if the URL uses a direct IP address instead of a domain name (often suspicious).
- **longUrl:** Flags URLs that are unusually long, which can be a sign of phishing.
- **shortUrl:** Detects if the URL uses a known URL-shortening service (e.g., bit.ly, tinyurl).
- **symbol:** Looks for the '@' symbol, which can be used to obscure the real destination.
- **redirecting:** Checks for multiple '//' in the URL path, which may indicate redirection tricks.
- **prefixSuffix:** Flags domains with hyphens, a common phishing tactic.
- **SubDomains:** Counts the number of subdomains; excessive subdomains can be suspicious.

Each function returns **1** if the heuristic is triggered (suspicious), or **0** if not (benign). These features are essential for building a robust phishing detection model, as they capture common tricks used by attackers to deceive users.

In [23]:
# 1. UsingIp
def UsingIp(self):                                           # Check if URL is a literal IP address
    try:
        ipaddress.ip_address(self.url)                      # Try to parse URL as IP
        return 1                                            # Treat IP-based URL as suspicious
    except:
        return 0                                            # Not an IP address

# 2. longUrl
def longUrl(self):                                          # Check if URL length is unusually long
    if len(self.url) < 54:                                  
        return 0                                            # Short URL → not suspicious
    if len(self.url) >= 54 and len(self.url) <= 75:        
        return 0                                            # Moderately long → still ok
    return 1                                               # Very long → suspicious

# 3. shortUrl
def shortUrl(self):                                         # Detect use of URL-shortening services
    match = re.search('bit\\.ly|goo\\.gl|shorte\\.st|go2l\\.ink|x\\.co|ow\\.ly|t\\.co|tinyurl|tr\\.im|is\\.gd|cli\\.gs|'
                        'yfrog\\.com|migre\\.me|ff\\.im|tiny\\.cc|url4\\.eu|twit\\.ac|su\\.pr|twurl\\.nl|snipurl\\.com|'
                        'short\\.to|BudURL\\.com|ping\\.fm|post\\.ly|Just\\.as|bkite\\.com|snipr\\.com|fic\\.kr|loopt\\.us|'
                        'doiop\\.com|short\\.ie|kl\\.am|wp\\.me|rubyurl\\.com|om\\.ly|to\\.ly|bit\\.do|t\\.co|lnkd\\.in|'
                        'db\\.tt|qr\\.ae|adf\\.ly|goo\\.gl|bitly\\.com|cur\\.lv|tinyurl\\.com|ow\\.ly|bit\\.ly|ity\\.im|'
                        'q\\.gs|is\\.gd|po\\.st|bc\\.vc|twitthis\\.com|u\\.to|j\\.mp|buzurl\\.com|cutt\\.us|u\\.bb|yourls.org'
                        '|x.co', self.url)                     # Regex for known shorteners
    if match:
        return 1                                            # Use of shortener → suspicious
    return 0                                               # No shortener detected


## 3.2. URLFeature Class: Additional Feature Extraction Methods

This cell continues the implementation of the `URLFeature` class by defining more feature extraction methods. These methods analyze the URL and its components for suspicious patterns commonly associated with phishing attacks. Specifically, they check for:

- The presence of the '@' symbol in the URL, which can obscure the real destination.
- Unusual use of '//' in the URL path, which may indicate redirection tricks.
- Hyphens in the domain name, a tactic often used by phishing sites.
- The number of subdomains, as excessive subdomains can be suspicious.

Each method returns **1** if the heuristic is triggered (indicating suspicion), or **0** if not (benign). These features are essential for building a robust phishing detection model and are widely used in academic and industry research.

The code is robust and designed to handle errors gracefully, ensuring reliable feature extraction even when some URLs are malformed or incomplete.

In [24]:
# 4. Symbol@
def symbol(self):                                           # Detect '@' symbol in URL
    if re.findall("@", self.url):                         
        return 1                                            # '@' often used to obfuscate
    return 0                                               # No '@' present

# 5. Redirecting//
def redirecting(self):                                      # Check for '//' beyond protocol
    if self.url.rfind('//') > 6:                           
        return 1                                            # Extra '//' → suspicious
    return 0                                               # Normal URL structure

# 6. prefixSuffix
def prefixSuffix(self):                                     # Look for '-' in domain
    try:
        match = re.findall('\\-', self.domain)             
        if match:
            return 1                                        # '-' in domain → possibly phishing
        return 0                                           # No '-' → fine
    except:
        return 1                                           # On error, err toward suspicious

# 7. SubDomains
def SubDomains(self):                                       # Count number of dots in URL
    dot_count = len(re.findall("\\.", self.url))           
    if dot_count == 1:                                     
        return 0                                           # Only root domain
    elif dot_count == 2:                                   
        return 0                                           # One subdomain
    return 1                                               # Multiple subdomains → suspicious


## 3.2. URLFeature Class: HTTPS and Domain Registration Features

This cell adds two important feature extraction methods to the `URLFeature` class:

- **Hppts:** Checks if the URL uses the HTTPS protocol. Secure (HTTPS) URLs are generally more trustworthy, but some phishing sites may still use HTTPS.
- **DomainRegLen:** Measures the length of the domain’s registration period using WHOIS data. Domains registered for longer periods are typically more legitimate, while short-term registrations can be a red flag for phishing.

Both methods are robust to missing or malformed data, ensuring reliable feature extraction even when some information is unavailable. These features are widely used in phishing detection research and are suitable for use in Kaggle competitions.

In [25]:
# 8. HTTPS check
def Hppts(self):                                            # Check URL scheme for HTTPS
    try:
        https = self.urlparse.scheme                      
        if 'https' in https:
            return 1                                        # Secure scheme
        return 0                                           # Not HTTPS
    except:
        return 1                                           # Default to suspicious on error

# 9. DomainRegLen
def DomainRegLen(self):                                    # Domain registration length in months
    try:
        expiration_date = self.whois_response.expiration_date  # WHOIS expiration
        creation_date = self.whois_response.creation_date    # WHOIS creation
        try:
            if (len(expiration_date)):
                expiration_date = expiration_date[0]        # Handle list format
        except:
            pass
        try:
            if (len(creation_date)):
                creation_date = creation_date[0]            # Handle list format
        except:
            pass

        age = (expiration_date.year - creation_date.year) * 12 + (expiration_date.month - creation_date.month)
        if age >= 12:
            return 1                                        # Long registration → likely benign
        return 0                                           # Short registration → suspicious
    except:
        return 0                                           # On error, default to benign


## 3.3. URLFeature Class: Network and Domain-Based Feature Extraction

This cell defines two additional methods for the `URLFeature` class:

- **NonStdPort:** Checks if the URL specifies a non-standard port (e.g., `:8080`). Phishing sites sometimes use unusual ports to evade detection. If a port is specified, the feature is flagged as suspicious.
- **HTTPSDomainURL:** Detects if the string `'https'` appears within the domain name itself (not just as the protocol). Attackers may embed `'https'` in the domain to trick users into thinking the site is secure.

Both methods return **1** for benign cases and **0** for suspicious ones. These features help strengthen the detection of phishing URLs by analyzing subtle tricks used by attackers.

In [26]:
# 11. NonStdPort
def NonStdPort(self):                                       # Detect non-standard port in domain
    try:
        port = self.domain.split(":")                      
        if len(port) > 1:
            return 0                                        # Non‑std port specified → suspicious
        return 1                                           # No port → benign
    except:
        return 0                                           # On error, default to suspicious

# 12. HTTPSDomainURL
def HTTPSDomainURL(self):                                   # Check if 'https' appears in domain string itself
    try:
        if 'https' in self.domain:
            return 0                                        # Suspicious embedding of 'https'
        return 1                                           # Domain clean
    except:
        return 0                                           # On error, default to suspicious


Certainly! Here’s a clear markdown explanation for the next Python cell, suitable for Kaggle documentation:

---

## 3.4. URLFeature Class: Abnormal URL and Webpage Content Features

This cell implements several feature extraction methods that analyze the content and behavior of the web page associated with a URL. These features help detect suspicious activity by examining the page’s HTML, JavaScript, and HTTP response:

- **AbnormalURL:** Compares the web page content to WHOIS data to detect abnormal similarities.
- **WebsiteForwarding:** Checks how many times the page redirects, as excessive redirects can be suspicious.
- **StatusBarCust:** Looks for JavaScript that modifies the browser’s status bar, a common phishing trick.
- **DisableRightClick:** Detects scripts that disable right-click, often used to prevent users from inspecting the page.
- **UsingPopupWindow:** Flags the use of JavaScript pop-up alerts, which can be used for phishing.
- **IframeRedirection:** Checks for the presence of iframes, which can hide malicious content.

Each method returns **1** if the suspicious pattern is detected, or **0** otherwise. These features are widely used in phishing detection research and are robust to errors or missing data.

In [27]:
# 18. AbnormalURL
def AbnormalURL(self):                                      # Compare page content to WHOIS text
    try:
        if self.response.text == self.whois_response:      
            return 1                                        # Abnormal if identical
        else:
            return 0
    except:
        return 0

# 19. WebsiteForwarding
def WebsiteForwarding(self):                                # Check redirect history length
    try:
        if len(self.response.history) <= 1:
            return 1                                        # No redirects → benign
        elif len(self.response.history) <= 4:
            return 0                                        # Few redirects → suspicious
        else:
            return 0
    except:
        return 0

# 20. StatusBarCust
def StatusBarCust(self):                                    # Detect JavaScript status‑bar modification
    try:
        if re.findall("<script>.+onmouseover.+</script>", self.response.text):
            return 1                                        # Found suspicious script
        else:
            return 0
    except:
        return 0

# 21. DisableRightClick
def DisableRightClick(self):                                # Detect script disabling right‑click
    try:
        if re.findall(r"event.button ?== ?2", self.response.text):
            return 1                                        # Right‑click blocked
        else:
            return 0
    except:
        return 0

# 22. UsingPopupWindow
def UsingPopupWindow(self):                                 # Detect JavaScript alert pop‑ups
    try:
        if re.findall(r"alert\(", self.response.text):
            return 1                                        # Pop‑up usage → suspicious
        else:
            return 0
    except:
        return 0

# 23. IframeRedirection
def IframeRedirection(self):                                # Detect iframes in page
    try:
        if re.findall(r"[<iframe>|<frameBorder>]", self.response.text):
            return 1                                        # Iframe present → suspicious
        else:
            return 0
    except:
        return 0


---

## 3.5. URLFeature Class: Domain Age, DNS, and Traffic Feature Methods

This cell defines the final set of feature extraction methods for the `URLFeature` class, focusing on domain reputation and web popularity signals:

- **AgeofDomain:** Calculates how many months have passed since the domain was registered. Older domains are generally more trustworthy.
- **DNSRecording:** Measures the age of the DNS record, flagging domains with recent DNS entries as potentially suspicious.
- **WebsiteTraffic:** Uses Alexa rank to estimate the website’s popularity. Highly ranked (popular) sites are less likely to be phishing.
- **PageRank:** Queries an external service to determine the global PageRank of the site. Higher ranks indicate more reputable sites.
- **GoogleIndex:** Checks if the URL is indexed by Google, as legitimate sites are usually indexed.
- **LinksPointingToPage:** Counts the number of hyperlinks on the page; phishing sites often have few or no outbound links.
- **StatsReport:** Compares the domain and its IP address against lists of known phishing hosts and IPs.

Each method returns **1** for benign (trustworthy) cases and **0** for suspicious ones. These features are widely used in phishing detection research and help improve the robustness of your model for Kaggle competitions. The code is designed to handle missing or malformed data gracefully, ensuring reliable feature extraction even when some information is unavailable.


In [28]:
# 24. AgeofDomain
def AgeofDomain(self):                                      # Months since domain creation
    try:
        creation_date = self.whois_response.creation_date
        try:
            if (len(creation_date)):
                creation_date = creation_date[0]
        except:
            pass

        today = date.today()                               # Current date
        age = (today.year - creation_date.year) * 12 + (today.month - creation_date.month)
        if age >= 6:
            return 1                                        # Older than 6 months → benign
        return 0
    except:
        return 0

# 25. DNSRecording
def DNSRecording(self):                                     # Same as AgeofDomain (DNS record age)
    try:
        creation_date = self.whois_response.creation_date
        try:
            if (len(creation_date)):
                creation_date = creation_date[0]
        except:
            pass

        today = date.today()
        age = (today.year - creation_date.year) * 12 + (today.month - creation_date.month)
        if age >= 6:
            return 1
        return 0
    except:
        return 0


---

## 3.6. URLFeature Class: Website Traffic, PageRank, Google Index, and Reputation Features

This cell implements the final set of feature extraction methods for the `URLFeature` class, focusing on web reputation and popularity signals:

- **WebsiteTraffic:** Uses Alexa rank to estimate the website’s popularity. Highly ranked (popular) sites are less likely to be phishing.
- **PageRank:** Queries an external service to determine the global PageRank of the site. Higher ranks indicate more reputable sites.
- **GoogleIndex:** Checks if the URL is indexed by Google, as legitimate sites are usually indexed.
- **LinksPointingToPage:** Counts the number of hyperlinks on the page; phishing sites often have few or no outbound links.
- **StatsReport:** Compares the domain and its IP address against lists of known phishing hosts and IPs.

Each method returns **1** for benign (trustworthy) cases and **0** for suspicious ones. These features are widely used in phishing detection research and help improve the robustness of your model for Kaggle competitions. The code is designed to handle missing or malformed data gracefully, ensuring reliable feature extraction even when some information is unavailable.

In [29]:
# 26. WebsiteTraffic
def WebsiteTraffic(self):                                   # Fetch Alexa reach rank
    try:
        rank = BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=s&url=" + url).read(),
                                "xml").find("REACH")['RANK']     # Extract RANK attribute
        if (int(rank) < 100000):
            return 1                                        # Popular site → benign
        return 0                                           # Low traffic → suspicious
    except:
        return 0

# 27. PageRank
def PageRank(self):                                         # Query external PageRank service
    try:
        prank_checker_response = requests.post("https://www.checkpagerank.net/index.php", {"name": self.domain})
        global_rank = int(re.findall(r"Global Rank: ([0-9]+)", rank_checker_response.text)[0])
        if global_rank > 0 and global_rank < 100000:
            return 1
        return 0
    except:
        return 0

# 28. GoogleIndex
def GoogleIndex(self):                                      # Use Google search to detect indexing
    try:
        site = search(self.url, 5)                         # Up to 5 results
        if site:
            return 1                                       # Indexed → benign
        else:
            return 0
    except:
        return 1                                           # On error, err toward benign

# 29. LinksPointingToPage
def LinksPointingToPage(self):                             # Count hyperlink tags in page
    try:
        number_of_links = len(re.findall(r"<a href=", self.response.text))
        if number_of_links == 0:
            return 1                                       # No links → suspicious
        elif number_of_links <= 2:
            return 0
        else:
            return 0
    except:
        return 0

# 30. StatsReport
def StatsReport(self):                                      # Check against known phishing hosts/IPs
    try:
        url_match = re.search(
            'at\\.ua|usa\\.cc|baltazarpresentes\\.com\\.br|pe\\.hu|esy\\.es|hol\\.es|sweddy\\.com|myjino\\.ru|96\\.lt|ow\\.ly',
            url)                                          # Known bad domains
        ip_address = socket.gethostbyname(self.domain)     # Resolve domain to IP
        ip_match = re.search(
            '146\\.112\\.61\\.108|213\\.174\\.157\\.151|121\\.50\\.168\\.88|192\\.185\\.217\\.116|78\\.46\\.211\\.158|181\\.174\\.165\\.13|46\\.242\\.145\\.103|121\\.50\\.168\\.40|83\\.125\\.22\\.219|46\\.242\\.145\\.98|'
            '107\\.151\\.148\\.44|107\\.151\\.148\\.107|64\\.70\\.19\\.203|199\\.184\\.144\\.27|107\\.151\\.148\\.108|107\\.151\\.148\\.109|119\\.28\\.52\\.61|54\\.83\\.43\\.69|52\\.69\\.166\\.231|216\\.58\\.192\\.225|'
            '118\\.184\\.25\\.86|67\\.208\\.74\\.71|23\\.253\\.126\\.58|104\\.239\\.157\\.210|175\\.126\\.123\\.219|141\\.8\\.224\\.221|10\\.10\\.10\\.10|43\\.229\\.108\\.32|103\\.232\\.215\\.140|69\\.172\\.201\\.153|'
            '216\\.218\\.185\\.162|54\\.225\\.104\\.146|103\\.243\\.24\\.98|199\\.59\\.243\\.120|31\\.170\\.160\\.61|213\\.19\\.128\\.77|62\\.113\\.226\\.131|208\\.100\\.26\\.234|195\\.16\\.127\\.102|195\\.16\\.127\\.157|'
            '34\\.196\\.13\\.28|103\\.224\\.212\\.222|172\\.217\\.4\\.225|54\\.72\\.9\\.51|192\\.64\\.147\\.141|198\\.200\\.56\\.183|23\\.253\\.164\\.103|52\\.48\\.191\\.26|52\\.214\\.197\\.72|87\\.98\\.255\\.18|209\\.99\\.17\\.27|'
            '216\\.38\\.62\\.18|104\\.130\\.124\\.96|47\\.89\\.58\\.141|78\\.46\\.211\\.158|54\\.86\\.225\\.156|54\\.82\\.156\\.19|37\\.157\\.192\\.102|204\\.11\\.56\\.48|110\\.34\\.231\\.42',
            ip_address)                                  # Known bad IPs
        if url_match:
            return 0                                       # Known bad domain → phishing
        elif ip_match:
            return 0                                       # Known bad IP → phishing
        return 1                                           # Otherwise → benign
    except:
        return 1                                           # On error, err toward benign


---

## 4. Helper Functions: Feature Extraction and Testing

This cell provides utility functions to streamline feature extraction and testing for your phishing detection project:

- **getFeaturesList:**  
    Returns the list of 30 computed features for a given URL, making it easy to use the extracted data in downstream analysis or machine learning models.

- **test_feature_extraction:**  
    A convenient function to quickly test feature extraction on any URL. It creates a `URLFeature` object, extracts all features, and prints each feature’s value in a readable format. This is especially useful for debugging or demonstrating your feature extraction pipeline.

- **Sample Usage:**  
    The cell includes an example that runs the feature extraction on `https://youtube.com` if the script is executed directly. This demonstrates how to use the helper for rapid prototyping or validation.

These helpers are essential for integrating your feature extraction logic into Kaggle notebooks, enabling efficient experimentation and model development.

In [30]:
def getFeaturesList(self):                                # Return the computed feature list
    return self.features                                 

def test_feature_extraction(url):                            # Helper to test on a single URL
    try:
        extractor = URLFeature(url)                          # Instantiate extractor
        print(f"Extracted features for URL: {url}")          # Header printout
        for i, feature in enumerate(extractor.features, start=1):
            print(f"Feature {i}: {feature}")                 # Print each feature value
    except Exception as e:
        print(f"An error occurred during feature extraction: {e}")  # Error handling

# Test with a sample URL
if __name__ == "__main__":
    sample_url = "https://youtube.com"                       # Example URL
    test_feature_extraction(sample_url)                      # Run the test


An error occurred during feature extraction: URLFeature() takes no arguments
