# **Website Feature Extraction with BeautifulSoup & Selenium**

This notebook provides a comprehensive framework for extracting both static and dynamic features from web pages.

---

## **Notebook Structure**

1. **Imports & Class Skeleton**
  - Essential libraries (`re`, Selenium) are imported.
  - The `features_html` class is initialized with a Selenium `driver` and a BeautifulSoup `soup` object, enabling both static and dynamic analysis.

2. **Static HTML Features**
  - Methods analyze the parsed HTML (`soup`) for:
    - Presence of key elements (title, submit button, links, email inputs, etc.).
    - Counts of tags (inputs, buttons, images, lists, etc.).
    - Text and attribute metrics (title length, total text length).

3. **External-Resource Features**
  - Methods assess the safety and locality of external resources:
    - Percentage of images/scripts/links from the same domain.
    - Safety of anchor URLs.
    - Favicon and script/link resource checks.
    - Form handler safety and email information leakage.

4. **Dynamic JS-Driven Features**
  - Selenium-driven methods execute JavaScript to detect:
    - Clipboard, form, and cookie monitoring.
    - Hidden elements and offscreen content.
    - Page redirects and form behaviors.
    - Mouse/keyboard event tracking.

5. **Password-Field & Popup Features**
  - Methods count password fields and forms containing them.
  - Detects popups and built-in dialogs (alert, confirm, prompt).

---

## **Usage**

- **Static features** are extracted using BeautifulSoup on the loaded HTML.
- **Dynamic features** require a live Selenium WebDriver session to interact with JavaScript and DOM events.
- The class is modular, allowing easy extension or integration into larger web analysis pipelines.
---
**In summary:**  
This notebook is a robust toolkit for extracting a wide range of web page features, combining the strengths of static HTML parsing and dynamic browser automation.

# **1. Imports & Class Skeleton**

- We import `re` for regex matching.  
- `WebDriverWait` & `By` help us coordinate Selenium interactions for dynamic checks.  
- The `features_html` class bundles **all** static (BeautifulSoup) and dynamic (Selenium‑driven) page features.


In [11]:
from selenium.webdriver.support.ui import WebDriverWait

import re  # Import regular expressions module for pattern matching (e.g., searching, extracting, or validating text patterns such as emails, URLs, etc.)
# Import WebDriverWait for waiting on Selenium events, such as waiting for elements to load or for certain conditions to be met before proceeding. 
# This is essential when dealing with dynamic web pages where elements may not be immediately available.
from selenium.webdriver.common.by import By  # Import By for locating elements in Selenium

class features_html:
    def __init__(self, driver, soup):
        self.driver = driver  # Selenium WebDriver instance for dynamic page interaction
        self.soup = soup  # BeautifulSoup object for static HTML parsing
        self.wait = WebDriverWait(driver, 2)  # Wait helper for Selenium actions (2 second timeout)


# **2. Static HTML Features**

These methods use **BeautifulSoup** on the already‑rendered `soup`:

- Presence checks (title, submit button, link, email input).  
- Tag counts (inputs, buttons, images, lists, etc.).  
- Text and attribute metrics (title length, total text length).


In [12]:
def has_title(self):
    return 1 if self.soup.title else 0

def has_submit(self):
    return 1 if self.soup.find('input', {'type': 'submit'}) else 0

def has_link(self):
    return 1 if self.soup.find('link') else 0

def has_email_input(self):
    email_inputs = self.soup.find_all('input', {'type': 'email'})
    id_inputs = self.soup.find_all('input', id=lambda x: x and 'email' in x.lower())
    name_inputs = self.soup.find_all('input', attrs={'name': lambda x: x and 'email' in x.lower()})
    return 1 if (email_inputs or id_inputs or name_inputs) else 0

def number_of_inputs(self):
    return len(self.soup.find_all('input'))

def number_of_buttons(self):
    return len(self.soup.find_all('button'))

def number_of_images(self):
    return len(self.soup.find_all('img'))

def number_of_option(self):
    return len(self.soup.find_all('option'))

def number_of_list(self):
    return len(self.soup.find_all('li'))

def number_of_href(self):
    links = self.soup.find_all('a', href=True)
    return len(links)

def number_of_paragraph(self):
    return len(self.soup.find_all('p'))

def number_of_script(self):
    return len(self.soup.find_all('script'))

def length_of_title(self):
    return len(self.soup.title.string) if self.soup.title else 0

def has_h1(self):
    return 1 if self.soup.find('h1') else 0

def has_h2(self):
    return 1 if self.soup.find('h2') else 0

def has_h3(self):
    return 1 if self.soup.find('h3') else 0

def length_of_text(self):
    return len(self.soup.get_text())

def number_of_clickable_button(self):
    return len(self.soup.find_all('button', {'type': 'button'}))

def number_of_a(self):
    return len(self.soup.find_all('a'))

def number_of_div(self):
    return len(self.soup.find_all('div'))

def has_footer(self):
    return 1 if self.soup.find('footer') else 0

def number_of_forms(self):
    return len(self.soup.find_all('form'))

def has_text_area(self):
    return 1 if self.soup.find('textarea') else 0

def has_iframe(self):
    return 1 if self.soup.find('iframe') else 0

def has_text_input(self):
    inputs = self.soup.find_all('input', {'type': 'text'})
    return 1 if inputs else 0

def number_of_meta(self):
    return len(self.soup.find_all('meta'))

def has_nav(self):
    return 1 if self.soup.find('nav') else 0

def number_of_sources(self):
    return len(self.soup.find_all('source'))

def number_of_span(self):
    return len(self.soup.find_all('span'))

def number_of_table(self):
    return len(self.soup.find_all('table'))


# **3. External‑Resource Features**

- **RequestURL:** % of image/audio/embed/iframe sources that come from this domain.  
- **AnchorURL:** % of `<a>` hrefs considered “safe” (same domain or not javascript/mailto).  
- **Favicon:** whether the page’s `<link>` favicon is local.  
- **LinksInScriptTags:** same idea, but for `<link>` and `<script src>`.  
- **ServerFormHandler:** checks if form `action` is blank/about:blank or external.  
- **InfoEmail:** looks for `mailto:` or `mail()` usage in page text.


In [13]:
def RequestURL(self):
    try:
        success, i = 0, 0
        for tag in ['img','audio','embed','iframe']:
            for el in self.soup.find_all(tag, src=True):
                dots = [m.start() for m in re.finditer(r'\.', el['src'])]
                if self.url in el['src'] or self.domain in el['src'] or len(dots) == 1:
                    success += 1
                i += 1
        pct = (success / float(i)) * 100 if i>0 else 0
        return 1 if pct >= 50.0 else 0
    except:
        return 0

def AnchorURL(self):
    try:
        unsafe, total = 0, 0
        for a in self.soup.find_all('a', href=True):
            href = a['href'].lower()
            if ("#" in href or "javascript" in href or "mailto" in href or
                not (self.url in href or self.domain in href)):
                unsafe += 1
            total += 1
        pct = (unsafe / float(total))*100 if total>0 else 100
        return 1 if pct < 50.0 else 0
    except:
        return 0

def Favicon(self):
    try:
        for link in self.soup.find_all('link', href=True):
            if any(x in link['href'] for x in [self.url, self.domain]) or link['href'].count('.')==1:
                return 1
        return 0
    except:
        return 0

def LinksInScriptTags(self):
    try:
        success, total = 0, 0
        for el in self.soup.find_all('link', href=True) + self.soup.find_all('script', src=True):
            dots = [m.start() for m in re.finditer(r'\.', el.get('href',el.get('src')))]
            path = el.get('href', el.get('src'))
            if self.url in path or self.domain in path or len(dots)==1:
                success += 1
            total += 1
        pct = (success/float(total))*100 if total>0 else 0
        return 1 if pct >= 50.0 else 0
    except:
        return 0

def ServerFormHandler(self):
    try:
        forms = self.soup.find_all('form', action=True)
        if not forms:
            return 1
        for f in forms:
            action = f['action']
            if action in ("", "about:blank") or \
                (self.url not in action and self.domain not in action):
                return 0
        return 1
    except:
        return 0

def InfoEmail(self):
    try:
        return 0 if re.search(r"(mailto:|mail\()", str(self.soup)) else 1
    except:
        return 0


# **4. Dynamic JS‑Driven Features**

These execute small JS snippets via Selenium:

- **Clipboard, Form, Cookie** monitoring detection.  
- **Hidden elements** count by style/display.  
- **Page redirect** detection (initial vs. final URL).  
- **Form redirect behavior** & **external form actions**.  
- **Mouse & Keyboard** event tracking.


In [14]:
def check_clipboard_access(self):
    try:
        js = """
        let accessed = false;
        ['copy','cut','paste'].forEach(e=>document.addEventListener(e,()=>accessed=true));
        return accessed?1:0;
        """
        return self.driver.execute_script(js)
    except:
        return 0

def check_form_data_collection(self):
    try:
        js = """
        let collected = false;
        document.querySelectorAll('form').forEach(f=>{
            f.addEventListener('submit',()=>collected=true);
            f.querySelectorAll('input').forEach(i=>i.addEventListener('change',()=>collected=true));
        });
        return collected?1:0;
        """
        return self.driver.execute_script(js)
    except:
        return 0

def check_cookie_manipulation(self):
    try:
        js = """
        let manipulated=false;
        const orig=Object.getOwnPropertyDescriptor(Document.prototype,'cookie');
        Object.defineProperty(document,'cookie',{
            get(){manipulated=true;return orig.get.call(this);},
            set(v){manipulated=true;return orig.set.call(this,v);}
        });
        return manipulated?1:0;
        """
        return self.driver.execute_script(js)
    except:
        return 0

def check_suspicious_js(self):
    return {
        'clipboard_monitoring': self.check_clipboard_access(),
        'form_data_collection': self.check_form_data_collection(),
        'cookie_manipulation': self.check_cookie_manipulation()
    }

def number_of_hidden_element(self):
    try:
        counts = self.driver.execute_script("""
            let els=[...document.getElementsByTagName('*')];
            return {
            display_none: els.filter(e=>getComputedStyle(e).display==='none').length,
            visibility_hidden: els.filter(e=>getComputedStyle(e).visibility==='hidden').length,
            hidden_inputs: [...document.getElementsByTagName('input')].filter(i=>i.type==='hidden').length,
            offscreen: els.filter(e=>{const r=e.getBoundingClientRect();return r.left<0||r.top<0;}).length
            };
        """)
        return sum(counts.values())
    except:
        return 0

def page_redirect(self):
    try:
        start = self.driver.current_url
        self.wait.until(lambda d: d.execute_script("return document.readyState")=='complete')
        return 1 if self.driver.current_url!=start else 0
    except:
        return 0

def form_redirect_behavior(self):
    try:
        forms = self.driver.find_elements(By.TAG_NAME,'form')
        cur = self.driver.current_url
        for f in forms:
            act = f.get_attribute('action')
            if act in ("","about:blank") or ("http" in act and cur not in act):
                return 1
        return 0
    except:
        return 0

def check_external_form_action(self):
    try:
        forms = self.driver.find_elements(By.TAG_NAME,'form')
        dom = self.driver.current_url.split('/')[2]
        for f in forms:
            act = f.get_attribute('action')
            if act and dom not in act:
                return 1
        return 0
    except:
        return 0

def has_mouse_tracking(self):
    js = """
    let t=false;document.addEventListener('mousemove',()=>t=true);return t?1:0;
    """
    return self.driver.execute_script(js)

def has_keyboard_monitoring(self):
    js = """
    let k=false;document.addEventListener('keydown',()=>k=true);return k?1:0;
    """
    return self.driver.execute_script(js)


# **5. Password‑Field & Popup Features**

- **check_password_fields:** counts password inputs, hidden password fields, forms containing passwords.  
- **has_popups:** detect `window.open` or built‑in dialogs (`alert`, `confirm`, `prompt`).


In [15]:
def check_password_fields(self):
    try:
        feats = {'password_type_count':0,
                    'password_name_id_count':0,
                    'hidden_password_count':0,
                    'form_with_password':0}
        forms = self.driver.find_elements(By.TAG_NAME,'form')
        for f in forms:
            has_pwd=False
            for i in f.find_elements(By.TAG_NAME,'input'):
                t=i.get_attribute('type'); n=i.get_attribute('name'); _id=i.get_attribute('id')
                if t=='password': feats['password_type_count']+=1; has_pwd=True
                if (n and 'password' in n.lower()) or (_id and 'password' in _id.lower()):
                    feats['password_name_id_count']+=1
                if t=='hidden' and ((n and 'password' in n.lower()) or (_id and 'password' in _id.lower())):
                    feats['hidden_password_count']+=1
            if has_pwd: feats['form_with_password']+=1
        return feats
    except:
        return {'password_type_count':0,'password_name_id_count':0,
                'hidden_password_count':0,'form_with_password':0}

def has_popups(self):
    try:
        js1 = """
            let p=false;const o=window.open;window.open=()=>p=true;return p?1:0;
        """
        js2 = """
            let d=false;['alert','confirm','prompt'].forEach(m=>{
            const orig=window[m];window[m]=()=>d=true;
            });return d?1:0;
        """
        return 1 if (self.driver.execute_script(js1) or self.driver.execute_script(js2)) else 0
    except:
        return 0
