# 1. Imports & Global State

**Purpose:**  
Load all required libraries and set up shared variables:

- **Concurrency:** `threading` + `Queue` let us run many workers in parallel without stepping on each other’s toes.  
- **Browser Automation:** `selenium` + headless Chrome to render pages (including JavaScript).  
- **HTML Parsing:** `BeautifulSoup` to inspect the final DOM.  
- **Data I/O:** `pandas` for CSV read/write; `tqdm` for progress bars.  
- **Feature Extraction:** our custom `FeaturesExtraction` class.  
- **Counters & Lock:** global counters to track how many phishing vs legitimate URLs we’ve processed, plus a `lock` to serialize file writes and counter updates.


--------------------------------------------------------
Purpose:
  Evaluate multithreaded URL feature extraction performance.
  Reads URLs, launches headless browsers in threads,
  extracts features, limits to max_count per label, and measures time.
--------------------------------------------------------

In [1]:

import threading                               # For concurrent worker threads
from queue import Queue                        # Thread-safe queue of URLs
from bs4 import BeautifulSoup                  # Parse HTML page source
from selenium import webdriver                 # Automate Chrome to render pages
from selenium.webdriver.chrome.options import Options  # Configure ChromeDriver
from selenium.common.exceptions import TimeoutException  # Handle slow page loads
import pandas as pd                            # Read/write CSVs and DataFrame handling
from features_extraction import FeaturesExtraction  # Extract feature vectors from pages
from urllib3.exceptions import InsecureRequestWarning  # Suppress SSL warnings
from urllib3 import disable_warnings            # Disable those warnings globally
import time                                    # Track elapsed time for performance
from tqdm import tqdm                          # Display progress bars

disable_warnings(InsecureRequestWarning)       # Turn off insecure request warnings

# Global counters for processed pages, protected by lock
phishing_count = 0                             # Total phishing pages processed
legitimate_count = 0                           # Total legitimate pages processed
lock = threading.Lock()                        # Mutex to synchronize shared state

ModuleNotFoundError: No module named 'features_extraction'

# 2. Headless Chrome Setup

**Purpose:**  
Every worker needs its own Chrome browser instance. This helper:

1. Runs Chrome **headless** (no window).  
2. Disables sandboxing & shared‑memory mode for compatibility in Docker/CI.  
3. Turns off images & notifications for speed.  

By isolating these options in one function, the worker code stays clean.


In [None]:
def initialize_driver():
    """
    Create and return a headless Chrome WebDriver optimized for speed.
    """
    # Configure headless Chrome options
    chrome_options = Options()                  # Initialize options container
    chrome_options.add_argument("--headless")  # Do not open browser window
    chrome_options.add_argument("--no-sandbox")  # Required in some Linux environments
    chrome_options.add_argument("--disable-dev-shm-usage")  # Avoid shared memory errors
    prefs = {                                   # Browser preferences
        "profile.managed_default_content_settings.images": 2,  # Disable images
        "profile.default_content_setting_values.notifications": 2  # Disable notifications
    }
    chrome_options.add_experimental_option("prefs", prefs)  # Apply preferences

    driver = webdriver.Chrome(options=chrome_options)  # Launch Chrome WebDriver

    return driver


# 3. Worker Logic

**Purpose:**  
Each thread runs this loop:

1. **Pull** a URL from the shared `queue`.  
2. **Stop** when it receives a `None` sentinel.  
3. **Skip** if that class already reached `max_count`.  
4. **Load** the page (2 s timeout).  
5. **Parse** the rendered HTML via BeautifulSoup.  
6. **Extract** numeric features via `FeaturesExtraction`.  
7. **Append** the URL and class label to the feature vector.  
8. **Write** one CSV row under `lock` so rows never interleave.  
9. **Update** the appropriate counter and print live progress.  
10. **Handle** timeouts by restarting the driver, and always `quit()` at the end.


In [None]:
def worker(queue, data_list, label, max_count):
    global phishing_count, legitimate_count

    driver = initialize_driver()
    driver.set_page_load_timeout(2)

    try:
        while True:
            url = queue.get()                   # Fetch next URL or None
            if url is None:                     # Termination signal
                break

            with lock:                          # Protect counters
                # Skip if reached max_count for this label
                if (label == 1 and phishing_count >= max_count) or \
                   (label == 0 and legitimate_count >= max_count):
                    queue.task_done()            # Mark task done
                    continue                     # Skip processing

            print(f"\nProcessing URL: {url}")  # Log URL
            try:
                driver.get(url)                 # Load page in browser
                print(f"Successfully loaded URL: {url}")

                # Parse rendered HTML and extract features
                soup = BeautifulSoup(driver.page_source, "html.parser")
                extractor = FeaturesExtraction(driver, soup, url)
                vector = extractor.create_vector()  # Feature list
                vector.append(str(url))            # Append URL metadata
                vector.append(label)               # Append class label

                with lock:                        # Synchronize shared list and counters
                    if label == 1 and phishing_count < max_count:
                        data_list.append(vector)   # Store phishing vector
                        phishing_count += 1        # Increment phishing count
                        print(f"\rProcessed phishing sites: {phishing_count}/{max_count}", end="")
                    elif label == 0 and legitimate_count < max_count:
                        data_list.append(vector)   # Store legitimate vector
                        legitimate_count += 1      # Increment legitimate count
                        print(f"\rProcessed legitimate sites: {legitimate_count}/{max_count}", end="")
            except TimeoutException:
                print(f"Timeout while loading URL: {url}")
                driver.quit()                    # Restart driver on timeout
                driver = webdriver.Chrome(options=chrome_options)
            except Exception as e:
                print(f"Error processing URL {url}: {str(e)}")
                continue                          # Skip on error
            finally:
                queue.task_done()                # Signal completion
    finally:
        driver.quit()                          # Clean up browser




# 4. Orchestrator: process_urls_threaded()

**Purpose:**  
Wrap everything together:

1. **Reset** the global counters for the chosen label.  
2. **Read** a slice of URLs from the phishing CSV (label 1) or Tranco CSV (label 0).  
3. **Initialize** the URL queue, in‑memory list, and worker threads.  
4. **Enqueue** URLs (up to `max_count`), showing a progress bar.  
5. **Send** `None` to stop threads, then `join()` them.  
6. **Build** a final DataFrame from `data_list` and save it as a CSV.


In [None]:
def process_urls_threaded(input_file, output_file, label, num_threads=4, max_count=100):
    global phishing_count, legitimate_count

    # Reset counters based on label
    if label == 1:
        phishing_count = 0
    else:
        legitimate_count = 0

    # Read input list of URLs
    if label == 0:
        df = pd.read_csv(input_file, names=["id", "url"])  # No header
        URL_list = ["http://" + u for u in df['url'].tolist()]
    else:
        df = pd.read_csv(input_file)             # CSV with header
        URL_list = df['url'].tolist()

    url_queue = Queue()                         # Shared URL queue
    data_list = []                              # Collect feature vectors

    # Spawn worker threads
    threads = []                                # Keep references to threads
    for _ in range(num_threads):
        t = threading.Thread(target=worker, args=(url_queue, data_list, label, max_count))
        t.daemon = True                          # Daemon threads exit with main program
        t.start()                                # Begin execution
        threads.append(t)

    # Enqueue URLs until max_count reached
    with tqdm(total=max_count, desc="Processing URLs") as pbar:
        for url in URL_list:
            with lock:
                if (label == 1 and phishing_count >= max_count) or \
                   (label == 0 and legitimate_count >= max_count):
                    break                          # Stop adding URLs
            url_queue.put(url)                    # Add URL to queue
            # Update progress by current count
            pbar.update(phishing_count if label == 1 else legitimate_count)

    # Signal threads to terminate
    for _ in range(num_threads):
        url_queue.put(None)
    for t in threads:
        t.join()                                 # Wait for all threads to finish

    # 6) Write final CSV
     df_out = pd.DataFrame(data=data_list, columns=columns)  # Create DataFrame
    df_out.to_csv(output_file, index=False)      # Save to CSV


# 5. Feature Column Names

**Purpose:**  
Return the exact ordered list of column names used in the CSV:

- HTML static features  
- Dynamic (JS) features  
- Password‐field features  
- Suspicious JS features  
- URL‐based features  
- Finally: `URL`, `label`


In [None]:
# Build DataFrame header
    def get_feature_columns():
        basic = [  # HTML static
            "has_title","has_submit","has_link","has_email_input",
            "number_of_inputs","number_of_buttons","number_of_images",
            "number_of_option","number_of_list","number_of_href",
            "number_of_paragraph","number_of_script","length_of_title",
            "has_h1","has_h2","has_h3","length_of_text",
            "number_of_clickable_button","number_of_a","number_of_div",
            "has_footer","number_of_forms","has_text_area","has_iframe",
            "has_text_input","number_of_meta","has_nav",
            "number_of_sources","number_of_span","number_of_table"
        ]
        dynamic = [  # Dynamic JS
            "has_mouse_tracking","has_keyboard_monitoring",
            "has_popups","number_of_hidden_element",
            "page_redirect","form_redirect_behavior","external_form_action"
        ]
        password = [  # Password features
            "password_type_count","password_name_id_count",
            "hidden_password_count","form_with_password"
        ]
        js_feats = ["clipboard_monitoring","form_data_collection","cookie_manipulation"]  # JS
        metadata = ["URL","label"]               # Metadata columns
        return basic + dynamic + password + js_feats + metadata


# 6. Performance Testing & `__main__`

**Purpose:**  
- **`test_thread_performance()`**: Run the pipeline with different thread counts, measure total time, and return results.  
- **`__main__` block**: When you `python this_script.py`, it tests `[10, 20, 50]` threads on the legitimate list and prints out the timings.


In [None]:
# Function to test various thread counts
def test_thread_performance(input_file, output_file_template, label, thread_counts):
    results = []                                # Store (threads, time)
    for nt in thread_counts:
        print(f"Testing with {nt} threads...")
        start_time = time.time()                # Start timer
        process_urls_threaded(
            input_file,
            output_file_template.format(nt),
            label,
            num_threads=nt,
            max_count=100
        )
        elapsed = time.time() - start_time      # Compute elapsed time
        print(f"Threads: {nt}, Time Taken: {elapsed:.2f} seconds")
        results.append((nt, elapsed))
    return results

# Entry point when run as script
if __name__ == "__main__":
    thread_counts_to_test = [10, 20, 50]        # Thread counts to evaluate
    print("Testing legitimate websites processing speed...")
    legit_results = test_thread_performance(
        "tranco_list.csv",
        "legitimate_websites_{}_threads.csv",
        label=0,
        thread_counts=thread_counts_to_test
    )
    print("\nResults:")
    for nt, elapsed in legit_results:
        print(f"Threads: {nt}, Time Taken: {elapsed:.2f} seconds")

