## 1. Inputs: Two URL Lists

* **Phishing URLs (label = 1)**
  It expects a CSV (by default named `verified_online.csv`) with the PhishTank data. That file has columns like
  `phish_id, url, phish_detail_url, …, online, target`.
  When you call

  ```python
  process_urls_threaded(
      "verified_online.csv",      # input_file
      "phishing_websites.csv",    # output_file
      label=1,                    # tells it “these are phishing”
      …
  )
  ```

  it reads those columns (including the `url` field) into a DataFrame, takes the `url` column, and uses those as the list of pages to visit.

* **Legitimate URLs (label = 0)**
  It expects a simple CSV of popular domains (by default `tranco_list.csv`) with no header, just rank and domain name. When called with `label=0`, it reads each domain, prefixes it with `http://`, and uses that as the list of “legitimate” pages to visit.

---

## 2. Feature Extraction Worker

The heart of the script is a pool of worker threads that:

1. **Initialize a headless Chrome driver** (via Selenium) with images and notifications disabled.
2. **For each URL** (phishing or legit):

   * Calls `driver.get(url)` (with a 2 s timeout).
   * Grabs the rendered page HTML (`driver.page_source`) into BeautifulSoup.
   * Passes `driver`, the `soup`, and the `url` into the project’s `FeaturesExtraction` class.
   * Calls `create_vector()`, which returns a list of feature values (all the URL‑, HTML‑ and dynamic‑interaction features).
   * **Appends** the raw URL string and the label (0 or 1) to that feature vector.
   * **Immediately writes** that single row into the output CSV (either `phishing_websites.csv` or `legitimate_websites.csv`) in append mode, under a thread‑safe lock, so you end up with one big features+label table.

Any time a page times out or throws an exception, it logs the URL and continues.

---

## 3. Outputs

* **`phishing_websites.csv`**
  A CSV whose columns are exactly the 73 features returned by `get_feature_columns()`, plus two final columns:

  * `URL` (the string you visited)
  * `label` (1 for phishing)

* **`legitimate_websites.csv`**
  Same format, but `label` is 0.

---

## 4. How to Run It

At the bottom of the file there’s an `if __name__ == "__main__":` block where they set:

```python
PHISH_THREADS = 32
LEGITIMATE_THRESHOLD = 16

# Example (commented out) call for phishing:
# process_urls_threaded(
#   "verified_online.csv", "phishing_websites.csv", label=1,
#   start_row=21000, end_row=22000, num_threads=PHISH_THREADS,
# )

# Active call for legit:
process_urls_threaded(
  "tranco_list.csv", "legitimate_websites.csv", label=0,
  start_row=24100, end_row=24300, num_threads=LEGITIMATE_THRESHOLD,
)
```

You would typically:

1. Pick your row range (`start_row`, `end_row`) to control how many URLs to process.
2. Decide how many threads to use.
3. Run the script twice: once with `label=1` on the PhishTank CSV (to build `phishing_websites.csv`), then with `label=0` on the Tranco CSV (to build `legitimate_websites.csv`).

---

### Why This Matters

* **Inputs**: live lists of URLs (one phishing, one legit).
* **Processing**: headless browser + feature‑extraction class.
* **Outputs**: two large CSVs of feature vectors ready for model training.

Once you’ve generated those CSVs, you can load them in your notebook (or Python script) and begin training and evaluating your classifiers. That’s the foundation of the entire project’s “URL + content → feature vector → model” pipeline.
"""

# URL & Content-based Phishing Detection Pipeline
"""
**Purpose:**
This script reads lists of URLs (either verified phishing sites or top-ranked legitimate domains), loads each page in a headless browser,
 extracts a rich set of features (both URL‐based and HTML/JavaScript‐based), and writes them as feature vectors into a CSV. 
These vectors can then be used to train or evaluate machine learning models to distinguish phishing pages from legitimate ones.

**Key Components & Theory:**
1. **Feature Engineering:**
   - **URL Features:** Length, presence of IP address, number of subdomains, redirection behavior, age of domain, etc.
   - **HTML/JS Features:** Number of forms, input fields, scripts, hidden elements, use of popups, monitoring scripts (e.g., keylogging, clipboard access), etc.
   Good feature selection is critical: combining structural clues (URL patterns) with behavioral indicators (DOM/JS characteristics) often yields higher detection accuracy.

2. **Selenium + BeautifulSoup:**
   - **Selenium** automates a real browser, enabling execution of dynamic JavaScript and rendering complex pages.
   - **BeautifulSoup** parses the final rendered HTML for static analysis.
   This hybrid approach captures both static and dynamic content cues.

3. **Multithreading & I/O Bound Tasks:**
   - Loading pages over the network is I/O bound. Using a thread pool speeds up the crawl by overlapping downloads.
   - A `Queue` distributes URLs to worker threads; a `Lock` ensures thread‐safe writes to the CSV.

4. **Scalability Considerations:**
   - Headless mode and disabling images/notifications reduce resource usage.
   - Timeouts prevent hanging on slow or unresponsive sites.
   - Thread‐local counters track progress per thread without race conditions.




   Let’s break it down step by step in everyday language—no jargon:

---

## 1. What this script does, in a nutshell

1. **Reads** a list of website addresses (URLs).
2. **Opens** each page in a “headless” browser (a browser you can’t see) to let JavaScript run.
3. **Scans** the finished page for lots of clues (features) that hint at phishing or not.
4. **Saves** each page’s clues as one row in a CSV file.
5. **Repeats** this quickly by running many mini‑workers (threads) at once.



### Why it’s built this way

* **Speed:** Threads + headless browser + disabled images = faster crawling.
* **Safety:** Timeouts and exception catches mean one bad URL won’t crash everything.
* **Scalability:** You can increase threads or split input CSVs into chunks.
* **Modularity:** All feature logic lives in `FeaturesExtraction`, so the crawling code stays clean.

Hope this clears up how each piece fits together! Let me know if any part still feels fuzzy.

# Imports & Global Variables

* **threading**:allows the script to run multiple operations at the same time (threads), so you can process many URLs in parallel, making the crawl much faster.
* **Queue**: is a thread-safe data structure that lets threads share work (URLs to process) without interfering with each other or causing data corruption. A queue in Python is a linear data structure that follows the First-In, First-Out (FIFO) principle, meaning the first element added to the queue is the first one to be removed.
* **selenium**: Selenium is a powerful tool for automating browsers.
    * It launches a real Chrome browser (invisible/headless mode), loads web pages, and interacts with them just like a human would.
    * This is crucial for pages that use JavaScript, as it ensures all dynamic content is loaded before analysis.
    * **webdriver**  launch and interact with a browser session so that web pages can be fully loaded and analyzed.

* **BeautifulSoup**: BeautifulSoup is a Python library for parsing HTML and XML documents.
      * After Selenium loads a page, BeautifulSoup reads the HTML so you can easily extract and count elements like forms, buttons, links, and more.
      * It’s especially useful for web scraping and analyzing the structure of web pages.
* **pandas**: reads and writes CSV files where all our results get stored.
* **tqdm**: shows a progress bar so you know how far along we are.
* **url warnings**: turns off annoying SSL warnings so they don’t clutter our output.

In [1]:
import threading  
from queue import Queue  
from bs4 import BeautifulSoup  # Beautiful Soup is a Python library designed for parsing HTML and XML documents, including those with malformed or imperfect markup. It facilitates the extraction of data from these documents, making it a popular tool for web scraping. 

from selenium import webdriver  #  to write Python scripts that control a web browser, navigate to web pages, and interact with elements on those pages
from selenium.webdriver.chrome.options import (
    Options,
)  # to configure ChromeDriver options
from selenium.common.exceptions import (
    TimeoutException,
)  # exception for page load timeouts
import pandas as pd  # pandas for reading/writing CSVs and DataFrame operations
import os  # for checking file existence
from features_extraction import FeaturesExtraction  # custom module to extract phishing features
from urllib3.exceptions import (
    InsecureRequestWarning,
)  # warning for insecure HTTPS requests
from urllib3 import disable_warnings  # function to suppress urllib3 warnings
from tqdm import tqdm  # progress bar utility for iterables

disable_warnings(InsecureRequestWarning)  # suppress insecure request warnings globally

# Thread-local storage for counters to avoid race conditions
phishing_count = threading.local()  # Each thread gets its own 'phishing_count.value'
legitimate_count = threading.local()  # Each thread gets its own 'legitimate_count.value'

# What this means:
# Instead of using a single global variable (which could be changed by multiple threads at once and cause errors),
# we use threading.local() to give each thread its own separate variable.
# This is important because when running many threads at the same time (to process URLs faster),
# we want each thread to keep track of its own progress without interfering with others.
# Threads allow us to do many tasks (like loading web pages) in parallel, making the whole process much faster.
# threading.local() gives each thread its own independent variable.
# So, when a thread sets phishing_count.value = 1, it doesn't affect the value in other threads.
# This helps keep the count accurate and prevents bugs when multiple threads are running at the same time.

ImportError: cannot import name 'FeaturesExtraction' from 'features_extraction' (c:\Users\USER\Desktop\MLproject\ayman\dataset\features_extraction.py)

#  Browser Initialization

**Purpose:** Encapsulate headless Chrome setup so each worker can spin up its own browser.

- Runs Chrome in headless mode (no visible window).  
- Disables sandbox and shared‑memory usage for container compatibility.  
- Turns off images & notifications for faster page loads.

## 3. Setting up the browser (`initialize_driver`)

* We ask Chrome to run in **headless** mode (no visible window).
* We disable images and pop‑up notifications so pages load faster and use less memory.
* We return a ready‑to‑go Chrome instance.

In [None]:
def initialize_driver():
    """
    Creates and returns a Chrome browser that runs invisibly in the background.
    This function sets up Chrome with specific options to make it work better for web scraping:
    - Opens Chrome in 'headless' mode (no window pops up).
    - Turns off images and pop-up notifications so pages load faster.
    - Tweaks settings so it works smoothly in virtual machines or Docker.
    - Gives you back a browser you can control with code (Selenium).

    This is used so each worker thread can load web pages quickly and quietly.
    :return: A Selenium WebDriver object for Chrome. 
    """
    chrome_options = Options()
    # Set Chrome options for headless operation and performance
    chrome_options.add_argument("--disable-gpu")  # Disable GPU hardware acceleration
    chrome_options.add_argument("--window-size=1920x1080")  # Set a fixed window size 
    chrome_options.add_argument("--headless")  # Run Chrome without opening a window
    chrome_options.add_argument("--no-sandbox")  # Needed for some server setups
    chrome_options.add_argument("--disable-dev-shm-usage")  # Fixes shared memory issues

    # Tell Chrome not to load images or show notifications
    # This speeds up page loading and avoids pop-ups that can slow down scraping
    # 'prefs' is a dictionary that sets Chrome's default content settings
    prefs = {
        "profile.managed_default_content_settings.images": 2,  # Block images
        "profile.managed_default_content_settings.stylesheets": 2,  # Block stylesheets
        "profile.managed_default_content_settings.scripts": 2,  # Block scripts
        "profile.managed_default_content_settings.popups": 2,  # Block pop-ups
        "profile.default_content_setting_values.notifications": 2,  # Block pop-ups
    }
    chrome_options.add_experimental_option("prefs", prefs)
    

    # Start Chrome with these settings and return the driver object
    return webdriver.Chrome(options=chrome_options)


# The “worker” function

Think of a worker as a little helper that does the following, over and over:

1. **Opens its own invisible browser** (so it can visit websites without you seeing anything).
2. **Takes a website address** from a shared list (the queue).
3. **Tries to load the page** (waits up to 2 seconds—if it’s too slow, skips it).
4. **Reads the page’s HTML** using BeautifulSoup (so it can look for clues).
5. **Uses the FeaturesExtraction tool** to turn the page into a list of numbers (features) that describe it.
6. **Adds the website address and its label** (1 = phishing, 0 = legit) to the end of that list.
7. **Writes that row into a CSV file** right away (using a lock so two helpers don’t write at the same time).
8. **Keeps track of how many pages it’s finished** and prints the count as it goes.
9. **Repeats this process** for each website, until it gets a special “None” signal (which means “you’re done!”).
10. **Closes its browser** before stopping.

**Why use multiple workers?**  
By running several workers at once (each in its own thread), the script can process many websites in parallel, making the whole job much faster. Each worker is independent and safe from interfering with others, thanks to the queue (for sharing work) and the lock (for writing to the file).

In [None]:
def worker(queue, output_file, label, columns_lock):
    """
    Worker function to be run in each thread.
    - queue: shared Queue of URLs to process
    - output_file: CSV path to append features
    - label: 1 for phishing, 0 for legitimate
    - columns_lock: threading.Lock to synchronize file writes
    """
    # initialize per-thread counters
    phishing_count.value = 0
    legitimate_count.value = 0

    driver = initialize_driver()  # get a fresh WebDriver
    driver.set_page_load_timeout(2)  # set max page load time to 2 seconds

    try:
        while True:
            url = queue.get()  # get next URL (blocks if empty)
            if url is None:  # termination signal
                break

            try:
                driver.get(url)  # navigate to the URL
                soup = BeautifulSoup(
                    driver.page_source, "html.parser"
                )  # parse page HTML

                # extract features from URL and DOM
                extractor = FeaturesExtraction(driver, soup, url)
                vector = extractor.create_vector()  # list of feature values

                # append metadata: URL string and label
                vector.append(str(url))
                vector.append(label)

                # thread-safe write: convert to DataFrame and append as a row
                with columns_lock:
                    df_row = pd.DataFrame([vector], columns=get_feature_columns())
                    df_row.to_csv(output_file, mode="a", header=False, index=False)

                # update and display counters
                if label == 1:
                    phishing_count.value += 1
                    print(f"\rProcessed phishing sites: {phishing_count.value}", end="")
                elif label == 0:
                    legitimate_count.value += 1
                    print(
                        f"\rProcessed legitimate sites: {legitimate_count.value}",
                        end="",
                    )

            except TimeoutException:
                print(f"Timeout while loading URL: {url}")  # page took too long
            except Exception as e:
                print(f"Error processing URL {url}: {e}")  # catch-all for other errors
            finally:
                queue.task_done()  # signal that this URL is processed
    finally:
        driver.quit()  # ensure browser is closed when thread finishes
        
print("cest bon")  # indicate thread completion


cest bon


# Orchestration Function Explained

This function is the "conductor" that coordinates the entire data collection process. Here’s what it does, step by step:

---

### **1. Read a slice from the input CSV**

- **Phishing URLs:**  
    If you’re processing phishing sites, it reads a chunk of rows from a CSV file (like `verified_online.csv`). It grabs the `url` column, which contains the full web addresses to visit.

- **Legitimate URLs:**  
    If you’re processing legitimate sites, it reads a chunk of rows from a CSV file (like `tranco_list.csv`). These rows usually just have domain names (like `google.com`). The function adds `http://` in front to make them full URLs.

- **Why a slice?**  
    You can specify which rows to process (`start_row` and `end_row`). This lets you break up a huge list into smaller, manageable pieces.

---

### **2. Build the list of URLs**

- After reading the CSV, it creates a Python list of URLs to visit.
- For legitimate domains, it automatically adds `http://` to each domain name.

---

### **3. Initialize a Queue, start worker threads, and ensure the output CSV has headers**

- **Queue:**  
    A thread-safe queue is created to hold all the URLs. This lets multiple worker threads safely pull URLs to process, one at a time.

- **Worker Threads:**  
    The function starts several worker threads (mini-programs running in parallel). Each worker:
    - Opens its own invisible browser.
    - Visits URLs from the queue.
    - Extracts features from the page.
    - Writes results to the output CSV.

- **Output CSV:**  
    If the output file (where results are saved) doesn’t exist yet, the function writes a header row with all the feature names. This ensures the CSV is ready for new data.

---

### **4. Enqueue URLs with a progress bar**

- All URLs are added to the queue, one by one.
- A progress bar (using `tqdm`) shows how many URLs have been queued, so you can track progress.

---

### **5. Signal threads to stop and wait for them**

- After all URLs are queued, the function adds a special `None` value to the queue for each worker.  
    This tells each worker, “You’re done, you can stop now.”
- The function waits for all worker threads to finish before exiting.

---

## **Why is this important?**

- **Parallel Processing:**  
    By using multiple threads, the function can process many URLs at the same time, making the whole process much faster.

- **Thread Safety:**  
    The queue and file-writing lock prevent data corruption when multiple threads are running.

- **Scalability:**  
    You can adjust the number of threads and the size of the data slice to fit your computer’s resources.

---

## **Key Terms**

- **Queue:**  
    A special list that lets threads safely share work.

- **Worker Thread:**  
    A mini-program that does the actual work (visiting URLs, extracting features).

- **CSV:**  
    A file format for storing tabular data (like a spreadsheet).

- **Progress Bar:**  
    A visual indicator of how much work has been done.

---

**In summary:**  
This orchestration function reads a batch of URLs, sets up parallel workers, manages safe data sharing, and writes the extracted features to a CSV file for later machine learning or analysis.

 ### Orchestrating everything (`process_urls_threaded`)

1. **Read** the input CSV (either phishing or legitimate list) for the slice of rows you asked for.
2. **Build** a Python list of URLs.
3. **Prepare**:

   * A `Queue` loaded with all those URLs.
   * A `Lock` to guard file writes.
4. **Create** the output CSV with headers if it doesn’t already exist.
5. **Launch** N worker threads, each running the `worker` function.
6. **Feed** each URL into the queue (with a nice progress bar).
7. **After** all URLs are queued, put N `None` entries to tell each worker “you’re done.”
8. **Wait** for all the threads to finish.

# Cell 5: Column Names

**Purpose:** Returns the ordered list of all feature names + `URL` and `label`.  
Must exactly match the order produced by `FeaturesExtraction.create_vector()`.

## 6. Defining the column names (`get_feature_columns`)

* This function just returns a long list of all the names of the clues (features) in the exact order that `FeaturesExtraction` 
produces them, plus two extras at the end:

  * **URL** (so you can see which row is which page)
  * **label** (1 or 0, telling you phishing or legitimate)

In [None]:
def get_feature_columns():
    """
    Generate the ordered list of feature column names for the output CSV.
    Returns a list of strings: all HTML, dynamic, password, JS, URL-based features, plus metadata.
    """
    # List of basic HTML and page structure features
    basic_columns = [
        "has_title",                  # Whether the page has a <title> tag
        "has_submit",                 # Whether the page has a submit button
        "has_link",                   # Whether the page has any <a> links
        "has_email_input",            # Whether the page has an email input field
        "number_of_inputs",           # Number of input fields on the page
        "number_of_buttons",          # Number of button elements
        "number_of_images",           # Number of image elements
        "number_of_option",           # Number of <option> elements (dropdowns)
        "number_of_list",             # Number of list elements (<ul>, <ol>)
        "number_of_href",             # Number of href attributes (links)
        "number_of_paragraph",        # Number of <p> (paragraph) elements
        "number_of_script",           # Number of <script> tags
        "length_of_title",            # Length of the page title
        "has_h1",                     # Whether the page has an <h1> header
        "has_h2",                     # Whether the page has an <h2> header
        "has_h3",                     # Whether the page has an <h3> header
        "length_of_text",             # Total length of visible text
        "number_of_clickable_button", # Number of clickable buttons
        "number_of_a",                # Number of <a> tags
        "number_of_div",              # Number of <div> elements
        "has_footer",                 # Whether the page has a <footer>
        "number_of_forms",            # Number of <form> elements
        "has_text_area",              # Whether the page has a <textarea>
        "has_iframe",                 # Whether the page has an <iframe>
        "has_text_input",             # Whether the page has a text input
        "number_of_meta",             # Number of <meta> tags
        "has_nav",                    # Whether the page has a <nav> element
        "number_of_sources",          # Number of <source> tags (media)
        "number_of_span",             # Number of <span> elements
        "number_of_table",            # Number of <table> elements
        "RequestURL",                 # Feature related to request URLs
        "AnchorURL",                  # Feature related to anchor URLs
        "Favicon",                    # Whether the page has a favicon
        "LinksInScriptTags",          # Links found inside <script> tags
        "ServerFormHandler",          # Whether the form is handled by the server
        "InfoEmail",                  # Presence of info email addresses
    ]

    # Features related to dynamic or interactive content
    dynamic_columns = [
        "has_mouse_tracking",         # Whether mouse tracking is present
        "has_keyboard_monitoring",    # Whether keyboard monitoring is present
        "has_popups",                 # Whether popups are present
        "number_of_hidden_element",   # Number of hidden elements
        "page_redirect",              # Whether the page redirects
        "form_redirect_behavior",     # Form submission causes redirect
        "external_form_action",       # Form action points to external site
    ]

    # Features related to password fields and forms
    password_columns = [
        "password_type_count",        # Number of password input types
        "password_name_id_count",     # Number of password fields with name/id
        "hidden_password_count",      # Number of hidden password fields
        "form_with_password",         # Whether a form contains a password field
    ]

    # Features related to JavaScript activity
    js_columns = [
        "clipboard_monitoring",       # JS monitors clipboard
        "form_data_collection",       # JS collects form data
        "cookie_manipulation",        # JS manipulates cookies
    ]

    # Features extracted from the URL or domain
    url_features_columns = [
        "UsingIp",                    # URL uses an IP address
        "longUrl",                    # URL is long
        "shortUrl",                   # URL is short
        "symbol",                     # URL contains special symbols
        "redirecting",                # URL redirects
        "prefixSuffix",               # URL has prefix/suffix
        "SubDomains",                 # Number of subdomains
        "Hppts",                      # Misspelled HTTPS
        "DomainRegLen",               # Domain registration length
        "NonStdPort",                 # Uses non-standard port
        "HTTPSDomainURL",             # HTTPS in domain part
        "AbnormalURL",                # Abnormal URL structure
        "WebsiteForwarding",          # Website forwarding detected
        "StatusBarCust",              # Custom status bar
        "DisableRightClick",          # Right-click disabled
        "UsingPopupWindow",           # Uses popup windows
        "IframeRedirection",          # Uses iframe for redirection
        "AgeofDomain",                # Age of the domain
        "DNSRecording",               # DNS record presence
        "WebsiteTraffic",             # Website traffic rank
        "PageRank",                   # PageRank value
        "GoogleIndex",                # Indexed by Google
        "LinksPointingToPage",        # Number of links pointing to page
        "StatsReport",                # Statistical report presence
    ]

    # Metadata columns (URL and label/class)
    metadata_columns = [
        "URL",                        # The URL of the page
        "label",                      # The label/class (e.g., phishing/legit)
    ]

    # Combine all feature lists into a single list and return it
    return (
        basic_columns
        + dynamic_columns
        + password_columns
        + js_columns
        + url_features_columns
        + metadata_columns
    )


## The “main” section

When you run the script directly:

1. It sets how many threads to use for phishing and legitimate sites.
2. It picks which rows of each CSV to process (so you can do them in chunks).
3. It **only** processes the legitimate list (the phishing part is commented out, but you could turn it on).
4. You see “Processing legitimate websites…” then the progress bar.
5. At the end, you see “Processing completed!”

In [None]:
if __name__ == "__main__":    # This block only runs if the script is executed directly (not imported as a module).

    # Set how many threads to use for each type of URL processing:
    PHISH_THREADS = 32              # Number of parallel threads for phishing URLs
    LEGITIMATE_THRESHOLD = 16       # Number of parallel threads for legitimate URLs

    # Define which rows of the phishing CSV to process (for batching):
    start_row_phishing = 21000      # Start at row 21,000 in the phishing CSV
    end_row_phishing = start_row_phishing + 1000  # Process 1,000 rows (21,000–21,999)

    # Define which rows of the legitimate CSV to process:
    start_row_legitimate = 24100    # Start at row 24,100 in the legitimate CSV
    end_row_legitimate = start_row_legitimate + 200  # Process 200 rows (24,100–24,299)
''' 
    # --- PHISHING PROCESSING (currently commented out) ---
    # To process phishing URLs, uncomment the following lines:
    # print("Processing phishing websites...")
    # process_urls_threaded(
    #     "verified_online.csv",         # Input CSV with phishing URLs
    #     "phishing_websites.csv",       # Output CSV for extracted features
    #     label=1,                       # Label 1 = phishing
    #     start_row=start_row_phishing,  # Start row for batch
    #     end_row=end_row_phishing,      # End row for batch
    #     num_threads=PHISH_THREADS,     # Number of threads to use
    # )

    # # --- LEGITIMATE PROCESSING (active) ---
    # print("Processing legitimate websites...")
    # process_urls_threaded(
    #     "tranco_list.csv",               # Input CSV with legitimate domains
    #     "legitimate_websites.csv",       # Output CSV for extracted features
    #     label=0,                         # Label 0 = legitimate
    #     start_row=start_row_legitimate,  # Start row for batch
    #     end_row=end_row_legitimate,      # End row for batch
    #     num_threads=LEGITIMATE_THRESHOLD,# Number of threads to use
    # )
'''
print("Processing completed!")       # Print when all URLs are processed

# In summary:
# - This code sets up how many threads to use and which rows to process from each input file.
# - It can process phishing or legitimate URLs (one at a time, depending on which block is active).
# - It calls process_urls_threaded() to extract features from each URL and save them to a CSV.
# - The process is parallelized for speed, and you can control the batch size by changing the row ranges.


Processing completed!
