## 🚀 Phishing‑Check API with Flask

**Overview:**  
This lightweight REST API receives a URL, renders and analyzes it via Selenium+BeautifulSoup, then returns a phishing prediction and confidence score from a pre‐trained model. An in‐memory cache ensures repeat requests are served instantly.

---

### 📦 1) Setup & Model Loading

- **Flask & CORS**  
  - `Flask` provides HTTP routing and JSON handling.  
  - `CORS` enables browser clients on other origins to call our API.
- **Joblib**  
  - Loads a saved scikit‑learn (or compatible) model from disk into memory on startup.
- **Cache**  
  - A simple Python `dict` that tracks URLs already in‐flight or processed.

---

### 🔧 2) Browser Initialization

- **Headless Chrome**  
  - Runs without a GUI (`--headless`) for efficiency on servers.  
  - Disables sandboxing and shared‑memory usage to avoid Docker or CI issues.
- **`initialize_driver()`**  
  - Encapsulates WebDriver setup to keep endpoint code clean.

---

### 🛠️ 3) `/check_url` Endpoint Logic

1. **Input Validation**  
   - Rejects requests missing the `url` field with a 400 error.
2. **Cache Handling**  
   - **In‑Flight:** If a URL is already being processed, replies with a “please wait” message.
   - **Done:** If processed before, returns the stored result instantly.
3. **Processing**  
   - Marks the URL as “in progress” in the cache.  
   - Calls `check_if_phishing()`, updates the cache, and returns the prediction.
4. **Error Resilience**  
   - On any exception, caches a conservative “phishing” verdict (to err on the side of caution) and returns an error message.

---

### 🔍 4) Core Detection Function

- **Page Rendering:**  
  Uses Selenium to execute JavaScript and get the final DOM.
- **Feature Extraction:**  
  Passes the WebDriver, BeautifulSoup object, and URL to your custom `FeaturesExtraction` class to compute a numeric feature vector.
- **Prediction & Confidence:**  
  - `model.predict(...)` yields `0` for **legitimate** or `1` for **phishing**.  
  - `model.predict_proba(...)` returns probabilities; we choose the probability of the predicted class as our confidence score.
- **Resource Cleanup:**  
  Ensures the browser is always closed, even on errors, to prevent resource leaks.

---

### 🚀 5) Launching the Service

- Runs the Flask development server in **debug** mode for rapid iteration.  
- For production, consider a WSGI server (e.g. Gunicorn) and environment variables to disable debug mode.

---

**Next Steps & Best Practices:**

- **Persistence:**  
  Replace the in‐memory cache with Redis or Memcached for multi‑process deployments.
- **Rate Limiting & Authentication:**  
  Protect your endpoint from abuse by limiting requests per client.
- **Metrics & Logging:**  
  Instrument with Prometheus/Grafana or ELK to track usage, latencies, and errors.
- **Containerization:**  
  Dockerize the service along with a headless Chrome binary for easy deployment.


In [10]:
%pip install nbimporter

# --------------------------------------------------------
# Purpose:
#   Flask-based REST API to check if a given URL is phishing using a pre-trained model.
#   Supports caching to avoid duplicate work and returns prediction with confidence.
# --------------------------------------------------------

from flask import Flask, request, jsonify         # Flask web framework for HTTP endpoints
from flask_cors import CORS                        # Enable Cross-Origin requests
from selenium import webdriver                     # Automate Chrome to fetch dynamic pages
from bs4 import BeautifulSoup                      # Parse HTML for feature extraction
import joblib                                      # Load serialized ML model

import sys
import os

# Add the path to the dataset directory containing features_extraction.ipynb
# __file__ is not defined in Jupyter, so use the current working directory instead
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), 'dataset')))
print(f"Current working directory: {os.getcwd()}")  
%cd dataset  # Change to the dataset directory
import nbimporter
from features_extraction import FeaturesExtraction  # Now import from the notebook

# --------------------------------------------------------
# 1) App & Model Initialization
# --------------------------------------------------------
app = Flask(__name__)  # Create Flask app instance
CORS(app)               # Allow cross-origin requests to this API

MODEL_PATH = 'Model/model.pkl'          # Path to serialized ML model
model = joblib.load(MODEL_PATH)         # Load model into memory

# Initialize a simple in-memory cache to store results per URL
cache = {}

# --------------------------------------------------------
# 2) Helper: Initialize Selenium WebDriver
# --------------------------------------------------------
def initialize_driver():
    # Configure Chrome to run headless (no UI)
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    return webdriver.Chrome(options=options)  # Return configured WebDriver

# --------------------------------------------------------
# 3) Main Endpoint: /check_url
# --------------------------------------------------------
@app.route('/check_url', methods=['POST'])  # Define POST endpoint
def check_url():
    data = request.get_json()               # Parse JSON payload
    url = data.get('url')                   # Extract 'url' field

    if not url:                             # Validate input
        return jsonify({'error': 'URL is required'}), 400

    # If URL is cached, return stored result
    if url in cache:
        # If still processing, ask client to wait
        if cache[url]['is_phishing'] is None:
            return jsonify({
                'url': url,
                'error': 'Please wait for a moment!',
                'is_phishing': True,     # Mark suspicious by default
                'confidence': 1.0
            })
        # Return cached prediction
        return jsonify({
            'url': url,
            'is_phishing': cache[url]['is_phishing'],
            'confidence': cache[url]['confidence']
        })

    # Mark as processing to prevent duplicate work
    cache[url] = {'is_phishing': None, 'confidence': None}

    try:
        # Perform actual phishing check
        result = check_if_phishing(url)
        cache[url] = result  # Store final result
        return jsonify({
            'url': url,
            'is_phishing': result['is_phishing'],
            'confidence': result['confidence']
        })
    except Exception as e:
        print(f"Error processing URL {url}: {e}")
        # On error, cache as suspicious and return error
        cache[url] = {'is_phishing': True, 'confidence': 1.0}
        return jsonify({
            'url': url,
            'error': str(e),
            'is_phishing': True,
            'confidence': 1.0
        })

# --------------------------------------------------------
# 4) Core Logic: check_if_phishing
# --------------------------------------------------------
def check_if_phishing(url):
    driver = None
    try:
        driver = initialize_driver()           # Start browser
        driver.get(url)                        # Load page URL

        html_content = driver.page_source      # Get rendered HTML
        soup = BeautifulSoup(html_content, 'html.parser')  # Parse with BeautifulSoup

        # Extract features from page and URL
        feature_extractor = FeaturesExtraction(driver, soup, url)
        features = feature_extractor.create_vector()  # List of feature values

        prediction = model.predict([features])[0]      # 0=legit,1=phishing
        probability = model.predict_proba([features])[0]  # [prob_legit, prob_phish]
        confidence = float(probability[1] if prediction == 1 else probability[0])

        return {'is_phishing': bool(prediction), 'confidence': confidence}
    finally:
        if driver:
            driver.quit()  # Always close browser to free resources

# --------------------------------------------------------
# 5) Run the App
# --------------------------------------------------------
if __name__ == '__main__':
    app.run(debug=True)  # Launch Flask dev server with debug mode


Note: you may need to restart the kernel to use updated packages.
Current working directory: c:\Users\USER\Desktop\MLproject\ayman
[WinError 2] Le fichier spécifié est introuvable: 'dataset # Change to the dataset directory'
c:\Users\USER\Desktop\MLproject\ayman


ModuleNotFoundError: No module named 'features_extraction'