# Domain Ranking Model - DuckDuckGo TrackerRadar Integration

This notebook demonstrates the complete end-to-end ML pipeline for domain risk scoring using DuckDuckGo TrackerRadar data.

## Overview

Our domain ranking model uses machine learning to predict domain tracking intensity based on:
- Domain characteristics and reputation
- TrackerRadar data for known tracking behaviors
- Privacy policies and data collection practices

The model outputs safety scores (0-100) that integrate with our privacy analysis heuristics.

In [None]:
# Import required libraries
import sys
import os
sys.path.append(os.path.join(os.getcwd(), '..', 'backend'))

import pandas as pd
import numpy as np
import requests
import json
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"Working directory: {os.getcwd()}")
print(f"Python path includes: {sys.path[-1]}")

## Step 1: Load and Explore TrackerRadar Data

First, let's load the DuckDuckGo TrackerRadar data and explore its structure.

In [None]:
# Load the training script components
from scripts.train_domain_model import TrackerRadarParser, TargetConstructor, DomainRiskModel

# Initialize the parser and load TrackerRadar data
parser = TrackerRadarParser()

print("Downloading DuckDuckGo TrackerRadar data...")
tracker_data = parser.download_tracker_radar()
print(f"‚úÖ Downloaded data with {len(tracker_data.get('domains', {}))} domains")

# Parse domains and display sample data
parsed_domains = parser.parse_domains(tracker_data)
print(f"‚úÖ Parsed {len(parsed_domains)} domains with features")

# Show sample parsed domains
print("\nSample domain data:")
for i, (domain, features) in enumerate(list(parsed_domains.items())[:5]):
    print(f"{i+1}. {domain}:")
    for key, value in features.items():
        print(f"   {key}: {value}")
    print()

## Step 2: Generate Training Targets

Now let's construct tracking intensity targets for our ML model.

In [None]:
# Generate tracking intensity targets
target_constructor = TargetConstructor()
targets = target_constructor.construct_targets(tracker_data)

print(f"‚úÖ Generated {len(targets)} tracking intensity targets")
print(f"Target range: {min(targets.values()):.3f} to {max(targets.values()):.3f}")
print(f"Mean target: {np.mean(list(targets.values())):.3f}")

# Show distribution of targets
plt.figure(figsize=(10, 6))
plt.hist(list(targets.values()), bins=30, alpha=0.7, edgecolor='black')
plt.xlabel('Tracking Intensity')
plt.ylabel('Number of Domains')
plt.title('Distribution of Tracking Intensity Targets')
plt.grid(True, alpha=0.3)
plt.show()

# Show sample targets
print("\nSample tracking intensity scores:")
sorted_targets = sorted(targets.items(), key=lambda x: x[1], reverse=True)
for i, (domain, score) in enumerate(sorted_targets[:10]):
    print(f"{i+1}. {domain}: {score:.3f}")
print(f"...")
for i, (domain, score) in enumerate(sorted_targets[-5:]):
    print(f"{len(sorted_targets)-4+i}. {domain}: {score:.3f}")

## Step 3: Train Domain Risk Model

Let's train our machine learning model using the parsed features and targets.

In [None]:
# Train the domain risk model
model = DomainRiskModel(model_type='lightgbm')

print("Training domain risk model...")
model.train(parsed_domains, targets)
print("‚úÖ Model training completed!")

# Display model performance
print(f"Training R¬≤ score: {model.r2_score:.4f}")
print(f"Training RMSE: {model.rmse:.4f}")
print(f"Number of features: {len(model.feature_names_)}")

# Show feature importance (top 15)
feature_importance = model.get_feature_importance()
print("\nTop 15 most important features:")
for i, (feature, importance) in enumerate(feature_importance[:15]):
    print(f"{i+1:2d}. {feature}: {importance:.4f}")

# Plot feature importance
plt.figure(figsize=(12, 8))
features, importances = zip(*feature_importance[:20])
plt.barh(range(len(features)), importances)
plt.yticks(range(len(features)), features)
plt.xlabel('Feature Importance')
plt.title('Top 20 Most Important Features')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## Step 4: Test Model Predictions

Let's test our model with some example domains and see how it performs.

In [None]:
# Test with example domains
test_domains = [
    'google-analytics.com',
    'facebook.com', 
    'doubleclick.net',
    'github.com',
    'stackoverflow.com',
    'wikipedia.org',
    'amazon.com',
    'cloudflare.com'
]

print("Testing model predictions on example domains:")
print("=" * 60)

test_results = []
for domain in test_domains:
    # Get tracking intensity prediction (0-1)
    tracking_intensity = model.predict_single_domain(domain)
    
    # Convert to safety score (0-100, higher is safer)
    safety_score = (1 - tracking_intensity) * 100
    
    test_results.append({
        'domain': domain,
        'tracking_intensity': tracking_intensity,
        'safety_score': safety_score
    })
    
    print(f"{domain:25} | Intensity: {tracking_intensity:.3f} | Safety: {safety_score:.1f}/100")

# Create visualization
df_results = pd.DataFrame(test_results)
df_results = df_results.sort_values('safety_score')

plt.figure(figsize=(12, 6))
bars = plt.barh(df_results['domain'], df_results['safety_score'])

# Color code bars (red for low safety, green for high safety)
for i, bar in enumerate(bars):
    score = df_results.iloc[i]['safety_score']
    if score >= 80:
        bar.set_color('green')
    elif score >= 60:
        bar.set_color('orange') 
    else:
        bar.set_color('red')

plt.xlabel('Safety Score (0-100)')
plt.title('Domain Safety Scores - Model Predictions')
plt.xlim(0, 100)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Step 5: Save Model and Test Integration

Finally, let's save the trained model and test the FastAPI integration.

In [None]:
# Save the trained model
models_dir = os.path.join('..', 'backend', 'models')
os.makedirs(models_dir, exist_ok=True)

model_path = model.save_model(models_dir)
print(f"‚úÖ Model saved to: {model_path}")

# Test the ML scoring integration
print("\nTesting ML scoring integration...")
try:
    from app.ml_scoring import DomainScoringService, get_ml_score_for_page
    from app.models import PrivacyFeatures
    
    # Create sample privacy features
    sample_features = PrivacyFeatures(
        page_url="https://example.com",
        num_third_party_domains=5,
        num_third_party_scripts=8,
        num_third_party_cookies=3,
        fraction_third_party_requests=0.4,
        num_known_tracker_domains=2,
        num_persistent_cookies=4,
        has_analytics_global=1,
        num_inline_scripts=3,
        fingerprinting_flag=0,
        tracker_script_ratio=0.25,
        third_party_domains=["google-analytics.com", "facebook.com", "doubleclick.net", "amazon.com", "cloudflare.com"],
        additional_data={}
    )
    
    # Get ML score
    ml_score = get_ml_score_for_page(sample_features)
    print(f"‚úÖ ML scoring successful! Score: {ml_score:.1f}/100")
    
    print("\nDomain-level predictions for this page:")
    for domain in sample_features.third_party_domains:
        # Test individual domain scoring
        service = DomainScoringService()
        domains_data = [{
            'domain': domain,
            'frequency': 1,
            'is_known_tracker': domain in ['google-analytics.com', 'facebook.com', 'doubleclick.net']
        }]
        result = service.score_domains(domains_data)
        print(f"  {domain}: {result['weighted_score']:.1f}/100")
        
except Exception as e:
    print(f"‚ùå Integration test failed: {e}")
    
print("\nüéâ Domain ranking model pipeline completed successfully!")
print("\nNext steps:")
print("1. The model is now integrated into the FastAPI backend")
print("2. Privacy scores will use real ML predictions instead of fixed values")
print("3. The ML scoring API is available at /api/v1/model/ endpoints")
print("4. Heuristic penalties are now reduced by half for better balance")

## Step 6: Retrain with Latest Aggressive Improvements

Let's retrain the model with the latest aggressive improvements and verify the entire pipeline works correctly.

In [1]:
# Retrain the model using the latest training script with aggressive improvements
import subprocess
import os

print("üîÑ Retraining model with latest aggressive improvements...")
print("This includes:")
print("- Enhanced category-based fingerprinting detection")
print("- 70% reduction in num_resources importance")
print("- Legitimate domain allowlisting")
print("- Very aggressive multi-category tracking scoring")
print()

# Change to project root and run the training script
os.chdir('..')
result = subprocess.run(['python3', 'scripts/train_domain_model.py'], 
                       capture_output=True, text=True)

if result.returncode == 0:
    print("‚úÖ Model retraining completed successfully!")
    print(result.stdout.split('\n')[-20:])  # Show last 20 lines
else:
    print("‚ùå Model retraining failed!")
    print(result.stderr)
    
# Change back to notebooks directory
os.chdir('notebooks')

üîÑ Retraining model with latest aggressive improvements...
This includes:
- Enhanced category-based fingerprinting detection
- 70% reduction in num_resources importance
- Legitimate domain allowlisting
- Very aggressive multi-category tracking scoring

‚ùå Model retraining failed!
Traceback (most recent call last):
  File [35m"/Users/zananvirani/Desktop/PrivInspect/scripts/train_domain_model.py"[0m, line [35m23[0m, in [35m<module>[0m
    import pandas as pd
[1;35mModuleNotFoundError[0m: [35mNo module named 'pandas'[0m



In [2]:
# Test the retrained model on key domains
import joblib
import json
import numpy as np

print("üß™ Testing retrained model on key domains...")

# Load the retrained model
try:
    model_data = joblib.load('../models/domain_risk_model.pkl')
    model = model_data['model']
    scaler = model_data['scaler']
    
    # Load features
    with open('../models/domain_features.json', 'r') as f:
        all_features = json.load(f)
    
    print(f"‚úÖ Loaded retrained model with {len(all_features)} domains")
    
    # Test key domains
    test_domains = [
        'addthis.com',           # Previously problematic
        'doubleclick.net',       # Should be very low safety
        'googletagmanager.com',  # Should be very low safety
        'facebook.com',          # Should be very low safety
        'google-analytics.com',  # Should be very low safety
        'wikipedia.org',         # Should be high safety
        'archive.org',           # Should be high safety
        'github.com',            # Should be high safety
        'mozilla.org',           # Should be high safety
        'firebaseremoteconfig.googleapis.com'  # From ChatGPT example
    ]
    
    print("\\n" + "="*70)
    print("RETRAINED MODEL TEST RESULTS")
    print("="*70)
    
    tracking_scores = []
    legitimate_scores = []
    
    for domain in test_domains:
        if domain in all_features:
            features = all_features[domain]
            feature_vector = [
                features['fingerprinting'],
                features['cookies_prevalence'], 
                features['global_prevalence'],
                features['num_sites'],
                features['num_subdomains'],
                features['num_cnames'],
                features['num_resources'],
                features['num_top_initiators'],
                features['owner_present'],
                features['resource_type_script_count'],
                features['resource_type_xhr_count'],
                features['resource_type_image_count'],
                features['resource_type_css_count'],
                features['resource_type_font_count'],
                features['resource_type_media_count'],
                features['avg_resource_fingerprinting'],
                features['has_example_sites']
            ]
            
            # Scale features and predict
            scaled_features = scaler.transform([feature_vector])
            prediction = model.predict(scaled_features)[0]
            
            # Convert to safety score (0-100, higher = safer)
            safety_score = prediction * 100
            
            # Categorize for analysis
            if domain in ['wikipedia.org', 'archive.org', 'github.com', 'mozilla.org']:
                legitimate_scores.append(safety_score)
                category = "üü¢ LEGITIMATE"
            else:
                tracking_scores.append(safety_score)
                category = "üî¥ TRACKING"
            
            # Get enhanced fingerprinting for analysis
            enhanced_fp = features['fingerprinting']
            
            print(f"{domain:35} | {safety_score:5.1f}/100 | Enhanced FP: {enhanced_fp:3.1f} | {category}")
        else:
            print(f"{domain:35} | NOT FOUND in dataset")
    
    # Performance summary
    print("\\n" + "="*70)
    print("PERFORMANCE SUMMARY")
    print("="*70)
    
    if tracking_scores:
        print(f"üî¥ TRACKING DOMAINS:")
        print(f"   Average: {np.mean(tracking_scores):.1f}/100 (should be low)")
        print(f"   Range: {min(tracking_scores):.1f} - {max(tracking_scores):.1f}/100")
        
    if legitimate_scores:
        print(f"üü¢ LEGITIMATE DOMAINS:")
        print(f"   Average: {np.mean(legitimate_scores):.1f}/100 (should be high)")
        print(f"   Range: {min(legitimate_scores):.1f} - {max(legitimate_scores):.1f}/100")
    
    # Check AddThis specifically
    if 'addthis.com' in all_features and 'addthis.com' in [d for d in test_domains]:
        addthis_score = next(score for score, domain in zip(tracking_scores + legitimate_scores, test_domains) 
                           if domain == 'addthis.com' and domain in all_features)
        print(f"\\nüéØ ADDTHIS IMPROVEMENT: {addthis_score:.1f}/100 (target: <30/100)")
        if addthis_score < 30:
            print("   ‚úÖ SUCCESS: AddThis properly classified as risky!")
        elif addthis_score < 60:
            print("   üü° GOOD: Major improvement from 100/100")
        else:
            print("   ‚ùå NEEDS MORE WORK: Still too high")
    
    print("\\n‚úÖ Retrained model testing completed!")
    
except Exception as e:
    print(f"‚ùå Error testing retrained model: {e}")
    import traceback
    traceback.print_exc()

ModuleNotFoundError: No module named 'joblib'

In [None]:
# Test full integration with gentler heuristic penalties
import sys
import os
sys.path.append('../backend')

print("üîó Testing full integration with gentler heuristic penalties...")

try:
    from app.ml_scoring import get_ml_score_for_page, domain_scoring_service, initialize_domain_scoring
    from app.models import AnalyzeRequest, NetworkRequest, Script, Cookie
    
    # Initialize the ML scoring service
    initialize_domain_scoring()
    
    # Create a realistic test scenario (simulating ChatGPT page)
    chatgpt_request = AnalyzeRequest(
        page_url="https://chatgpt.com",
        network_requests=[
            NetworkRequest(url="https://ab.chatgpt.com/v1/rgstr", domain="ab.chatgpt.com", method="POST", type="xmlhttprequest"),
            NetworkRequest(url="https://ab.chatgpt.com/v1/rgstr", domain="ab.chatgpt.com", method="POST", type="xmlhttprequest"),
            NetworkRequest(url="https://ab.chatgpt.com/v1/rgstr", domain="ab.chatgpt.com", method="POST", type="xmlhttprequest"),
            NetworkRequest(url="https://ab.chatgpt.com/v1/rgstr", domain="ab.chatgpt.com", method="POST", type="xmlhttprequest"),
            NetworkRequest(url="https://ab.chatgpt.com/v1/rgstr", domain="ab.chatgpt.com", method="POST", type="xmlhttprequest"),
            NetworkRequest(url="https://firebaseremoteconfig.googleapis.com/v1/projects/castify-storage/namespaces/firebase:fetch", domain="firebaseremoteconfig.googleapis.com", method="POST", type="xmlhttprequest"),
            NetworkRequest(url="https://firebaseremoteconfig.googleapis.com/v1/projects/castify-storage/namespaces/firebase:fetch", domain="firebaseremoteconfig.googleapis.com", method="OPTIONS", type="xmlhttprequest"),
        ],
        scripts=[],
        raw_cookies=[],
        additional_data={}
    )
    
    # Test ML scoring
    ml_score = get_ml_score_for_page(chatgpt_request)
    print(f"‚úÖ ML Score: {ml_score:.2f}/100")
    
    # Test individual domain scoring
    domain_counts = domain_scoring_service.extract_domains_from_analyze_request(chatgpt_request)
    result = domain_scoring_service.score_domains(domain_counts)
    
    print(f"\\nüìä DETAILED SCORING RESULTS:")
    print(f"Total domains found: {result.total_domains}")
    print(f"Known domains: {result.known_domains}")
    print(f"Unknown domains: {result.unknown_domains}")
    print(f"Aggregated ML score: {result.aggregated_ml_score:.2f}/100")
    
    print(f"\\nDomain breakdown:")
    for domain_score in result.domains:
        known_status = "‚úÖ Known" if domain_score.domain_known else "‚ùå Unknown"
        print(f"  {domain_score.domain:35} | Count: {domain_score.count} | Score: {domain_score.domain_safe_score:5.1f}/100 | {known_status}")
    
    # Now test complete privacy scoring pipeline
    print(f"\\nüßÆ Testing complete privacy scoring with gentler penalties...")
    
    # This would normally be done in the FastAPI endpoint, but let's test the logic
    from app.routers.analyze import extract_privacy_features, compute_privacy_score
    
    # Extract features (this simulates what the analyze endpoint does)
    features = extract_privacy_features(chatgpt_request)
    
    # Compute privacy score with gentler penalties
    privacy_result = compute_privacy_score(features, chatgpt_request)
    
    print(f"\\nüéØ FINAL PRIVACY ANALYSIS:")
    print(f"ML Base Score: {privacy_result['breakdown']['ml_base_score']:.2f}/100")
    print(f"Total Penalty: {privacy_result['breakdown']['total_penalty_applied']:.2f}")
    print(f"Final Score: {privacy_result['score']:.2f}/100")
    print(f"Grade: {privacy_result['grade_letter']}")
    
    print(f"\\nPenalty breakdown (gentler system):")
    for penalty_type, value in privacy_result['breakdown']['individual_penalties'].items():
        if value != 0:
            print(f"  {penalty_type}: {value:.1f}")
    
    if privacy_result['breakdown']['penalty_cap_applied']:
        print(f"\\n‚ö†Ô∏è Penalty cap applied (limited to -10.0)")
    
    print(f"\\n‚úÖ Full integration test completed successfully!")
    print(f"üéâ The system now provides much fairer scoring with gentler penalties!")
    
except Exception as e:
    print(f"‚ùå Integration test failed: {e}")
    import traceback
    traceback.print_exc()

# Domain Ranking Model using DuckDuckGo TrackerRadar

This notebook implements an end-to-end machine learning pipeline that:
1. Downloads and parses DuckDuckGo TrackerRadar data
2. Engineers domain-level features for privacy risk assessment
3. Trains a gradient boosted regressor to predict tracking intensity
4. Integrates with FastAPI for real-time domain scoring
5. Provides ML-based privacy scores for web page analysis

The trained model will be integrated into the PrivInspect backend to replace the placeholder ML score (currently fixed at 100) with actual domain-based risk predictions.

## 1. Environment Setup and Dependencies

Install and import all required libraries for data processing, machine learning, and API integration.

In [None]:
# Install required packages (run this if packages are not installed)
import subprocess
import sys

def install_package(package):
    """Install a package using pip if not already installed"""
    try:
        __import__(package)
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# List of required packages
packages = [
    'pandas', 'numpy', 'scikit-learn', 'lightgbm', 'xgboost', 
    'fastapi', 'joblib', 'requests', 'tqdm', 'matplotlib', 'seaborn'
]

for package in packages:
    install_package(package)

print("All packages installed successfully!")