# Tier 2: Support Vector Machines (SVM)

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** bd672181-b720-4a15-82ce-893083f211ae

---

## Citation
Brandon Deloatch, "Tier 2: Support Vector Machines (SVM)," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** bd672181-b720-4a15-82ce-893083f211ae
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [None]:
# Import Essential Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# Scikit-learn imports
from sklearn.svm import SVC, SVR, LinearSVC
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, roc_curve, auc
from sklearn.metrics import precision_recall_curve, mean_absolute_error
from sklearn.datasets import make_classification, make_circles, make_moons

# Additional utilities
from sklearn.pipeline import Pipeline
from sklearn.model_selection import learning_curve, validation_curve
import warnings
warnings.filterwarnings('ignore')

print(" Tier 2: Support Vector Machines (SVM) - Libraries Loaded Successfully!")
print("=" * 80)
print("Available SVM Techniques:")
print("• Linear SVM - Maximum margin linear classification")
print("• Kernel SVM - Non-linear classification with RBF, polynomial kernels")
print("• SVM Regression (SVR) - Support vector regression for continuous targets")
print("• Hyperparameter Optimization - C, gamma, kernel parameter tuning")
print("• Support Vector Analysis - Understanding model decision boundaries")
print("• Kernel Trick Visualization - Non-linear transformation insights")

In [None]:
# Generate Comprehensive Datasets for SVM Analysis
np.random.seed(42)

def generate_svm_datasets():
 """Generate datasets optimized for SVM analysis with various complexity levels"""

 # 1. LINEAR SEPARABLE DATASET - Credit Risk Assessment
 n_samples = 1000

 # Generate linearly separable credit data
 credit_score = np.random.normal(650, 120, n_samples)
 credit_score = np.clip(credit_score, 300, 850)

 debt_to_income = np.random.beta(2, 5, n_samples) * 80 # 0-80% DTI
 annual_income = np.random.lognormal(10.8, 0.6, n_samples)
 annual_income = np.clip(annual_income, 30000, 200000)

 employment_years = np.random.exponential(5, n_samples) + 0.5
 employment_years = np.clip(employment_years, 0.5, 40)

 # Create clear linear separation for credit approval
 # Good credit: high score, low DTI, good income
 linear_separator = (credit_score - 600) / 100 + (50 - debt_to_income) / 25 + np.log(annual_income) - 10.5

 # Add some noise but maintain linear separability
 noise = np.random.normal(0, 0.3, n_samples)
 credit_approved = (linear_separator + noise) > 0

 # Add realistic features
 previous_defaults = np.random.binomial(3, 0.1, n_samples)
 credit_utilization = np.random.beta(2, 3, n_samples) * 100

 # Make some correlation with approval
 previous_defaults[credit_approved] = np.random.binomial(3, 0.05, credit_approved.sum())
 previous_defaults[~credit_approved] = np.random.binomial(3, 0.2, (~credit_approved).sum())

 linear_df = pd.DataFrame({
 'credit_score': credit_score,
 'debt_to_income': debt_to_income,
 'annual_income': annual_income,
 'employment_years': employment_years,
 'previous_defaults': previous_defaults,
 'credit_utilization': credit_utilization,
 'approved': credit_approved.astype(int)
 })

 # 2. NON-LINEAR DATASET - Customer Segmentation (Circular patterns)
 # Generate concentric circles for non-linear classification
 X_circles, y_circles = make_circles(n_samples=800, noise=0.1, factor=0.3, random_state=42)

 # Transform to business context - Customer value segments
 # Inner circle = high value, outer circle = standard value
 customer_spending = X_circles[:, 0] * 5000 + 7000 # Scale to spending range
 customer_frequency = X_circles[:, 1] * 20 + 25 # Scale to frequency range

 # Add business-relevant features
 customer_tenure = np.random.exponential(3, len(X_circles)) + 0.5
 support_tickets = np.random.poisson(2, len(X_circles))

 # High value customers (inner circle) have different patterns
 high_value_mask = y_circles == 0
 customer_tenure[high_value_mask] += np.random.exponential(2, high_value_mask.sum()) # Longer tenure
 support_tickets[high_value_mask] = np.random.poisson(1, high_value_mask.sum()) # Fewer tickets

 nonlinear_df = pd.DataFrame({
 'customer_spending': customer_spending,
 'customer_frequency': customer_frequency,
 'customer_tenure': customer_tenure,
 'support_tickets': support_tickets,
 'value_segment': y_circles # 0=High Value, 1=Standard
 })

 # 3. REGRESSION DATASET - House Price Prediction with Complex Relationships
 n_reg_samples = 1000

 # Generate house features
 lot_size = np.random.gamma(2, 2000, n_reg_samples) + 1000 # sq ft
 house_age = np.random.exponential(15, n_reg_samples) + 1
 bedrooms = np.random.poisson(3, n_reg_samples) + 1
 bathrooms = np.random.poisson(2, n_reg_samples) + 1

 # School district rating (affects price non-linearly)
 school_rating = np.random.beta(2, 2, n_reg_samples) * 10 + 1

 # Crime rate (non-linear negative effect)
 crime_rate = np.random.exponential(5, n_reg_samples)

 # Distance to city center
 distance_to_center = np.random.gamma(2, 5, n_reg_samples) + 1

 # Generate prices with non-linear relationships
 base_price = (
 lot_size * 50 + # Linear lot effect
 bedrooms * 25000 + # Linear bedroom effect
 bathrooms * 20000 + # Linear bathroom effect
 np.exp(school_rating / 3) * 15000 + # Exponential school effect
 -crime_rate ** 1.5 * 3000 + # Non-linear crime penalty
 -np.log(distance_to_center + 1) * 20000 + # Log distance effect
 -house_age * 2000 # Linear age depreciation
 )

 # Add noise and ensure positive prices
 price_noise = np.random.normal(0, 30000, n_reg_samples)
 house_prices = np.maximum(base_price + price_noise, 100000)

 regression_df = pd.DataFrame({
 'lot_size': lot_size,
 'house_age': house_age,
 'bedrooms': bedrooms,
 'bathrooms': bathrooms,
 'school_rating': school_rating,
 'crime_rate': crime_rate,
 'distance_to_center': distance_to_center,
 'price': house_prices
 })

 return linear_df, nonlinear_df, regression_df

# Generate datasets
print(" Generating SVM-optimized datasets...")
linear_df, nonlinear_df, regression_df = generate_svm_datasets()

print(f"Linear Dataset (Credit Approval): {linear_df.shape}")
print(f"Non-linear Dataset (Customer Segments): {nonlinear_df.shape}")
print(f"Regression Dataset (House Prices): {regression_df.shape}")

print("\nLinear Classification Dataset (Credit Approval):")
print(linear_df.head())
print(f"Approval Rate: {linear_df['approved'].mean():.1%}")

print("\nNon-linear Classification Dataset (Customer Segmentation):")
print(nonlinear_df.head())
print(f"High Value Customers: {(nonlinear_df['value_segment'] == 0).mean():.1%}")

print("\nRegression Dataset (House Prices):")
print(regression_df.head())
print(f"Price Range: ${regression_df['price'].min():,.0f} - ${regression_df['price'].max():,.0f}")

In [None]:
# 1. LINEAR SVM ANALYSIS
print(" 1. LINEAR SVM ANALYSIS")
print("=" * 25)

# Prepare linear classification data
linear_features = ['credit_score', 'debt_to_income', 'annual_income',
 'employment_years', 'previous_defaults', 'credit_utilization']
X_linear = linear_df[linear_features]
y_linear = linear_df['approved']

# Split data
X_linear_train, X_linear_test, y_linear_train, y_linear_test = train_test_split(
 X_linear, y_linear, test_size=0.2, random_state=42, stratify=y_linear
)

# Scale features (important for SVM)
scaler_linear = StandardScaler()
X_linear_train_scaled = scaler_linear.fit_transform(X_linear_train)
X_linear_test_scaled = scaler_linear.transform(X_linear_test)

print(f"Training set: {X_linear_train_scaled.shape}")
print(f"Test set: {X_linear_test_scaled.shape}")
print(f"Class distribution: {y_linear_train.value_counts().to_dict()}")

# Train Linear SVM with different C values
C_values = [0.01, 0.1, 1, 10, 100]
linear_svm_results = {}

print(f"\n C Parameter Optimization:")

for C in C_values:
 # Linear SVM
 svm_linear = SVC(kernel='linear', C=C, random_state=42)
 svm_linear.fit(X_linear_train_scaled, y_linear_train)

 # Predictions
 train_score = svm_linear.score(X_linear_train_scaled, y_linear_train)
 test_score = svm_linear.score(X_linear_test_scaled, y_linear_test)

 # Cross-validation
 cv_scores = cross_val_score(svm_linear, X_linear_train_scaled, y_linear_train, cv=5)

 # Support vector count
 n_support = len(svm_linear.support_)

 linear_svm_results[C] = {
 'train_accuracy': train_score,
 'test_accuracy': test_score,
 'cv_mean': cv_scores.mean(),
 'cv_std': cv_scores.std(),
 'n_support_vectors': n_support
 }

 print(f"• C={C}: Test Acc={test_score:.4f}, CV={cv_scores.mean():.4f}±{cv_scores.std():.4f}, SV={n_support}")

# Find optimal C
results_df = pd.DataFrame(linear_svm_results).T
optimal_C = results_df['cv_mean'].idxmax()
print(f"\n Optimal C: {optimal_C}")

# Visualize C parameter effect
fig_c_effect = make_subplots(
 rows=1, cols=2,
 subplot_titles=['Accuracy vs C Parameter', 'Support Vectors vs C Parameter']
)

# Accuracy plot
fig_c_effect.add_trace(
 go.Scatter(
 x=list(C_values),
 y=results_df['train_accuracy'],
 mode='lines+markers',
 name='Training',
 line=dict(color='blue')
 ),
 row=1, col=1
)

fig_c_effect.add_trace(
 go.Scatter(
 x=list(C_values),
 y=results_df['cv_mean'],
 mode='lines+markers',
 name='CV Mean',
 line=dict(color='green'),
 error_y=dict(type='data', array=results_df['cv_std'])
 ),
 row=1, col=1
)

fig_c_effect.add_trace(
 go.Scatter(
 x=list(C_values),
 y=results_df['test_accuracy'],
 mode='lines+markers',
 name='Test',
 line=dict(color='red')
 ),
 row=1, col=1
)

# Support vectors plot
fig_c_effect.add_trace(
 go.Scatter(
 x=list(C_values),
 y=results_df['n_support_vectors'],
 mode='lines+markers',
 name='Support Vectors',
 line=dict(color='purple'),
 showlegend=False
 ),
 row=1, col=2
)

fig_c_effect.update_xaxes(type="log", title_text="C Parameter", row=1, col=1)
fig_c_effect.update_xaxes(type="log", title_text="C Parameter", row=1, col=2)
fig_c_effect.update_yaxes(title_text="Accuracy", row=1, col=1)
fig_c_effect.update_yaxes(title_text="Number of Support Vectors", row=1, col=2)

fig_c_effect.update_layout(
 title="Linear SVM: C Parameter Analysis",
 height=500
)
fig_c_effect.show()

# Train final model with optimal C
svm_linear_final = SVC(kernel='linear', C=optimal_C, random_state=42)
svm_linear_final.fit(X_linear_train_scaled, y_linear_train)

# Final predictions and metrics
y_linear_pred = svm_linear_final.predict(X_linear_test_scaled)
linear_accuracy = accuracy_score(y_linear_test, y_linear_pred)

print(f"\n Final Linear SVM Performance:")
print(f"• C = {optimal_C}")
print(f"• Test Accuracy: {linear_accuracy:.4f}")
print(f"• Support Vectors: {len(svm_linear_final.support_)}/{len(X_linear_train_scaled)} ({len(svm_linear_final.support_)/len(X_linear_train_scaled):.1%})")

print(f"\nClassification Report:")
print(classification_report(y_linear_test, y_linear_pred, target_names=['Rejected', 'Approved']))

# Feature importance analysis (using coefficients)
feature_importance = np.abs(svm_linear_final.coef_[0])
feature_importance_normalized = feature_importance / feature_importance.sum()

importance_df = pd.DataFrame({
 'Feature': linear_features,
 'Importance': feature_importance_normalized
}).sort_values('Importance', ascending=False)

print(f"\nFeature Importance (Linear SVM coefficients):")
for _, row in importance_df.iterrows():
 print(f"• {row['Feature']}: {row['Importance']:.3f}")

# Visualize feature importance
fig_feat_imp = go.Figure()

fig_feat_imp.add_trace(
 go.Bar(
 x=importance_df['Feature'],
 y=importance_df['Importance'],
 marker_color='lightblue',
 hovertemplate="Feature: %{x}<br>Importance: %{y:.3f}<extra></extra>"
 )
)

fig_feat_imp.update_layout(
 title="Linear SVM Feature Importance",
 xaxis_title="Features",
 yaxis_title="Normalized Coefficient Magnitude",
 xaxis_tickangle=-45,
 height=500
)
fig_feat_imp.show()

# Confusion Matrix
cm_linear = confusion_matrix(y_linear_test, y_linear_pred)

fig_cm_linear = ff.create_annotated_heatmap(
 z=cm_linear,
 x=['Rejected', 'Approved'],
 y=['Rejected', 'Approved'],
 annotation_text=cm_linear,
 colorscale='Blues',
 showscale=True
)

fig_cm_linear.update_layout(
 title=f"Linear SVM Confusion Matrix (C={optimal_C})",
 xaxis_title="Predicted",
 yaxis_title="Actual",
 height=400
)
fig_cm_linear.show()

In [None]:
# 2. KERNEL SVM ANALYSIS (NON-LINEAR)
print(" 2. KERNEL SVM ANALYSIS (NON-LINEAR)")
print("=" * 36)

# Prepare non-linear classification data
nonlinear_features = ['customer_spending', 'customer_frequency', 'customer_tenure', 'support_tickets']
X_nonlinear = nonlinear_df[nonlinear_features]
y_nonlinear = nonlinear_df['value_segment']

# Split data
X_nonlinear_train, X_nonlinear_test, y_nonlinear_train, y_nonlinear_test = train_test_split(
 X_nonlinear, y_nonlinear, test_size=0.2, random_state=42, stratify=y_nonlinear
)

# Scale features
scaler_nonlinear = StandardScaler()
X_nonlinear_train_scaled = scaler_nonlinear.fit_transform(X_nonlinear_train)
X_nonlinear_test_scaled = scaler_nonlinear.transform(X_nonlinear_test)

print(f"Training set: {X_nonlinear_train_scaled.shape}")
print(f"Test set: {X_nonlinear_test_scaled.shape}")
print(f"Class distribution: {y_nonlinear_train.value_counts().to_dict()}")

# Test different kernels
kernels = ['linear', 'rbf', 'poly', 'sigmoid']
kernel_results = {}

print(f"\n Kernel Comparison:")

for kernel in kernels:
 if kernel == 'poly':
 svm_kernel = SVC(kernel=kernel, degree=3, C=1.0, random_state=42)
 else:
 svm_kernel = SVC(kernel=kernel, C=1.0, random_state=42)

 # Cross-validation
 cv_scores = cross_val_score(svm_kernel, X_nonlinear_train_scaled, y_nonlinear_train, cv=5)

 # Fit and test
 svm_kernel.fit(X_nonlinear_train_scaled, y_nonlinear_train)
 test_score = svm_kernel.score(X_nonlinear_test_scaled, y_nonlinear_test)
 n_support = len(svm_kernel.support_)

 kernel_results[kernel] = {
 'cv_mean': cv_scores.mean(),
 'cv_std': cv_scores.std(),
 'test_accuracy': test_score,
 'n_support_vectors': n_support
 }

 print(f"• {kernel}: Test Acc={test_score:.4f}, CV={cv_scores.mean():.4f}±{cv_scores.std():.4f}, SV={n_support}")

# Find best kernel
kernel_df = pd.DataFrame(kernel_results).T
best_kernel = kernel_df['cv_mean'].idxmax()
print(f"\n Best kernel: {best_kernel}")

# Visualize kernel comparison
fig_kernels = go.Figure()

fig_kernels.add_trace(
 go.Bar(
 x=list(kernel_results.keys()),
 y=[result['test_accuracy'] for result in kernel_results.values()],
 name='Test Accuracy',
 marker_color='lightcoral',
 hovertemplate="Kernel: %{x}<br>Accuracy: %{y:.4f}<extra></extra>"
 )
)

fig_kernels.update_layout(
 title="Kernel Comparison (Non-linear Dataset)",
 xaxis_title="Kernel Type",
 yaxis_title="Test Accuracy",
 height=500
)
fig_kernels.show()

# Hyperparameter optimization for RBF kernel
print(f"\n RBF Kernel Hyperparameter Optimization:")

# Grid search for C and gamma
param_grid = {
 'C': [0.1, 1, 10, 100],
 'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1]
}

svm_rbf = SVC(kernel='rbf', random_state=42)
grid_search = GridSearchCV(
 svm_rbf,
 param_grid,
 cv=5,
 scoring='accuracy',
 n_jobs=-1
)

grid_search.fit(X_nonlinear_train_scaled, y_nonlinear_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

# Test best model
best_svm = grid_search.best_estimator_
y_nonlinear_pred = best_svm.predict(X_nonlinear_test_scaled)
nonlinear_accuracy = accuracy_score(y_nonlinear_test, y_nonlinear_pred)

print(f"Test accuracy: {nonlinear_accuracy:.4f}")
print(f"Support vectors: {len(best_svm.support_)}/{len(X_nonlinear_train_scaled)} ({len(best_svm.support_)/len(X_nonlinear_train_scaled):.1%})")

# Visualize hyperparameter grid search results
results = pd.DataFrame(grid_search.cv_results_)

# Create heatmap for C vs gamma
C_values = param_grid['C']
gamma_values = [g for g in param_grid['gamma'] if isinstance(g, (int, float))]

# Filter results for numeric gamma values only
numeric_results = results[results['param_gamma'].isin(gamma_values)]

if len(numeric_results) > 0:
 heatmap_data = numeric_results.pivot_table(
 values='mean_test_score',
 index='param_C',
 columns='param_gamma',
 aggfunc='mean'
 )

 fig_heatmap = go.Figure(data=go.Heatmap(
 z=heatmap_data.values,
 x=[str(g) for g in heatmap_data.columns],
 y=[str(c) for c in heatmap_data.index],
 colorscale='Viridis',
 hovertemplate="C: %{y}<br>Gamma: %{x}<br>CV Score: %{z:.4f}<extra></extra>"
 ))

 fig_heatmap.update_layout(
 title="RBF SVM Hyperparameter Grid Search",
 xaxis_title="Gamma",
 yaxis_title="C",
 height=500
 )
 fig_heatmap.show()

# Classification report for best model
print(f"\nClassification Report (Best RBF SVM):")
print(classification_report(y_nonlinear_test, y_nonlinear_pred, target_names=['High Value', 'Standard']))

# Confusion Matrix
cm_nonlinear = confusion_matrix(y_nonlinear_test, y_nonlinear_pred)

fig_cm_nonlinear = ff.create_annotated_heatmap(
 z=cm_nonlinear,
 x=['High Value', 'Standard'],
 y=['High Value', 'Standard'],
 annotation_text=cm_nonlinear,
 colorscale='Blues',
 showscale=True
)

fig_cm_nonlinear.update_layout(
 title=f"RBF SVM Confusion Matrix",
 xaxis_title="Predicted",
 yaxis_title="Actual",
 height=400
)
fig_cm_nonlinear.show()

In [None]:
# 3. SUPPORT VECTOR REGRESSION (SVR)
print(" 3. SUPPORT VECTOR REGRESSION (SVR)")
print("=" * 33)

# Prepare regression data
regression_features = ['lot_size', 'house_age', 'bedrooms', 'bathrooms',
 'school_rating', 'crime_rate', 'distance_to_center']
X_reg = regression_df[regression_features]
y_reg = regression_df['price']

# Split data
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
 X_reg, y_reg, test_size=0.2, random_state=42
)

# Scale features and target
scaler_reg = StandardScaler()
X_reg_train_scaled = scaler_reg.fit_transform(X_reg_train)
X_reg_test_scaled = scaler_reg.transform(X_reg_test)

# Scale target for better SVR performance
y_scaler = StandardScaler()
y_reg_train_scaled = y_scaler.fit_transform(y_reg_train.values.reshape(-1, 1)).ravel()
y_reg_test_scaled = y_scaler.transform(y_reg_test.values.reshape(-1, 1)).ravel()

print(f"Training set: {X_reg_train_scaled.shape}")
print(f"Test set: {X_reg_test_scaled.shape}")

# Test different SVR kernels
svr_kernels = ['linear', 'rbf', 'poly']
svr_results = {}

print(f"\n SVR Kernel Comparison:")

for kernel in svr_kernels:
 if kernel == 'poly':
 svr_model = SVR(kernel=kernel, degree=3, C=1.0)
 else:
 svr_model = SVR(kernel=kernel, C=1.0)

 # Fit model
 svr_model.fit(X_reg_train_scaled, y_reg_train_scaled)

 # Predictions (transform back to original scale)
 y_pred_scaled = svr_model.predict(X_reg_test_scaled)
 y_pred = y_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1)).ravel()

 # Metrics
 mse = mean_squared_error(y_reg_test, y_pred)
 r2 = r2_score(y_reg_test, y_pred)
 mae = mean_absolute_error(y_reg_test, y_pred)
 n_support = len(svr_model.support_)

 svr_results[kernel] = {
 'MSE': mse,
 'R2': r2,
 'MAE': mae,
 'RMSE': np.sqrt(mse),
 'n_support_vectors': n_support
 }

 print(f"• {kernel}: R²={r2:.4f}, RMSE=${np.sqrt(mse):,.0f}, SV={n_support}")

# Find best SVR kernel
svr_df = pd.DataFrame(svr_results).T
best_svr_kernel = svr_df['R2'].idxmax()
print(f"\n Best SVR kernel: {best_svr_kernel}")

# Visualize SVR kernel comparison
fig_svr_kernels = make_subplots(
 rows=1, cols=2,
 subplot_titles=['R² Score by Kernel', 'RMSE by Kernel']
)

fig_svr_kernels.add_trace(
 go.Bar(
 x=list(svr_results.keys()),
 y=[result['R2'] for result in svr_results.values()],
 name='R² Score',
 marker_color='lightgreen'
 ),
 row=1, col=1
)

fig_svr_kernels.add_trace(
 go.Bar(
 x=list(svr_results.keys()),
 y=[result['RMSE'] for result in svr_results.values()],
 name='RMSE',
 marker_color='lightcoral',
 showlegend=False
 ),
 row=1, col=2
)

fig_svr_kernels.update_layout(
 title="SVR Kernel Comparison",
 height=500
)

fig_svr_kernels.update_yaxes(title_text="R² Score", row=1, col=1)
fig_svr_kernels.update_yaxes(title_text="RMSE ($)", row=1, col=2)

fig_svr_kernels.show()

# Hyperparameter optimization for best kernel
print(f"\n SVR Hyperparameter Optimization ({best_svr_kernel} kernel):")

# Grid search for SVR
svr_param_grid = {
 'C': [0.1, 1, 10, 100],
 'epsilon': [0.01, 0.1, 0.2, 0.5]
}

if best_svr_kernel == 'rbf':
 svr_param_grid['gamma'] = ['scale', 'auto', 0.001, 0.01, 0.1]

svr_best = SVR(kernel=best_svr_kernel)
svr_grid_search = GridSearchCV(
 svr_best,
 svr_param_grid,
 cv=5,
 scoring='r2',
 n_jobs=-1
)

svr_grid_search.fit(X_reg_train_scaled, y_reg_train_scaled)

print(f"Best parameters: {svr_grid_search.best_params_}")
print(f"Best CV R²: {svr_grid_search.best_score_:.4f}")

# Final SVR model evaluation
best_svr = svr_grid_search.best_estimator_
y_reg_pred_scaled = best_svr.predict(X_reg_test_scaled)
y_reg_pred = y_scaler.inverse_transform(y_reg_pred_scaled.reshape(-1, 1)).ravel()

# Final metrics
final_mse = mean_squared_error(y_reg_test, y_reg_pred)
final_r2 = r2_score(y_reg_test, y_reg_pred)
final_mae = mean_absolute_error(y_reg_test, y_reg_pred)

print(f"\n Final SVR Performance:")
print(f"• Kernel: {best_svr_kernel}")
print(f"• Test R²: {final_r2:.4f}")
print(f"• Test RMSE: ${np.sqrt(final_mse):,.0f}")
print(f"• Test MAE: ${final_mae:,.0f}")
print(f"• Support Vectors: {len(best_svr.support_)}/{len(X_reg_train_scaled)} ({len(best_svr.support_)/len(X_reg_train_scaled):.1%})")

# Actual vs Predicted plot
fig_svr_pred = go.Figure()

fig_svr_pred.add_trace(
 go.Scatter(
 x=y_reg_test,
 y=y_reg_pred,
 mode='markers',
 marker=dict(color='blue', opacity=0.6),
 name='Predictions',
 hovertemplate="Actual: $%{x:,.0f}<br>Predicted: $%{y:,.0f}<extra></extra>"
 )
)

# Perfect prediction line
min_price = min(y_reg_test.min(), y_reg_pred.min())
max_price = max(y_reg_test.max(), y_reg_pred.max())

fig_svr_pred.add_trace(
 go.Scatter(
 x=[min_price, max_price],
 y=[min_price, max_price],
 mode='lines',
 line=dict(color='red', dash='dash'),
 name='Perfect Prediction',
 hovertemplate="Perfect Line<extra></extra>"
 )
)

fig_svr_pred.update_layout(
 title=f"SVR: Actual vs Predicted Prices ({best_svr_kernel} kernel)",
 xaxis_title="Actual Price ($)",
 yaxis_title="Predicted Price ($)",
 height=500
)
fig_svr_pred.show()

# Residuals analysis
residuals = y_reg_test - y_reg_pred

fig_residuals = go.Figure()

fig_residuals.add_trace(
 go.Scatter(
 x=y_reg_pred,
 y=residuals,
 mode='markers',
 marker=dict(color='green', opacity=0.6),
 hovertemplate="Predicted: $%{x:,.0f}<br>Residual: $%{y:,.0f}<extra></extra>"
 )
)

fig_residuals.add_hline(y=0, line_dash="dash", line_color="red")

fig_residuals.update_layout(
 title="SVR Residuals Analysis",
 xaxis_title="Predicted Price ($)",
 yaxis_title="Residuals ($)",
 height=500
)
fig_residuals.show()

In [None]:
# 4. LEARNING CURVES AND MODEL COMPLEXITY
print(" 4. LEARNING CURVES AND MODEL COMPLEXITY")
print("=" * 41)

# Learning curves for different models
def plot_learning_curves(estimator, X, y, title, cv=5):
 """Plot learning curves for an estimator"""

 train_sizes = np.linspace(0.1, 1.0, 10)
 train_sizes_abs, train_scores, val_scores = learning_curve(
 estimator, X, y, train_sizes=train_sizes, cv=cv, n_jobs=-1, random_state=42
 )

 train_mean = train_scores.mean(axis=1)
 train_std = train_scores.std(axis=1)
 val_mean = val_scores.mean(axis=1)
 val_std = val_scores.std(axis=1)

 return train_sizes_abs, train_mean, train_std, val_mean, val_std

# Generate learning curves for different SVM configurations
print(" Generating Learning Curves...")

# Linear SVM learning curve
linear_svm_lc = SVC(kernel='linear', C=optimal_C, random_state=42)
train_sizes, lin_train_mean, lin_train_std, lin_val_mean, lin_val_std = plot_learning_curves(
 linear_svm_lc, X_linear_train_scaled, y_linear_train, "Linear SVM"
)

# RBF SVM learning curve
rbf_svm_lc = SVC(kernel='rbf', C=grid_search.best_params_['C'],
 gamma=grid_search.best_params_['gamma'], random_state=42)
_, rbf_train_mean, rbf_train_std, rbf_val_mean, rbf_val_std = plot_learning_curves(
 rbf_svm_lc, X_nonlinear_train_scaled, y_nonlinear_train, "RBF SVM"
)

# Plot learning curves
fig_lc = make_subplots(
 rows=1, cols=2,
 subplot_titles=['Linear SVM Learning Curve', 'RBF SVM Learning Curve']
)

# Linear SVM
fig_lc.add_trace(
 go.Scatter(
 x=train_sizes,
 y=lin_train_mean,
 mode='lines+markers',
 name='Training',
 line=dict(color='blue'),
 error_y=dict(type='data', array=lin_train_std)
 ),
 row=1, col=1
)

fig_lc.add_trace(
 go.Scatter(
 x=train_sizes,
 y=lin_val_mean,
 mode='lines+markers',
 name='Validation',
 line=dict(color='red'),
 error_y=dict(type='data', array=lin_val_std),
 showlegend=False
 ),
 row=1, col=1
)

# RBF SVM
fig_lc.add_trace(
 go.Scatter(
 x=train_sizes,
 y=rbf_train_mean,
 mode='lines+markers',
 name='Training',
 line=dict(color='blue'),
 error_y=dict(type='data', array=rbf_train_std),
 showlegend=False
 ),
 row=1, col=2
)

fig_lc.add_trace(
 go.Scatter(
 x=train_sizes,
 y=rbf_val_mean,
 mode='lines+markers',
 name='Validation',
 line=dict(color='red'),
 error_y=dict(type='data', array=rbf_val_std),
 showlegend=False
 ),
 row=1, col=2
)

fig_lc.update_layout(
 title="SVM Learning Curves",
 height=500
)

fig_lc.update_xaxes(title_text="Training Set Size", row=1, col=1)
fig_lc.update_xaxes(title_text="Training Set Size", row=1, col=2)
fig_lc.update_yaxes(title_text="Accuracy", row=1, col=1)
fig_lc.update_yaxes(title_text="Accuracy", row=1, col=2)

fig_lc.show()

# Validation curves for C parameter
print(f"\n Validation Curves for C Parameter:")

# C parameter validation curve for Linear SVM
C_range = np.logspace(-3, 2, 10)
train_scores_c, val_scores_c = validation_curve(
 SVC(kernel='linear', random_state=42),
 X_linear_train_scaled, y_linear_train,
 param_name='C', param_range=C_range, cv=5, n_jobs=-1
)

train_mean_c = train_scores_c.mean(axis=1)
train_std_c = train_scores_c.std(axis=1)
val_mean_c = val_scores_c.mean(axis=1)
val_std_c = val_scores_c.std(axis=1)

# Gamma parameter validation curve for RBF SVM
gamma_range = np.logspace(-4, 1, 10)
train_scores_g, val_scores_g = validation_curve(
 SVC(kernel='rbf', C=1.0, random_state=42),
 X_nonlinear_train_scaled, y_nonlinear_train,
 param_name='gamma', param_range=gamma_range, cv=5, n_jobs=-1
)

train_mean_g = train_scores_g.mean(axis=1)
train_std_g = train_scores_g.std(axis=1)
val_mean_g = val_scores_g.mean(axis=1)
val_std_g = val_scores_g.std(axis=1)

# Plot validation curves
fig_vc = make_subplots(
 rows=1, cols=2,
 subplot_titles=['C Parameter Validation Curve', 'Gamma Parameter Validation Curve']
)

# C parameter
fig_vc.add_trace(
 go.Scatter(
 x=C_range,
 y=train_mean_c,
 mode='lines+markers',
 name='Training',
 line=dict(color='blue'),
 error_y=dict(type='data', array=train_std_c)
 ),
 row=1, col=1
)

fig_vc.add_trace(
 go.Scatter(
 x=C_range,
 y=val_mean_c,
 mode='lines+markers',
 name='Validation',
 line=dict(color='red'),
 error_y=dict(type='data', array=val_std_c),
 showlegend=False
 ),
 row=1, col=1
)

# Gamma parameter
fig_vc.add_trace(
 go.Scatter(
 x=gamma_range,
 y=train_mean_g,
 mode='lines+markers',
 name='Training',
 line=dict(color='blue'),
 error_y=dict(type='data', array=train_std_g),
 showlegend=False
 ),
 row=1, col=2
)

fig_vc.add_trace(
 go.Scatter(
 x=gamma_range,
 y=val_mean_g,
 mode='lines+markers',
 name='Validation',
 line=dict(color='red'),
 error_y=dict(type='data', array=val_std_g),
 showlegend=False
 ),
 row=1, col=2
)

fig_vc.update_xaxes(type="log", title_text="C Parameter", row=1, col=1)
fig_vc.update_xaxes(type="log", title_text="Gamma Parameter", row=1, col=2)
fig_vc.update_yaxes(title_text="Accuracy", row=1, col=1)
fig_vc.update_yaxes(title_text="Accuracy", row=1, col=2)

fig_vc.update_layout(
 title="SVM Validation Curves",
 height=500
)
fig_vc.show()

# Model complexity analysis
print(f"\n Model Complexity Analysis:")

# Support vector analysis
print(f"Support Vector Statistics:")
print(f"• Linear SVM: {len(svm_linear_final.support_)} support vectors ({len(svm_linear_final.support_)/len(X_linear_train_scaled):.1%} of training data)")
print(f"• RBF SVM: {len(best_svm.support_)} support vectors ({len(best_svm.support_)/len(X_nonlinear_train_scaled):.1%} of training data)")
print(f"• SVR: {len(best_svr.support_)} support vectors ({len(best_svr.support_)/len(X_reg_train_scaled):.1%} of training data)")

# Margin analysis for Linear SVM
margin = 2 / np.sqrt(np.sum(svm_linear_final.coef_ ** 2))
print(f"\nLinear SVM Margin Analysis:")
print(f"• Decision boundary margin: {margin:.4f}")
print(f"• This represents the distance between support vectors and decision boundary")

# Training time comparison
import time

training_times = {}

# Linear SVM timing
start_time = time.time()
svm_linear_timing = SVC(kernel='linear', C=optimal_C, random_state=42)
svm_linear_timing.fit(X_linear_train_scaled, y_linear_train)
training_times['Linear SVM'] = time.time() - start_time

# RBF SVM timing
start_time = time.time()
svm_rbf_timing = SVC(kernel='rbf', C=1.0, random_state=42)
svm_rbf_timing.fit(X_nonlinear_train_scaled, y_nonlinear_train)
training_times['RBF SVM'] = time.time() - start_time

print(f"\nTraining Time Comparison:")
for model, time_taken in training_times.items():
 print(f"• {model}: {time_taken:.3f} seconds")

In [None]:
# 5. BUSINESS INSIGHTS AND STRATEGIC RECOMMENDATIONS
print(" 5. BUSINESS INSIGHTS AND STRATEGIC RECOMMENDATIONS")
print("=" * 54)

# Decision boundary analysis and interpretability
print(" SVM Decision Boundary Analysis:")

print(f"\n1. MODEL PERFORMANCE SUMMARY:")
print(f" • Linear SVM (Credit Approval): {linear_accuracy:.1%} accuracy")
print(f" • RBF SVM (Customer Segmentation): {nonlinear_accuracy:.1%} accuracy")
print(f" • SVR (House Price Prediction): R² = {final_r2:.3f}")

# Support vector insights
print(f"\n2. SUPPORT VECTOR INSIGHTS:")
linear_sv_pct = len(svm_linear_final.support_) / len(X_linear_train_scaled) * 100
rbf_sv_pct = len(best_svm.support_) / len(X_nonlinear_train_scaled) * 100
svr_sv_pct = len(best_svr.support_) / len(X_reg_train_scaled) * 100

print(f" • Linear SVM uses {linear_sv_pct:.1f}% of training data as support vectors")
print(f" • RBF SVM uses {rbf_sv_pct:.1f}% of training data as support vectors")
print(f" • SVR uses {svr_sv_pct:.1f}% of training data as support vectors")

if linear_sv_pct < 50:
 print(" • Linear model: Good separation with clear margin")
else:
 print(" • Linear model: Complex decision boundary, may benefit from feature engineering")

if rbf_sv_pct < 30:
 print(" • RBF model: Efficient non-linear separation")
elif rbf_sv_pct > 70:
 print(" • RBF model: High complexity, consider regularization")
else:
 print(" • RBF model: Balanced complexity for non-linear patterns")

# Feature importance insights
print(f"\n3. FEATURE IMPORTANCE INSIGHTS (Linear SVM):")
top_3_features = importance_df.head(3)
for i, (_, row) in enumerate(top_3_features.iterrows(), 1):
 print(f" • #{i} {row['Feature']}: {row['Importance']:.3f} importance")

most_important = top_3_features.iloc[0]['Feature']
least_important = importance_df.tail(1).iloc[0]['Feature']

print(f" • Focus data quality efforts on: {most_important}")
print(f" • Consider removing: {least_important} (lowest impact)")

# Hyperparameter insights
print(f"\n4. HYPERPARAMETER INSIGHTS:")
print(f" • Optimal Linear SVM C: {optimal_C}")
if optimal_C < 1:
 print(" - Low C suggests high regularization needed")
 print(" - Data may have noise or overlapping classes")
elif optimal_C > 10:
 print(" - High C suggests low regularization needed")
 print(" - Data is well-separated")
else:
 print(" - Moderate C suggests balanced regularization")

print(f" • Optimal RBF SVM parameters: C={grid_search.best_params_['C']}, gamma={grid_search.best_params_['gamma']}")

rbf_c = grid_search.best_params_['C']
rbf_gamma = grid_search.best_params_['gamma']

if isinstance(rbf_gamma, str):
 print(f" - Using automatic gamma scaling")
elif rbf_gamma < 0.1:
 print(f" - Low gamma: smooth decision boundary")
else:
 print(f" - High gamma: complex decision boundary")

# Business application strategies
print(f"\n5. BUSINESS APPLICATION STRATEGIES:")

print(f"\n Credit Approval System (Linear SVM):")
print(f" • Deploy for automated credit decisions")
print(f" • {linear_accuracy:.1%} accuracy reduces manual review by ~{linear_accuracy*100-50:.0f}%")
print(f" • Most important factor: {most_important}")
print(f" • Support vectors represent edge cases for manual review")
print(f" • Recommended: A/B test against current decision rules")

print(f"\n Customer Segmentation (RBF SVM):")
print(f" • {nonlinear_accuracy:.1%} accuracy for customer targeting")
print(f" • Non-linear patterns suggest complex customer behaviors")
print(f" • Use for personalized marketing campaigns")
print(f" • Support vectors identify boundary customers for special attention")

print(f"\n House Price Prediction (SVR):")
print(f" • R² = {final_r2:.3f} explains {final_r2*100:.1f}% of price variance")
print(f" • Average prediction error: ${final_mae:,.0f}")
print(f" • Use for automated property valuation")
print(f" • Support vectors represent unique/complex properties")

# ROI and cost-benefit analysis
print(f"\n6. ROI AND COST-BENEFIT ANALYSIS:")

# Credit approval ROI
credit_volume = 10000 # Annual applications
current_approval_rate = 0.7
manual_review_cost = 50 # per application
automated_cost = 5 # per application

manual_cost = credit_volume * manual_review_cost
automated_cost_total = credit_volume * automated_cost
accuracy_benefit = linear_accuracy - 0.7 # vs random/current system

print(f"\n Credit Approval System:")
print(f" • Manual review cost: ${manual_cost:,}/year")
print(f" • Automated system cost: ${automated_cost_total:,}/year")
print(f" • Cost savings: ${manual_cost - automated_cost_total:,}/year")
print(f" • Accuracy improvement: +{accuracy_benefit:.1%}")
print(f" • Break-even: ~{automated_cost_total/(manual_review_cost-automated_cost):,.0f} applications")

# Customer segmentation ROI
customer_base = 50000
campaign_cost_per_customer = 10
conversion_rate_improvement = 0.15 # 15% improvement
revenue_per_conversion = 100

segmentation_revenue = customer_base * campaign_cost_per_customer * conversion_rate_improvement * revenue_per_conversion / campaign_cost_per_customer

print(f"\n Customer Segmentation:")
print(f" • Improved targeting on {customer_base:,} customers")
print(f" • Expected conversion improvement: +{conversion_rate_improvement:.1%}")
print(f" • Additional annual revenue: ${segmentation_revenue:,.0f}")
print(f" • ROI: {segmentation_revenue/(customer_base*2):,.0f}x (assuming $2/customer implementation cost)")

# Implementation recommendations
print(f"\n7. IMPLEMENTATION RECOMMENDATIONS:")

print(f"\n Technical Implementation:")
print(f" • Use scikit-learn Pipeline for preprocessing consistency")
print(f" • Implement model versioning and A/B testing framework")
print(f" • Monitor support vector count for model drift detection")
print(f" • Set up automated retraining when performance degrades")
print(f" • Consider approximate methods for large-scale deployment")

print(f"\n Monitoring and Maintenance:")
print(f" • Track prediction confidence and flag low-confidence cases")
print(f" • Monitor support vector characteristics over time")
print(f" • Retrain when support vector percentage changes significantly")
print(f" • Validate model assumptions quarterly")

print(f"\n Risk Management:")
print(f" • Implement prediction explanation for regulatory compliance")
print(f" • Set confidence thresholds for automated decisions")
print(f" • Maintain human oversight for edge cases")
print(f" • Regular bias audits on decision boundaries")

print(f"\n8. NEXT STEPS AND ADVANCED TECHNIQUES:")
print(f" • Experiment with ensemble methods combining multiple kernels")
print(f" • Investigate feature engineering for better linear separability")
print(f" • Consider online/incremental SVM for streaming data")
print(f" • Explore kernel customization for domain-specific problems")
print(f" • Implement SHAP values for better model interpretability")

print(f"\n" + "="*80)
print(f" SVM LEARNING SUMMARY:")
print(f" Mastered linear and non-linear SVM classification")
print(f" Optimized hyperparameters using grid search and cross-validation")
print(f" Applied SVR for regression problems with kernel tricks")
print(f" Analyzed support vectors and decision boundary characteristics")
print(f" Understood model complexity trade-offs and performance curves")
print(f" Generated comprehensive business insights and ROI analysis")
print(f"="*80)