# Tier 5: Support Vector Machine Classification

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** e874fd9d-091e-4438-9a00-597b6ec197a2

---

## Citation
Brandon Deloatch, "Tier 5: Support Vector Machine Classification," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** e874fd9d-091e-4438-9a00-597b6ec197a2
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, validation_curve
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.datasets import make_classification, make_circles, make_moons
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
warnings.filterwarnings('ignore')

print(" Tier 5: Support Vector Machine Classification - Libraries Loaded!")
print("="*68)
print("Support Vector Machine Techniques:")
print("• Linear SVM (Support Vector Classifier)")
print("• RBF Kernel SVM (Radial Basis Function)")
print("• Polynomial Kernel SVM")
print("• Sigmoid Kernel SVM")
print("• Linear SVC (Large scale linear classification)")
print("• Hyperparameter optimization (C, gamma, kernel parameters)")
print("• Decision boundary visualization and analysis")
print("• High-dimensional sparse data classification")

In [None]:
# Generate comprehensive SVM datasets
np.random.seed(42)

# 1. Text classification dataset (high-dimensional sparse)
def generate_text_dataset(n_samples=3000):
 """Generate realistic text classification dataset."""

 categories = ['Technology', 'Healthcare', 'Finance', 'Education', 'Sports']

 # Technology keywords
 tech_words = ['algorithm', 'software', 'computer', 'data', 'programming', 'AI', 'machine learning',
 'cloud', 'database', 'network', 'security', 'innovation', 'digital', 'technology']

 # Healthcare keywords
 health_words = ['patient', 'medical', 'treatment', 'diagnosis', 'hospital', 'doctor', 'surgery',
 'medicine', 'health', 'disease', 'therapy', 'clinical', 'research', 'pharmaceutical']

 # Finance keywords
 finance_words = ['investment', 'bank', 'market', 'stock', 'profit', 'revenue', 'financial',
 'economy', 'trading', 'currency', 'portfolio', 'asset', 'risk', 'capital']

 # Education keywords
 education_words = ['student', 'school', 'learning', 'education', 'teacher', 'university', 'course',
 'study', 'academic', 'research', 'knowledge', 'curriculum', 'degree', 'classroom']

 # Sports keywords
 sports_words = ['game', 'player', 'team', 'sport', 'competition', 'championship', 'training',
 'coach', 'athlete', 'performance', 'fitness', 'exercise', 'match', 'score']

 keyword_sets = [tech_words, health_words, finance_words, education_words, sports_words]

 data = []

 for i in range(n_samples):
 # Choose category
 category_idx = np.random.choice(5, p=[0.25, 0.20, 0.20, 0.20, 0.15])
 category = categories[category_idx]

 # Generate text document
 doc_length = np.random.randint(50, 200) # Document length in words

 # Primary keywords (from chosen category)
 primary_keywords = np.random.choice(keyword_sets[category_idx],
 size=min(8, len(keyword_sets[category_idx])),
 replace=False).tolist()

 # Add some noise keywords from other categories
 noise_keywords = []
 for other_idx in range(5):
 if other_idx != category_idx and np.random.random() < 0.3:
 noise_keywords.extend(np.random.choice(keyword_sets[other_idx],
 size=np.random.randint(0, 3)).tolist())

 # Common words
 common_words = ['the', 'and', 'to', 'of', 'in', 'is', 'for', 'with', 'on', 'as',
 'this', 'that', 'can', 'will', 'from', 'by', 'at', 'an', 'are', 'be']

 # Build document
 all_keywords = primary_keywords + noise_keywords + common_words
 document = ' '.join(np.random.choice(all_keywords, size=doc_length, replace=True))

 # Add some numerical features (document stats)
 data.append({
 'document_id': f'DOC_{i:06d}',
 'text': document,
 'word_count': doc_length,
 'unique_words': len(set(document.split())),
 'avg_word_length': np.mean([len(word) for word in document.split()]),
 'category': category,
 'category_code': category_idx
 })

 return pd.DataFrame(data)

# 2. Image-like high-dimensional dataset
def generate_image_dataset(n_samples=2500):
 """Generate high-dimensional dataset simulating image features."""

 object_types = ['Circle', 'Square', 'Triangle']
 data = []

 for i in range(n_samples):
 # Choose object type
 object_idx = np.random.choice(3, p=[0.4, 0.35, 0.25])
 object_type = object_types[object_idx]

 # Generate features based on object type
 if object_type == 'Circle':
 # Circular patterns - higher values in center, lower at edges
 center_features = np.random.normal(0.8, 0.1, 25) # Center pixels
 edge_features = np.random.normal(0.2, 0.05, 75) # Edge pixels
 texture_features = np.random.beta(2, 5, 50) # Smooth texture

 elif object_type == 'Square':
 # Square patterns - uniform intensity in regions
 center_features = np.random.normal(0.6, 0.15, 25)
 edge_features = np.random.normal(0.7, 0.1, 75) # Sharp edges
 texture_features = np.random.beta(5, 2, 50) # Angular texture

 else: # Triangle
 # Triangular patterns - gradient features
 center_features = np.random.normal(0.5, 0.2, 25)
 edge_features = np.random.normal(0.4, 0.15, 75)
 texture_features = np.random.beta(3, 3, 50) # Mixed texture

 # Combine all features
 pixel_features = np.concatenate([center_features, edge_features])
 all_features = np.concatenate([pixel_features, texture_features])

 # Add noise
 noise = np.random.normal(0, 0.05, len(all_features))
 final_features = np.clip(all_features + noise, 0, 1)

 # Create feature dictionary
 feature_dict = {f'feature_{j:03d}': final_features[j] for j in range(len(final_features))}
 feature_dict.update({
 'image_id': f'IMG_{i:06d}',
 'brightness': np.mean(final_features),
 'contrast': np.std(final_features),
 'symmetry': 1 - np.abs(np.mean(final_features[:50]) - np.mean(final_features[50:])),
 'object_type': object_type,
 'type_code': object_idx
 })

 data.append(feature_dict)

 return pd.DataFrame(data)

# 3. Non-linear separable dataset
def generate_nonlinear_dataset():
 """Generate non-linearly separable datasets for kernel demonstration."""

 # Circles dataset
 X_circles, y_circles = make_circles(n_samples=1000, factor=0.3, noise=0.1, random_state=42)

 # Moons dataset
 X_moons, y_moons = make_moons(n_samples=1000, noise=0.1, random_state=42)

 # Complex multi-class dataset
 X_complex, y_complex = make_classification(
 n_samples=1500, n_features=20, n_informative=15,
 n_redundant=5, n_classes=4, n_clusters_per_class=2,
 class_sep=0.8, random_state=42
 )

 return (X_circles, y_circles), (X_moons, y_moons), (X_complex, y_complex)

# Generate datasets
text_df = generate_text_dataset()
image_df = generate_image_dataset()
(X_circles, y_circles), (X_moons, y_moons), (X_complex, y_complex) = generate_nonlinear_dataset()

print(" Support Vector Machine Datasets Created:")
print(f"Text classification: {text_df.shape}")
print(f"Category distribution: {text_df['category'].value_counts().to_dict()}")
print(f"\nImage classification: {image_df.shape}")
print(f"Object type distribution: {image_df['object_type'].value_counts().to_dict()}")
print(f"\nNon-linear datasets:")
print(f"Circles: {X_circles.shape}, Moons: {X_moons.shape}, Complex: {X_complex.shape}")

# Show some text examples
print(f"\nSample text documents:")
for category in ['Technology', 'Healthcare', 'Finance']:
 sample = text_df[text_df['category'] == category].iloc[0]
 text_preview = ' '.join(sample['text'].split()[:15]) + '...'
 print(f"• {category}: {text_preview}")

In [None]:
# 1. KERNEL COMPARISON AND HYPERPARAMETER OPTIMIZATION
print(" 1. KERNEL COMPARISON AND HYPERPARAMETER OPTIMIZATION")
print("="*56)

# Text classification with TF-IDF
print("Text Classification with Different SVM Kernels:")

# Prepare text data
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X_text_tfidf = vectorizer.fit_transform(text_df['text']).toarray()
y_text = text_df['category_code'].values

# Add numerical features
numerical_features = text_df[['word_count', 'unique_words', 'avg_word_length']].values
scaler_text = StandardScaler()
numerical_features_scaled = scaler_text.fit_transform(numerical_features)

# Combine TF-IDF and numerical features
X_text_combined = np.hstack([X_text_tfidf, numerical_features_scaled])

# Split data
X_text_train, X_text_test, y_text_train, y_text_test = train_test_split(
 X_text_combined, y_text, test_size=0.2, random_state=42, stratify=y_text
)

# Define SVM kernels to compare
svm_kernels = {
 'Linear': SVC(kernel='linear', random_state=42),
 'RBF': SVC(kernel='rbf', random_state=42),
 'Polynomial': SVC(kernel='poly', degree=3, random_state=42),
 'Sigmoid': SVC(kernel='sigmoid', random_state=42)
}

# Compare kernels on text data
text_results = {}
print("\nKernel performance on text classification:")

for kernel_name, svm_model in svm_kernels.items():
 print(f"Training {kernel_name} SVM...")

 # Train model
 svm_model.fit(X_text_train, y_text_train)

 # Evaluate
 train_score = svm_model.score(X_text_train, y_text_train)
 test_score = svm_model.score(X_text_test, y_text_test)
 cv_scores = cross_val_score(svm_model, X_text_train, y_text_train, cv=3)

 text_results[kernel_name] = {
 'train_score': train_score,
 'test_score': test_score,
 'cv_mean': cv_scores.mean(),
 'cv_std': cv_scores.std(),
 'model': svm_model
 }

 print(f" Train: {train_score:.3f}, Test: {test_score:.3f}, CV: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# Hyperparameter optimization for best kernel
best_text_kernel = max(text_results.keys(), key=lambda x: text_results[x]['test_score'])
print(f"\nBest kernel for text: {best_text_kernel}")
print(f"Optimizing hyperparameters for {best_text_kernel} kernel...")

if best_text_kernel == 'Linear':
 param_grid = {'C': [0.1, 1, 10, 100]}
 base_model = SVC(kernel='linear', random_state=42)
elif best_text_kernel == 'RBF':
 param_grid = {'C': [0.1, 1, 10, 100], 'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1]}
 base_model = SVC(kernel='rbf', random_state=42)
elif best_text_kernel == 'Polynomial':
 param_grid = {'C': [0.1, 1, 10], 'degree': [2, 3, 4], 'gamma': ['scale', 'auto']}
 base_model = SVC(kernel='poly', random_state=42)
else: # Sigmoid
 param_grid = {'C': [0.1, 1, 10], 'gamma': ['scale', 'auto', 0.01, 0.1]}
 base_model = SVC(kernel='sigmoid', random_state=42)

# Grid search
grid_search_text = GridSearchCV(base_model, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search_text.fit(X_text_train, y_text_train)

optimized_text_score = grid_search_text.score(X_text_test, y_text_test)
print(f"Best parameters: {grid_search_text.best_params_}")
print(f"Optimized accuracy: {optimized_text_score:.3f}")
print(f"Improvement: {optimized_text_score - text_results[best_text_kernel]['test_score']:+.3f}")

In [None]:
# 2. HIGH-DIMENSIONAL IMAGE CLASSIFICATION
print(" 2. HIGH-DIMENSIONAL IMAGE CLASSIFICATION")
print("="*42)

# Prepare image data
image_feature_cols = [col for col in image_df.columns
 if col.startswith('feature_') or col in ['brightness', 'contrast', 'symmetry']]
X_image = image_df[image_feature_cols].values
y_image = image_df['type_code'].values

# Scale features
scaler_image = StandardScaler()
X_image_scaled = scaler_image.fit_transform(X_image)

# Split data
X_img_train, X_img_test, y_img_train, y_img_test = train_test_split(
 X_image_scaled, y_image, test_size=0.2, random_state=42, stratify=y_image
)

print(f"Image dataset: {X_image_scaled.shape[0]} samples, {X_image_scaled.shape[1]} features")

# Compare SVM performance on high-dimensional data
image_results = {}
print("\nKernel performance on image classification:")

for kernel_name, svm_model in svm_kernels.items():
 print(f"Training {kernel_name} SVM on image data...")

 # For high-dimensional data, use smaller subset for complex kernels
 if kernel_name in ['RBF', 'Polynomial'] and X_img_train.shape[0] > 1000:
 # Use subset for expensive kernels
 subset_indices = np.random.choice(len(X_img_train), 1000, replace=False)
 X_train_subset = X_img_train[subset_indices]
 y_train_subset = y_img_train[subset_indices]
 else:
 X_train_subset = X_img_train
 y_train_subset = y_img_train

 # Train model
 svm_model.fit(X_train_subset, y_train_subset)

 # Evaluate
 train_score = svm_model.score(X_train_subset, y_train_subset)
 test_score = svm_model.score(X_img_test, y_img_test)

 image_results[kernel_name] = {
 'train_score': train_score,
 'test_score': test_score,
 'model': svm_model,
 'training_size': len(X_train_subset)
 }

 print(f" Train: {train_score:.3f}, Test: {test_score:.3f} (trained on {len(X_train_subset)} samples)")

# C parameter analysis for Linear SVM (most suitable for high-dimensional)
print(f"\nC Parameter Analysis for Linear SVM:")
C_values = [0.01, 0.1, 1, 10, 100, 1000]
C_train_scores = []
C_test_scores = []

for C in C_values:
 linear_svm_c = SVC(kernel='linear', C=C, random_state=42)
 linear_svm_c.fit(X_img_train, y_img_train)

 train_score = linear_svm_c.score(X_img_train, y_img_train)
 test_score = linear_svm_c.score(X_img_test, y_img_test)

 C_train_scores.append(train_score)
 C_test_scores.append(test_score)

 print(f"C={C:6.2f}: Train={train_score:.3f}, Test={test_score:.3f}")

best_C_idx = np.argmax(C_test_scores)
best_C = C_values[best_C_idx]
print(f"Best C value: {best_C} (Test accuracy: {C_test_scores[best_C_idx]:.3f})")

# Linear SVC for large-scale classification
print(f"\nLinear SVC (optimized for large datasets):")
linear_svc = LinearSVC(C=best_C, random_state=42, max_iter=2000)
linear_svc.fit(X_img_train, y_img_train)

svc_train_score = linear_svc.score(X_img_train, y_img_train)
svc_test_score = linear_svc.score(X_img_test, y_img_test)

print(f"LinearSVC - Train: {svc_train_score:.3f}, Test: {svc_test_score:.3f}")
print(f"Number of support vectors (Linear SVM): {image_results['Linear']['model'].n_support_}")
print(f"Total support vectors: {sum(image_results['Linear']['model'].n_support_)}")

In [None]:
# 3. NON-LINEAR KERNEL ANALYSIS AND DECISION BOUNDARIES
print(" 3. NON-LINEAR KERNEL ANALYSIS AND DECISION BOUNDARIES")
print("="*58)

# Analyze non-linear datasets
nonlinear_datasets = {
 'Circles': (X_circles, y_circles),
 'Moons': (X_moons, y_moons)
}

nonlinear_results = {}

for dataset_name, (X_data, y_data) in nonlinear_datasets.items():
 print(f"\n{dataset_name} Dataset Analysis:")

 # Split data
 X_train, X_test, y_train, y_test = train_test_split(
 X_data, y_data, test_size=0.2, random_state=42, stratify=y_data
 )

 # Test different kernels
 kernel_performance = {}

 for kernel_name in ['Linear', 'RBF', 'Polynomial']:
 if kernel_name == 'Linear':
 svm = SVC(kernel='linear', C=1.0, random_state=42)
 elif kernel_name == 'RBF':
 svm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
 else: # Polynomial
 svm = SVC(kernel='poly', degree=3, C=1.0, gamma='scale', random_state=42)

 svm.fit(X_train, y_train)
 test_score = svm.score(X_test, y_test)

 kernel_performance[kernel_name] = {
 'score': test_score,
 'model': svm,
 'n_support': sum(svm.n_support_) if hasattr(svm, 'n_support_') else 0
 }

 print(f" {kernel_name}: {test_score:.3f} (Support vectors: {sum(svm.n_support_)})")

 nonlinear_results[dataset_name] = kernel_performance

# RBF kernel gamma analysis
print(f"\nRBF Kernel Gamma Analysis on Circles Dataset:")
gamma_values = [0.001, 0.01, 0.1, 1, 10, 100]
gamma_scores = []
gamma_support_vectors = []

X_circles_train, X_circles_test, y_circles_train, y_circles_test = train_test_split(
 X_circles, y_circles, test_size=0.2, random_state=42, stratify=y_circles
)

for gamma in gamma_values:
 rbf_svm = SVC(kernel='rbf', C=1.0, gamma=gamma, random_state=42)
 rbf_svm.fit(X_circles_train, y_circles_train)

 test_score = rbf_svm.score(X_circles_test, y_circles_test)
 n_sv = sum(rbf_svm.n_support_)

 gamma_scores.append(test_score)
 gamma_support_vectors.append(n_sv)

 print(f"Gamma={gamma:6.3f}: Accuracy={test_score:.3f}, Support Vectors={n_sv}")

best_gamma_idx = np.argmax(gamma_scores)
best_gamma = gamma_values[best_gamma_idx]
print(f"Best gamma: {best_gamma} (Accuracy: {gamma_scores[best_gamma_idx]:.3f})")

# Complex multi-class analysis
print(f"\nComplex Multi-class Dataset (4 classes, 20 features):")

# Scale the complex dataset
scaler_complex = StandardScaler()
X_complex_scaled = scaler_complex.fit_transform(X_complex)

X_comp_train, X_comp_test, y_comp_train, y_comp_test = train_test_split(
 X_complex_scaled, y_complex, test_size=0.2, random_state=42, stratify=y_complex
)

# Multi-class SVM comparison
multiclass_kernels = ['Linear', 'RBF', 'Polynomial']
multiclass_results = {}

for kernel_name in multiclass_kernels:
 if kernel_name == 'Linear':
 svm = SVC(kernel='linear', C=1.0, random_state=42)
 elif kernel_name == 'RBF':
 svm = SVC(kernel='rbf', C=1.0, gamma=best_gamma, random_state=42)
 else: # Polynomial
 svm = SVC(kernel='poly', degree=3, C=1.0, gamma='scale', random_state=42)

 svm.fit(X_comp_train, y_comp_train)

 train_score = svm.score(X_comp_train, y_comp_train)
 test_score = svm.score(X_comp_test, y_comp_test)

 # Get per-class performance
 y_pred = svm.predict(X_comp_test)

 multiclass_results[kernel_name] = {
 'train_score': train_score,
 'test_score': test_score,
 'model': svm,
 'y_pred': y_pred
 }

 print(f"{kernel_name}: Train={train_score:.3f}, Test={test_score:.3f}")

# Best performing model detailed analysis
best_multiclass_kernel = max(multiclass_results.keys(),
 key=lambda x: multiclass_results[x]['test_score'])
best_multiclass_model = multiclass_results[best_multiclass_kernel]['model']

print(f"\nBest multi-class kernel: {best_multiclass_kernel}")
print(f"Classification report:")
y_pred_best = multiclass_results[best_multiclass_kernel]['y_pred']
print(classification_report(y_comp_test, y_pred_best,
 target_names=[f'Class_{i}' for i in range(4)]))

In [None]:
# 4. COMPREHENSIVE SVM VISUALIZATION DASHBOARD
print(" 4. COMPREHENSIVE SVM VISUALIZATION DASHBOARD")
print("="*49)

# Create comprehensive dashboard
fig = make_subplots(
 rows=3, cols=2,
 subplot_titles=[
 'Kernel Performance: Text Classification',
 'C Parameter Analysis: High-Dimensional Data',
 'Gamma Parameter Analysis: RBF Kernel',
 'Decision Boundaries: Non-linear Data',
 'Multi-class Performance Comparison',
 'Support Vector Analysis'
 ],
 specs=[[{"secondary_y": False}, {"secondary_y": False}],
 [{"secondary_y": False}, {"secondary_y": False}],
 [{"secondary_y": False}, {"secondary_y": False}]]
)

# 1. Text classification kernel comparison
text_kernels = list(text_results.keys())
text_scores = [text_results[k]['test_score'] for k in text_kernels]
text_cv_scores = [text_results[k]['cv_mean'] for k in text_kernels]

fig.add_trace(
 go.Bar(
 x=text_kernels,
 y=text_scores,
 name='Test Accuracy',
 marker_color='lightblue',
 text=[f'{score:.3f}' for score in text_scores],
 textposition='auto'
 ),
 row=1, col=1
)

fig.add_trace(
 go.Bar(
 x=text_kernels,
 y=text_cv_scores,
 name='CV Mean',
 marker_color='lightcoral',
 text=[f'{score:.3f}' for score in text_cv_scores],
 textposition='auto'
 ),
 row=1, col=1
)

# 2. C parameter analysis
fig.add_trace(
 go.Scatter(
 x=C_values,
 y=C_train_scores,
 mode='lines+markers',
 name='Training Score',
 line=dict(color='blue'),
 xaxis='log'
 ),
 row=1, col=2
)

fig.add_trace(
 go.Scatter(
 x=C_values,
 y=C_test_scores,
 mode='lines+markers',
 name='Test Score',
 line=dict(color='red'),
 xaxis='log'
 ),
 row=1, col=2
)

# 3. Gamma analysis
fig.add_trace(
 go.Scatter(
 x=gamma_values,
 y=gamma_scores,
 mode='lines+markers',
 name='RBF Accuracy',
 line=dict(color='green'),
 yaxis='y',
 xaxis='log'
 ),
 row=2, col=1
)

# Secondary y-axis for support vectors
fig.add_trace(
 go.Scatter(
 x=gamma_values,
 y=gamma_support_vectors,
 mode='lines+markers',
 name='Support Vectors',
 line=dict(color='purple', dash='dash'),
 yaxis='y2'
 ),
 row=2, col=1
)

# 4. Non-linear dataset visualization (circles)
# Plot circles dataset with decision boundary
colors = ['red', 'blue']
for class_idx in [0, 1]:
 mask = y_circles == class_idx
 fig.add_trace(
 go.Scatter(
 x=X_circles[mask, 0],
 y=X_circles[mask, 1],
 mode='markers',
 name=f'Class {class_idx}',
 marker=dict(color=colors[class_idx], size=4)
 ),
 row=2, col=2
 )

# 5. Multi-class performance
multiclass_kernel_names = list(multiclass_results.keys())
multiclass_scores = [multiclass_results[k]['test_score'] for k in multiclass_kernel_names]

fig.add_trace(
 go.Bar(
 x=multiclass_kernel_names,
 y=multiclass_scores,
 name='Multi-class Accuracy',
 marker_color='orange',
 text=[f'{score:.3f}' for score in multiclass_scores],
 textposition='auto'
 ),
 row=3, col=1
)

# 6. Support vector analysis across datasets
datasets = ['Text', 'Image', 'Circles', 'Moons']
linear_sv = [
 sum(text_results['Linear']['model'].n_support_),
 sum(image_results['Linear']['model'].n_support_),
 nonlinear_results['Circles']['Linear']['n_support'],
 nonlinear_results['Moons']['Linear']['n_support']
]
rbf_sv = [
 sum(text_results['RBF']['model'].n_support_),
 sum(image_results['RBF']['model'].n_support_),
 nonlinear_results['Circles']['RBF']['n_support'],
 nonlinear_results['Moons']['RBF']['n_support']
]

fig.add_trace(
 go.Bar(
 x=datasets,
 y=linear_sv,
 name='Linear SV',
 marker_color='lightblue'
 ),
 row=3, col=2
)

fig.add_trace(
 go.Bar(
 x=datasets,
 y=rbf_sv,
 name='RBF SV',
 marker_color='lightgreen'
 ),
 row=3, col=2
)

# Update layout
fig.update_layout(
 height=1200,
 title="Support Vector Machine Classification - Comprehensive Analysis Dashboard",
 showlegend=True
)

# Update axis labels
fig.update_xaxes(title_text="Kernel Type", row=1, col=1)
fig.update_xaxes(title_text="C Parameter (log scale)", type="log", row=1, col=2)
fig.update_xaxes(title_text="Gamma Parameter (log scale)", type="log", row=2, col=1)
fig.update_xaxes(title_text="Feature 1", row=2, col=2)
fig.update_xaxes(title_text="Kernel Type", row=3, col=1)
fig.update_xaxes(title_text="Dataset", row=3, col=2)

fig.update_yaxes(title_text="Accuracy", row=1, col=1)
fig.update_yaxes(title_text="Accuracy", row=1, col=2)
fig.update_yaxes(title_text="Accuracy", row=2, col=1)
fig.update_yaxes(title_text="Feature 2", row=2, col=2)
fig.update_yaxes(title_text="Accuracy", row=3, col=1)
fig.update_yaxes(title_text="Support Vectors", row=3, col=2)

fig.show()

In [None]:
# 5. BUSINESS INSIGHTS AND ROI ANALYSIS
print(" 5. BUSINESS INSIGHTS AND ROI ANALYSIS")
print("="*40)

# Text classification business impact
print("Document Classification System ROI:")
monthly_documents = 100_000 # Documents processed per month
classification_accuracy = optimized_text_score

# Manual processing costs
manual_cost_per_doc = 0.50 # $0.50 per document for manual classification
manual_monthly_cost = monthly_documents * manual_cost_per_doc

# Automated system costs
system_monthly_cost = 8_000 # System operational cost
manual_review_rate = 1 - classification_accuracy # Documents requiring manual review
manual_review_cost = monthly_documents * manual_review_rate * manual_cost_per_doc

total_automated_cost = system_monthly_cost + manual_review_cost
monthly_savings = manual_monthly_cost - total_automated_cost
annual_savings = monthly_savings * 12

# Productivity improvements
time_saved_per_doc = 2.5 # minutes saved per document
correctly_classified = monthly_documents * classification_accuracy
total_time_saved = correctly_classified * time_saved_per_doc / 60 # hours
hourly_rate = 25 # average hourly rate
productivity_value = total_time_saved * hourly_rate

print(f"\nDocument Classification Impact:")
print(f"• Monthly documents: {monthly_documents:,}")
print(f"• Classification accuracy: {classification_accuracy:.1%}")
print(f"• Manual processing cost: ${manual_monthly_cost:,.0f}/month")
print(f"• Automated system cost: ${total_automated_cost:,.0f}/month")
print(f"• Monthly cost savings: ${monthly_savings:,.0f}")
print(f"• Annual cost savings: ${annual_savings:,.0f}")
print(f"• Additional productivity value: ${productivity_value:,.0f}/month")

# Image classification ROI
print(f"\nImage Recognition System ROI:")
daily_images = 50_000 # Images processed per day
image_accuracy = max([image_results[k]['test_score'] for k in image_results.keys()])

# Quality control improvements
defect_detection_rate = image_accuracy
defect_rate = 0.02 # 2% defect rate
cost_per_defect = 150 # Cost if defect reaches customer

monthly_images = daily_images * 30
expected_defects = monthly_images * defect_rate
defects_caught = expected_defects * defect_detection_rate
defects_missed = expected_defects - defects_caught

cost_avoidance = defects_caught * cost_per_defect
remaining_defect_cost = defects_missed * cost_per_defect

# System costs
image_system_cost = 15_000 # Monthly system cost
net_monthly_benefit = cost_avoidance - image_system_cost
image_roi = net_monthly_benefit / image_system_cost

print(f"• Daily images processed: {daily_images:,}")
print(f"• Detection accuracy: {image_accuracy:.1%}")
print(f"• Expected monthly defects: {expected_defects:.0f}")
print(f"• Defects caught: {defects_caught:.0f}")
print(f"• Cost avoidance: ${cost_avoidance:,.0f}/month")
print(f"• System cost: ${image_system_cost:,.0f}/month")
print(f"• Net monthly benefit: ${net_monthly_benefit:,.0f}")
print(f"• ROI: {image_roi*100:.0f}%")

# High-dimensional analysis ROI
print(f"\nHigh-Dimensional Analysis ROI:")
research_projects = 50 # Number of research projects per year
analysis_time_saved = 40 # Hours saved per project
researcher_hourly_rate = 75 # PhD researcher hourly rate

annual_time_savings = research_projects * analysis_time_saved
annual_cost_savings_research = annual_time_savings * researcher_hourly_rate

# Accuracy improvements leading to better research outcomes
improved_discovery_rate = 0.15 # 15% improvement in discovery rate
avg_discovery_value = 500_000 # Average value of a research discovery
additional_discoveries = research_projects * improved_discovery_rate
discovery_value = additional_discoveries * avg_discovery_value

research_system_cost = 25_000 # Annual system cost
total_research_benefit = annual_cost_savings_research + discovery_value
research_roi = (total_research_benefit - research_system_cost) / research_system_cost

print(f"• Research projects per year: {research_projects}")
print(f"• Time savings: {annual_time_savings:,} hours/year")
print(f"• Cost savings: ${annual_cost_savings_research:,.0f}/year")
print(f"• Additional discoveries: {additional_discoveries:.1f}/year")
print(f"• Discovery value: ${discovery_value:,.0f}/year")
print(f"• Total benefit: ${total_research_benefit:,.0f}/year")
print(f"• System cost: ${research_system_cost:,.0f}/year")
print(f"• ROI: {research_roi*100:.0f}%")

# Combined systems summary
total_annual_investment = (system_monthly_cost * 12) + (image_system_cost * 12) + research_system_cost
total_annual_benefits = annual_savings + (net_monthly_benefit * 12) + (total_research_benefit - research_system_cost)
combined_roi = total_annual_benefits / total_annual_investment

print(f"\nCombined SVM Systems Summary:")
print(f"• Total annual investment: ${total_annual_investment:,.0f}")
print(f"• Total annual benefits: ${total_annual_benefits:,.0f}")
print(f"• Combined ROI: {combined_roi*100:.0f}%")
print(f"• Payback period: {total_annual_investment/total_annual_benefits*12:.1f} months")

# Implementation guidelines
print(f"\nSVM Implementation Guidelines:")
print(f"• Use Linear SVM for high-dimensional sparse data (text, genes)")
print(f"• Use RBF kernel for non-linear problems with moderate dimensions")
print(f"• Scale features when using RBF or polynomial kernels")
print(f"• Start with C=1.0, optimize via cross-validation")
print(f"• Use LinearSVC for large-scale linear problems")
print(f"• Consider kernel approximation for very large datasets")

print(f"\nCross-Reference Learning Path:")
print(f"• Foundation: Tier2_SVM.ipynb (basic SVM concepts)")
print(f"• Building On: Tier2_LinearRegression.ipynb (linear methods)")
print(f"• Comparison: Tier5_Classification.ipynb (algorithm comparison)")
print(f"• Advanced: Advanced_KernelMethods.ipynb, Advanced_HighDimensionalClassification.ipynb")