# Tier 2: Naive Bayes Classification

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** dc5e0d95-8fab-4735-bcc0-fbc7b6c5ed53

---

## Citation
Brandon Deloatch, "Tier 2: Naive Bayes Classification," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** dc5e0d95-8fab-4735-bcc0-fbc7b6c5ed53
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [None]:
# Import Essential Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# Scikit-learn imports
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import roc_curve, auc, precision_recall_curve
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Text processing
import re
from collections import Counter
import string

# Additional utilities
from sklearn.datasets import make_classification
import warnings
warnings.filterwarnings('ignore')

print(" Tier 2: Naive Bayes Classification - Libraries Loaded Successfully!")
print("=" * 75)
print("Available Naive Bayes Techniques:")
print("• Gaussian Naive Bayes - Continuous features with normal distribution")
print("• Multinomial Naive Bayes - Discrete count features (text classification)")
print("• Bernoulli Naive Bayes - Binary features and document classification")
print("• Probability Analysis - Conditional probabilities and feature independence")
print("• Text Classification - Email spam detection and sentiment analysis")
print("• Medical Diagnosis - Risk assessment with categorical features")

In [None]:
# Generate Comprehensive Datasets for Naive Bayes Analysis
np.random.seed(42)

def generate_naive_bayes_datasets():
 """Generate datasets optimized for different Naive Bayes variants"""

 # 1. GAUSSIAN NAIVE BAYES DATASET - Medical Diagnosis
 n_patients = 1000

 # Generate patient features with realistic distributions
 age = np.random.normal(45, 15, n_patients)
 age = np.clip(age, 18, 90)

 # BMI with different distributions by health status
 healthy_mask = np.random.random(n_patients) < 0.7 # 70% healthy

 bmi = np.zeros(n_patients)
 bmi[healthy_mask] = np.random.normal(24, 3, healthy_mask.sum()) # Healthy BMI
 bmi[~healthy_mask] = np.random.normal(29, 4, (~healthy_mask).sum()) # Higher BMI for at-risk
 bmi = np.clip(bmi, 16, 45)

 # Blood pressure (systolic)
 blood_pressure = np.zeros(n_patients)
 blood_pressure[healthy_mask] = np.random.normal(120, 10, healthy_mask.sum())
 blood_pressure[~healthy_mask] = np.random.normal(140, 15, (~healthy_mask).sum())
 blood_pressure = np.clip(blood_pressure, 90, 200)

 # Cholesterol levels
 cholesterol = np.zeros(n_patients)
 cholesterol[healthy_mask] = np.random.normal(180, 20, healthy_mask.sum())
 cholesterol[~healthy_mask] = np.random.normal(220, 25, (~healthy_mask).sum())
 cholesterol = np.clip(cholesterol, 120, 350)

 # Exercise hours per week
 exercise_hours = np.zeros(n_patients)
 exercise_hours[healthy_mask] = np.random.gamma(2, 2, healthy_mask.sum()) + 2
 exercise_hours[~healthy_mask] = np.random.gamma(1, 1, (~healthy_mask).sum()) + 0.5
 exercise_hours = np.clip(exercise_hours, 0, 20)

 # Smoking history (binary, but affects continuous features)
 smoking = np.random.binomial(1, 0.25, n_patients) # 25% smokers

 # Adjust features for smokers
 blood_pressure[smoking == 1] += np.random.normal(10, 5, (smoking == 1).sum())
 cholesterol[smoking == 1] += np.random.normal(15, 8, (smoking == 1).sum())
 exercise_hours[smoking == 1] *= 0.8 # Smokers exercise less

 # Create health risk target (influenced by all factors)
 risk_score = (
 (age - 45) / 15 * 0.3 +
 (bmi - 25) / 5 * 0.25 +
 (blood_pressure - 120) / 20 * 0.2 +
 (cholesterol - 200) / 30 * 0.15 +
 (5 - exercise_hours) / 5 * 0.1 +
 smoking * 0.3 +
 np.random.normal(0, 0.1, n_patients) # Add noise
 )

 health_risk = (risk_score > 0.5).astype(int) # Binary: 0=Low Risk, 1=High Risk

 medical_df = pd.DataFrame({
 'age': age,
 'bmi': bmi,
 'blood_pressure': blood_pressure,
 'cholesterol': cholesterol,
 'exercise_hours': exercise_hours,
 'smoking': smoking,
 'health_risk': health_risk
 })

 # 2. MULTINOMIAL NAIVE BAYES DATASET - Document Classification
 # Generate synthetic text documents
 topics = ['technology', 'sports', 'politics', 'entertainment']

 # Word vocabularies for each topic
 vocab = {
 'technology': ['computer', 'software', 'data', 'algorithm', 'artificial', 'intelligence',
 'programming', 'code', 'system', 'digital', 'innovation', 'tech', 'app',
 'database', 'machine', 'learning', 'cloud', 'server', 'network', 'security'],
 'sports': ['game', 'team', 'player', 'score', 'win', 'match', 'championship', 'league',
 'football', 'basketball', 'soccer', 'baseball', 'tennis', 'golf', 'olympic',
 'coach', 'training', 'competition', 'tournament', 'athletic'],
 'politics': ['government', 'election', 'vote', 'president', 'congress', 'policy', 'law',
 'democracy', 'republican', 'democrat', 'campaign', 'senator', 'governor',
 'bill', 'constitution', 'court', 'justice', 'reform', 'debate', 'citizen'],
 'entertainment': ['movie', 'film', 'actor', 'actress', 'director', 'music', 'song',
 'concert', 'album', 'artist', 'show', 'television', 'celebrity',
 'theater', 'performance', 'award', 'oscar', 'grammy', 'festival', 'star']
 }

 documents = []
 labels = []

 n_docs_per_topic = 200

 for topic in topics:
 for _ in range(n_docs_per_topic):
 # Generate document with 10-30 words
 doc_length = np.random.randint(10, 31)

 # 70% words from topic vocabulary, 30% from other topics (noise)
 topic_words = np.random.choice(vocab[topic],
 size=int(doc_length * 0.7),
 replace=True)

 # Add some noise from other topics
 other_topics = [t for t in topics if t != topic]
 noise_topic = np.random.choice(other_topics)
 noise_words = np.random.choice(vocab[noise_topic],
 size=int(doc_length * 0.3),
 replace=True)

 # Combine and shuffle
 all_words = np.concatenate([topic_words, noise_words])
 np.random.shuffle(all_words)

 documents.append(' '.join(all_words))
 labels.append(topic)

 text_df = pd.DataFrame({
 'document': documents,
 'topic': labels
 })

 # 3. BERNOULLI NAIVE BAYES DATASET - Email Spam Detection
 n_emails = 1000

 # Define spam and ham indicators
 spam_indicators = [
 'free', 'money', 'offer', 'click', 'urgent', 'limited', 'win', 'prize',
 'discount', 'deal', 'sale', 'buy', 'cheap', 'save', 'cash', 'bonus'
 ]

 ham_indicators = [
 'meeting', 'schedule', 'report', 'project', 'team', 'work', 'office',
 'client', 'business', 'professional', 'conference', 'presentation'
 ]

 emails = []
 spam_labels = []

 # Generate spam emails (40% of dataset)
 n_spam = int(n_emails * 0.4)
 for _ in range(n_spam):
 # Spam emails have higher probability of spam indicators
 features = {}
 for indicator in spam_indicators:
 features[f'has_{indicator}'] = np.random.binomial(1, 0.6) # 60% chance
 for indicator in ham_indicators:
 features[f'has_{indicator}'] = np.random.binomial(1, 0.1) # 10% chance

 # Additional spam features
 features['has_exclamation'] = np.random.binomial(1, 0.8)
 features['has_all_caps'] = np.random.binomial(1, 0.7)
 features['has_numbers'] = np.random.binomial(1, 0.9)
 features['length_short'] = np.random.binomial(1, 0.6) # Spam tends to be shorter

 emails.append(features)
 spam_labels.append(1) # Spam

 # Generate ham emails (60% of dataset)
 n_ham = n_emails - n_spam
 for _ in range(n_ham):
 features = {}
 for indicator in spam_indicators:
 features[f'has_{indicator}'] = np.random.binomial(1, 0.05) # 5% chance
 for indicator in ham_indicators:
 features[f'has_{indicator}'] = np.random.binomial(1, 0.4) # 40% chance

 # Additional ham features
 features['has_exclamation'] = np.random.binomial(1, 0.2)
 features['has_all_caps'] = np.random.binomial(1, 0.1)
 features['has_numbers'] = np.random.binomial(1, 0.3)
 features['length_short'] = np.random.binomial(1, 0.2) # Ham tends to be longer

 emails.append(features)
 spam_labels.append(0) # Ham

 # Convert to DataFrame
 email_features = list(emails[0].keys())
 email_data = {feature: [email[feature] for email in emails] for feature in email_features}
 email_data['is_spam'] = spam_labels

 email_df = pd.DataFrame(email_data)

 return medical_df, text_df, email_df

# Generate datasets
print(" Generating Naive Bayes optimized datasets...")
medical_df, text_df, email_df = generate_naive_bayes_datasets()

print(f"Medical Dataset (Gaussian NB): {medical_df.shape}")
print(f"Text Dataset (Multinomial NB): {text_df.shape}")
print(f"Email Dataset (Bernoulli NB): {email_df.shape}")

print("\nMedical Dataset (Health Risk Assessment):")
print(medical_df.head())
print(f"High Risk Rate: {medical_df['health_risk'].mean():.1%}")

print("\nText Dataset (Document Classification):")
print(text_df.head())
print("Topic Distribution:")
print(text_df['topic'].value_counts())

print("\nEmail Dataset (Spam Detection):")
print(email_df.head())
print(f"Spam Rate: {email_df['is_spam'].mean():.1%}")

In [None]:
# 1. GAUSSIAN NAIVE BAYES ANALYSIS
print("🩺 1. GAUSSIAN NAIVE BAYES ANALYSIS")
print("=" * 34)

# Prepare medical data
medical_features = ['age', 'bmi', 'blood_pressure', 'cholesterol', 'exercise_hours', 'smoking']
X_medical = medical_df[medical_features]
y_medical = medical_df['health_risk']

# Split data
X_med_train, X_med_test, y_med_train, y_med_test = train_test_split(
 X_medical, y_medical, test_size=0.2, random_state=42, stratify=y_medical
)

print(f"Training set: {X_med_train.shape}")
print(f"Test set: {X_med_test.shape}")
print(f"Class distribution: {y_med_train.value_counts().to_dict()}")

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_med_train, y_med_train)

# Predictions and probabilities
y_med_pred = gnb.predict(X_med_test)
y_med_proba = gnb.predict_proba(X_med_test)

# Performance metrics
med_accuracy = accuracy_score(y_med_test, y_med_pred)
print(f"\n Gaussian Naive Bayes Performance:")
print(f"• Test Accuracy: {med_accuracy:.4f}")

# Cross-validation
cv_scores = cross_val_score(gnb, X_med_train, y_med_train, cv=5)
print(f"• Cross-validation: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

print(f"\nClassification Report:")
print(classification_report(y_med_test, y_med_pred, target_names=['Low Risk', 'High Risk']))

# Feature distribution analysis by class
print(f"\n Feature Distribution Analysis:")

feature_stats = {}
for feature in medical_features:
 low_risk_values = X_med_train[y_med_train == 0][feature]
 high_risk_values = X_med_train[y_med_train == 1][feature]

 feature_stats[feature] = {
 'low_risk_mean': low_risk_values.mean(),
 'low_risk_std': low_risk_values.std(),
 'high_risk_mean': high_risk_values.mean(),
 'high_risk_std': high_risk_values.std()
 }

 print(f"• {feature}:")
 print(f" Low Risk: μ={low_risk_values.mean():.2f}, σ={low_risk_values.std():.2f}")
 print(f" High Risk: μ={high_risk_values.mean():.2f}, σ={high_risk_values.std():.2f}")

# Visualize feature distributions
fig_distributions = make_subplots(
 rows=2, cols=3,
 subplot_titles=medical_features,
 specs=[[{"secondary_y": False}, {"secondary_y": False}, {"secondary_y": False}],
 [{"secondary_y": False}, {"secondary_y": False}, {"secondary_y": False}]]
)

colors = ['blue', 'red']
risk_labels = ['Low Risk', 'High Risk']

for i, feature in enumerate(medical_features):
 row = i // 3 + 1
 col = i % 3 + 1

 for risk_level in [0, 1]:
 feature_data = X_med_train[y_med_train == risk_level][feature]

 fig_distributions.add_trace(
 go.Histogram(
 x=feature_data,
 name=f'{risk_labels[risk_level]}',
 opacity=0.7,
 marker_color=colors[risk_level],
 nbinsx=20,
 showlegend=(i == 0) # Only show legend for first subplot
 ),
 row=row, col=col
 )

fig_distributions.update_layout(
 title="Feature Distributions by Health Risk Class",
 height=600,
 barmode='overlay'
)
fig_distributions.show()

# Probability analysis
print(f"\n Probability Analysis (Sample Predictions):")

# Show probability predictions for first 10 test samples
sample_indices = range(min(10, len(X_med_test)))
for i in sample_indices:
 actual = y_med_test.iloc[i]
 predicted = y_med_pred[i]
 prob_low = y_med_proba[i][0]
 prob_high = y_med_proba[i][1]

 print(f"Patient {i+1}: Actual={risk_labels[actual]}, "
 f"Predicted={risk_labels[predicted]} "
 f"(P(Low)={prob_low:.3f}, P(High)={prob_high:.3f})")

# Confusion Matrix
cm_medical = confusion_matrix(y_med_test, y_med_pred)

fig_cm_medical = ff.create_annotated_heatmap(
 z=cm_medical,
 x=['Low Risk', 'High Risk'],
 y=['Low Risk', 'High Risk'],
 annotation_text=cm_medical,
 colorscale='Blues',
 showscale=True
)

fig_cm_medical.update_layout(
 title="Gaussian Naive Bayes Confusion Matrix (Medical Diagnosis)",
 xaxis_title="Predicted",
 yaxis_title="Actual",
 height=400
)
fig_cm_medical.show()

# ROC Curve
fpr, tpr, _ = roc_curve(y_med_test, y_med_proba[:, 1])
roc_auc = auc(fpr, tpr)

fig_roc = go.Figure()

fig_roc.add_trace(
 go.Scatter(
 x=fpr,
 y=tpr,
 mode='lines',
 name=f'ROC Curve (AUC = {roc_auc:.3f})',
 line=dict(color='blue', width=2)
 )
)

fig_roc.add_trace(
 go.Scatter(
 x=[0, 1],
 y=[0, 1],
 mode='lines',
 name='Random Classifier',
 line=dict(color='red', dash='dash')
 )
)

fig_roc.update_layout(
 title="ROC Curve - Gaussian Naive Bayes (Medical Diagnosis)",
 xaxis_title="False Positive Rate",
 yaxis_title="True Positive Rate",
 height=500
)
fig_roc.show()

In [None]:
# 2. MULTINOMIAL NAIVE BAYES ANALYSIS (TEXT CLASSIFICATION)
print(" 2. MULTINOMIAL NAIVE BAYES ANALYSIS")
print("=" * 37)

# Prepare text data using TF-IDF vectorization
print("Processing text documents...")

# Create TF-IDF features
tfidf_vectorizer = TfidfVectorizer(
 max_features=1000, # Limit vocabulary size
 stop_words='english',
 ngram_range=(1, 2), # Include unigrams and bigrams
 min_df=2, # Ignore terms that appear in less than 2 documents
 max_df=0.95 # Ignore terms that appear in more than 95% of documents
)

# Transform documents to TF-IDF features
X_text_tfidf = tfidf_vectorizer.fit_transform(text_df['document'])
y_text = text_df['topic']

print(f"TF-IDF Matrix shape: {X_text_tfidf.shape}")
print(f"Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")

# Split data
X_text_train, X_text_test, y_text_train, y_text_test = train_test_split(
 X_text_tfidf, y_text, test_size=0.2, random_state=42, stratify=y_text
)

print(f"Training set: {X_text_train.shape}")
print(f"Test set: {X_text_test.shape}")

# Also create count-based features for Multinomial NB
count_vectorizer = CountVectorizer(
 max_features=1000,
 stop_words='english',
 ngram_range=(1, 2),
 min_df=2,
 max_df=0.95
)

X_text_counts = count_vectorizer.fit_transform(text_df['document'])
X_counts_train, X_counts_test, _, _ = train_test_split(
 X_text_counts, y_text, test_size=0.2, random_state=42, stratify=y_text
)

# Train Multinomial Naive Bayes with different alpha values
alpha_values = [0.1, 0.5, 1.0, 2.0, 5.0]
mnb_results = {}

print(f"\n Alpha Parameter Optimization:")

for alpha in alpha_values:
 mnb = MultinomialNB(alpha=alpha)
 mnb.fit(X_counts_train, y_text_train)

 # Cross-validation
 cv_scores = cross_val_score(mnb, X_counts_train, y_text_train, cv=5)
 test_score = mnb.score(X_counts_test, y_text_test)

 mnb_results[alpha] = {
 'cv_mean': cv_scores.mean(),
 'cv_std': cv_scores.std(),
 'test_accuracy': test_score
 }

 print(f"• α={alpha}: Test Acc={test_score:.4f}, CV={cv_scores.mean():.4f}±{cv_scores.std():.4f}")

# Find optimal alpha
mnb_df = pd.DataFrame(mnb_results).T
optimal_alpha = mnb_df['cv_mean'].idxmax()
print(f"\n Optimal α: {optimal_alpha}")

# Train final model with optimal alpha
mnb_final = MultinomialNB(alpha=optimal_alpha)
mnb_final.fit(X_counts_train, y_text_train)

# Predictions
y_text_pred = mnb_final.predict(X_counts_test)
y_text_proba = mnb_final.predict_proba(X_counts_test)

text_accuracy = accuracy_score(y_text_test, y_text_pred)
print(f"\n Final Multinomial Naive Bayes Performance:")
print(f"• α = {optimal_alpha}")
print(f"• Test Accuracy: {text_accuracy:.4f}")

print(f"\nClassification Report:")
print(classification_report(y_text_test, y_text_pred))

# Visualize alpha parameter effect
fig_alpha = go.Figure()

fig_alpha.add_trace(
 go.Scatter(
 x=list(alpha_values),
 y=mnb_df['cv_mean'],
 mode='lines+markers',
 name='CV Mean',
 line=dict(color='blue'),
 error_y=dict(type='data', array=mnb_df['cv_std'])
 )
)

fig_alpha.add_trace(
 go.Scatter(
 x=list(alpha_values),
 y=mnb_df['test_accuracy'],
 mode='lines+markers',
 name='Test Accuracy',
 line=dict(color='red')
 )
)

fig_alpha.update_layout(
 title="Multinomial Naive Bayes: Alpha Parameter Effect",
 xaxis_title="Alpha (Smoothing Parameter)",
 yaxis_title="Accuracy",
 height=500
)
fig_alpha.show()

# Feature importance analysis (top words per topic)
print(f"\n Top Words per Topic:")

feature_names = count_vectorizer.get_feature_names_out()
topics = mnb_final.classes_

for i, topic in enumerate(topics):
 # Get log probabilities for this topic
 log_probs = mnb_final.feature_log_prob_[i]

 # Get top 10 features
 top_indices = log_probs.argsort()[-10:][::-1]
 top_words = [feature_names[idx] for idx in top_indices]
 top_probs = [np.exp(log_probs[idx]) for idx in top_indices]

 print(f"\n{topic.upper()}:")
 for word, prob in zip(top_words, top_probs):
 print(f" • {word}: {prob:.4f}")

# Visualize top words for each topic
fig_words = make_subplots(
 rows=2, cols=2,
 subplot_titles=[topic.title() for topic in topics]
)

positions = [(1, 1), (1, 2), (2, 1), (2, 2)]

for i, topic in enumerate(topics):
 row, col = positions[i]

 log_probs = mnb_final.feature_log_prob_[i]
 top_indices = log_probs.argsort()[-10:][::-1]
 top_words = [feature_names[idx] for idx in top_indices]
 top_probs = [np.exp(log_probs[idx]) for idx in top_indices]

 fig_words.add_trace(
 go.Bar(
 x=top_probs,
 y=top_words,
 orientation='h',
 name=topic,
 showlegend=False,
 marker_color=px.colors.qualitative.Set1[i]
 ),
 row=row, col=col
 )

fig_words.update_layout(
 title="Top Words by Topic (Multinomial Naive Bayes)",
 height=600
)
fig_words.show()

# Confusion Matrix
cm_text = confusion_matrix(y_text_test, y_text_pred)

fig_cm_text = ff.create_annotated_heatmap(
 z=cm_text,
 x=topics,
 y=topics,
 annotation_text=cm_text,
 colorscale='Blues',
 showscale=True
)

fig_cm_text.update_layout(
 title=f"Multinomial Naive Bayes Confusion Matrix (α={optimal_alpha})",
 xaxis_title="Predicted Topic",
 yaxis_title="Actual Topic",
 height=500
)
fig_cm_text.show()

In [None]:
# 3. BERNOULLI NAIVE BAYES ANALYSIS (SPAM DETECTION)
print(" 3. BERNOULLI NAIVE BAYES ANALYSIS")
print("=" * 34)

# Prepare email data
email_features = [col for col in email_df.columns if col != 'is_spam']
X_email = email_df[email_features]
y_email = email_df['is_spam']

print(f"Email features: {len(email_features)}")
print(f"Feature sample: {email_features[:5]}")

# Split data
X_email_train, X_email_test, y_email_train, y_email_test = train_test_split(
 X_email, y_email, test_size=0.2, random_state=42, stratify=y_email
)

print(f"Training set: {X_email_train.shape}")
print(f"Test set: {X_email_test.shape}")
print(f"Class distribution: {y_email_train.value_counts().to_dict()}")

# Train Bernoulli Naive Bayes with different alpha values
bnb_results = {}

print(f"\n Alpha Parameter Optimization (Bernoulli NB):")

for alpha in alpha_values:
 bnb = BernoulliNB(alpha=alpha)
 bnb.fit(X_email_train, y_email_train)

 # Cross-validation
 cv_scores = cross_val_score(bnb, X_email_train, y_email_train, cv=5)
 test_score = bnb.score(X_email_test, y_email_test)

 bnb_results[alpha] = {
 'cv_mean': cv_scores.mean(),
 'cv_std': cv_scores.std(),
 'test_accuracy': test_score
 }

 print(f"• α={alpha}: Test Acc={test_score:.4f}, CV={cv_scores.mean():.4f}±{cv_scores.std():.4f}")

# Find optimal alpha
bnb_df = pd.DataFrame(bnb_results).T
optimal_alpha_bnb = bnb_df['cv_mean'].idxmax()
print(f"\n Optimal α: {optimal_alpha_bnb}")

# Train final model
bnb_final = BernoulliNB(alpha=optimal_alpha_bnb)
bnb_final.fit(X_email_train, y_email_train)

# Predictions
y_email_pred = bnb_final.predict(X_email_test)
y_email_proba = bnb_final.predict_proba(X_email_test)

email_accuracy = accuracy_score(y_email_test, y_email_pred)
print(f"\n Final Bernoulli Naive Bayes Performance:")
print(f"• α = {optimal_alpha_bnb}")
print(f"• Test Accuracy: {email_accuracy:.4f}")

print(f"\nClassification Report:")
print(classification_report(y_email_test, y_email_pred, target_names=['Ham', 'Spam']))

# Feature importance analysis for spam detection
print(f"\n Feature Analysis for Spam Detection:")

# Calculate feature importance based on log probability ratios
spam_log_probs = bnb_final.feature_log_prob_[1] # Spam class
ham_log_probs = bnb_final.feature_log_prob_[0] # Ham class

# Log odds ratio
log_odds_ratio = spam_log_probs - ham_log_probs

# Sort features by importance
feature_importance = pd.DataFrame({
 'feature': email_features,
 'log_odds_ratio': log_odds_ratio,
 'spam_indicator': log_odds_ratio > 0
}).sort_values('log_odds_ratio', key=abs, ascending=False)

print(f"Top 10 Spam Indicators:")
spam_indicators = feature_importance[feature_importance['spam_indicator'] == True].head(10)
for _, row in spam_indicators.iterrows():
 print(f"• {row['feature']}: {row['log_odds_ratio']:.3f}")

print(f"\nTop 10 Ham Indicators:")
ham_indicators = feature_importance[feature_importance['spam_indicator'] == False].head(10)
for _, row in ham_indicators.iterrows():
 print(f"• {row['feature']}: {row['log_odds_ratio']:.3f}")

# Visualize feature importance
fig_feat_importance = go.Figure()

# Top spam features
top_spam = feature_importance[feature_importance['spam_indicator'] == True].head(10)
fig_feat_importance.add_trace(
 go.Bar(
 x=top_spam['log_odds_ratio'],
 y=top_spam['feature'],
 orientation='h',
 name='Spam Indicators',
 marker_color='red',
 opacity=0.7
 )
)

# Top ham features
top_ham = feature_importance[feature_importance['spam_indicator'] == False].head(10)
fig_feat_importance.add_trace(
 go.Bar(
 x=top_ham['log_odds_ratio'],
 y=top_ham['feature'],
 orientation='h',
 name='Ham Indicators',
 marker_color='blue',
 opacity=0.7
 )
)

fig_feat_importance.update_layout(
 title="Feature Importance for Spam Detection (Bernoulli Naive Bayes)",
 xaxis_title="Log Odds Ratio (Spam vs Ham)",
 yaxis_title="Features",
 height=600
)
fig_feat_importance.show()

# ROC and Precision-Recall curves
fpr_email, tpr_email, _ = roc_curve(y_email_test, y_email_proba[:, 1])
roc_auc_email = auc(fpr_email, tpr_email)

precision, recall, _ = precision_recall_curve(y_email_test, y_email_proba[:, 1])

fig_curves = make_subplots(
 rows=1, cols=2,
 subplot_titles=['ROC Curve', 'Precision-Recall Curve']
)

# ROC Curve
fig_curves.add_trace(
 go.Scatter(
 x=fpr_email,
 y=tpr_email,
 mode='lines',
 name=f'ROC (AUC = {roc_auc_email:.3f})',
 line=dict(color='blue')
 ),
 row=1, col=1
)

fig_curves.add_trace(
 go.Scatter(
 x=[0, 1],
 y=[0, 1],
 mode='lines',
 name='Random',
 line=dict(color='red', dash='dash'),
 showlegend=False
 ),
 row=1, col=1
)

# Precision-Recall Curve
fig_curves.add_trace(
 go.Scatter(
 x=recall,
 y=precision,
 mode='lines',
 name='Precision-Recall',
 line=dict(color='green'),
 showlegend=False
 ),
 row=1, col=2
)

fig_curves.update_xaxes(title_text="False Positive Rate", row=1, col=1)
fig_curves.update_yaxes(title_text="True Positive Rate", row=1, col=1)
fig_curves.update_xaxes(title_text="Recall", row=1, col=2)
fig_curves.update_yaxes(title_text="Precision", row=1, col=2)

fig_curves.update_layout(
 title="Bernoulli Naive Bayes Performance Curves (Spam Detection)",
 height=500
)
fig_curves.show()

# Confusion Matrix
cm_email = confusion_matrix(y_email_test, y_email_pred)

fig_cm_email = ff.create_annotated_heatmap(
 z=cm_email,
 x=['Ham', 'Spam'],
 y=['Ham', 'Spam'],
 annotation_text=cm_email,
 colorscale='Blues',
 showscale=True
)

fig_cm_email.update_layout(
 title=f"Bernoulli Naive Bayes Confusion Matrix (α={optimal_alpha_bnb})",
 xaxis_title="Predicted",
 yaxis_title="Actual",
 height=400
)
fig_cm_email.show()

# Prediction confidence analysis
print(f"\n Prediction Confidence Analysis:")

# Analyze prediction confidence for both classes
spam_confidences = y_email_proba[y_email_test == 1][:, 1] # Spam predictions
ham_confidences = y_email_proba[y_email_test == 0][:, 0] # Ham predictions

print(f"Spam detection confidence: {spam_confidences.mean():.3f} ± {spam_confidences.std():.3f}")
print(f"Ham detection confidence: {ham_confidences.mean():.3f} ± {ham_confidences.std():.3f}")

# Show some example predictions
print(f"\nSample Email Predictions:")
sample_size = min(5, len(X_email_test))
for i in range(sample_size):
 actual = y_email_test.iloc[i]
 predicted = y_email_pred[i]
 prob_ham = y_email_proba[i][0]
 prob_spam = y_email_proba[i][1]

 actual_label = 'Spam' if actual == 1 else 'Ham'
 pred_label = 'Spam' if predicted == 1 else 'Ham'

 print(f"Email {i+1}: Actual={actual_label}, Predicted={pred_label} "
 f"(P(Ham)={prob_ham:.3f}, P(Spam)={prob_spam:.3f})")

In [None]:
# 4. NAIVE BAYES VARIANTS COMPARISON
print(" 4. NAIVE BAYES VARIANTS COMPARISON")
print("=" * 37)

# Compare all three Naive Bayes variants on appropriate datasets
print("Comparing Naive Bayes variants across different problem types:")

# Create comparison results
comparison_results = {
 'Gaussian NB (Medical)': {
 'dataset': 'Medical Diagnosis',
 'accuracy': med_accuracy,
 'best_for': 'Continuous features with normal distributions',
 'features': 'Age, BMI, Blood Pressure, etc.',
 'sample_size': len(X_med_test)
 },
 'Multinomial NB (Text)': {
 'dataset': 'Document Classification',
 'accuracy': text_accuracy,
 'best_for': 'Discrete count features (text, documents)',
 'features': 'Word counts, TF-IDF scores',
 'sample_size': len(X_text_test)
 },
 'Bernoulli NB (Spam)': {
 'dataset': 'Email Spam Detection',
 'accuracy': email_accuracy,
 'best_for': 'Binary/boolean features',
 'features': 'Presence/absence of keywords',
 'sample_size': len(X_email_test)
 }
}

print(f"\n Performance Comparison:")
for variant, results in comparison_results.items():
 print(f"\n{variant}:")
 print(f" • Dataset: {results['dataset']}")
 print(f" • Test Accuracy: {results['accuracy']:.4f}")
 print(f" • Best for: {results['best_for']}")
 print(f" • Feature types: {results['features']}")
 print(f" • Test samples: {results['sample_size']}")

# Visualize comparison
variants = list(comparison_results.keys())
accuracies = [results['accuracy'] for results in comparison_results.values()]
datasets = [results['dataset'] for results in comparison_results.values()]

fig_comparison = go.Figure()

fig_comparison.add_trace(
 go.Bar(
 x=variants,
 y=accuracies,
 text=[f"{acc:.3f}" for acc in accuracies],
 textposition='outside',
 marker_color=['lightblue', 'lightgreen', 'lightcoral'],
 hovertemplate="Variant: %{x}<br>Accuracy: %{y:.4f}<extra></extra>"
 )
)

fig_comparison.update_layout(
 title="Naive Bayes Variants Performance Comparison",
 xaxis_title="Naive Bayes Variant",
 yaxis_title="Test Accuracy",
 height=500
)
fig_comparison.show()

# Feature independence analysis
print(f"\n Feature Independence Analysis:")

# Analyze feature correlations for each dataset
print(f"\n1. Medical Dataset Feature Correlations:")
medical_corr = X_medical.corr()
print("Strong correlations (|r| > 0.3):")
for i in range(len(medical_corr.columns)):
 for j in range(i+1, len(medical_corr.columns)):
 corr_val = medical_corr.iloc[i, j]
 if abs(corr_val) > 0.3:
 print(f" • {medical_corr.columns[i]} ↔ {medical_corr.columns[j]}: {corr_val:.3f}")

print(f"\n2. Email Dataset Feature Correlations:")
email_corr = X_email.corr()
print("Strong correlations (|r| > 0.3):")
strong_corr_count = 0
for i in range(len(email_corr.columns)):
 for j in range(i+1, len(email_corr.columns)):
 corr_val = email_corr.iloc[i, j]
 if abs(corr_val) > 0.3:
 print(f" • {email_corr.columns[i]} ↔ {email_corr.columns[j]}: {corr_val:.3f}")
 strong_corr_count += 1
 if strong_corr_count >= 5: # Limit output
 break
 if strong_corr_count >= 5:
 break

if strong_corr_count >= 5:
 print(" • ... (showing first 5 correlations)")

# Independence assumption violation impact
print(f"\n3. Independence Assumption Impact:")
print(f" • Medical data: Some correlations exist (e.g., BMI vs blood pressure)")
print(f" • Impact: Moderate - correlations are expected in medical data")
print(f" • Recommendation: Monitor performance; consider feature selection")
print(f" ")
print(f" • Email data: Binary features may have logical correlations")
print(f" • Impact: Low - Naive Bayes often robust to moderate violations")
print(f" • Recommendation: Current performance suggests assumption is reasonable")

# Training speed comparison
import time

print(f"\n Training Speed Comparison:")

training_times = {}

# Gaussian NB timing
start_time = time.time()
gnb_timing = GaussianNB()
gnb_timing.fit(X_med_train, y_med_train)
training_times['Gaussian NB'] = time.time() - start_time

# Multinomial NB timing
start_time = time.time()
mnb_timing = MultinomialNB(alpha=optimal_alpha)
mnb_timing.fit(X_counts_train, y_text_train)
training_times['Multinomial NB'] = time.time() - start_time

# Bernoulli NB timing
start_time = time.time()
bnb_timing = BernoulliNB(alpha=optimal_alpha_bnb)
bnb_timing.fit(X_email_train, y_email_train)
training_times['Bernoulli NB'] = time.time() - start_time

for variant, time_taken in training_times.items():
 print(f" • {variant}: {time_taken:.4f} seconds")

print(f"\nAll Naive Bayes variants demonstrate excellent scalability!")

# Memory usage estimation
print(f"\n Memory Usage Characteristics:")
print(f" • Gaussian NB: Stores mean and variance for each feature-class pair")
print(f" Memory: O(features × classes) = O({len(medical_features)} × 2) parameters")
print(f" ")
print(f" • Multinomial NB: Stores probability for each feature-class pair")
print(f" Memory: O(vocabulary × classes) = O({X_counts_train.shape[1]} × {len(mnb_final.classes_)}) parameters")
print(f" ")
print(f" • Bernoulli NB: Stores probability for each binary feature-class pair")
print(f" Memory: O(features × classes) = O({len(email_features)} × 2) parameters")
print(f" ")
print(f"All variants have minimal memory requirements compared to other algorithms!")

In [None]:
# 5. BUSINESS INSIGHTS AND STRATEGIC RECOMMENDATIONS
print(" 5. BUSINESS INSIGHTS AND STRATEGIC RECOMMENDATIONS")
print("=" * 54)

# Probabilistic insights and business applications
print(" Probabilistic Decision Making Insights:")

print(f"\n1. MODEL PERFORMANCE SUMMARY:")
print(f" • Medical Diagnosis (Gaussian NB): {med_accuracy:.1%} accuracy")
print(f" • Document Classification (Multinomial NB): {text_accuracy:.1%} accuracy")
print(f" • Spam Detection (Bernoulli NB): {email_accuracy:.1%} accuracy")

# ROI Analysis for each application
print(f"\n2. ROI ANALYSIS BY APPLICATION:")

# Medical diagnosis ROI
medical_volume = 1000 # Daily patients
misdiagnosis_cost = 5000 # Cost of misdiagnosis
screening_cost = 50 # Cost per automated screening
manual_diagnosis_cost = 200 # Cost per manual diagnosis

automated_daily_cost = medical_volume * screening_cost
manual_daily_cost = medical_volume * manual_diagnosis_cost
daily_savings = manual_daily_cost - automated_daily_cost

# Assume 5% of high-risk patients would have serious conditions if missed
high_risk_patients = medical_volume * (medical_df['health_risk'].mean())
false_negative_rate = 1 - med_accuracy # Simplified
potential_misdiagnosis_cost = high_risk_patients * false_negative_rate * misdiagnosis_cost * 0.05

print(f"\n Medical Diagnosis System:")
print(f" • Daily patient volume: {medical_volume:,}")
print(f" • Automated screening cost: ${automated_daily_cost:,}/day")
print(f" • Manual diagnosis cost: ${manual_daily_cost:,}/day")
print(f" • Daily cost savings: ${daily_savings:,}")
print(f" • Annual savings: ${daily_savings * 365:,}")
print(f" • Risk mitigation: Early detection of {high_risk_patients:.0f} high-risk patients/day")

# Document classification ROI
doc_volume = 10000 # Daily documents
manual_classification_cost = 2 # Cost per manual classification
automated_classification_cost = 0.1 # Cost per automated classification

doc_daily_savings = doc_volume * (manual_classification_cost - automated_classification_cost)

print(f"\n Document Classification System:")
print(f" • Daily document volume: {doc_volume:,}")
print(f" • Cost savings per document: ${manual_classification_cost - automated_classification_cost:.2f}")
print(f" • Daily cost savings: ${doc_daily_savings:,}")
print(f" • Annual savings: ${doc_daily_savings * 365:,}")
print(f" • Accuracy: {text_accuracy:.1%} automated classification")

# Spam detection ROI
email_volume = 100000 # Daily emails
spam_rate = 0.4 # 40% spam
false_positive_cost = 10 # Cost of blocking legitimate email
false_negative_cost = 1 # Cost of letting spam through
manual_review_cost = 0.5 # Cost per manual review

spam_emails = email_volume * spam_rate
ham_emails = email_volume * (1 - spam_rate)

# Calculate costs with current system
fp_rate = cm_email[0][1] / (cm_email[0][0] + cm_email[0][1]) # Ham classified as spam
fn_rate = cm_email[1][0] / (cm_email[1][0] + cm_email[1][1]) # Spam classified as ham

daily_fp_cost = ham_emails * fp_rate * false_positive_cost
daily_fn_cost = spam_emails * fn_rate * false_negative_cost
total_daily_cost = daily_fp_cost + daily_fn_cost

print(f"\n Spam Detection System:")
print(f" • Daily email volume: {email_volume:,}")
print(f" • Spam rate: {spam_rate:.1%}")
print(f" • False positive cost: ${daily_fp_cost:,.0f}/day")
print(f" • False negative cost: ${daily_fn_cost:,.0f}/day")
print(f" • Total error cost: ${total_daily_cost:,.0f}/day")
print(f" • Annual error cost: ${total_daily_cost * 365:,.0f}")

print(f"\n3. STRATEGIC IMPLEMENTATION RECOMMENDATIONS:")

print(f"\n Gaussian Naive Bayes (Medical Diagnosis):")
print(f" • Deploy as preliminary screening tool")
print(f" • Use probability scores to prioritize urgent cases")
print(f" • Integrate with electronic health records")
print(f" • Maintain human oversight for final diagnosis")
print(f" • Regular model updates with new patient data")

print(f"\n Multinomial Naive Bayes (Document Classification):")
print(f" • Implement for content management systems")
print(f" • Use for automated news categorization")
print(f" • Apply to customer support ticket routing")
print(f" • Enable real-time document processing")
print(f" • Continuously update vocabulary and categories")

print(f"\n Bernoulli Naive Bayes (Spam Detection):")
print(f" • Deploy as first-line email defense")
print(f" • Combine with other security measures")
print(f" • Implement user feedback loop")
print(f" • Regular feature engineering for new spam patterns")
print(f" • Monitor false positive rates closely")

print(f"\n4. NAIVE BAYES ADVANTAGES:")
print(f" • Fast training and prediction (real-time capable)")
print(f" • Minimal memory requirements")
print(f" • Excellent baseline performance")
print(f" • Handles multiple classes naturally")
print(f" • Probabilistic output enables confidence scoring")
print(f" • Robust to irrelevant features")
print(f" • Works well with small datasets")

print(f"\n5. LIMITATIONS AND MITIGATION:")
print(f" • Independence assumption rarely holds perfectly")
print(f" → Monitor feature correlations and performance")
print(f" • Can be outperformed by more complex models")
print(f" → Use as baseline; ensemble with other methods")
print(f" • Sensitive to skewed features")
print(f" → Apply appropriate preprocessing and smoothing")
print(f" • Zero probability problem")
print(f" → Use Laplace smoothing (alpha parameter)")

print(f"\n6. MONITORING AND MAINTENANCE:")
print(f" • Track prediction confidence distributions")
print(f" • Monitor feature importance changes over time")
print(f" • Set up automated retraining pipelines")
print(f" • Implement A/B testing for model updates")
print(f" • Regular validation against ground truth")

print(f"\n7. ADVANCED TECHNIQUES:")
print(f" • Complement Naive Bayes for dependent features")
print(f" • Ensemble methods combining multiple NB variants")
print(f" • Online learning for streaming data")
print(f" • Feature selection to improve independence")
print(f" • Calibration for better probability estimates")

print(f"\n8. NEXT STEPS:")
print(f" • Pilot deployment in production environment")
print(f" • Collect user feedback and performance metrics")
print(f" • Experiment with feature engineering")
print(f" • Compare against ensemble methods")
print(f" • Develop domain-specific variants")

print(f"\n" + "="*80)
print(f" NAIVE BAYES LEARNING SUMMARY:")
print(f" Mastered Bayes' theorem and conditional probability")
print(f" Applied three NB variants to appropriate problem types")
print(f" Optimized hyperparameters and analyzed feature importance")
print(f" Understood independence assumptions and their violations")
print(f" Analyzed probabilistic outputs and prediction confidence")
print(f" Generated comprehensive business applications and ROI analysis")
print(f"="*80)