# Tier 5: Naive Bayes Classification

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** b0b14187-df8b-4626-823d-a105acb62f35

---

## Citation
Brandon Deloatch, "Tier 5: Naive Bayes Classification," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** b0b14187-df8b-4626-823d-a105acb62f35
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [6]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.datasets import make_classification, fetch_20newsgroups
import warnings
warnings.filterwarnings('ignore')

print(" Tier 5: Naive Bayes Classification - Libraries Loaded!")
print("="*55)
print("Naive Bayes Classification Techniques:")
print("• Gaussian Naive Bayes for continuous features")
print("• Multinomial Naive Bayes for count data")
print("• Bernoulli Naive Bayes for binary features")
print("• Text classification and document analysis")
print("• Probability estimation and feature independence")
print("• Real-time classification applications")

 Tier 5: Naive Bayes Classification - Libraries Loaded!
Naive Bayes Classification Techniques:
• Gaussian Naive Bayes for continuous features
• Multinomial Naive Bayes for count data
• Bernoulli Naive Bayes for binary features
• Text classification and document analysis
• Probability estimation and feature independence
• Real-time classification applications


In [7]:
# Generate comprehensive Naive Bayes datasets
np.random.seed(42)

# 1. Medical diagnosis dataset (Gaussian NB)
def generate_medical_dataset(n_samples=2000):
    """Generate realistic medical diagnosis dataset."""

    # Define disease classes
    diseases = ['Healthy', 'Flu', 'COVID-19', 'Pneumonia']
    n_classes = len(diseases)

    data = []

    for i in range(n_samples):
        # Random disease assignment
        disease_idx = np.random.choice(n_classes, p=[0.4, 0.3, 0.2, 0.1])
        disease = diseases[disease_idx]

        # Generate symptoms based on disease
        if disease == 'Healthy':
            temperature = np.random.normal(98.6, 0.5)
            heart_rate = np.random.normal(70, 8)
            blood_pressure = np.random.normal(120, 10)
            oxygen_saturation = np.random.normal(98, 1)
            white_cell_count = np.random.normal(7000, 1000)

        elif disease == 'Flu':
            temperature = np.random.normal(101.5, 1.2)
            heart_rate = np.random.normal(85, 12)
            blood_pressure = np.random.normal(125, 12)
            oxygen_saturation = np.random.normal(97, 1.5)
            white_cell_count = np.random.normal(9000, 1500)

        elif disease == 'COVID-19':
            temperature = np.random.normal(102.2, 1.5)
            heart_rate = np.random.normal(90, 15)
            blood_pressure = np.random.normal(118, 15)
            oxygen_saturation = np.random.normal(94, 3)
            white_cell_count = np.random.normal(6000, 2000)

        else: # Pneumonia
            temperature = np.random.normal(103.1, 1.8)
            heart_rate = np.random.normal(95, 18)
            blood_pressure = np.random.normal(115, 18)
            oxygen_saturation = np.random.normal(91, 4)
            white_cell_count = np.random.normal(12000, 2500)

        data.append({
            'patient_id': f'PAT_{i:06d}',
            'temperature': temperature,
            'heart_rate': heart_rate,
            'blood_pressure': blood_pressure,
            'oxygen_saturation': oxygen_saturation,
            'white_cell_count': white_cell_count,
            'age': np.random.normal(45, 15),
            'diagnosis': disease,
            'diagnosis_code': disease_idx
        })

    return pd.DataFrame(data)

# 2. Email spam detection dataset (Multinomial/Bernoulli NB)
def generate_email_dataset(n_samples=3000):
    """Generate email spam detection dataset."""

    # Common words in spam vs ham emails
    spam_words = ['free', 'money', 'win', 'prize', 'urgent', 'limited', 'offer', 'click',
                  'buy', 'discount', 'sale', 'deal', 'cash', 'credit', 'loan']
    ham_words = ['meeting', 'project', 'report', 'team', 'schedule', 'work', 'office',
                 'client', 'proposal', 'budget', 'deadline', 'presentation', 'conference']

    emails = []

    for i in range(n_samples):
        is_spam = np.random.choice([0, 1], p=[0.7, 0.3]) # 30% spam

        if is_spam:
            # Generate spam email
            email_words = np.random.choice(spam_words, size=np.random.randint(10, 30))
            # Add some random normal words
            normal_words = np.random.choice(ham_words, size=np.random.randint(2, 8))
            all_words = list(email_words) + list(normal_words)
        else:
            # Generate ham email
            email_words = np.random.choice(ham_words, size=np.random.randint(15, 40))
            # Add occasional spam words (false positives)
            occasional_spam = np.random.choice(spam_words, size=np.random.randint(0, 3))
            all_words = list(email_words) + list(occasional_spam)

        # Create email text
        email_text = ' '.join(all_words)

        # Email features
        emails.append({
            'email_id': f'EMAIL_{i:06d}',
            'text': email_text,
            'word_count': len(all_words),
            'exclamation_count': email_text.count('!'),
            'capital_ratio': sum(1 for c in email_text if c.isupper()) / len(email_text),
            'spam_word_count': sum(1 for word in all_words if word in spam_words),
            'is_spam': is_spam
        })

    return pd.DataFrame(emails)

# Generate datasets
medical_df = generate_medical_dataset()
email_df = generate_email_dataset()

print(" Naive Bayes Datasets Created:")
print(f"Medical diagnosis dataset: {medical_df.shape}")
print(f"Disease distribution: {medical_df['diagnosis'].value_counts().to_dict()}")
print(f"\nEmail spam dataset: {email_df.shape}")
print(f"Spam distribution: {email_df['is_spam'].value_counts().to_dict()}")
print(f"Sample email text: '{email_df['text'].iloc[0][:50]}...'")

 Naive Bayes Datasets Created:
Medical diagnosis dataset: (2000, 9)
Disease distribution: {'Healthy': 811, 'Flu': 610, 'COVID-19': 404, 'Pneumonia': 175}

Email spam dataset: (3000, 7)
Spam distribution: {0: 2121, 1: 879}
Sample email text: 'money deal free credit offer buy money deal offer ...'


In [8]:
# 1. GAUSSIAN NAIVE BAYES FOR MEDICAL DIAGNOSIS
print(" 1. GAUSSIAN NAIVE BAYES FOR MEDICAL DIAGNOSIS")
print("="*48)

# Prepare medical data
medical_features = ['temperature', 'heart_rate', 'blood_pressure', 'oxygen_saturation', 'white_cell_count', 'age']
X_medical = medical_df[medical_features].values
y_medical = medical_df['diagnosis_code'].values

# Split the data
X_med_train, X_med_test, y_med_train, y_med_test = train_test_split(
 X_medical, y_medical, test_size=0.2, random_state=42, stratify=y_medical
)

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_med_train, y_med_train)

# Predictions and evaluation
y_med_pred = gnb.predict(X_med_test)
y_med_proba = gnb.predict_proba(X_med_test)

# Performance metrics
med_accuracy = accuracy_score(y_med_test, y_med_pred)
med_cv_scores = cross_val_score(gnb, X_med_train, y_med_train, cv=5)

print(f"Medical Diagnosis Performance:")
print(f"Accuracy: {med_accuracy:.3f}")
print(f"Cross-validation: {med_cv_scores.mean():.3f} ± {med_cv_scores.std():.3f}")

# Feature analysis
print(f"\nFeature Analysis (class means):")
diseases = ['Healthy', 'Flu', 'COVID-19', 'Pneumonia']
for i, disease in enumerate(diseases):
    class_means = gnb.theta_[i]
    print(f"{disease}:")
    for j, feature in enumerate(medical_features):
        print(f" {feature}: {class_means[j]:.1f}")

# Sample predictions with probabilities
print(f"\nSample Predictions (with probabilities):")
for i in range(5):
    actual = diseases[y_med_test[i]]
    predicted = diseases[y_med_pred[i]]
    proba = y_med_proba[i]
    max_proba = np.max(proba)
    print(f"Patient {i+1}: Actual={actual}, Predicted={predicted} (confidence: {max_proba:.2f})")

 1. GAUSSIAN NAIVE BAYES FOR MEDICAL DIAGNOSIS
Medical Diagnosis Performance:
Accuracy: 0.925
Cross-validation: 0.922 ± 0.008

Feature Analysis (class means):
Healthy:
 temperature: 98.6
 heart_rate: 69.6
 blood_pressure: 120.5
 oxygen_saturation: 98.0
 white_cell_count: 6997.8
 age: 45.7
Flu:
 temperature: 101.5
 heart_rate: 84.7
 blood_pressure: 125.6
 oxygen_saturation: 97.1
 white_cell_count: 8990.2
 age: 44.8
COVID-19:
 temperature: 102.0
 heart_rate: 88.5
 blood_pressure: 115.7
 oxygen_saturation: 94.1
 white_cell_count: 6019.8
 age: 44.5
Pneumonia:
 temperature: 102.9
 heart_rate: 94.5
 blood_pressure: 116.4
 oxygen_saturation: 90.9
 white_cell_count: 12150.4
 age: 41.8

Sample Predictions (with probabilities):
Patient 1: Actual=Healthy, Predicted=Healthy (confidence: 0.99)
Patient 2: Actual=Healthy, Predicted=Healthy (confidence: 1.00)
Patient 3: Actual=Healthy, Predicted=Healthy (confidence: 0.99)
Patient 4: Actual=Healthy, Predicted=Healthy (confidence: 1.00)
Patient 5: Actua

In [9]:
# 2. MULTINOMIAL AND BERNOULLI NAIVE BAYES FOR TEXT CLASSIFICATION
print(" 2. MULTINOMIAL AND BERNOULLI NAIVE BAYES FOR TEXT CLASSIFICATION")
print("="*66)

# Prepare email text data
# Create TF-IDF features for Multinomial NB
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X_email_tfidf = tfidf_vectorizer.fit_transform(email_df['text'])

# Create binary features for Bernoulli NB
binary_vectorizer = CountVectorizer(max_features=500, binary=True, stop_words='english')
X_email_binary = binary_vectorizer.fit_transform(email_df['text'])

y_email = email_df['is_spam'].values

# Split data
X_email_tfidf_train, X_email_tfidf_test, y_email_train, y_email_test = train_test_split(
 X_email_tfidf, y_email, test_size=0.2, random_state=42, stratify=y_email
)

X_email_bin_train, X_email_bin_test, _, _ = train_test_split(
 X_email_binary, y_email, test_size=0.2, random_state=42, stratify=y_email
)

# Train Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X_email_tfidf_train, y_email_train)
y_email_pred_mnb = mnb.predict(X_email_tfidf_test)
mnb_accuracy = accuracy_score(y_email_test, y_email_pred_mnb)

# Train Bernoulli Naive Bayes
bnb = BernoulliNB()
bnb.fit(X_email_bin_train, y_email_train)
y_email_pred_bnb = bnb.predict(X_email_bin_test)
bnb_accuracy = accuracy_score(y_email_test, y_email_pred_bnb)

# Hyperparameter tuning
print("Hyperparameter Optimization:")

# Multinomial NB alpha tuning
alpha_values = [0.01, 0.1, 0.5, 1.0, 2.0, 5.0]
mnb_scores = []

for alpha in alpha_values:
    mnb_temp = MultinomialNB(alpha=alpha)
    scores = cross_val_score(mnb_temp, X_email_tfidf_train, y_email_train, cv=5)
    mnb_scores.append(scores.mean())

best_alpha_mnb = alpha_values[np.argmax(mnb_scores)]
print(f"Best alpha for Multinomial NB: {best_alpha_mnb} (CV score: {max(mnb_scores):.3f})")

# Bernoulli NB alpha tuning
bnb_scores = []
for alpha in alpha_values:
    bnb_temp = BernoulliNB(alpha=alpha)
    scores = cross_val_score(bnb_temp, X_email_bin_train, y_email_train, cv=5)
    bnb_scores.append(scores.mean())

best_alpha_bnb = alpha_values[np.argmax(bnb_scores)]
print(f"Best alpha for Bernoulli NB: {best_alpha_bnb} (CV score: {max(bnb_scores):.3f})")

# Retrain with optimal parameters
mnb_best = MultinomialNB(alpha=best_alpha_mnb)
mnb_best.fit(X_email_tfidf_train, y_email_train)
y_email_pred_mnb_best = mnb_best.predict(X_email_tfidf_test)

bnb_best = BernoulliNB(alpha=best_alpha_bnb)
bnb_best.fit(X_email_bin_train, y_email_train)
y_email_pred_bnb_best = bnb_best.predict(X_email_bin_test)

print(f"\nEmail Spam Detection Performance:")
print(f"Multinomial NB: {accuracy_score(y_email_test, y_email_pred_mnb_best):.3f}")
print(f"Bernoulli NB: {accuracy_score(y_email_test, y_email_pred_bnb_best):.3f}")

# Feature importance analysis
print(f"\nTop spam-indicating words (Multinomial NB):")
feature_names = tfidf_vectorizer.get_feature_names_out()
spam_log_prob = mnb_best.feature_log_prob_[1] # Spam class
ham_log_prob = mnb_best.feature_log_prob_[0] # Ham class

# Calculate feature importance as difference in log probabilities
feature_importance = spam_log_prob - ham_log_prob
top_spam_indices = np.argsort(feature_importance)[-10:]

for idx in reversed(top_spam_indices):
    word = feature_names[idx]
    importance = feature_importance[idx]
    print(f" {word}: {importance:.3f}")

 2. MULTINOMIAL AND BERNOULLI NAIVE BAYES FOR TEXT CLASSIFICATION
Hyperparameter Optimization:
Best alpha for Multinomial NB: 0.01 (CV score: 1.000)
Best alpha for Bernoulli NB: 0.01 (CV score: 1.000)

Email Spam Detection Performance:
Multinomial NB: 1.000
Bernoulli NB: 1.000

Top spam-indicating words (Multinomial NB):
 prize: 2.784
 cash: 2.729
 free: 2.721
 urgent: 2.700
 sale: 2.697
 win: 2.678
 limited: 2.655
 deal: 2.623
 discount: 2.615
 loan: 2.605
Hyperparameter Optimization:
Best alpha for Multinomial NB: 0.01 (CV score: 1.000)
Best alpha for Bernoulli NB: 0.01 (CV score: 1.000)

Email Spam Detection Performance:
Multinomial NB: 1.000
Bernoulli NB: 1.000

Top spam-indicating words (Multinomial NB):
 prize: 2.784
 cash: 2.729
 free: 2.721
 urgent: 2.700
 sale: 2.697
 win: 2.678
 limited: 2.655
 deal: 2.623
 discount: 2.615
 loan: 2.605


In [10]:
# 3. COMPREHENSIVE NAIVE BAYES VISUALIZATION DASHBOARD
print(" 3. COMPREHENSIVE NAIVE BAYES VISUALIZATION DASHBOARD")
print("="*58)

# Create comprehensive dashboard
fig = make_subplots(
 rows=3, cols=2,
 subplot_titles=[
 'Medical Diagnosis: Feature Distributions by Class',
 'Email Classification: Algorithm Comparison',
 'Hyperparameter Tuning: Alpha vs Performance',
 'Confusion Matrix: Medical Diagnosis',
 'ROC Curves: Binary Classification (Spam Detection)',
 'Feature Importance: Top Spam Indicators'
 ],
 specs=[[{"secondary_y": False}, {"secondary_y": False}],
 [{"secondary_y": False}, {"type": "heatmap"}],
 [{"secondary_y": False}, {"secondary_y": False}]]
)

# 1. Medical feature distributions
colors = ['blue', 'green', 'red', 'purple']
diseases = ['Healthy', 'Flu', 'COVID-19', 'Pneumonia']

for i, disease in enumerate(diseases):
    disease_data = medical_df[medical_df['diagnosis'] == disease]
    fig.add_trace(
        go.Violin(
            y=disease_data['temperature'],
            name=disease,
            line_color=colors[i],
            box_visible=True,
            meanline_visible=True
        ),
        row=1, col=1
    )

# 2. Algorithm comparison
algorithms = ['Multinomial NB', 'Bernoulli NB']
accuracies = [
 accuracy_score(y_email_test, y_email_pred_mnb_best),
 accuracy_score(y_email_test, y_email_pred_bnb_best)
]

fig.add_trace(
 go.Bar(
 x=algorithms,
 y=accuracies,
 marker_color=['lightblue', 'lightcoral'],
 text=[f'{acc:.3f}' for acc in accuracies],
 textposition='auto'
 ),
 row=1, col=2
)

# 3. Hyperparameter tuning curves
fig.add_trace(
 go.Scatter(
 x=alpha_values,
 y=mnb_scores,
 mode='lines+markers',
 name='Multinomial NB',
 line=dict(color='blue')
 ),
 row=2, col=1
)

fig.add_trace(
 go.Scatter(
 x=alpha_values,
 y=bnb_scores,
 mode='lines+markers',
 name='Bernoulli NB',
 line=dict(color='red')
 ),
 row=2, col=1
)

# 4. Confusion matrix for medical diagnosis
cm_medical = confusion_matrix(y_med_test, y_med_pred)
fig.add_trace(
 go.Heatmap(
 z=cm_medical,
 x=diseases,
 y=diseases,
 colorscale='Blues',
 text=cm_medical,
 texttemplate='%{text}',
 hovertemplate='Predicted: %{x}<br>Actual: %{y}<br>Count: %{z}<extra></extra>'
 ),
 row=2, col=2
)

# 5. ROC curves for spam detection
# Multinomial NB ROC
y_score_mnb = mnb_best.predict_proba(X_email_tfidf_test)[:, 1]
fpr_mnb, tpr_mnb, _ = roc_curve(y_email_test, y_score_mnb)
auc_mnb = auc(fpr_mnb, tpr_mnb)

fig.add_trace(
 go.Scatter(
 x=fpr_mnb,
 y=tpr_mnb,
 mode='lines',
 name=f'Multinomial NB (AUC = {auc_mnb:.3f})',
 line=dict(color='blue')
 ),
 row=3, col=1
)

# Bernoulli NB ROC
y_score_bnb = bnb_best.predict_proba(X_email_bin_test)[:, 1]
fpr_bnb, tpr_bnb, _ = roc_curve(y_email_test, y_score_bnb)
auc_bnb = auc(fpr_bnb, tpr_bnb)

fig.add_trace(
 go.Scatter(
 x=fpr_bnb,
 y=tpr_bnb,
 mode='lines',
 name=f'Bernoulli NB (AUC = {auc_bnb:.3f})',
 line=dict(color='red')
 ),
 row=3, col=1
)

# Diagonal line
fig.add_trace(
 go.Scatter(
 x=[0, 1],
 y=[0, 1],
 mode='lines',
 line=dict(dash='dash', color='black'),
 name='Random Classifier',
 showlegend=False
 ),
 row=3, col=1
)

# 6. Feature importance for spam detection
top_words = [feature_names[idx] for idx in reversed(top_spam_indices[-10:])]
top_importance = [feature_importance[idx] for idx in reversed(top_spam_indices[-10:])]

fig.add_trace(
 go.Bar(
 x=top_importance,
 y=top_words,
 orientation='h',
 marker_color='orange'
 ),
 row=3, col=2
)

# Update layout
fig.update_layout(
 height=1200,
 title="Naive Bayes Classification - Comprehensive Analysis Dashboard",
 showlegend=True
)

# Update axis labels
fig.update_yaxes(title_text="Temperature (°F)", row=1, col=1)
fig.update_yaxes(title_text="Accuracy", row=1, col=2)
fig.update_yaxes(title_text="CV Score", row=2, col=1)
fig.update_yaxes(title_text="Actual", row=2, col=2)
fig.update_yaxes(title_text="True Positive Rate", row=3, col=1)
fig.update_yaxes(title_text="Words", row=3, col=2)

fig.update_xaxes(title_text="Disease Class", row=1, col=1)
fig.update_xaxes(title_text="Algorithm", row=1, col=2)
fig.update_xaxes(title_text="Alpha (Smoothing Parameter)", row=2, col=1)
fig.update_xaxes(title_text="Predicted", row=2, col=2)
fig.update_xaxes(title_text="False Positive Rate", row=3, col=1)
fig.update_xaxes(title_text="Log Probability Difference", row=3, col=2)

fig.show()

 3. COMPREHENSIVE NAIVE BAYES VISUALIZATION DASHBOARD


In [11]:
# 4. BUSINESS INSIGHTS AND ROI ANALYSIS
print(" 4. BUSINESS INSIGHTS AND ROI ANALYSIS")
print("="*40)

# Medical diagnosis business impact
print("Medical Diagnosis System ROI:")
total_patients = 50000 # Annual patient volume
diagnosis_accuracy = med_accuracy
misdiagnosis_cost = 25000 # Average cost of misdiagnosis
system_cost = 150000 # Annual system cost

# Calculate prevented misdiagnoses
baseline_accuracy = 0.75 # Human doctor baseline
improvement = diagnosis_accuracy - baseline_accuracy
prevented_misdiagnoses = total_patients * improvement
cost_savings = prevented_misdiagnoses * misdiagnosis_cost
net_benefit = cost_savings - system_cost

print(f"• Diagnostic accuracy improvement: {improvement*100:.1f}%")
print(f"• Prevented misdiagnoses: {prevented_misdiagnoses:.0f} cases/year")
print(f"• Cost savings: ${cost_savings:,.0f}/year")
print(f"• Net benefit: ${net_benefit:,.0f}/year")
print(f"• ROI: {(net_benefit/system_cost)*100:.0f}%")

# Email spam detection business impact
print(f"\nEmail Spam Detection System ROI:")
daily_emails = 100000 # Emails processed per day
spam_accuracy = max(accuracy_score(y_email_test, y_email_pred_mnb_best),
 accuracy_score(y_email_test, y_email_pred_bnb_best))
time_saved_per_spam = 0.5 # Minutes saved per correctly filtered spam
hourly_wage = 30 # Average employee hourly wage
system_cost_email = 75000 # Annual system cost

# Calculate time and cost savings
annual_emails = daily_emails * 365
spam_emails = annual_emails * 0.3 # 30% spam rate
correctly_filtered = spam_emails * spam_accuracy
time_saved_hours = (correctly_filtered * time_saved_per_spam) / 60
labor_cost_savings = time_saved_hours * hourly_wage
net_benefit_email = labor_cost_savings - system_cost_email

print(f"• Spam detection accuracy: {spam_accuracy*100:.1f}%")
print(f"• Emails correctly filtered: {correctly_filtered:,.0f}/year")
print(f"• Time saved: {time_saved_hours:,.0f} hours/year")
print(f"• Labor cost savings: ${labor_cost_savings:,.0f}/year")
print(f"• Net benefit: ${net_benefit_email:,.0f}/year")
print(f"• ROI: {(net_benefit_email/system_cost_email)*100:.0f}%")

# Combined system benefits
total_investment = system_cost + system_cost_email
total_benefits = net_benefit + net_benefit_email
combined_roi = (total_benefits / total_investment) * 100

print(f"\nCombined Naive Bayes Systems ROI:")
print(f"• Total investment: ${total_investment:,.0f}")
print(f"• Total annual benefits: ${total_benefits:,.0f}")
print(f"• Combined ROI: {combined_roi:.0f}%")
print(f"• Payback period: {total_investment/total_benefits*12:.1f} months")

# Algorithm selection recommendations
print(f"\nNaive Bayes Algorithm Selection Guide:")
print(f"• Gaussian NB: Continuous features (medical data, sensor readings)")
print(f"• Multinomial NB: Count data (text classification, word frequencies)")
print(f"• Bernoulli NB: Binary features (presence/absence, yes/no data)")
print(f"• Complement NB: Imbalanced datasets with many classes")

print(f"\nImplementation Considerations:")
print(f"• Feature independence assumption: Monitor correlation matrices")
print(f"• Smoothing parameter (alpha): Use cross-validation for optimization")
print(f"• Real-time performance: Naive Bayes excels in streaming applications")
print(f"• Interpretability: Clear probability outputs aid decision-making")
print(f"• Scalability: Linear time complexity enables big data processing")

print(f"\nCross-Reference Learning Path:")
print(f"• Foundation: Tier1_Distribution.ipynb (probability distributions)")
print(f"• Building On: Tier2_LogisticRegression.ipynb (probabilistic classification)")
print(f"• Comparison: Tier5_Classification.ipynb (algorithm comparison)")
print(f"• Advanced: Advanced_TextClassification.ipynb, Advanced_BayesianMethods.ipynb")

 4. BUSINESS INSIGHTS AND ROI ANALYSIS
Medical Diagnosis System ROI:
• Diagnostic accuracy improvement: 17.5%
• Prevented misdiagnoses: 8750 cases/year
• Cost savings: $218,750,000/year
• Net benefit: $218,600,000/year
• ROI: 145733%

Email Spam Detection System ROI:
• Spam detection accuracy: 100.0%
• Emails correctly filtered: 10,950,000/year
• Time saved: 91,250 hours/year
• Labor cost savings: $2,737,500/year
• Net benefit: $2,662,500/year
• ROI: 3550%

Combined Naive Bayes Systems ROI:
• Total investment: $225,000
• Total annual benefits: $221,262,500
• Combined ROI: 98339%
• Payback period: 0.0 months

Naive Bayes Algorithm Selection Guide:
• Gaussian NB: Continuous features (medical data, sensor readings)
• Multinomial NB: Count data (text classification, word frequencies)
• Bernoulli NB: Binary features (presence/absence, yes/no data)
• Complement NB: Imbalanced datasets with many classes

Implementation Considerations:
• Feature independence assumption: Monitor correlation mat