# üìä NLP Sentiment Analysis for Technician Feedback Classification

## A Comprehensive Machine Learning Approach

---

**Author:** Sentiment Analysis Team  
**Date:** 2024  
**Presentation Duration:** 40-50 minutes

---

## üìã Table of Contents

1. **Introduction & Problem Definition** (5 min)
2. **Data Loading & Exploration** (8 min)
3. **Data Preprocessing** (10 min)
4. **Model Building & Training** (12 min)
5. **Model Evaluation & Results** (10 min)
6. **Conclusion & Future Work** (5 min)

---

# Section 1: Introduction & Problem Definition

## üéØ What is Sentiment Analysis?

**Sentiment Analysis** (also known as opinion mining) is a Natural Language Processing (NLP) technique that identifies and extracts subjective information from text.

### Key Concepts:
- **Sentiment**: The emotional tone behind words (positive, negative, neutral)
- **Classification**: Categorizing text into predefined sentiment classes
- **Feature Extraction**: Converting text to numerical representations

## üîß Why Technician Feedback Matters?

1. **Quality Improvement**: Identify equipment issues early
2. **Safety Monitoring**: Detect safety concerns in feedback
3. **Resource Allocation**: Optimize training and tool investments
4. **Employee Satisfaction**: Track morale and workload issues
5. **Predictive Maintenance**: Anticipate equipment failures

## üíº Business Use Cases

- **Manufacturing**: Monitor production line feedback
- **Field Service**: Analyze technician reports
- **IT Support**: Classify support ticket sentiment
- **Quality Assurance**: Track quality-related feedback

---

# Section 2: Data Loading & Exploration

## üì¶ Setup and Imports

In [None]:
# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import plotly.express as px
import plotly.graph_objects as go
from collections import Counter

# NLP libraries
import nltk
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

# ML libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, auc
)
from sklearn.preprocessing import label_binarize, LabelEncoder

# Set random seed for reproducibility
np.random.seed(42)

# Visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

print("‚úÖ All libraries imported successfully!")

## üìÇ Load the Dataset

In [None]:
# Load the technician feedback dataset
df = pd.read_csv('../data/technician_feedback.csv')

print(f"üìä Dataset Shape: {df.shape}")
print(f"üìù Total Samples: {len(df)}")
print(f"üìã Columns: {list(df.columns)}")

In [None]:
# Display first few rows
print("\nüîç First 10 Records:")
df.head(10)

In [None]:
# Dataset information
print("\nüìã Dataset Info:")
df.info()

## üìä Statistical Summary

In [None]:
# Basic statistics
print("üìà Statistical Summary:")
print("\n--- Sentiment Distribution ---")
print(df['sentiment'].value_counts())
print("\n--- Category Distribution ---")
print(df['category'].value_counts())

# Text length statistics
df['text_length'] = df['feedback_text'].apply(len)
df['word_count'] = df['feedback_text'].apply(lambda x: len(str(x).split()))

print("\n--- Text Statistics ---")
print(df[['text_length', 'word_count']].describe())

## üìà Class Distribution Visualization

In [None]:
# Define colors for sentiments
colors = {'positive': '#2ecc71', 'negative': '#e74c3c', 'neutral': '#3498db'}

# Create figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar Chart
sentiment_counts = df['sentiment'].value_counts()
ax1 = axes[0]
bars = ax1.bar(sentiment_counts.index, sentiment_counts.values, 
               color=[colors[s] for s in sentiment_counts.index])
ax1.set_xlabel('Sentiment')
ax1.set_ylabel('Count')
ax1.set_title('Sentiment Distribution (Bar Chart)', fontweight='bold')
for bar in bars:
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 5,
             f'{int(height)}', ha='center', va='bottom', fontsize=12)

# Pie Chart
ax2 = axes[1]
ax2.pie(sentiment_counts.values, labels=sentiment_counts.index, autopct='%1.1f%%',
        colors=[colors[s] for s in sentiment_counts.index], explode=[0.02]*3,
        shadow=True, startangle=90)
ax2.set_title('Sentiment Distribution (Pie Chart)', fontweight='bold')

plt.tight_layout()
plt.savefig('../outputs/sentiment_distribution.png', dpi=300, bbox_inches='tight')
plt.show()
print("üìä Sentiment distribution visualized!")

In [None]:
# Category distribution
fig, ax = plt.subplots(figsize=(12, 6))

category_counts = df['category'].value_counts()
bars = ax.barh(category_counts.index, category_counts.values, color='#3498db')
ax.set_xlabel('Count')
ax.set_ylabel('Category')
ax.set_title('Feedback by Category', fontweight='bold')

for bar in bars:
    width = bar.get_width()
    ax.text(width + 1, bar.get_y() + bar.get_height()/2.,
            f'{int(width)}', ha='left', va='center')

plt.tight_layout()
plt.savefig('../outputs/category_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

## üìù Word Frequency Analysis

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

stop_words = set(stopwords.words('english'))

def get_word_freq(texts):
    """Get word frequencies from texts."""
    all_words = []
    for text in texts:
        # Tokenize and clean
        words = word_tokenize(str(text).lower())
        words = [w for w in words if w.isalpha() and w not in stop_words and len(w) > 2]
        all_words.extend(words)
    return Counter(all_words)

# Word frequency by sentiment
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for i, sentiment in enumerate(['positive', 'negative', 'neutral']):
    texts = df[df['sentiment'] == sentiment]['feedback_text']
    word_freq = get_word_freq(texts)
    top_words = word_freq.most_common(15)
    
    words, counts = zip(*top_words)
    axes[i].barh(words, counts, color=colors[sentiment])
    axes[i].set_xlabel('Frequency')
    axes[i].set_title(f'Top Words - {sentiment.title()}', fontweight='bold')
    axes[i].invert_yaxis()

plt.tight_layout()
plt.savefig('../outputs/word_frequency.png', dpi=300, bbox_inches='tight')
plt.show()

## ‚òÅÔ∏è Word Clouds

In [None]:
# Create word clouds for each sentiment
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

colormaps = {'positive': 'Greens', 'negative': 'Reds', 'neutral': 'Blues'}

for i, sentiment in enumerate(['positive', 'negative', 'neutral']):
    texts = df[df['sentiment'] == sentiment]['feedback_text']
    combined_text = ' '.join(texts.values)
    
    wordcloud = WordCloud(
        width=800, height=400,
        background_color='white',
        colormap=colormaps[sentiment],
        max_words=100,
        stopwords=stop_words
    ).generate(combined_text)
    
    axes[i].imshow(wordcloud, interpolation='bilinear')
    axes[i].axis('off')
    axes[i].set_title(f'{sentiment.title()} Sentiment', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('../outputs/wordclouds.png', dpi=300, bbox_inches='tight')
plt.show()
print("‚òÅÔ∏è Word clouds generated!")

---

# Section 3: Data Preprocessing

## üîß Text Preprocessing Steps

1. **Lowercasing** - Convert all text to lowercase
2. **Punctuation Removal** - Remove special characters
3. **Tokenization** - Split text into words
4. **Stopword Removal** - Remove common words
5. **Lemmatization** - Reduce words to base form

In [None]:
import re
from nltk.stem import WordNetLemmatizer

class TextPreprocessor:
    """Text preprocessing class for NLP tasks."""
    
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
    
    def clean_text(self, text):
        """Clean and preprocess text."""
        if not isinstance(text, str):
            return ""
        
        # Lowercase
        text = text.lower()
        
        # Remove URLs
        text = re.sub(r'http\S+|www\S+', '', text)
        
        # Remove numbers
        text = re.sub(r'\d+', '', text)
        
        # Remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))
        
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text
    
    def tokenize_and_lemmatize(self, text):
        """Tokenize, remove stopwords, and lemmatize."""
        tokens = word_tokenize(text)
        tokens = [self.lemmatizer.lemmatize(t) for t in tokens 
                  if t not in self.stop_words and len(t) > 2]
        return ' '.join(tokens)
    
    def preprocess(self, text):
        """Full preprocessing pipeline."""
        cleaned = self.clean_text(text)
        processed = self.tokenize_and_lemmatize(cleaned)
        return processed

# Initialize preprocessor
preprocessor = TextPreprocessor()
print("‚úÖ TextPreprocessor initialized!")

## üìù Before/After Examples

In [None]:
# Demonstrate preprocessing
print("\nüìã Preprocessing Examples:\n")
print("="*80)

sample_texts = df['feedback_text'].head(5).values

for i, text in enumerate(sample_texts, 1):
    processed = preprocessor.preprocess(text)
    print(f"Example {i}:")
    print(f"  BEFORE: {text}")
    print(f"  AFTER:  {processed}")
    print("-"*80)

In [None]:
# Apply preprocessing to all texts
print("\nüîÑ Preprocessing all texts...")
df['processed_text'] = df['feedback_text'].apply(preprocessor.preprocess)
print(f"‚úÖ Processed {len(df)} texts")

# Show sample
df[['feedback_text', 'processed_text', 'sentiment']].head()

## üî¢ Feature Extraction

In [None]:
# Prepare data
X = df['processed_text'].values
y = df['sentiment'].values

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"üìä Training set: {len(X_train)} samples")
print(f"üìä Test set: {len(X_test)} samples")

In [None]:
# TF-IDF Vectorization
print("\nüìä TF-IDF Vectorization:")

tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95
)

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print(f"  Training features shape: {X_train_tfidf.shape}")
print(f"  Test features shape: {X_test_tfidf.shape}")
print(f"  Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")

In [None]:
# Count Vectorization
print("\nüìä Count Vectorization:")

count_vectorizer = CountVectorizer(
    max_features=5000,
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95
)

X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

print(f"  Training features shape: {X_train_count.shape}")
print(f"  Test features shape: {X_test_count.shape}")

---

# Section 4: Model Building & Training

## ü§ñ Models to Train

1. **Naive Bayes** - Probabilistic classifier
2. **Support Vector Machine (SVM)** - Hyperplane-based classifier
3. **Logistic Regression** - Linear classifier
4. **Random Forest** - Ensemble method

In [None]:
# Store results
results = {}
trained_models = {}

### 1Ô∏è‚É£ Naive Bayes Classifier

In [None]:
print("\n" + "="*60)
print("üîµ NAIVE BAYES CLASSIFIER")
print("="*60)

# Train model
nb_model = MultinomialNB(alpha=1.0)
nb_model.fit(X_train_tfidf, y_train)

# Predict
y_pred_nb = nb_model.predict(X_test_tfidf)
y_proba_nb = nb_model.predict_proba(X_test_tfidf)

# Calculate metrics
results['Naive Bayes'] = {
    'accuracy': accuracy_score(y_test, y_pred_nb),
    'precision': precision_score(y_test, y_pred_nb, average='weighted'),
    'recall': recall_score(y_test, y_pred_nb, average='weighted'),
    'f1_score': f1_score(y_test, y_pred_nb, average='weighted')
}
trained_models['Naive Bayes'] = (nb_model, y_pred_nb, y_proba_nb)

print(f"\nResults:")
for metric, value in results['Naive Bayes'].items():
    print(f"  {metric.title()}: {value:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_nb))

### 2Ô∏è‚É£ Support Vector Machine

In [None]:
print("\n" + "="*60)
print("üü¢ SUPPORT VECTOR MACHINE")
print("="*60)

# Train model
svm_model = SVC(kernel='linear', C=1.0, probability=True, random_state=42)
svm_model.fit(X_train_tfidf, y_train)

# Predict
y_pred_svm = svm_model.predict(X_test_tfidf)
y_proba_svm = svm_model.predict_proba(X_test_tfidf)

# Calculate metrics
results['SVM'] = {
    'accuracy': accuracy_score(y_test, y_pred_svm),
    'precision': precision_score(y_test, y_pred_svm, average='weighted'),
    'recall': recall_score(y_test, y_pred_svm, average='weighted'),
    'f1_score': f1_score(y_test, y_pred_svm, average='weighted')
}
trained_models['SVM'] = (svm_model, y_pred_svm, y_proba_svm)

print(f"\nResults:")
for metric, value in results['SVM'].items():
    print(f"  {metric.title()}: {value:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_svm))

### 3Ô∏è‚É£ Logistic Regression

In [None]:
print("\n" + "="*60)
print("üü° LOGISTIC REGRESSION")
print("="*60)

# Train model
lr_model = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
lr_model.fit(X_train_tfidf, y_train)

# Predict
y_pred_lr = lr_model.predict(X_test_tfidf)
y_proba_lr = lr_model.predict_proba(X_test_tfidf)

# Calculate metrics
results['Logistic Regression'] = {
    'accuracy': accuracy_score(y_test, y_pred_lr),
    'precision': precision_score(y_test, y_pred_lr, average='weighted'),
    'recall': recall_score(y_test, y_pred_lr, average='weighted'),
    'f1_score': f1_score(y_test, y_pred_lr, average='weighted')
}
trained_models['Logistic Regression'] = (lr_model, y_pred_lr, y_proba_lr)

print(f"\nResults:")
for metric, value in results['Logistic Regression'].items():
    print(f"  {metric.title()}: {value:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))

### 4Ô∏è‚É£ Random Forest

In [None]:
print("\n" + "="*60)
print("üü† RANDOM FOREST")
print("="*60)

# Train model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train_tfidf, y_train)

# Predict
y_pred_rf = rf_model.predict(X_test_tfidf)
y_proba_rf = rf_model.predict_proba(X_test_tfidf)

# Calculate metrics
results['Random Forest'] = {
    'accuracy': accuracy_score(y_test, y_pred_rf),
    'precision': precision_score(y_test, y_pred_rf, average='weighted'),
    'recall': recall_score(y_test, y_pred_rf, average='weighted'),
    'f1_score': f1_score(y_test, y_pred_rf, average='weighted')
}
trained_models['Random Forest'] = (rf_model, y_pred_rf, y_proba_rf)

print(f"\nResults:")
for metric, value in results['Random Forest'].items():
    print(f"  {metric.title()}: {value:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))

### üîß Hyperparameter Tuning (GridSearchCV)

In [None]:
print("\n" + "="*60)
print("üîß HYPERPARAMETER TUNING - Logistic Regression")
print("="*60)

# Define parameter grid
param_grid = {
    'C': [0.01, 0.1, 1.0, 10.0]
}

# Grid search
grid_search = GridSearchCV(
    LogisticRegression(max_iter=1000, random_state=42),
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1
)

grid_search.fit(X_train_tfidf, y_train)

print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")

### üìä Cross-Validation Results

In [None]:
print("\n" + "="*60)
print("üìä CROSS-VALIDATION RESULTS (5-Fold)")
print("="*60)

cv_results = {}
models_for_cv = {
    'Naive Bayes': MultinomialNB(alpha=1.0),
    'SVM': SVC(kernel='linear', C=1.0, random_state=42),
    'Logistic Regression': LogisticRegression(C=1.0, max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
}

for name, model in models_for_cv.items():
    scores = cross_val_score(model, X_train_tfidf, y_train, cv=5, scoring='f1_weighted')
    cv_results[name] = {'mean': scores.mean(), 'std': scores.std()}
    print(f"{name}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

---

# Section 5: Model Evaluation & Results

## üìà Results Summary

In [None]:
# Create results DataFrame
results_df = pd.DataFrame(results).T
results_df = results_df.round(4)

print("\nüìä Model Performance Comparison:")
print("="*70)
print(results_df.to_string())
print("="*70)

# Highlight best model
best_model = results_df['f1_score'].idxmax()
best_score = results_df.loc[best_model, 'f1_score']
print(f"\nüèÜ Best Model: {best_model} (F1-Score: {best_score:.4f})")

## üìä Model Comparison Bar Chart

In [None]:
# Model comparison bar chart
metrics = ['accuracy', 'precision', 'recall', 'f1_score']
x = np.arange(len(results))
width = 0.2

fig, ax = plt.subplots(figsize=(14, 6))

colors_bar = plt.cm.Set2(np.linspace(0, 1, 4))

for i, metric in enumerate(metrics):
    values = [results[model][metric] for model in results]
    offset = (i - len(metrics)/2 + 0.5) * width
    bars = ax.bar(x + offset, values, width, label=metric.replace('_', ' ').title(), color=colors_bar[i])
    
    # Add value labels
    for bar, value in zip(bars, values):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                f'{value:.2f}', ha='center', va='bottom', fontsize=8)

ax.set_xlabel('Model', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(results.keys())
ax.legend(loc='lower right')
ax.set_ylim([0, 1.15])
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('../outputs/model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

## üî• Confusion Matrices

In [None]:
# Plot confusion matrices
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

classes = sorted(np.unique(y_test))

for idx, (name, (model, y_pred, y_proba)) in enumerate(trained_models.items()):
    cm = confusion_matrix(y_test, y_pred, labels=classes)
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=classes, yticklabels=classes, ax=axes[idx])
    axes[idx].set_xlabel('Predicted')
    axes[idx].set_ylabel('Actual')
    axes[idx].set_title(f'{name}\nAccuracy: {results[name]["accuracy"]:.4f}', fontweight='bold')

plt.tight_layout()
plt.savefig('../outputs/confusion_matrices.png', dpi=300, bbox_inches='tight')
plt.show()

## üìà ROC Curves

In [None]:
# ROC Curves
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

classes = sorted(np.unique(y_test))
y_test_bin = label_binarize(y_test, classes=classes)

for idx, (name, (model, y_pred, y_proba)) in enumerate(trained_models.items()):
    ax = axes[idx]
    
    for i, class_name in enumerate(classes):
        fpr, tpr, _ = roc_curve(y_test_bin[:, i], y_proba[:, i])
        roc_auc = auc(fpr, tpr)
        ax.plot(fpr, tpr, lw=2, label=f'{class_name} (AUC = {roc_auc:.3f})')
    
    ax.plot([0, 1], [0, 1], 'k--', lw=2)
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title(f'{name} - ROC Curves', fontweight='bold')
    ax.legend(loc='lower right')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../outputs/roc_curves.png', dpi=300, bbox_inches='tight')
plt.show()

## üîç Feature Importance (Random Forest)

In [None]:
# Feature importance from Random Forest
feature_names = tfidf_vectorizer.get_feature_names_out()
importances = rf_model.feature_importances_

# Get top 20 features
top_idx = np.argsort(importances)[-20:][::-1]
top_features = [(feature_names[i], importances[i]) for i in top_idx]

# Plot
fig, ax = plt.subplots(figsize=(10, 8))

features, scores = zip(*top_features)
y_pos = np.arange(len(features))

ax.barh(y_pos, scores, color='#3498db', alpha=0.8)
ax.set_yticks(y_pos)
ax.set_yticklabels(features)
ax.invert_yaxis()
ax.set_xlabel('Importance Score')
ax.set_title('Top 20 Feature Importance (Random Forest)', fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig('../outputs/feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

---

# Section 6: Conclusion

## üéØ Key Findings

In [None]:
print("\n" + "="*60)
print("üìã KEY FINDINGS")
print("="*60)

print("\n1Ô∏è‚É£  Dataset Analysis:")
print(f"    - Total samples: {len(df)}")
print(f"    - Sentiment distribution: Positive ({df['sentiment'].value_counts()['positive']}), "
      f"Negative ({df['sentiment'].value_counts()['negative']}), "
      f"Neutral ({df['sentiment'].value_counts()['neutral']})")

print("\n2Ô∏è‚É£  Model Performance:")
for model, metrics in sorted(results.items(), key=lambda x: x[1]['f1_score'], reverse=True):
    print(f"    - {model}: F1={metrics['f1_score']:.4f}, Accuracy={metrics['accuracy']:.4f}")

print(f"\n3Ô∏è‚É£  Best Model: {best_model}")
print(f"    - F1-Score: {best_score:.4f}")
print(f"    - Accuracy: {results[best_model]['accuracy']:.4f}")

print("\n4Ô∏è‚É£  Feature Engineering:")
print(f"    - TF-IDF features: {X_train_tfidf.shape[1]}")
print(f"    - N-gram range: (1, 2)")

print("\n" + "="*60)

## üîÆ Future Improvements

1. **Deep Learning Models**
   - LSTM Neural Networks
   - BERT-based models (DistilBERT)

2. **Enhanced Features**
   - Word embeddings (Word2Vec, GloVe)
   - Sentiment lexicons

3. **Model Enhancements**
   - Ensemble methods
   - Model explainability (LIME, SHAP)

4. **Production Deployment**
   - API endpoint
   - Real-time predictions

## üôè Thank You!

**Questions?**

In [None]:
# Save models for later use
import joblib
import os

os.makedirs('../models', exist_ok=True)
os.makedirs('../outputs', exist_ok=True)

# Save vectorizer
joblib.dump(tfidf_vectorizer, '../models/tfidf_vectorizer.joblib')

# Save models
for name, (model, _, _) in trained_models.items():
    filename = name.lower().replace(' ', '_')
    model_data = {
        'model': model,
        'label_encoder': LabelEncoder().fit(y),
        'classes_': sorted(np.unique(y)),
        'is_fitted': True
    }
    joblib.dump(model_data, f'../models/{filename}_model.joblib')

# Save results
results_df.to_csv('../models/model_results.csv')

print("\n‚úÖ All models and results saved!")
print("\nSaved files:")
for f in os.listdir('../models'):
    print(f"  - models/{f}")