# üìù Text Classification

**Author**: Data Science Master System  
**Difficulty**: ‚≠ê‚≠ê Intermediate  
**Time**: 45 minutes  
**Prerequisites**: 10_model_evaluation

## Learning Objectives
- Text preprocessing techniques
- TF-IDF vectorization
- Classical ML for text
- Multi-class classification

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
import re

## 1. Sample Data

In [None]:
# Sample dataset
texts = [
    "Great product, highly recommend!",
    "Terrible quality, waste of money",
    "Average performance, nothing special",
    "Excellent service and fast delivery",
    "Worst purchase ever, avoid!",
    "Good value for the price",
    "Disappointed with the quality",
    "Amazing! Exceeded my expectations",
] * 20  # Repeat for more data

labels = [1, 0, 1, 1, 0, 1, 0, 1] * 20

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")

## 2. Text Preprocessing

In [None]:
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    return text

X_train_clean = [preprocess(t) for t in X_train]
X_test_clean = [preprocess(t) for t in X_test]

print(f"Before: {texts[0]}")
print(f"After: {preprocess(texts[0])}")

## 3. TF-IDF Vectorization

In [None]:
tfidf = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
X_train_vec = tfidf.fit_transform(X_train_clean)
X_test_vec = tfidf.transform(X_test_clean)

print(f"Vocabulary size: {len(tfidf.vocabulary_)}")
print(f"Feature matrix: {X_train_vec.shape}")

## 4. Model Comparison

In [None]:
models = {
    'Naive Bayes': MultinomialNB(),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'SVM': LinearSVC()
}

results = []
for name, model in models.items():
    model.fit(X_train_vec, y_train)
    acc = model.score(X_test_vec, y_test)
    results.append({'Model': name, 'Accuracy': f'{acc:.2%}'})

print("üìä Results:")
display(pd.DataFrame(results))

## üéØ Key Takeaways
1. TF-IDF captures word importance
2. N-grams capture phrases
3. Start with simple models (Naive Bayes)
4. Use transformers for better accuracy

**Next**: 16_nlp_sentiment_analysis.ipynb