Skip to content

dannyblaker/text-classification-guide

Repository files navigation

Text Classification: A Comprehensive Educational Guide

Welcome to this educational repository on Text Classification! This project covers the fundamental concepts, techniques, and practical implementations of text classification in Natural Language Processing (NLP).

A Danny Blaker project badge

πŸ“š Table of Contents

  1. Introduction
  2. Text Labeling Methods
  3. Common Use Cases
  4. Project Structure
  5. Getting Started
  6. Examples
  7. Additional Resources

Introduction

⬆ Back to top

Text Classification is the task of assigning predefined categories or labels to text documents. It's one of the most fundamental tasks in NLP and has numerous real-world applications.

What You'll Learn

  • Different approaches to labeling text data
  • Manual vs. automatic labeling techniques
  • Implementation of various text classification use cases
  • Traditional ML and modern deep learning approaches
  • Best practices and evaluation metrics

Text Labeling Methods

⬆ Back to top

Manual Labeling

Manual labeling involves human annotators assigning labels to text data. This approach:

  • Pros: High accuracy, domain-specific expertise, handles nuances
  • Cons: Time-consuming, expensive, potential for bias
  • Best for: Small datasets, complex tasks, establishing ground truth

Approaches covered:

  • Simple annotation workflows
  • Inter-annotator agreement
  • Label quality validation
  • Annotation guidelines

Automatic Labeling

Automatic labeling uses algorithms to assign labels without human intervention:

  1. Rule-Based Methods

    • Keyword matching
    • Regular expressions
    • Pattern-based classification
  2. Weak Supervision

    • Labeling functions
    • Snorkel framework concepts
    • Programmatic labeling
  3. Transfer Learning

    • Pre-trained models
    • Zero-shot classification
    • Few-shot learning
  4. Active Learning

    • Uncertainty sampling
    • Query strategies
    • Human-in-the-loop

Common Use Cases

⬆ Back to top

1. Sentiment Analysis

Determine the emotional tone of text (positive, negative, neutral).

Applications:

  • Product reviews
  • Social media monitoring
  • Customer feedback analysis
  • Brand reputation management

2. Spam Detection

Identify unwanted or malicious messages.

Applications:

  • Email filtering
  • SMS filtering
  • Comment moderation
  • Fraud detection

3. Topic Classification

Categorize documents into predefined topics.

Applications:

  • News categorization
  • Document organization
  • Content recommendation
  • Academic paper classification

4. Intent Classification

Understand user intent in conversational AI.

Applications:

  • Chatbots
  • Virtual assistants
  • Customer service automation

5. Language Detection

Identify the language of a text.

Applications:

  • Multilingual content routing
  • Translation services
  • Content filtering

Project Structure

⬆ Back to top

text_classification/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”œβ”€β”€ labeling/
β”‚   β”œβ”€β”€ manual_labeling.py            # Manual annotation examples
β”‚   β”œβ”€β”€ automatic_labeling.py         # Rule-based and weak supervision
β”‚   β”œβ”€β”€ active_learning.py            # Active learning implementation
β”‚   └── annotation_guidelines.md      # Best practices for labeling
β”œβ”€β”€ use_cases/
β”‚   β”œβ”€β”€ sentiment_analysis.py         # Sentiment classification
β”‚   β”œβ”€β”€ spam_detection.py             # Spam filtering
β”‚   β”œβ”€β”€ topic_classification.py       # Topic categorization
β”‚   └── intent_classification.py      # Intent detection
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ data_loader.py                # Data loading utilities
β”‚   β”œβ”€β”€ preprocessing.py              # Text preprocessing
β”‚   β”œβ”€β”€ evaluation.py                 # Metrics and evaluation
β”‚   └── visualization.py              # Plotting and visualization
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ sample_reviews.csv            # Sample sentiment data
β”‚   β”œβ”€β”€ sample_spam.csv               # Sample spam data
β”‚   └── sample_news.csv               # Sample topic data
└── notebooks/
    β”œβ”€β”€ 01_introduction.ipynb         # Introduction to text classification
    β”œβ”€β”€ 02_manual_labeling.ipynb      # Manual labeling tutorial
    β”œβ”€β”€ 03_automatic_labeling.ipynb   # Automatic labeling tutorial
    └── 04_complete_pipeline.ipynb    # End-to-end pipeline

Getting Started

⬆ Back to top

Prerequisites

  • Python 3.8 or higher
  • Virtual environment (already created in .venv)

Installation

  1. Activate the virtual environment:
source .venv/bin/activate  # On Linux/Mac
# or
.venv\Scripts\activate  # On Windows
  1. Install dependencies:
pip install -r requirements.txt
  1. Download required NLP models:
python -m spacy download en_core_web_sm

Quick Start

Run the sentiment analysis example:

python use_cases/sentiment_analysis.py

Run the spam detection example:

python use_cases/spam_detection.py

Examples

⬆ Back to top

Example 1: Simple Sentiment Classification

from use_cases.sentiment_analysis import SentimentClassifier

# Create classifier
classifier = SentimentClassifier()

# Train on sample data
classifier.train()

# Predict sentiment
text = "This product is amazing! I love it!"
sentiment = classifier.predict(text)
print(f"Sentiment: {sentiment}")  # Output: positive

Example 2: Rule-Based Spam Detection

from labeling.automatic_labeling import RuleBasedLabeler

# Create labeler
labeler = RuleBasedLabeler()

# Add spam rules
labeler.add_rule("contains", ["free money", "click here", "winner"])

# Label text
text = "Congratulations! Click here to claim your free money!"
is_spam = labeler.label(text)
print(f"Is Spam: {is_spam}")  # Output: True

Example 3: Manual Labeling Interface

from labeling.manual_labeling import ManualLabeler

# Create labeling interface
labeler = ManualLabeler(labels=["positive", "negative", "neutral"])

# Start labeling session
labeler.label_dataset("data/sample_reviews.csv", output="labeled_data.csv")

πŸ“Š Evaluation Metrics

⬆ Back to top

All examples include comprehensive evaluation:

  • Accuracy: Overall correctness
  • Precision: Positive prediction accuracy
  • Recall: True positive coverage
  • F1-Score: Harmonic mean of precision and recall
  • Confusion Matrix: Detailed error analysis

πŸ› οΈ Technologies Used

⬆ Back to top

  • scikit-learn: Traditional ML algorithms
  • transformers: Pre-trained language models
  • spaCy: NLP preprocessing
  • pandas: Data manipulation
  • matplotlib/seaborn: Visualization
  • nltk: Text processing utilities

πŸ“š Learning Path

⬆ Back to top

  1. Start with basics: Read through the manual labeling examples
  2. Understand automation: Explore automatic labeling techniques
  3. Practice with use cases: Implement sentiment analysis, spam detection
  4. Advanced topics: Dive into transfer learning and active learning
  5. Build your own: Create a custom classifier for your domain

Additional Resources

⬆ Back to top

About

A comprehensive educational guide to text classification in NLP. Covers manual/automatic/active labeling methods and real-world use cases (sentiment analysis, spam detection, topic classification) with practical Python implementations and utilities.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages