Welcome to this educational repository on Text Classification! This project covers the fundamental concepts, techniques, and practical implementations of text classification in Natural Language Processing (NLP).
- Introduction
- Text Labeling Methods
- Common Use Cases
- Project Structure
- Getting Started
- Examples
- Additional Resources
Text Classification is the task of assigning predefined categories or labels to text documents. It's one of the most fundamental tasks in NLP and has numerous real-world applications.
- Different approaches to labeling text data
- Manual vs. automatic labeling techniques
- Implementation of various text classification use cases
- Traditional ML and modern deep learning approaches
- Best practices and evaluation metrics
Manual labeling involves human annotators assigning labels to text data. This approach:
- Pros: High accuracy, domain-specific expertise, handles nuances
- Cons: Time-consuming, expensive, potential for bias
- Best for: Small datasets, complex tasks, establishing ground truth
Approaches covered:
- Simple annotation workflows
- Inter-annotator agreement
- Label quality validation
- Annotation guidelines
Automatic labeling uses algorithms to assign labels without human intervention:
-
Rule-Based Methods
- Keyword matching
- Regular expressions
- Pattern-based classification
-
Weak Supervision
- Labeling functions
- Snorkel framework concepts
- Programmatic labeling
-
Transfer Learning
- Pre-trained models
- Zero-shot classification
- Few-shot learning
-
Active Learning
- Uncertainty sampling
- Query strategies
- Human-in-the-loop
Determine the emotional tone of text (positive, negative, neutral).
Applications:
- Product reviews
- Social media monitoring
- Customer feedback analysis
- Brand reputation management
Identify unwanted or malicious messages.
Applications:
- Email filtering
- SMS filtering
- Comment moderation
- Fraud detection
Categorize documents into predefined topics.
Applications:
- News categorization
- Document organization
- Content recommendation
- Academic paper classification
Understand user intent in conversational AI.
Applications:
- Chatbots
- Virtual assistants
- Customer service automation
Identify the language of a text.
Applications:
- Multilingual content routing
- Translation services
- Content filtering
text_classification/
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ labeling/
β βββ manual_labeling.py # Manual annotation examples
β βββ automatic_labeling.py # Rule-based and weak supervision
β βββ active_learning.py # Active learning implementation
β βββ annotation_guidelines.md # Best practices for labeling
βββ use_cases/
β βββ sentiment_analysis.py # Sentiment classification
β βββ spam_detection.py # Spam filtering
β βββ topic_classification.py # Topic categorization
β βββ intent_classification.py # Intent detection
βββ utils/
β βββ data_loader.py # Data loading utilities
β βββ preprocessing.py # Text preprocessing
β βββ evaluation.py # Metrics and evaluation
β βββ visualization.py # Plotting and visualization
βββ data/
β βββ sample_reviews.csv # Sample sentiment data
β βββ sample_spam.csv # Sample spam data
β βββ sample_news.csv # Sample topic data
βββ notebooks/
βββ 01_introduction.ipynb # Introduction to text classification
βββ 02_manual_labeling.ipynb # Manual labeling tutorial
βββ 03_automatic_labeling.ipynb # Automatic labeling tutorial
βββ 04_complete_pipeline.ipynb # End-to-end pipeline
- Python 3.8 or higher
- Virtual environment (already created in
.venv)
- Activate the virtual environment:
source .venv/bin/activate # On Linux/Mac
# or
.venv\Scripts\activate # On Windows- Install dependencies:
pip install -r requirements.txt- Download required NLP models:
python -m spacy download en_core_web_smRun the sentiment analysis example:
python use_cases/sentiment_analysis.pyRun the spam detection example:
python use_cases/spam_detection.pyfrom use_cases.sentiment_analysis import SentimentClassifier
# Create classifier
classifier = SentimentClassifier()
# Train on sample data
classifier.train()
# Predict sentiment
text = "This product is amazing! I love it!"
sentiment = classifier.predict(text)
print(f"Sentiment: {sentiment}") # Output: positivefrom labeling.automatic_labeling import RuleBasedLabeler
# Create labeler
labeler = RuleBasedLabeler()
# Add spam rules
labeler.add_rule("contains", ["free money", "click here", "winner"])
# Label text
text = "Congratulations! Click here to claim your free money!"
is_spam = labeler.label(text)
print(f"Is Spam: {is_spam}") # Output: Truefrom labeling.manual_labeling import ManualLabeler
# Create labeling interface
labeler = ManualLabeler(labels=["positive", "negative", "neutral"])
# Start labeling session
labeler.label_dataset("data/sample_reviews.csv", output="labeled_data.csv")All examples include comprehensive evaluation:
- Accuracy: Overall correctness
- Precision: Positive prediction accuracy
- Recall: True positive coverage
- F1-Score: Harmonic mean of precision and recall
- Confusion Matrix: Detailed error analysis
- scikit-learn: Traditional ML algorithms
- transformers: Pre-trained language models
- spaCy: NLP preprocessing
- pandas: Data manipulation
- matplotlib/seaborn: Visualization
- nltk: Text processing utilities
- Start with basics: Read through the manual labeling examples
- Understand automation: Explore automatic labeling techniques
- Practice with use cases: Implement sentiment analysis, spam detection
- Advanced topics: Dive into transfer learning and active learning
- Build your own: Create a custom classifier for your domain