# Text Classification with NeuroLite

This tutorial demonstrates how to build text classification models using NeuroLite. We'll cover:

1. Data preparation for text classification
2. Training transformer-based models
3. Model evaluation and analysis
4. Making predictions on new text
5. Advanced NLP features

## Dataset Format

For text classification, NeuroLite expects your data in CSV format with columns for text and labels:

```csv
text,label
"This movie is amazing!",positive
"I didn't like this film",negative
"Great acting and story",positive
```

## Setup and Data Preparation

In [None]:
import neurolite
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
import seaborn as sns

# Create a sample text classification dataset
def create_sample_text_dataset():
    """Create a sample movie review dataset for demonstration"""
    
    # Sample movie reviews
    positive_reviews = [
        "This movie is absolutely fantastic! Great acting and storyline.",
        "I loved every minute of this film. Highly recommended!",
        "Outstanding performance by the lead actor. Must watch!",
        "Brilliant cinematography and excellent direction.",
        "One of the best movies I've seen this year.",
        "Amazing special effects and compelling characters.",
        "Wonderful story that kept me engaged throughout.",
        "Superb acting and beautiful soundtrack.",
        "This film exceeded all my expectations.",
        "Perfect blend of action and emotion.",
        "Incredible movie with outstanding performances.",
        "Loved the plot twists and character development.",
        "Excellent movie with great visual effects.",
        "This is a masterpiece of modern cinema.",
        "Fantastic storytelling and amazing cast."
    ]
    
    negative_reviews = [
        "This movie was terrible. Waste of time and money.",
        "Poor acting and confusing plot. Very disappointing.",
        "I couldn't even finish watching this boring film.",
        "Worst movie I've ever seen. Completely pointless.",
        "Bad direction and terrible screenplay.",
        "This film lacks any coherent storyline.",
        "Awful acting and poor production quality.",
        "Complete waste of time. Very boring and predictable.",
        "I regret watching this movie. Total disappointment.",
        "Poor character development and weak plot.",
        "This movie is painfully slow and uninteresting.",
        "Bad script and unconvincing performances.",
        "I fell asleep halfway through this boring film.",
        "Terrible movie with no redeeming qualities.",
        "Poorly executed and completely forgettable."
    ]
    
    # Create DataFrame
    data = []
    
    for review in positive_reviews:
        data.append({'text': review, 'label': 'positive'})
    
    for review in negative_reviews:
        data.append({'text': review, 'label': 'negative'})
    
    df = pd.DataFrame(data)
    
    # Shuffle the data
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    
    # Save to CSV
    df.to_csv('movie_reviews.csv', index=False)
    
    return df

# Create the sample dataset
df = create_sample_text_dataset()

print("Sample Text Classification Dataset:")
print(f"Total samples: {len(df)}")
print(f"Label distribution:")
print(df['label'].value_counts())
print("\nFirst few samples:")
print(df.head())

## Training a Text Classification Model

Let's train a transformer-based model for sentiment analysis:

In [None]:
# Train a text classification model
model = neurolite.train(
    data='movie_reviews.csv',
    model='bert',  # Use BERT transformer
    task='text_classification',
    target='label',
    max_length=128,  # Maximum sequence length
    remove_stopwords=False,  # Keep stopwords for better context
    validation_split=0.2,
    test_split=0.1,
    optimize=True  # Enable hyperparameter optimization
)

print("Text classification model training completed!")
print(f"Model type: {type(model.model).__name__}")
print(f"Framework: {model.framework}")

## Model Evaluation

Let's analyze the model's performance:

In [None]:
# Print evaluation metrics
print("Text Classification Model Evaluation:")
print("====================================")

metrics = model.evaluation_results.metrics
for metric_name, value in metrics.items():
    if isinstance(value, float):
        print(f"{metric_name.capitalize()}: {value:.4f}")
    else:
        print(f"{metric_name.capitalize()}: {value}")

# Plot training history if available
if hasattr(model, 'training_history') and model.training_history:
    plt.figure(figsize=(15, 5))
    
    # Plot loss
    plt.subplot(1, 3, 1)
    if 'loss' in model.training_history:
        plt.plot(model.training_history['loss'], label='Training Loss')
    if 'val_loss' in model.training_history:
        plt.plot(model.training_history['val_loss'], label='Validation Loss')
    plt.title('Training Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    
    # Plot accuracy
    plt.subplot(1, 3, 2)
    if 'accuracy' in model.training_history:
        plt.plot(model.training_history['accuracy'], label='Training Accuracy')
    if 'val_accuracy' in model.training_history:
        plt.plot(model.training_history['val_accuracy'], label='Validation Accuracy')
    plt.title('Training Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    
    # Plot F1 score if available
    plt.subplot(1, 3, 3)
    if 'f1' in model.training_history:
        plt.plot(model.training_history['f1'], label='Training F1')
    if 'val_f1' in model.training_history:
        plt.plot(model.training_history['val_f1'], label='Validation F1')
    plt.title('F1 Score')
    plt.xlabel('Epoch')
    plt.ylabel('F1 Score')
    plt.legend()
    
    plt.tight_layout()
    plt.show()

# Show confusion matrix
if hasattr(model.evaluation_results, 'confusion_matrix') and model.evaluation_results.confusion_matrix is not None:
    plt.figure(figsize=(8, 6))
    sns.heatmap(
        model.evaluation_results.confusion_matrix,
        annot=True,
        fmt='d',
        cmap='Blues',
        xticklabels=['negative', 'positive'],
        yticklabels=['negative', 'positive']
    )
    plt.title('Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.show()

# Show classification report if available
if hasattr(model.evaluation_results, 'classification_report') and model.evaluation_results.classification_report:
    print("\nDetailed Classification Report:")
    print("==============================")
    for class_name, metrics in model.evaluation_results.classification_report.items():
        if isinstance(metrics, dict):
            print(f"\n{class_name.capitalize()}:")
            for metric, value in metrics.items():
                if isinstance(value, float):
                    print(f"  {metric}: {value:.4f}")
                else:
                    print(f"  {metric}: {value}")

## Making Predictions

Let's test our model on new text samples:

In [None]:
# Test the model with new text samples
test_texts = [
    "This movie is absolutely incredible! I loved every second of it.",
    "Boring and predictable. I want my money back.",
    "The acting was decent but the plot was confusing.",
    "Amazing cinematography and outstanding performances by all actors.",
    "I fell asleep during the movie. Very disappointing.",
    "This film is a masterpiece of storytelling and visual effects.",
    "The movie was okay, nothing special but not terrible either.",
    "Worst film I've ever seen. Complete waste of time."
]

# Make predictions
predictions = model.predict(test_texts)

print("Predictions on New Text:")
print("========================")

for i, (text, prediction) in enumerate(zip(test_texts, predictions)):
    # Truncate long text for display
    display_text = text if len(text) <= 60 else text[:57] + "..."
    print(f"{i+1}. \"{display_text}\"")
    print(f"   Predicted: {prediction}")
    print()

## Advanced NLP Features

### Different Model Architectures

In [None]:
# Compare different NLP models
nlp_models = ['bert', 'roberta', 'distilbert']
model_results = {}

for model_name in nlp_models:
    print(f"Training {model_name}...")
    try:
        trained_model = neurolite.train(
            data='movie_reviews.csv',
            model=model_name,
            task='text_classification',
            target='label',
            max_length=64,  # Shorter for faster training
            validation_split=0.2,
            optimize=False  # Skip optimization for speed
        )
        
        accuracy = trained_model.evaluation_results.metrics.get('accuracy', 0.0)
        f1_score = trained_model.evaluation_results.metrics.get('f1', 0.0)
        
        model_results[model_name] = {
            'accuracy': accuracy,
            'f1': f1_score
        }
        
        print(f"{model_name} - Accuracy: {accuracy:.4f}, F1: {f1_score:.4f}")
        
    except Exception as e:
        print(f"Failed to train {model_name}: {e}")
        model_results[model_name] = {'accuracy': 0.0, 'f1': 0.0}

# Plot model comparison
if model_results:
    models = list(model_results.keys())
    accuracies = [model_results[m]['accuracy'] for m in models]
    f1_scores = [model_results[m]['f1'] for m in models]
    
    x = np.arange(len(models))
    width = 0.35
    
    plt.figure(figsize=(12, 6))
    
    plt.subplot(1, 2, 1)
    plt.bar(x - width/2, accuracies, width, label='Accuracy', alpha=0.8)
    plt.bar(x + width/2, f1_scores, width, label='F1 Score', alpha=0.8)
    plt.xlabel('Model')
    plt.ylabel('Score')
    plt.title('NLP Model Comparison')
    plt.xticks(x, models)
    plt.legend()
    plt.ylim(0, 1)
    
    # Add value labels
    for i, (acc, f1) in enumerate(zip(accuracies, f1_scores)):
        plt.text(i - width/2, acc + 0.01, f'{acc:.3f}', ha='center', fontsize=9)
        plt.text(i + width/2, f1 + 0.01, f'{f1:.3f}', ha='center', fontsize=9)
    
    # Show training time comparison (simulated)
    plt.subplot(1, 2, 2)
    training_times = [100, 120, 80]  # Simulated training times
    plt.bar(models, training_times, alpha=0.7, color='orange')
    plt.xlabel('Model')
    plt.ylabel('Training Time (seconds)')
    plt.title('Training Time Comparison')
    
    for i, time in enumerate(training_times):
        plt.text(i, time + 2, f'{time}s', ha='center')
    
    plt.tight_layout()
    plt.show()

### Custom Text Preprocessing

In [None]:
# Train with custom preprocessing options
custom_model = neurolite.train(
    data='movie_reviews.csv',
    model='distilbert',  # Faster variant of BERT
    task='text_classification',
    target='label',
    max_length=256,
    remove_stopwords=True,  # Remove common words
    validation_split=0.2,
    # Custom preprocessing parameters
    lowercase=True,
    remove_punctuation=False,  # Keep punctuation for sentiment
    min_word_length=2,
    optimize=False
)

print("Custom preprocessing model training completed!")
print(f"Accuracy: {custom_model.evaluation_results.metrics.get('accuracy', 'N/A'):.4f}")

# Test the custom model
test_custom = [
    "This movie is AMAZING!!!",
    "terrible, boring, waste of time",
    "Good movie, but could be better."
]

custom_predictions = custom_model.predict(test_custom)

print("\nCustom Model Predictions:")
for text, pred in zip(test_custom, custom_predictions):
    print(f"\"{text}\" -> {pred}")

## Working with Real-World Datasets

Let's try with a more realistic dataset:

In [None]:
# Load a subset of 20 newsgroups dataset
try:
    categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
    newsgroups_train = fetch_20newsgroups(
        subset='train',
        categories=categories,
        shuffle=True,
        random_state=42,
        remove=('headers', 'footers', 'quotes')
    )
    
    # Create DataFrame
    newsgroups_df = pd.DataFrame({
        'text': newsgroups_train.data[:200],  # Use subset for demo
        'label': [newsgroups_train.target_names[i] for i in newsgroups_train.target[:200]]
    })
    
    # Clean the text data
    newsgroups_df['text'] = newsgroups_df['text'].str.replace('\n', ' ').str.replace('\t', ' ')
    newsgroups_df['text'] = newsgroups_df['text'].str[:500]  # Truncate long texts
    
    # Remove empty texts
    newsgroups_df = newsgroups_df[newsgroups_df['text'].str.len() > 10]
    
    newsgroups_df.to_csv('newsgroups.csv', index=False)
    
    print("20 Newsgroups Dataset:")
    print(f"Total samples: {len(newsgroups_df)}")
    print(f"Categories: {newsgroups_df['label'].unique()}")
    print(f"Label distribution:")
    print(newsgroups_df['label'].value_counts())
    
    # Train model on newsgroups data
    newsgroups_model = neurolite.train(
        data='newsgroups.csv',
        model='distilbert',
        task='text_classification',
        target='label',
        max_length=256,
        validation_split=0.2,
        optimize=False
    )
    
    print(f"\nNewsgroups model accuracy: {newsgroups_model.evaluation_results.metrics.get('accuracy', 'N/A'):.4f}")
    
    # Test with sample texts
    test_newsgroups = [
        "I believe in God and follow Christian teachings.",
        "Computer graphics and 3D rendering are fascinating topics.",
        "Medical research shows that exercise improves health.",
        "I don't believe in any religious doctrine or deity."
    ]
    
    newsgroups_predictions = newsgroups_model.predict(test_newsgroups)
    
    print("\nNewsgroups Predictions:")
    for text, pred in zip(test_newsgroups, newsgroups_predictions):
        print(f"\"{text}\" -> {pred}")
        
except Exception as e:
    print(f"Could not load 20 newsgroups dataset: {e}")
    print("This is normal if you don't have internet connection or sklearn datasets.")

## Model Deployment for NLP

Deploy your text classification model:

In [None]:
# Deploy the text classification model
print("Deploying NLP model...")

# Export to ONNX for cross-platform inference
try:
    onnx_model = neurolite.deploy(model, format='onnx')
    print("✓ NLP model exported to ONNX format")
except Exception as e:
    print(f"✗ ONNX export failed: {e}")

# Create REST API for text classification
print("\nTo deploy as REST API for text classification:")
print("api_server = neurolite.deploy(model, format='api', port=8080)")
print("")
print("API Usage:")
print("POST /predict")
print("Content-Type: application/json")
print("Body: {\"text\": \"Your text to classify\"}")
print("")
print("Response: {\"prediction\": \"positive\", \"confidence\": 0.95}")

# Save model for later use
try:
    model.save('text_classification_model')
    print("\n✓ Model saved to 'text_classification_model' directory")
except Exception as e:
    print(f"✗ Model save failed: {e}")

## Best Practices for Text Classification

### 1. Data Preparation
- Clean your text data (remove HTML, special characters if needed)
- Handle class imbalance with stratified sampling
- Use appropriate train/validation/test splits
- Consider data augmentation for small datasets

### 2. Model Selection
- **BERT**: Best overall performance, slower training
- **DistilBERT**: Faster training, 97% of BERT performance
- **RoBERTa**: Often better than BERT on many tasks
- **ELECTRA**: More efficient pre-training, good performance

### 3. Hyperparameter Tuning
- **max_length**: Balance between context and efficiency
- **learning_rate**: Start with 2e-5 for BERT-based models
- **batch_size**: Larger is better, limited by GPU memory
- **epochs**: Usually 3-5 epochs for fine-tuning

### 4. Evaluation Metrics
- **Accuracy**: Good for balanced datasets
- **F1-score**: Better for imbalanced datasets
- **Precision/Recall**: Important for specific use cases
- **Confusion Matrix**: Understand per-class performance

### 5. Common Issues and Solutions
- **Overfitting**: Use dropout, early stopping, smaller learning rate
- **Slow training**: Use DistilBERT, reduce max_length, increase batch_size
- **Poor performance**: Check data quality, try different models, tune hyperparameters
- **Memory issues**: Reduce batch_size, use gradient accumulation

## Next Steps

- Try [Sentiment Analysis Tutorial](02_sentiment_analysis.ipynb) for specialized sentiment tasks
- Explore [Custom NLP Models](03_custom_models.ipynb) to create domain-specific models
- Learn about [Hyperparameter Optimization](../advanced/01_hyperparameter_optimization.ipynb)
- Check out [Model Deployment](../advanced/02_deployment.ipynb) for production deployment

## Summary

In this tutorial, you learned how to:

✓ Prepare text data for classification
✓ Train transformer-based models with NeuroLite
✓ Evaluate model performance with appropriate metrics
✓ Make predictions on new text
✓ Compare different NLP model architectures
✓ Apply custom preprocessing options
✓ Work with real-world datasets
✓ Deploy models for production use

NeuroLite makes NLP accessible while providing the flexibility to customize for your specific needs!