# AutoNLP-Agent: No-Code NLP Platform

A comprehensive NLP platform that automatically detects tasks, preprocesses data, trains models, and evaluates performance.

## Features
- üîç Automatic NLP task detection
- üßπ Data preprocessing pipeline
- ü§ñ ML model training (Scikit-learn & Transformers)
- üìä Model evaluation with visualizations
- üéØ GPU acceleration support

## Supported Tasks
- Text Classification
- Sentiment Analysis
- Named Entity Recognition
- Question Answering
- Text Summarization

In [None]:
# Install required packages
!pip install pandas numpy scikit-learn transformers torch nltk spacy
!pip install matplotlib seaborn plotly
!pip install openpyxl xlrd  # For Excel file support

# Download NLTK data
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Download SpaCy model
!python -m spacy download en_core_web_sm

In [None]:
import pandas as pd
import numpy as np
import re
from typing import Dict, Any, List, Tuple
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch
from torch.utils.data import Dataset
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from datetime import datetime
import plotly.graph_objects as go
import plotly.express as px
from io import BytesIO
import base64

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("AutoNLP-Agent initialized successfully!")
print(f"GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU device: {torch.cuda.get_device_name(0)}")

## Step 1: Upload Your Dataset

Upload your CSV, TXT, or Excel file containing text data.

In [None]:
from google.colab import files

# Upload file
uploaded = files.upload()

if uploaded:
    filename = list(uploaded.keys())[0]
    print(f"Uploaded file: {filename}")
    
    # Determine file type and load data
    if filename.endswith('.csv'):
        df = pd.read_csv(filename)
    elif filename.endswith(('.xlsx', '.xls')):
        df = pd.read_excel(filename)
    elif filename.endswith('.txt'):
        df = pd.read_csv(filename, sep='\t')
    
    print(f"Dataset shape: {df.shape}")
    print("\nFirst 5 rows:")
    print(df.head())
    print("\nColumn info:")
    print(df.info())
else:
    print("No file uploaded. Using sample data instead.")
    
    # Sample sentiment analysis data
    sample_data = {
        'text': [
            'I love this product! It works perfectly.',
            'This is amazing quality and great value.',
            'Excellent customer service and fast delivery.',
            'Terrible product, complete waste of money.',
            'Poor quality and bad customer support.',
            'Awful experience, never buying again.',
            'Good product but arrived late.',
            'Decent quality for the price.',
            'Not bad, does what it says.',
            'Fantastic! Exceeded my expectations.'
        ],
        'sentiment': [
            'positive', 'positive', 'positive',
            'negative', 'negative', 'negative',
            'neutral', 'neutral', 'neutral', 'positive'
        ]
    }
    df = pd.DataFrame(sample_data)
    print("Using sample sentiment analysis data")
    print(df.head())

## Step 2: Automatic Task Detection

The system automatically detects what type of NLP task your data represents.

In [None]:
def detect_nlp_task(df: pd.DataFrame) -> str:
    """Automatically detect the NLP task type from the dataset structure and content."""
    columns = df.columns.tolist()
    
    # Check for sentiment-related keywords
    sentiment_keywords = ['sentiment', 'polarity', 'emotion', 'feeling']
    if any(keyword in ' '.join(columns).lower() for keyword in sentiment_keywords):
        return 'sentiment_analysis'
    
    # Check for classification patterns
    text_columns = []
    label_columns = []
    
    for col in columns:
        col_lower = col.lower()
        if any(keyword in col_lower for keyword in ['text', 'content', 'review', 'comment']):
            text_columns.append(col)
        elif any(keyword in col_lower for keyword in ['label', 'target', 'class', 'category']):
            label_columns.append(col)
    
    if text_columns and label_columns:
        return 'classification'
    
    # Check content for sentiment indicators
    if len(columns) >= 2:
        sample_df = df.head(min(50, len(df)))
        
        for col in columns:
            if df[col].dtype == 'object':
                sample_texts = sample_df[col].dropna().astype(str).tolist()[:10]
                
                positive_words = ['good', 'great', 'excellent', 'amazing', 'love', 'best']
                negative_words = ['bad', 'terrible', 'awful', 'hate', 'worst', 'poor']
                
                has_sentiment = False
                for text in sample_texts:
                    text_lower = text.lower()
                    if any(word in text_lower for word in positive_words + negative_words):
                        has_sentiment = True
                        break
                
                if has_sentiment:
                    return 'sentiment_analysis'
    
    # Default to classification
    return 'classification'

# Detect task
detected_task = detect_nlp_task(df)
print(f"üîç Detected NLP Task: {detected_task.upper()}")

# Identify columns
text_col = None
label_col = None

for col in df.columns:
    col_lower = col.lower()
    if df[col].dtype == 'object' and not text_col:
        text_col = col
    elif not label_col and col != text_col:
        label_col = col

print(f"üìù Text Column: {text_col}")
print(f"üè∑Ô∏è  Label Column: {label_col}")

## Step 3: Data Preprocessing

Clean and prepare your text data for model training.

In [None]:
class TextPreprocessor:
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
        self.label_encoder = LabelEncoder()
    
    def preprocess_text(self, text: str) -> str:
        """Preprocess individual text"""
        if not isinstance(text, str):
            text = str(text)
        
        # Lowercase
        text = text.lower()
        
        # Remove special characters and digits
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        
        # Tokenize
        import nltk
        tokens = nltk.word_tokenize(text)
        
        # Remove stop words
        tokens = [token for token in tokens if token not in self.stop_words]
        
        # Lemmatize
        tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
        
        return ' '.join(tokens)
    
    def preprocess_dataset(self, df: pd.DataFrame, text_col: str, label_col: str) -> Tuple[pd.DataFrame, Dict[str, Any]]:
        """Preprocess the entire dataset"""
        print("üßπ Starting data preprocessing...")
        
        # Handle missing values
        df = df.dropna(subset=[text_col, label_col])
        print(f"‚úÖ Handled missing values. Remaining rows: {len(df)}")
        
        # Preprocess text
        print("üìù Preprocessing text data...")
        df[f'{text_col}_processed'] = df[text_col].apply(self.preprocess_text)
        
        # Encode labels
        df[f'{label_col}_encoded'] = self.label_encoder.fit_transform(df[label_col])
        print(f"üè∑Ô∏è  Encoded {len(self.label_encoder.classes_)} classes: {list(self.label_encoder.classes_)}")
        
        # Add text features
        df[f'{text_col}_length'] = df[text_col].apply(len)
        df[f'{text_col}_word_count'] = df[text_col].apply(lambda x: len(str(x).split()))
        
        metadata = {
            'original_shape': df.shape,
            'text_column': text_col,
            'label_column': label_col,
            'processed_text_column': f'{text_col}_processed',
            'encoded_label_column': f'{label_col}_encoded',
            'classes': list(self.label_encoder.classes_),
            'num_classes': len(self.label_encoder.classes_)
        }
        
        print("‚úÖ Preprocessing completed!")
        return df, metadata

# Preprocess data
preprocessor = TextPreprocessor()
processed_df, metadata = preprocessor.preprocess_dataset(df, text_col, label_col)

print("\nüìä Processed Dataset Info:")
print(f"Shape: {processed_df.shape}")
print(f"Classes: {metadata['classes']}")
print("\nSample processed data:")
print(processed_df[[text_col, f'{text_col}_processed', label_col, f'{label_col}_encoded']].head())

## Step 4: Model Training

Train a machine learning model on your processed data.

In [None]:
class ModelTrainer:
    def __init__(self):
        self.models = {}
        self.vectorizers = {}
    
    def train_sklearn_model(self, X_train, y_train, model_type='logistic'):
        """Train a scikit-learn model"""
        print(f"ü§ñ Training {model_type} model...")
        
        # Vectorize text
        vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
        X_train_vec = vectorizer.fit_transform(X_train)
        
        # Choose model
        if model_type == 'logistic':
            model = LogisticRegression(random_state=42, max_iter=1000)
        elif model_type == 'random_forest':
            model = RandomForestClassifier(n_estimators=100, random_state=42)
        else:
            model = LogisticRegression(random_state=42, max_iter=1000)
        
        # Train
        model.fit(X_train_vec, y_train)
        
        model_id = f"sklearn_{model_type}_{datetime.now().strftime('%H%M%S')}"
        self.models[model_id] = model
        self.vectorizers[model_id] = vectorizer
        
        print("‚úÖ Model trained successfully!")
        return model_id, model, vectorizer
    
    def train_transformer_model(self, X_train, y_train, num_labels):
        """Train a transformer model"""
        print("üöÄ Training transformer model (this may take a while)...")
        
        # Load model and tokenizer
        model_name = "distilbert-base-uncased"
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(
            model_name, num_labels=num_labels
        )
        
        # Prepare dataset
        class TextDataset(Dataset):
            def __init__(self, texts, labels, tokenizer):
                self.texts = texts
                self.labels = labels
                self.tokenizer = tokenizer
            
            def __len__(self):
                return len(self.texts)
            
            def __getitem__(self, idx):
                text = self.texts[idx]
                label = self.labels[idx]
                
                encoding = tokenizer(
                    text,
                    truncation=True,
                    padding='max_length',
                    max_length=256,
                    return_tensors='pt'
                )
                
                return {
                    'input_ids': encoding['input_ids'].flatten(),
                    'attention_mask': encoding['attention_mask'].flatten(),
                    'labels': torch.tensor(label, dtype=torch.long)
                }
        
        train_dataset = TextDataset(X_train.tolist(), y_train.tolist(), tokenizer)
        
        # Training arguments
        training_args = TrainingArguments(
            output_dir='./results',
            num_train_epochs=2,
            per_device_train_batch_size=8,
            per_device_eval_batch_size=8,
            warmup_steps=100,
            weight_decay=0.01,
            logging_dir='./logs',
            logging_steps=50,
            save_steps=500,
            evaluation_strategy="no",
            save_strategy="no",
        )
        
        # Trainer
        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
        )
        
        # Train
        trainer.train()
        
        model_id = f"transformer_{datetime.now().strftime('%H%M%S')}"
        self.models[model_id] = {
            'model': model,
            'tokenizer': tokenizer,
            'trainer': trainer
        }
        
        print("‚úÖ Transformer model trained successfully!")
        return model_id

# Prepare training data
X = processed_df[metadata['processed_text_column']]
y = processed_df[metadata['encoded_label_column']]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"üìä Training set: {len(X_train)} samples")
print(f"üìä Test set: {len(X_test)} samples")

# Choose model type based on data size
trainer = ModelTrainer()

if len(X_train) < 1000:
    print("üìè Small dataset detected - using scikit-learn model")
    model_id, model, vectorizer = trainer.train_sklearn_model(X_train, y_train, 'logistic')
    model_type = 'sklearn'
else:
    print("üìè Large dataset detected - using transformer model")
    model_id = trainer.train_transformer_model(X_train, y_train, metadata['num_classes'])
    model_type = 'transformer'

print(f"üéØ Model trained with ID: {model_id}")

## Step 5: Model Evaluation

Evaluate your trained model with comprehensive metrics and visualizations.

In [None]:
def evaluate_model(model_id: str, X_test, y_test, model_type: str):
    """Evaluate the trained model"""
    print("üìä Evaluating model performance...")
    
    if model_type == 'sklearn':
        model = trainer.models[model_id]
        vectorizer = trainer.vectorizers[model_id]
        X_test_vec = vectorizer.transform(X_test)
        y_pred = model.predict(X_test_vec)
    else:
        # Transformer model
        model_info = trainer.models[model_id]
        model = model_info['model']
        tokenizer = model_info['tokenizer']
        
        y_pred = []
        for text in X_test:
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
            with torch.no_grad():
                outputs = model(**inputs)
                pred = torch.argmax(outputs.logits, dim=1).item()
                y_pred.append(pred)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    metrics = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1
    }
    
    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    
    print("‚úÖ Evaluation completed!")
    return metrics, cm, y_pred

# Evaluate model
metrics, confusion_matrix, y_pred = evaluate_model(model_id, X_test, y_test, model_type)

print("\nüìà Model Performance Metrics:")
for metric, value in metrics.items():
    print(f"{metric.upper()}: {value:.4f}")

# Decode predictions for display
y_test_decoded = preprocessor.label_encoder.inverse_transform(y_test)
y_pred_decoded = preprocessor.label_encoder.inverse_transform(y_pred)

print("\nüìã Classification Report:")
print(classification_report(y_test_decoded, y_pred_decoded))

## Step 6: Visualizations

Explore your results with interactive charts and visualizations.

In [None]:
# Create visualizations
def create_visualizations(metrics: Dict, confusion_matrix, y_test_decoded, y_pred_decoded):
    """Create comprehensive visualizations"""
    
    # 1. Metrics Bar Chart
    fig_metrics = go.Figure(data=[
        go.Bar(
            x=list(metrics.keys()),
            y=list(metrics.values()),
            marker_color='lightblue'
        )
    ])
    fig_metrics.update_layout(
        title='Model Performance Metrics',
        xaxis_title='Metric',
        yaxis_title='Score',
        yaxis_range=[0, 1]
    )
    fig_metrics.show()
    
    # 2. Confusion Matrix Heatmap
    plt.figure(figsize=(8, 6))
    sns.heatmap(
        confusion_matrix, 
        annot=True, 
        fmt='d', 
        cmap='Blues',
        xticklabels=metadata['classes'],
        yticklabels=metadata['classes']
    )
    plt.title('Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.show()
    
    # 3. Class Distribution
    unique_true, counts_true = np.unique(y_test_decoded, return_counts=True)
    unique_pred, counts_pred = np.unique(y_pred_decoded, return_counts=True)
    
    fig_dist = go.Figure()
    fig_dist.add_trace(go.Bar(
        name='True Labels',
        x=unique_true,
        y=counts_true,
        marker_color='lightgreen'
    ))
    fig_dist.add_trace(go.Bar(
        name='Predicted Labels',
        x=unique_pred,
        y=counts_pred,
        marker_color='lightcoral'
    ))
    fig_dist.update_layout(
        title='Class Distribution: True vs Predicted',
        xaxis_title='Class',
        yaxis_title='Count',
        barmode='group'
    )
    fig_dist.show()
    
    # 4. Text Length Distribution
    plt.figure(figsize=(10, 6))
    plt.hist(processed_df[f'{text_col}_length'], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
    plt.title('Text Length Distribution')
    plt.xlabel('Text Length (characters)')
    plt.ylabel('Frequency')
    plt.show()

# Generate visualizations
create_visualizations(metrics, confusion_matrix, y_test_decoded, y_pred_decoded)

## Step 7: Test Your Model

Try your trained model on new text examples.

In [None]:
def predict_text(text: str, model_id: str, model_type: str) -> str:
    """Make prediction on new text"""
    # Preprocess text
    processed_text = preprocessor.preprocess_text(text)
    
    if model_type == 'sklearn':
        model = trainer.models[model_id]
        vectorizer = trainer.vectorizers[model_id]
        text_vec = vectorizer.transform([processed_text])
        prediction = model.predict(text_vec)[0]
    else:
        # Transformer model
        model_info = trainer.models[model_id]
        model = model_info['model']
        tokenizer = model_info['tokenizer']
        
        inputs = tokenizer(processed_text, return_tensors="pt", truncation=True, padding=True)
        with torch.no_grad():
            outputs = model(**inputs)
            prediction = torch.argmax(outputs.logits, dim=1).item()
    
    # Decode prediction
    predicted_class = preprocessor.label_encoder.inverse_transform([prediction])[0]
    return predicted_class

# Test examples
test_texts = [
    "This product is amazing! I love it!",
    "Terrible quality, waste of money.",
    "It's okay, nothing special.",
    "Best purchase I've ever made!",
    "Poor customer service and defective item."
]

print("üß™ Testing your trained model:\n")

for text in test_texts:
    prediction = predict_text(text, model_id, model_type)
    print(f"Text: {text}")
    print(f"Prediction: {prediction.upper()}")
    print("-" * 50)

## Summary

Congratulations! You've successfully used AutoNLP-Agent to:

‚úÖ **Upload and analyze** your dataset  
‚úÖ **Automatically detect** the NLP task type  
‚úÖ **Preprocess and clean** your text data  
‚úÖ **Train a machine learning model** (with GPU acceleration if available)  
‚úÖ **Evaluate performance** with comprehensive metrics  
‚úÖ **Visualize results** with interactive charts  
‚úÖ **Test predictions** on new text examples  

### Key Features Demonstrated:
- **No-code NLP**: Just upload data, get results
- **Automatic task detection**: Intelligently identifies your use case
- **GPU acceleration**: Leverages Colab's GPU for faster training
- **Comprehensive evaluation**: Multiple metrics and visualizations
- **Production-ready models**: Can be exported and deployed

### Model Performance:
- **Accuracy**: {metrics['accuracy']:.4f}
- **F1-Score**: {metrics['f1_score']:.4f}
- **Precision**: {metrics['precision']:.4f}
- **Recall**: {metrics['recall']:.4f}

### Next Steps:
1. **Improve performance**: Try different models or hyperparameters
2. **Deploy model**: Export for production use
3. **Scale up**: Process larger datasets
4. **Customize**: Add domain-specific preprocessing

**AutoNLP-Agent** - Democratizing NLP through automation and simplicity! üöÄ