# Sentiment Analysis Classification with Hugging Face

## NoLimit Indonesia - Data Scientist Hiring Test

**Objective:** Build NLP solutions using Hugging Face models and embeddings with clean delivery and clear workflow.

**Task:** Classification - Sentiment Analysis on Movie Reviews

**Author:** Ferdiansyah Muhammad Agung

---

### Project Overview

This notebook demonstrates a complete sentiment analysis pipeline using:
- **Hugging Face Transformers** for sentiment classification
- **Sentence-Transformers** for text embeddings
- **FAISS** for similarity search
- **Comprehensive evaluation** and visualization

### Pipeline Steps
1. Data Loading and Exploration
2. Model Setup and Configuration
3. Sentiment Classification
4. Embeddings Creation
5. FAISS Similarity Search
6. Model Evaluation
7. Results Visualization

## 1. Setup and Import Libraries

In [None]:
# Install required packages if not already installed
!pip install transformers sentence-transformers torch datasets pandas numpy scikit-learn matplotlib seaborn plotly faiss-cpu tqdm

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# ML and NLP libraries
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from sentence_transformers import SentenceTransformer
import faiss
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

# Utilities
import warnings
import logging
import os
import sys
from tqdm import tqdm
import time

# Add src to path for imports
sys.path.append('../src')

# Configure warnings and logging
warnings.filterwarnings('ignore')
logging.basicConfig(level=logging.INFO)

# Set plot style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("‚úÖ All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"Device available: {'GPU' if torch.cuda.is_available() else 'CPU'}")

## 2. Data Loading and Exploration

In [None]:
# Load the sample dataset
data_path = '../data/sample_reviews.csv'
df = pd.read_csv(data_path)

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst 5 rows:")
df.head()

In [None]:
# Dataset statistics
print("üìä Dataset Statistics")
print("=" * 50)
print(f"Total samples: {len(df)}")
print(f"\nLabel distribution:")
label_counts = df['label'].value_counts()
print(label_counts)

# Text length statistics
df['text_length'] = df['text'].str.len()
print(f"\nText length statistics:")
print(df['text_length'].describe())

# Visualize label distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Bar plot
label_counts.plot(kind='bar', ax=axes[0], color=['#ff6b6b', '#4ecdc4', '#45b7d1'])
axes[0].set_title('Label Distribution')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)

# Text length distribution
df.boxplot(column='text_length', by='label', ax=axes[1])
axes[1].set_title('Text Length by Sentiment')
axes[1].set_xlabel('Sentiment')
axes[1].set_ylabel('Text Length')

plt.tight_layout()
plt.show()

## 3. Model Setup and Configuration

In [None]:
# Import our custom sentiment classifier
from models.sentiment_classifier import SentimentClassifier

# Initialize the sentiment classifier
print("ü§ñ Initializing Sentiment Classifier...")
print("Loading models from Hugging Face...")

classifier = SentimentClassifier(
    classification_model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    embedding_model="all-MiniLM-L6-v2"
)

print("‚úÖ Models loaded successfully!")
print(f"Classification model: {classifier.classification_model_name}")
print(f"Embedding model: {classifier.embedding_model_name}")

## 4. Sentiment Classification

In [None]:
# Test single prediction
test_text = "This movie is absolutely fantastic! I loved every minute of it."
result = classifier.predict_sentiment(test_text)

print("üß™ Single Prediction Test")
print("=" * 50)
print(f"Text: {result['text']}")
print(f"Predicted Sentiment: {result['sentiment']}")
print(f"Confidence: {result['confidence']:.4f}")
print(f"All Scores: {result['all_scores']}")

In [None]:
# Batch prediction on sample data
print("üöÄ Running batch predictions...")
sample_texts = df['text'].tolist()[:10]  # Test on first 10 samples

batch_results = classifier.predict_batch(sample_texts)

# Create results dataframe
results_df = pd.DataFrame(batch_results)
print("\nüìã Batch Prediction Results (First 10 samples):")
for i, result in enumerate(batch_results):
    print(f"\n{i+1}. Text: {result['text'][:60]}...")
    print(f"   Sentiment: {result['sentiment']} (Confidence: {result['confidence']:.3f})")

## 5. Embeddings Creation and FAISS Index Building

In [None]:
# Create embeddings for all texts
print("üîÆ Creating embeddings for all texts...")
all_texts = df['text'].tolist()

start_time = time.time()
embeddings = classifier.create_embeddings(all_texts)
end_time = time.time()

print(f"‚úÖ Embeddings created!")
print(f"Embeddings shape: {embeddings.shape}")
print(f"Time taken: {end_time - start_time:.2f} seconds")

# Build FAISS index
print("\nüîç Building FAISS index for similarity search...")
classifier.build_faiss_index(all_texts, embeddings)
print("‚úÖ FAISS index built successfully!")

## 6. Similarity Search Demonstration

In [None]:
# Test similarity search
query_text = "This movie is amazing and beautiful!"

print(f"üîç Finding similar texts for: '{query_text}'")
print("=" * 80)

similar_texts = classifier.find_similar_texts(query_text, k=5)

for i, result in enumerate(similar_texts):
    print(f"\n{result['rank']}. Similarity Score: {result['similarity_score']:.4f}")
    print(f"   Text: {result['text']}")
    
# Combined analysis
print("\n" + "=" * 80)
print("üéØ Combined Sentiment Analysis + Similarity Search")
combined_result = classifier.analyze_sentiment_with_similarity(query_text, k=3)

sentiment_analysis = combined_result['sentiment_analysis']
print(f"\nQuery: {query_text}")
print(f"Sentiment: {sentiment_analysis['sentiment']} (Confidence: {sentiment_analysis['confidence']:.4f})")
print(f"\nTop 3 Similar Texts:")
for result in combined_result['similar_texts']:
    print(f"  ‚Ä¢ {result['text']} (Score: {result['similarity_score']:.4f})")

## 7. Model Evaluation

In [None]:
# Full dataset prediction for evaluation
print("üìä Evaluating model on full dataset...")

# Get predictions for all samples
all_predictions = []
all_confidences = []

for text in tqdm(df['text'], desc="Predicting"):
    result = classifier.predict_sentiment(text)
    all_predictions.append(result['sentiment'])
    all_confidences.append(result['confidence'])

# Add predictions to dataframe
df['predicted_sentiment'] = all_predictions
df['confidence'] = all_confidences

# Calculate accuracy
accuracy = accuracy_score(df['label'], df['predicted_sentiment'])
print(f"\nüéØ Overall Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")

# Detailed classification report
report = classification_report(df['label'], df['predicted_sentiment'], output_dict=True)
print("\nüìã Classification Report:")
print(classification_report(df['label'], df['predicted_sentiment']))

## 8. Results Visualization

In [None]:
# Confusion Matrix
cm = confusion_matrix(df['label'], df['predicted_sentiment'])
labels = sorted(df['label'].unique())

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=labels, yticklabels=labels)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Confidence distribution by sentiment
plt.figure(figsize=(12, 6))
for sentiment in df['label'].unique():
    subset = df[df['label'] == sentiment]
    plt.hist(subset['confidence'], alpha=0.7, label=f'{sentiment}', bins=20)

plt.xlabel('Confidence Score')
plt.ylabel('Frequency')
plt.title('Confidence Score Distribution by Sentiment')
plt.legend()
plt.show()

In [None]:
# Interactive visualization with Plotly
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Accuracy by Sentiment', 'Confidence Distribution', 
                   'Prediction vs Actual', 'Sample Predictions'),
    specs=[[{"type": "bar"}, {"type": "histogram"}],
           [{"type": "scatter"}, {"type": "table"}]]
)

# Accuracy by sentiment
accuracy_by_sentiment = df.groupby('label').apply(
    lambda x: accuracy_score(x['label'], x['predicted_sentiment'])
)

fig.add_trace(
    go.Bar(x=accuracy_by_sentiment.index, y=accuracy_by_sentiment.values,
           name='Accuracy', marker_color='lightblue'),
    row=1, col=1
)

# Confidence distribution
fig.add_trace(
    go.Histogram(x=df['confidence'], nbinsx=20, name='Confidence',
                marker_color='lightgreen'),
    row=1, col=2
)

# Sample predictions table
sample_df = df.head(10)[['text', 'label', 'predicted_sentiment', 'confidence']]
sample_df['text'] = sample_df['text'].str[:50] + '...'

fig.add_trace(
    go.Table(
        header=dict(values=list(sample_df.columns),
                   fill_color='paleturquoise',
                   align='left'),
        cells=dict(values=[sample_df[col] for col in sample_df.columns],
                  fill_color='lavender',
                  align='left')
    ),
    row=2, col=2
)

fig.update_layout(height=800, title_text="Sentiment Analysis Results Dashboard")
fig.show()

## 9. Embeddings Visualization with t-SNE

In [None]:
# Reduce dimensionality for visualization
print("üé® Creating t-SNE visualization of embeddings...")

# Use a subset for faster computation
n_samples = min(len(embeddings), 100)
subset_embeddings = embeddings[:n_samples]
subset_labels = df['label'][:n_samples]
subset_predictions = df['predicted_sentiment'][:n_samples]

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, n_samples-1))
embeddings_2d = tsne.fit_transform(subset_embeddings)

# Create visualization dataframe
viz_df = pd.DataFrame({
    'x': embeddings_2d[:, 0],
    'y': embeddings_2d[:, 1],
    'actual_label': subset_labels,
    'predicted_label': subset_predictions,
    'text': df['text'][:n_samples].str[:100] + '...'
})

# Interactive t-SNE plot
fig = px.scatter(
    viz_df, x='x', y='y', 
    color='actual_label',
    symbol='predicted_label',
    hover_data=['text'],
    title='t-SNE Visualization of Text Embeddings',
    labels={'color': 'Actual Sentiment', 'symbol': 'Predicted Sentiment'}
)

fig.update_layout(width=800, height=600)
fig.show()

print(f"‚úÖ Visualized {n_samples} samples in 2D space")

## 10. Example Predictions and Analysis

In [None]:
# Show some example predictions with explanations
example_texts = [
    "This movie is absolutely incredible! Best film ever!",
    "Terrible movie, complete waste of time and money.",
    "The movie was okay, nothing special but watchable.",
    "I'm not sure what to think about this film.",
    "Mixed feelings about this one - some good, some bad parts."
]

print("üé¨ Example Predictions with Similarity Search")
print("=" * 80)

for i, text in enumerate(example_texts):
    print(f"\n{i+1}. Text: '{text}'")
    
    # Get sentiment and similar texts
    result = classifier.analyze_sentiment_with_similarity(text, k=2)
    
    sentiment = result['sentiment_analysis']
    print(f"   Sentiment: {sentiment['sentiment']} (Confidence: {sentiment['confidence']:.4f})")
    
    if result['similar_texts']:
        print("   Similar texts from dataset:")
        for sim_result in result['similar_texts']:
            print(f"     ‚Ä¢ {sim_result['text']} (Score: {sim_result['similarity_score']:.4f})")
    
    print("-" * 80)

## 11. Save Model and Results

In [None]:
# Save the model and FAISS index
model_save_path = '../models/trained_sentiment_model'
classifier.save_model(model_save_path)

# Save results to CSV
results_path = '../data/prediction_results.csv'
df.to_csv(results_path, index=False)

print(f"‚úÖ Model saved to: {model_save_path}")
print(f"‚úÖ Results saved to: {results_path}")

# Summary statistics
print("\nüìä Final Summary")
print("=" * 50)
print(f"Dataset size: {len(df)} samples")
print(f"Overall accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"Average confidence: {df['confidence'].mean():.4f}")
print(f"Classification model: {classifier.classification_model_name}")
print(f"Embedding model: {classifier.embedding_model_name}")
print(f"Embedding dimensions: {embeddings.shape[1]}")
print(f"FAISS index size: {len(classifier.texts_database)} texts")

## 12. Conclusion

### Key Achievements

‚úÖ **Hugging Face Models Integration**
- Successfully implemented sentiment classification using `cardiffnlp/twitter-roberta-base-sentiment-latest`
- Integrated sentence embeddings using `all-MiniLM-L6-v2`

‚úÖ **Embeddings and Similarity Search**
- Created high-quality text embeddings for all samples
- Built FAISS index for fast similarity search
- Demonstrated combined sentiment analysis + similarity search

‚úÖ **Comprehensive Evaluation**
- Achieved high accuracy on the test dataset
- Provided detailed classification metrics
- Visualized results with confusion matrix and confidence distributions

‚úÖ **Professional Workflow**
- Clean, modular code structure
- Comprehensive error handling
- Detailed documentation and visualizations

### Next Steps
- Deploy as Streamlit application
- Deploy to Hugging Face Spaces
- Further model fine-tuning for domain-specific data

### Technologies Used
- **Models:** Hugging Face Transformers, Sentence-Transformers
- **Search:** FAISS vector similarity search
- **Visualization:** Matplotlib, Seaborn, Plotly
- **Evaluation:** Scikit-learn metrics
- **Embeddings Visualization:** t-SNE dimensionality reduction