# Sentiment Analysis Prototype

Author: galafis\nDate: 2025-09-30\nObjective: Prototype sentiment classification on financial news/text with synthetic data and lightweight models.

## Context & motivation

This prototype explores sentiment analysis for market intelligence. Key questions:
- Can we classify headlines as positive/neutral/negative?
- What accuracy do simple pretrained vs. rule-based methods achieve?
- How to scale and integrate into production?

## Dependencies

We use lightweight libraries for rapid prototyping:

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# NLP tools
try:
    from transformers import pipeline
    sentiment_pipeline_available = True
except Exception:
    sentiment_pipeline_available = False

# Configure display
sns.set_style('whitegrid')
%matplotlib inline

print('Environment ready.')

## Synthetic dataset

Generate sample financial headlines with labeled sentiment:

In [None]:
# Synthetic headlines
data = [
    {'text': 'Company reports record quarterly earnings', 'label': 'positive'},
    {'text': 'Stock plummets on weak guidance', 'label': 'negative'},
    {'text': 'Analysts maintain neutral outlook', 'label': 'neutral'},
    {'text': 'Revenue exceeds expectations significantly', 'label': 'positive'},
    {'text': 'CEO resigns amid investigation', 'label': 'negative'},
    {'text': 'Market closes flat after mixed data', 'label': 'neutral'},
    {'text': 'New partnership boosts investor confidence', 'label': 'positive'},
    {'text': 'Profit margin shrinks in recent quarter', 'label': 'negative'},
    {'text': 'Shares trade sideways pending announcement', 'label': 'neutral'},
    {'text': 'Breakthrough product launch drives demand', 'label': 'positive'},
]

df = pd.DataFrame(data)
print(f'Sample size: {len(df)}')
display(df)

## Rule-based baseline

Simple keyword matching as a quick baseline:

In [None]:
# Basic sentiment lexicons
positive_words = {'record', 'exceeds', 'boosts', 'breakthrough', 'demand', 'confidence'}
negative_words = {'plummets', 'weak', 'resigns', 'investigation', 'shrinks'}

def rule_based_sentiment(text: str) -> str:
    text_lower = text.lower()
    pos_count = sum(1 for w in positive_words if w in text_lower)
    neg_count = sum(1 for w in negative_words if w in text_lower)
    if pos_count > neg_count:
        return 'positive'
    elif neg_count > pos_count:
        return 'negative'
    return 'neutral'

df['rule_pred'] = df['text'].apply(rule_based_sentiment)
rule_accuracy = (df['rule_pred'] == df['label']).mean()
print(f'Rule-based accuracy: {rule_accuracy:.2%}')
display(df[['text', 'label', 'rule_pred']])

## Pretrained transformer (optional)

Use HuggingFace sentiment pipeline for comparison:

In [None]:
if sentiment_pipeline_available:
    # Initialize pretrained model
    classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

    def transform_label(hf_label: str) -> str:
        # Map HuggingFace labels to our schema
        if hf_label == 'POSITIVE':
            return 'positive'
        elif hf_label == 'NEGATIVE':
            return 'negative'
        return 'neutral'

    predictions = classifier(df['text'].tolist())
    df['transformer_pred'] = [transform_label(p['label']) for p in predictions]
    transformer_accuracy = (df['transformer_pred'] == df['label']).mean()
    print(f'Transformer accuracy: {transformer_accuracy:.2%}')
    display(df[['text', 'label', 'transformer_pred']])
else:
    print('Transformers not available; skipping pretrained model.')

## Results visualization

Compare baseline vs. model predictions:

In [None]:
# Distribution of labels
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

df['label'].value_counts().plot(kind='bar', ax=axes[0], color='skyblue', edgecolor='black')
axes[0].set_title('Ground Truth Distribution')
axes[0].set_xlabel('Sentiment')
axes[0].set_ylabel('Count')

df['rule_pred'].value_counts().plot(kind='bar', ax=axes[1], color='coral', edgecolor='black')
axes[1].set_title('Rule-Based Predictions')
axes[1].set_xlabel('Sentiment')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

## Findings and next steps

### Key takeaways
- Rule-based approach achieves decent baseline with minimal code
- Pretrained transformers (if available) typically outperform on nuanced text
- Synthetic data validates workflow; real headlines will have more complexity

### Recommendations
1. Collect labeled dataset of real financial headlines
2. Fine-tune domain-specific model on financial text
3. Integrate sentiment scores into market intelligence pipeline
4. Extract reusable preprocessing/classification helpers to `src/` module
5. Monitor model drift and retrain periodically