# Text Preprocessing for Financial Sentiment Analysis

This notebook demonstrates text preprocessing techniques for financial text data collected from Reddit, Twitter, and News APIs. We'll perform:

1. **Tokenization** - Split text into individual words/tokens
2. **Stopword Removal** - Remove common words while preserving financial terms
3. **Lemmatization** - Reduce words to their base form

**Why Preprocessing?**
- Reduces noise in text data
- Standardizes text for sentiment analysis
- Reduces vocabulary size for better model performance
- Preserves important financial terminology

## 1. Import Required Libraries

Import NLTK for natural language processing, pandas for data handling, and our custom preprocessing module.

In [None]:
import sys
import os
from pathlib import Path
import json
from datetime import datetime

# Add backend to path for importing our modules
backend_path = str(Path.cwd().parent / "backend")
if backend_path not in sys.path:
    sys.path.insert(0, backend_path)

# Import our preprocessing module
from app.preprocessing import TextProcessor, preprocess_text, tokenize, remove_stopwords, lemmatize_tokens, normalize_text  # type: ignore

# Import NLTK for demos
import nltk
from nltk.corpus import stopwords

# Display settings
import warnings
warnings.filterwarnings('ignore')

print("âœ“ Libraries imported successfully!")

## 2. Sample Financial Text Data

Let's create sample texts from each data source to demonstrate preprocessing.

In [None]:
# Sample texts from different sources
sample_texts = {
    "reddit": "The stock market is so bullish right now! ðŸš€ $TSLA to the moon! Check out https://reddit.com/r/wallstreetbets #stocks #investing",
    "twitter": "@elonmusk Tesla earnings are AMAZING! The stock is up 15% today ðŸ“ˆ $TSLA #bullish #stocks https://t.co/abc123",
    "news": "Market Analysis: The S&P 500 gained 2.3% today as investors remained optimistic about earnings. Technology stocks led the rally with strong performance."
}

# Display samples
for source, text in sample_texts.items():
    print(f"\n{'='*60}")
    print(f"{source.upper()} Sample:")
    print(f"{'='*60}")
    print(text)
    print(f"\nLength: {len(text)} characters")

## 3. Text Normalization

First step: normalize text by removing URLs, mentions, hashtags, and punctuation.

In [None]:
# Demonstrate normalization
reddit_text = sample_texts["reddit"]
print("ORIGINAL:")
print(reddit_text)
print("\n" + "="*60)
print("NORMALIZED:")
normalized = normalize_text(reddit_text, lowercase=True, remove_urls=True, expand_hashtags=True)
print(normalized)
print("\nâœ“ URLs, hashtags, and emojis removed")

## 4. Tokenization

Split normalized text into individual words (tokens).

In [None]:
# Tokenize the normalized text
tokens = tokenize(normalized)
print(f"Tokens ({len(tokens)} total):")
print(tokens)
print(f"\nâœ“ Text split into {len(tokens)} tokens")

## 5. Stopword Removal

Remove common words while **preserving financial terminology**.

In [None]:
# Remove stopwords while preserving financial terms
filtered_tokens = remove_stopwords(tokens, preserve_financial=True)

print(f"Before: {len(tokens)} tokens")
print(tokens)
print(f"\nAfter: {len(filtered_tokens)} tokens")
print(filtered_tokens)
print(f"\nâœ“ Removed {len(tokens) - len(filtered_tokens)} stopwords")
print("âœ“ Financial terms like 'stock', 'market', 'bullish' preserved!")

## 6. Lemmatization

Reduce words to their base/dictionary form (e.g., "stocks" â†’ "stock", "running" â†’ "run").

In [None]:
# Lemmatize tokens
lemmatized_tokens = lemmatize_tokens(filtered_tokens)

print("Before lemmatization:")
print(filtered_tokens)
print("\nAfter lemmatization:")
print(lemmatized_tokens)
print("\nâœ“ Words reduced to base form")

## 7. Complete Pipeline

Use the `preprocess_text()` function to apply all steps at once.

In [None]:
# Process all sample texts at once
print("="*60)
print("COMPLETE PREPROCESSING PIPELINE")
print("="*60)

for source, text in sample_texts.items():
    print(f"\n{source.upper()}:")
    print(f"Original ({len(text)} chars): {text[:80]}...")
    
    # Full preprocessing
    processed = preprocess_text(
        text,
        remove_stopwords_flag=True,
        lemmatize=True,
        preserve_financial=True,
        return_string=True
    )
    
    print(f"Processed ({len(processed)} chars): {processed}")
    print(f"Reduction: {100 - (len(processed)/len(text)*100):.1f}%")

## 8. Using TextProcessor Class

Create a reusable processor with custom configuration.

In [None]:
# Create processor with custom configuration
processor = TextProcessor(
    lowercase=True,
    remove_urls=True,
    remove_stopwords=True,
    lemmatize=True,
    preserve_financial=True
)

# Process multiple texts
texts = [
    "Tesla stock surged 20% after earnings beat expectations!",
    "The Federal Reserve announced interest rate cuts today.",
    "Bitcoin reached new all-time highs as investors remain bullish."
]

print("Batch Processing:")
print("="*60)
results = processor.process_batch(texts, return_strings=True)
for i, (original, processed) in enumerate(zip(texts, results), 1):
    print(f"\n{i}. Original: {original}")
    print(f"   Processed: {processed}")

## 9. Comparison of Preprocessing Configurations

Compare different preprocessing levels: minimal, standard, and full.

In [None]:
test_text = "The stock markets are experiencing significant gains today! ðŸ“ˆ https://example.com"

configs = {
    "Minimal (normalize only)": {
        "remove_stopwords_flag": False,
        "lemmatize": False
    },
    "Standard (+ stopwords)": {
        "remove_stopwords_flag": True,
        "lemmatize": False,
        "preserve_financial": True
    },
    "Full (+ lemmatization)": {
        "remove_stopwords_flag": True,
        "lemmatize": True,
        "preserve_financial": True
    }
}

print("ORIGINAL TEXT:")
print(test_text)
print("\n" + "="*60)

for config_name, config_params in configs.items():
    result = preprocess_text(test_text, **config_params, return_string=True)
    tokens = preprocess_text(test_text, **config_params, return_string=False)
    print(f"\n{config_name}:")
    print(f"  Result: {result}")
    print(f"  Tokens: {len(tokens)}")

## 10. Financial Terms Preservation

Demonstrate how financial terminology is preserved even with stopword removal.

In [None]:
from app.preprocessing.text_processor import FINANCIAL_TERMS  # type: ignore

print("Financial terms preserved during stopword removal:")
print("="*60)
print(f"Total financial terms: {len(FINANCIAL_TERMS)}")
print(f"\nSample terms: {sorted(list(FINANCIAL_TERMS))[:20]}")

# Demonstrate preservation
financial_text = "The stock market shows bullish gains with strong returns on investment"
print(f"\n\nOriginal: {financial_text}")

# Without preserving financial terms
result_no_preserve = preprocess_text(
    financial_text, 
    remove_stopwords_flag=True, 
    preserve_financial=False,
    return_string=True
)
print(f"Without preservation: {result_no_preserve}")

# With preserving financial terms
result_preserve = preprocess_text(
    financial_text,
    remove_stopwords_flag=True,
    preserve_financial=True,
    return_string=True
)
print(f"With preservation: {result_preserve}")

## 11. Save Preprocessed Sample Data

Export preprocessed samples for reference.

In [None]:
# Create output directory
output_dir = Path("data/preprocessed/samples")
output_dir.mkdir(parents=True, exist_ok=True)

# Process and save samples
output_data = {
    "metadata": {
        "processed_at": datetime.now().isoformat(),
        "configuration": {
            "lowercase": True,
            "remove_stopwords": True,
            "lemmatize": True,
            "preserve_financial": True
        }
    },
    "samples": []
}

for source, text in sample_texts.items():
    tokens = preprocess_text(
        text,
        remove_stopwords_flag=True,
        lemmatize=True,
        preserve_financial=True,
        return_string=False
    )
    
    output_data["samples"].append({
        "source": source,
        "original": text,
        "processed": " ".join(tokens),
        "tokens": tokens,
        "token_count": len(tokens)
    })

# Save to JSON
output_file = output_dir / f"preprocessed_samples_{datetime.now().strftime('%Y%m%d')}.json"
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(output_data, f, indent=2, ensure_ascii=False)

print(f"âœ“ Saved preprocessed samples to: {output_file}")
print(f"âœ“ Total samples: {len(output_data['samples'])}")