# Financial Sentiment Analysis with VADER and FinBERT

## Overview

This Python script enhances financial sentiment analysis by adding VADER (Valence Aware Dictionary and sEntiment Reasoner) and FinBERT (Financial BERT) sentiment scores to existing financial text analysis results. It's designed to work with Management Discussion and Analysis (MDA) sections from financial filings and complements existing sentiment analysis (presumably from the Gemini API).

## Key Features

- Dual sentiment analysis approach using:
  - VADER: Lexicon and rule-based sentiment analysis
  - FinBERT: Financial domain-specific BERT model
- Parallel processing with ThreadPoolExecutor
- Robust text matching between datasets
- Automatic handling of long texts through chunking
- Progress tracking and intermediate result saving
- Comprehensive error handling and logging

## Installation Requirements

- Python 3.6+
- Required packages:
  ```bash
  pip install pandas numpy torch tqdm nltk transformers
  ```
- NLTK VADER lexicon (automatically downloaded if missing)
- ProsusAI/finbert model (automatically downloaded during first run)

## Configuration

Key parameters at the top of the script:

```python
MAX_TEXT_LENGTH = 60000  # Maximum text length to process
BATCH_SIZE = 5           # Rows to process per batch
SAVE_FREQUENCY = 20      # Save after processing this many batches
MAX_WORKERS = 4          # Number of parallel processing threads
```

## Usage

### Basic Operation

1. Run the script:
   ```bash
   python sentiment_enhancement.py
   ```

2. Follow the interactive prompts:
   - Enter years to process (comma-separated or "all")
   - Confirm file paths or provide alternatives
   - Review progress in the terminal

### Expected File Structure

The script expects:
```
project/
├── data/
│   └── mda_data/
│       └── mda_sections_YEAR.csv    # Original MDA text data
├── sentiment_results/
│   └── mda/
│       ├── mda_sentiment_YEAR_results_v2.csv  # Input with existing scores
│       └── mda_sentiment_YEAR_results_v3.csv  # Output with enhanced scores
└── sentiment_enhancement.py
```

## Sentiment Analysis Details

### VADER Analysis
VADER provides:
- `vader_compound`: Overall sentiment score (-1 to 1)
- `vader_pos`: Positive sentiment score (0 to 1)
- `vader_neg`: Negative sentiment score (0 to 1)
- `vader_neu`: Neutral sentiment score (0 to 1)

### FinBERT Analysis
FinBERT provides:
- `finbert_positive`: Probability of positive sentiment (0 to 1)
- `finbert_negative`: Probability of negative sentiment (0 to 1)
- `finbert_neutral`: Probability of neutral sentiment (0 to 1)
- `finbert_sentiment`: Overall sentiment score (positive - negative)

## Advanced Features

### Chunking for Long Texts
For longer documents, FinBERT analysis splits text into 512-token chunks and averages sentiment scores across all chunks.

### Parallel Processing
The script uses ThreadPoolExecutor to process multiple batches simultaneously, significantly improving performance.

### Multiple Matching Strategies
Three matching strategies are attempted when finding MDA text:
1. Match on all fields (Ticker, CIK, FormType, FiledAt)
2. Match on Ticker, CIK, and FormType
3. Match on just Ticker and CIK

## Troubleshooting

- **Missing files**: The script attempts to find alternative paths or prompts for the correct location
- **Model initialization errors**: Check CUDA compatibility and available memory
- **Text matching failures**: Ensure consistent date formats and identifiers between datasets
- **Memory issues**: Reduce MAX_WORKERS or BATCH_SIZE for large datasets

## Error Handling

The script implements robust error handling:
- Detailed logging to both console and file
- Graceful handling of missing data
- Thread-safe parallel processing
- Automatic NLTK resource downloading
- GPU/CPU detection with fallback options

## Output

The script generates a CSV file with all original columns plus:
- VADER sentiment scores (vader_compound, vader_pos, vader_neg, vader_neu)
- FinBERT sentiment scores (finbert_positive, finbert_negative, finbert_neutral, finbert_sentiment)

In [None]:
import pandas as pd
import numpy as np
import os
import glob
import time
import logging
import re
import torch
from datetime import datetime
from tqdm import tqdm
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from transformers import BertTokenizer, BertForSequenceClassification
from concurrent.futures import ThreadPoolExecutor

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(f'sentiment_enhancement_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

# Constants
MAX_TEXT_LENGTH = 60000  # Match the same limit as in the Gemini code
BATCH_SIZE = 5  # Number of rows to process at once for efficiency
SAVE_FREQUENCY = 20  # Save after processing this many rows
MAX_WORKERS = 4  # Thread pool size for parallel processing

# Initialize VADER and FinBERT only once
def initialize_models():
    logger.info("Initializing sentiment models...")
    try:
        # Initialize VADER
        vader_analyzer = SentimentIntensityAnalyzer()
        
        # Initialize FinBERT (with device management)
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        logger.info(f"Using device: {device}")
        
        finbert_tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')
        finbert_model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert').to(device)
        finbert_model.eval()  # Set to evaluation mode
        
        logger.info("Models initialized successfully!")
        return vader_analyzer, finbert_tokenizer, finbert_model, device
    except Exception as e:
        logger.error(f"Error initializing models: {e}")
        raise

# Helper function to standardize date format
def standardize_date_format(date_str):
    """
    Standardize date string to YYYY-MM-DD format for consistent matching
    """
    if pd.isna(date_str) or date_str is None or date_str == '':
        return None
    
    try:
        # Try to parse using pandas (handles many formats)
        dt = pd.to_datetime(date_str)
        return dt.strftime('%Y-%m-%d')
    except:
        logger.warning(f"Could not standardize date: {date_str}")
        return date_str

# VADER sentiment analysis function
def analyze_with_vader(text, vader_analyzer):
    """
    Analyze sentiment using VADER
    """
    try:
        # Truncate text if too long (VADER can handle long texts, but we'll be consistent)
        if text and len(text) > MAX_TEXT_LENGTH:
            logger.warning(f"Text truncated for VADER from {len(text)} to {MAX_TEXT_LENGTH} chars")
            text = text[:MAX_TEXT_LENGTH]
        
        if not text or not isinstance(text, str):
            logger.warning("Empty or invalid text for VADER analysis")
            return {
                'vader_compound': np.nan,
                'vader_pos': np.nan,
                'vader_neg': np.nan,
                'vader_neu': np.nan
            }
        
        # Get sentiment scores
        scores = vader_analyzer.polarity_scores(text)
        
        return {
            'vader_compound': scores['compound'],
            'vader_pos': scores['pos'],
            'vader_neg': scores['neg'],
            'vader_neu': scores['neu']
        }
    except Exception as e:
        logger.error(f"Error in VADER analysis: {e}")
        return {
            'vader_compound': np.nan,
            'vader_pos': np.nan,
            'vader_neg': np.nan,
            'vader_neu': np.nan
        }

# FinBERT chunking and sentiment analysis function
def analyze_with_finbert(text, tokenizer, model, device):
    """
    Analyze sentiment using FinBERT with chunking for long texts
    """
    try:
        if not text or not isinstance(text, str):
            logger.warning("Empty or invalid text for FinBERT analysis")
            return {
                'finbert_positive': np.nan,
                'finbert_negative': np.nan,
                'finbert_neutral': np.nan,
                'finbert_sentiment': np.nan  # Overall sentiment score (pos - neg)
            }
        
        # Truncate text if too long (for memory considerations)
        if len(text) > MAX_TEXT_LENGTH:
            logger.warning(f"Text truncated for FinBERT from {len(text)} to {MAX_TEXT_LENGTH} chars")
            text = text[:MAX_TEXT_LENGTH]
        
        # Tokenize and split into chunks of 512 tokens
        encoded_input = tokenizer(text, 
                                  return_tensors='pt', 
                                  max_length=512, 
                                  truncation=True, 
                                  padding=True, 
                                  return_overflowing_tokens=True)
        
        # Get number of chunks
        num_chunks = encoded_input['input_ids'].size(0)
        logger.info(f"Processing {num_chunks} chunks for FinBERT")
        
        if num_chunks == 0:
            logger.warning("No chunks were created by the tokenizer")
            return {
                'finbert_positive': np.nan,
                'finbert_negative': np.nan,
                'finbert_neutral': np.nan,
                'finbert_sentiment': np.nan
            }
        
        # Process all chunks in one batch (or in smaller batches if there are too many)
        all_probs = []
        batch_size = 8  # Process chunks in batches of 8 if there are many
        
        for i in range(0, num_chunks, batch_size):
            batch_input_ids = encoded_input['input_ids'][i:i+batch_size].to(device)
            batch_attention_mask = encoded_input['attention_mask'][i:i+batch_size].to(device)
            
            with torch.no_grad():
                outputs = model(batch_input_ids, attention_mask=batch_attention_mask)
                logits = outputs.logits
                probs = torch.softmax(logits, dim=1)
                all_probs.append(probs)
        
        # Combine all batches
        combined_probs = torch.cat(all_probs, dim=0)
        
        # Average probabilities across all chunks
        avg_probs = combined_probs.mean(dim=0).cpu().numpy()
        
        # Map to sentiment categories (FinBERT order: positive, negative, neutral)
        return {
            'finbert_positive': float(avg_probs[0]),
            'finbert_negative': float(avg_probs[1]),
            'finbert_neutral': float(avg_probs[2]),
            'finbert_sentiment': float(avg_probs[0] - avg_probs[1])  # pos - neg as overall score
        }
    except Exception as e:
        logger.error(f"Error in FinBERT analysis: {e}")
        import traceback
        logger.error(f"Traceback: {traceback.format_exc()}")
        return {
            'finbert_positive': np.nan,
            'finbert_negative': np.nan,
            'finbert_neutral': np.nan,
            'finbert_sentiment': np.nan
        }

# Find matching MDA text for a row
def find_matching_mda_text(row, mda_df):
    """
    Find matching MDA text for a row using multiple matching strategies
    Similar to the approach in the reference code
    """
    # Extract row data for matching
    ticker = row.get('Ticker', None)
    cik = row.get('CIK', None)
    form_type = row.get('FormType', None)
    filed_at = row.get('FiledAt', None)
    
    # Log what we're trying to match
    logger.info(f"Looking for match: Ticker={ticker}, CIK={cik}, FormType={form_type}, FiledAt={filed_at}")
    
    # Try different matching strategies
    matching_rows = None
    
    # Strategy 1: Try matching on all fields
    if matching_rows is None or len(matching_rows) == 0:
        mask = pd.Series([True] * len(mda_df), index=mda_df.index)
        
        if ticker is not None and 'Ticker' in mda_df.columns:
            mask = mask & (mda_df['Ticker'] == ticker)
        
        if cik is not None and 'CIK' in mda_df.columns:
            mask = mask & (mda_df['CIK'] == cik)
        
        if form_type is not None and 'FormType' in mda_df.columns:
            mask = mask & (mda_df['FormType'] == form_type)
        
        if filed_at is not None and 'FiledAt' in mda_df.columns:
            mask = mask & (mda_df['FiledAt'] == filed_at)
        
        matching_rows = mda_df[mask]
        logger.info(f"Strategy 1 (all fields): Found {len(matching_rows)} matches")
    
    # Strategy 2: Match on Ticker, CIK, and FormType but not date
    if matching_rows is None or len(matching_rows) == 0:
        mask = pd.Series([True] * len(mda_df), index=mda_df.index)
        
        if ticker is not None and 'Ticker' in mda_df.columns:
            mask = mask & (mda_df['Ticker'] == ticker)
        
        if cik is not None and 'CIK' in mda_df.columns:
            mask = mask & (mda_df['CIK'] == cik)
        
        if form_type is not None and 'FormType' in mda_df.columns:
            mask = mask & (mda_df['FormType'] == form_type)
        
        matching_rows = mda_df[mask]
        logger.info(f"Strategy 2 (no date): Found {len(matching_rows)} matches")
    
    # Strategy 3: Match on just Ticker and CIK
    if matching_rows is None or len(matching_rows) == 0:
        mask = pd.Series([True] * len(mda_df), index=mda_df.index)
        
        if ticker is not None and 'Ticker' in mda_df.columns:
            mask = mask & (mda_df['Ticker'] == ticker)
        
        if cik is not None and 'CIK' in mda_df.columns:
            mask = mask & (mda_df['CIK'] == cik)
        
        matching_rows = mda_df[mask]
        logger.info(f"Strategy 3 (Ticker+CIK): Found {len(matching_rows)} matches")
    
    # If we still have no matches, return None
    if matching_rows is None or len(matching_rows) == 0:
        logger.warning(f"Could not find matching row in MDA file")
        return None
    
    if len(matching_rows) > 1:
        logger.warning(f"Found multiple matching rows ({len(matching_rows)}), using the first one")
    
    # Get the MDA text from the first matching row
    try:
        mda_text = matching_rows.iloc[0]['MDA_Text']
        
        if not isinstance(mda_text, str) or not mda_text.strip():
            logger.warning(f"Empty or invalid MDA_Text in matched row")
            return None
        
        logger.info(f"Found matching MDA text (length: {len(mda_text)} chars)")
        
        # Truncate text if too long
        if len(mda_text) > MAX_TEXT_LENGTH:
            logger.warning(f"Text too long ({len(mda_text)} chars), truncating to {MAX_TEXT_LENGTH} chars")
            mda_text = mda_text[:MAX_TEXT_LENGTH]
        
        return mda_text
    
    except Exception as e:
        logger.error(f"Error getting MDA text: {e}")
        return None

# Process a batch of rows
def process_batch(batch_df, mda_df, vader_analyzer, finbert_tokenizer, finbert_model, device):
    """
    Process a batch of rows with both sentiment models
    """
    results = []
    
    for idx, row in batch_df.iterrows():
        try:
            # Find matching MDA text
            mda_text = find_matching_mda_text(row, mda_df)
            
            # Skip if no text
            if not mda_text:
                logger.warning(f"No matching MDA text found for row {idx}, skipping")
                results.append((idx, None, None))
                continue
            
            # Process with VADER
            vader_results = analyze_with_vader(mda_text, vader_analyzer)
            
            # Process with FinBERT
            finbert_results = analyze_with_finbert(mda_text, finbert_tokenizer, finbert_model, device)
            
            # Store results
            results.append((idx, vader_results, finbert_results))
            logger.info(f"Successfully analyzed sentiment for row {idx}")
            
        except Exception as e:
            logger.error(f"Error processing row {idx}: {e}")
            results.append((idx, None, None))
    
    return results

# Find MDA file path
def find_mda_file_path(year):
    """
    Attempt to find the correct MDA file path
    """
    # Default path
    default_path = f"data/mda_data/mda_sections_{year}.csv"
    
    # Check if default path exists
    if os.path.exists(default_path):
        return default_path
    
    # Try alternative paths
    alternatives = [
        f"mda_data/mda_sections_{year}.csv",
        f"data/mda_sections_{year}.csv",
        f"mda_sections_{year}.csv"
    ]
    
    # Check parent directories
    current_dir = os.getcwd()
    parent_dir = os.path.dirname(current_dir)
    
    for alt_path in alternatives:
        # Check in current directory
        if os.path.exists(alt_path):
            return alt_path
        
        # Check in parent directory
        parent_path = os.path.join(parent_dir, alt_path)
        if os.path.exists(parent_path):
            return parent_path
    
    # If we've made it here, try to find any file that might match
    possible_paths = []
    
    # Search patterns
    patterns = [
        f"**/mda_*{year}*.csv",
        f"**/mda*{year}*.csv",
        f"**/*mda*{year}*.csv"
    ]
    
    for pattern in patterns:
        matches = glob.glob(pattern, recursive=True)
        possible_paths.extend(matches)
    
    if possible_paths:
        logger.info(f"Found possible MDA files: {possible_paths}")
        return possible_paths[0]  # Return the first match
    
    return None

# Main function to process the CSV files
def enhance_sentiment_analysis(input_file, output_file, year):
    """
    Add VADER and FinBERT sentiment scores to the input file
    """
    try:
        logger.info(f"Loading input file: {input_file}")
        results_df = pd.read_csv(input_file)
        logger.info(f"Loaded {len(results_df)} rows from {input_file}")
        
        # Find and load MDA file with original text
        mda_file_path = find_mda_file_path(year)
        if not mda_file_path:
            logger.error(f"Could not find MDA file for year {year}")
            return None
        
        logger.info(f"Loading MDA file: {mda_file_path}")
        mda_df = pd.read_csv(mda_file_path)
        logger.info(f"Loaded {len(mda_df)} rows from MDA file")
        
        # Check if MDA_Text column exists in the MDA file
        if 'MDA_Text' not in mda_df.columns:
            logger.error(f"MDA_Text column not found in {mda_file_path}")
            # Try to find an alternative column
            text_columns = [col for col in mda_df.columns if 'text' in col.lower() or 'mda' in col.lower()]
            if not text_columns:
                raise ValueError(f"Could not find text column in {mda_file_path}")
            logger.info(f"Using {text_columns[0]} as the text column")
            mda_df['MDA_Text'] = mda_df[text_columns[0]]
        
        # Standardize date formats in both dataframes for accurate matching
        if 'FiledAt' in results_df.columns:
            logger.info("Standardizing FiledAt dates in results dataframe")
            results_df['FiledAt'] = results_df['FiledAt'].apply(standardize_date_format)
        
        if 'FiledAt' in mda_df.columns:
            logger.info("Standardizing FiledAt dates in MDA dataframe")
            mda_df['FiledAt'] = mda_df['FiledAt'].apply(standardize_date_format)
        
        # Check if VADER and FinBERT columns already exist
        vader_cols = [col for col in results_df.columns if 'vader' in col.lower()]
        finbert_cols = [col for col in results_df.columns if 'finbert' in col.lower()]
        
        if vader_cols and finbert_cols:
            logger.warning(f"VADER and FinBERT columns already exist in {input_file}")
            logger.info(f"VADER columns: {vader_cols}")
            logger.info(f"FinBERT columns: {finbert_cols}")
            overwrite = input("Overwrite existing sentiment columns? (y/n): ").lower() == 'y'
            if not overwrite:
                logger.info("Skipping file")
                return
        
        # Initialize models
        vader_analyzer, finbert_tokenizer, finbert_model, device = initialize_models()
        
        # Process in batches
        batch_indices = [results_df.index[i:i+BATCH_SIZE] for i in range(0, len(results_df), BATCH_SIZE)]
        all_results = []
        
        logger.info(f"Processing {len(batch_indices)} batches")
        
        # Use ThreadPoolExecutor for parallel processing
        with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
            futures = []
            
            for i, batch_idx in enumerate(batch_indices):
                batch_df = results_df.loc[batch_idx].copy()
                futures.append(executor.submit(
                    process_batch, batch_df, mda_df, vader_analyzer, finbert_tokenizer, finbert_model, device
                ))
                
                # Wait for futures to complete and save intermediate results periodically
                if (i + 1) % SAVE_FREQUENCY == 0 or i == len(batch_indices) - 1:
                    logger.info(f"Waiting for batch {i+1}/{len(batch_indices)} to complete")
                    
                    # Collect completed results
                    completed_futures = [f for f in futures if f.done()]
                    for future in completed_futures:
                        try:
                            batch_results = future.result()
                            all_results.extend(batch_results)
                        except Exception as e:
                            logger.error(f"Error getting future result: {e}")
                    
                    # Clear completed futures
                    futures = [f for f in futures if not f.done()]
                    
                    # Save intermediate results
                    if all_results:
                        temp_df = results_df.copy()
                        for idx, vader_result, finbert_result in all_results:
                            if vader_result:
                                for key, value in vader_result.items():
                                    temp_df.at[idx, key] = value
                            if finbert_result:
                                for key, value in finbert_result.items():
                                    temp_df.at[idx, key] = value
                        
                        temp_output = output_file.replace('.csv', f'_temp_{i+1}.csv')
                        temp_df.to_csv(temp_output, index=False)
                        logger.info(f"Saved intermediate results to {temp_output}")
            
            # Collect remaining results
            for future in futures:
                try:
                    batch_results = future.result()
                    all_results.extend(batch_results)
                except Exception as e:
                    logger.error(f"Error getting future result: {e}")
        
        # Update dataframe with all results (excluding MDA_Text)
        output_df = results_df.copy()
        
        # Remove any existing VADER or FinBERT columns
        for col in vader_cols + finbert_cols:
            if col in output_df.columns:
                output_df = output_df.drop(col, axis=1)
                
        # Add new sentiment scores
        for idx, vader_result, finbert_result in all_results:
            if vader_result:
                for key, value in vader_result.items():
                    output_df.at[idx, key] = value
            if finbert_result:
                for key, value in finbert_result.items():
                    output_df.at[idx, key] = value
        
        # Save final results
        os.makedirs(os.path.dirname(output_file), exist_ok=True)
        output_df.to_csv(output_file, index=False)
        logger.info(f"Saved final results to {output_file}")
        
        # Clean up temporary files
        temp_files = [f for f in os.listdir(os.path.dirname(output_file)) 
                     if os.path.basename(f).startswith(os.path.basename(output_file).split('.')[0] + '_temp_')]
        if temp_files and input("Remove temporary files? (y/n): ").lower() == 'y':
            for temp_file in temp_files:
                temp_path = os.path.join(os.path.dirname(output_file), temp_file)
                try:
                    os.remove(temp_path)
                    logger.info(f"Removed {temp_file}")
                except Exception as e:
                    logger.error(f"Could not remove {temp_file}: {e}")
        
        return output_df
        
    except Exception as e:
        logger.error(f"Error processing file {input_file}: {e}")
        import traceback
        logger.error(f"Traceback: {traceback.format_exc()}")
        return None

def main():
    print("\n=== ENHANCING MDA SENTIMENT ANALYSIS WITH VADER AND FINBERT ===")
    
    # Ask user for years to process
    years_input = input("Which years would you like to process? (Enter comma-separated values, e.g., '2023,2024' or 'all' for all years): ")
    
    if years_input.lower() == 'all':
        years = ['2023', '2024']  # Default years
    else:
        years = [year.strip() for year in years_input.split(',')]
    
    # Process each year
    for year in years:
        # Define file paths
        input_file = f"sentiment_results/mda/mda_sentiment_{year}_results_v2.csv"
        output_file = f"sentiment_results/mda/mda_sentiment_{year}_results_v3.csv"
        
        print(f"\n=== PROCESSING YEAR: {year} ===")
        print(f"Input file: {input_file}")
        print(f"Output file: {output_file}")
        
        # Check if input file exists
        if not os.path.exists(input_file):
            print(f"Error: Input file not found: {input_file}")
            # Try alternative paths
            alternatives = [
                f"mda_sentiment_{year}_results_v2.csv",
                f"*sentiment*{year}*v2*.csv"
            ]
            
            found = False
            for alt_pattern in alternatives:
                matches = glob.glob(alt_pattern)
                if matches:
                    input_file = matches[0]
                    print(f"Found alternative input file: {input_file}")
                    found = True
                    break
            
            if not found:
                print("Could not find input file. Skipping year.")
                continue
        
        # Check if MDA file can be found
        mda_file_path = find_mda_file_path(year)
        if not mda_file_path:
            print(f"Error: Could not find MDA data file for year {year}")
            user_path = input(f"Please provide the path to the MDA file for {year} (or press Enter to skip): ")
            if user_path.strip():
                if os.path.exists(user_path):
                    print(f"Using provided MDA file: {user_path}")
                else:
                    print(f"Provided path does not exist: {user_path}")
                    if input("Continue anyway? (y/n): ").lower() != 'y':
                        print(f"Skipping year {year}")
                        continue
            else:
                print(f"Skipping year {year}")
                continue
        else:
            print(f"Found MDA file: {mda_file_path}")
        
        # Ask user to confirm
        confirm = input(f"Process {year}? (y/n): ")
        if confirm.lower() != 'y':
            print(f"Skipping year {year}")
            continue
        
        # Enhance sentiment analysis
        enhance_sentiment_analysis(input_file, output_file, year)
        
        print(f"Completed enhancement for year {year}")

if __name__ == "__main__":
    # Check if required libraries are installed
    try:
        import nltk
        nltk.data.find('vader_lexicon')
    except (ImportError, LookupError):
        print("NLTK VADER lexicon not found. Installing...")
        import nltk
        nltk.download('vader_lexicon')
    
    try:
        import transformers
    except ImportError:
        print("Transformers library not found. Please install it with:")
        print("pip install transformers")
        if input("Continue anyway? (y/n): ").lower() != 'y':
            import sys
            sys.exit(1)
    
    try:
        import torch
    except ImportError:
        print("PyTorch not found. Please install it with:")
        print("pip install torch")
        if input("Continue anyway? (y/n): ").lower() != 'y':
            import sys
            sys.exit(1)
    
    # Run main function
    main()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/nawa/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


NLTK VADER lexicon not found. Installing...

=== ENHANCING MDA SENTIMENT ANALYSIS WITH VADER AND FINBERT ===


Which years would you like to process? (Enter comma-separated values, e.g., '2023,2024' or 'all' for all years):  2024



=== PROCESSING YEAR: 2024 ===
Input file: sentiment_results/mda/mda_sentiment_2024_results_v2.csv
Output file: sentiment_results/mda/mda_sentiment_2024_results_v3.csv
Found MDA file: data/mda_data/mda_sections_2024.csv


Process 2024? (y/n):  y


2025-04-13 16:50:02,668 - INFO - Loading input file: sentiment_results/mda/mda_sentiment_2024_results_v2.csv
2025-04-13 16:50:02,688 - INFO - Loaded 1959 rows from sentiment_results/mda/mda_sentiment_2024_results_v2.csv
2025-04-13 16:50:02,688 - INFO - Loading MDA file: data/mda_data/mda_sections_2024.csv
2025-04-13 16:50:03,534 - INFO - Loaded 1959 rows from MDA file
2025-04-13 16:50:03,535 - INFO - Standardizing FiledAt dates in results dataframe
2025-04-13 16:50:03,839 - INFO - Standardizing FiledAt dates in MDA dataframe
2025-04-13 16:50:04,289 - INFO - Initializing sentiment models...
2025-04-13 16:50:04,293 - INFO - Using device: cuda
2025-04-13 16:50:04,763 - INFO - Models initialized successfully!
2025-04-13 16:50:04,764 - INFO - Processing 392 batches
2025-04-13 16:50:04,766 - INFO - Looking for match: Ticker=MMM, CIK=66740, FormType=10-Q, FiledAt=2024-10-22
2025-04-13 16:50:04,769 - INFO - Strategy 1 (all fields): Found 1 matches
2025-04-13 16:50:04,769 - INFO - Looking for m