# MD&A Sentiment Analysis Tool

## Overview

This tool performs comprehensive sentiment analysis on Management Discussion & Analysis (MD&A) sections from SEC filings using Google's Gemini API. It processes CSV files containing MD&A text and generates multi-dimensional sentiment scores across various aspects and topics of financial reporting.

### Key Features

- Multi-dimensional sentiment analysis via Google Gemini AI
- Standardized aspect and topic categorization
- Robust error handling with retries
- Progress tracking to resume interrupted processing
- Comprehensive output in CSV format

## Setup Requirements

### Prerequisites

- Python 3.7+
- Google API key for Gemini LLM
- Input CSV files with MD&A sections

### Required Python Packages

```
pandas
google-generativeai
tqdm
```

To install the required packages:

```
pip install pandas google-generativeai tqdm
```

### API Configuration

The tool uses the Google Generative AI API. Set your API key at the top of the script:

```python
API_KEY = "YOUR_GEMINI_API_KEY"  # Replace with your actual API key
```

## Directory Structure

The tool uses the following directory structure:

```
├── data/
│   └── mda_data/                 # Directory containing input CSV files
│       └── mda_sections_*.csv    # Input files (named by year)
├── sentiment_results/            # Main results directory
│   └── mda/                      # MD&A sentiment results
│       ├── mda_sentiment_YYYY_results.csv        # Full results for each year
│       ├── sample_mda_sentiment_YYYY.csv         # Sample of first 5 rows
│       ├── debug_sentiment_YYYY.csv              # Simplified results for quick review
│       └── mda_sentiment_YYYY_full_results.json  # Raw JSON responses
├── progress/                     # Progress tracking
│   └── mda/                      # MD&A progress tracking
│       └── mda_YYYY_progress.json               # Tracks processed rows
└── log/                          # Logging directory
    └── sentiment/                # Sentiment analysis logs
        └── mda/                  # MD&A sentiment logs
            └── gemini_sentiment_YYYYMMDD_HHMMSS.log  # Timestamped log files
```

## Input Data Format

The tool expects CSV files with at least the following columns:

- `Ticker`: Stock symbol of the company
- `CIK`: Central Index Key
- `FormType`: Type of SEC form (10-K, 10-Q, etc.)
- `FiledAt`: Filing date
- `MDA_Text`: Full text of the MD&A section

Files should be named in the format `mda_sections_YYYY.csv` where YYYY is the year.

## Usage

Run the script with:

```
python mda_sentiment_analysis.py
```

The tool will:

1. Test the Gemini API connection
2. Prompt for processing parameters:
   - How many rows to process per file (enter 'all' for all rows)
   - Whether to force reprocessing of previously processed rows

After setting these parameters, the tool will process all available MD&A files in the data directory.

## Sentiment Analysis Approach

The tool analyzes sentiment across multiple dimensions:

1. **Basic Sentiment Metrics**:
   - Polarity Detection: Overall positive/negative sentiment (-1 to 1)
   - Emotion Detection: Emotional tone (-1 to 1)
   - Intent Analysis: Purpose/intent of text (-1 to 1)
   - Subjectivity Detection: Objectivity vs. subjectivity (-1 to 1)

2. **Standardized Aspect-Based Analysis**:
   - Financial Performance (Revenue Growth, Earnings, Profit Margins, etc.)
   - Management and Leadership
   - Product/Service Performance
   - Future Outlook
   - Legal and Risk

3. **Standardized Topic-Based Analysis**:
   - Financial Performance
   - Management and Leadership
   - Product/Service Performance
   - Industry and Market Factors
   - Future Outlook
   - Legal and Risk

4. **Gemini's Advanced Analysis**:
   - Overall Sentiment
   - Emotional Sentiment
   - Contextual Sentiment

## Output Files

For each input file (`mda_sections_YYYY.csv`), the tool generates:

1. **`mda_sentiment_YYYY_results.csv`**: 
   - Complete results with all sentiment scores
   - Retains all original columns (except MDA_Text to save space)
   - Adds numerous sentiment columns for each metric

2. **`sample_mda_sentiment_YYYY.csv`**:
   - First 5 rows of the full results
   - Useful for quick review

3. **`debug_sentiment_YYYY.csv`**:
   - Simplified version with only key sentiment columns
   - For quick assessment of core sentiment scores

4. **`mda_sentiment_YYYY_full_results.json`**:
   - Raw JSON responses from the Gemini API
   - Useful for debugging or advanced analysis

## Error Handling

The tool implements several error handling mechanisms:

### API Failures

- Each API call is attempted up to 3 times before failing
- 3-second delay between retry attempts
- Detailed error logging

### Partial Responses

- If the API returns a response but some scores are missing, default values (0.0) are used for consistency
- Standardized aspects/topics always appear in the results, even if not mentioned in the API response

### Complete Failures

- If a row completely fails after all retries, that row will be included in the output without sentiment scores
- The code will continue processing other rows

### Progress Tracking

- Progress is saved after every 5 rows
- Processing can be resumed after interruption
- Option to force reprocessing if needed

## Customization Options

Several parameters can be customized in the script:

### API Configuration

```python
API_KEY = "..."         # Your Gemini API key
MODEL = "gemini-2.0-flash"  # Gemini model to use
```

### Rate Limiting

```python
MAX_REQUESTS_PER_MINUTE = 1900  # Maximum API requests per minute
MAX_REQUESTS_PER_DAY = 100000   # Maximum API requests per day
REQUEST_DELAY = 0.5             # Delay between requests (seconds)
```

### Text Processing

```python
MAX_TEXT_LENGTH = 60000  # Maximum text length to send to API
SAVE_MDA_TEXT = False    # Whether to include MDA_Text in output
```

### Progress and Output

```python
SAVE_FREQUENCY = 20  # Save progress after every N batches
```

## Troubleshooting

### API Connection Issues

- Check that your API key is valid
- Verify network connectivity
- Review the log files for detailed error messages

### Missing Sentiment Scores

- Check the log files for API response errors
- Verify that the input text is valid and not too short
- Try running with `force_reprocess=True` to reanalyze problem rows

### Performance Issues

- Adjust `MAX_REQUESTS_PER_MINUTE` to match your API rate limits
- For very large datasets, process one file at a time
- Consider reducing `MAX_TEXT_LENGTH` if texts are too large

## Best Practices

1. **Test with small samples** before processing entire datasets
2. **Monitor log files** during processing to catch issues early
3. **Make backups** of important results
4. **Review sample files** to verify quality before full processing
5. **Process one year at a time** for very large datasets

## Limitations and Considerations

- API rate limits may restrict processing speed
- Very long MD&A sections are truncated to 60,000 characters
- API costs may apply depending on your Google API agreement
- Some nuanced financial language may be challenging for sentiment analysis

In [None]:
import pandas as pd
import json
import time
import os
import logging
import glob
import re
from datetime import datetime
from collections import deque
from tqdm import tqdm
from google import genai

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(f'gemini_sentiment_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

# Gemini API configuration
API_KEY = "AIzaSyAyPLiF-ckAV2N81bNwUZPzk1Vrrs-R9MI"  # Replace with your actual API key
MODEL = "gemini-2.0-flash"

# Set up Gemini client
client = genai.Client(api_key=API_KEY)

# Configure Gemini (optional parameters)
try:
    genai.configure(api_key=API_KEY)
except Exception as e:
    logger.warning(f"Could not configure genai: {e}, continuing with client-only approach")

# Constants for API limits and processing
BATCH_SIZE = 5  # Default batch size
MAX_REQUESTS_PER_MINUTE = 1900  # Set below 2000 RPM to be safe
MAX_REQUESTS_PER_DAY = 100000  # Large value, adjust based on your needs
REQUEST_DELAY = 0.5  # Shorter delay between requests due to higher limits
SAVE_FREQUENCY = 20  # Save progress after every 20 batches
MAX_TEXT_LENGTH = 60000  # Increased maximum length for text inputs
SAVE_MDA_TEXT = False  # Whether to include the full MDA text in the final output

# Create directories for outputs and progress tracking
os.makedirs("results", exist_ok=True)
os.makedirs("progress", exist_ok=True)

# Class to track and enforce rate limits
class RateLimiter:
    def __init__(self, max_per_minute, max_per_day):
        self.max_per_minute = max_per_minute
        self.max_per_day = max_per_day
        self.minute_requests = deque()
        self.daily_requests = 0
        self.start_time = time.time()
    
    def check_and_wait(self):
        """Check if we can make a request, wait if needed, and track the request"""
        current_time = time.time()
        
        # Check daily limit
        if self.daily_requests >= self.max_per_day:
            logger.warning(f"Reached maximum daily request limit of {self.max_per_day}")
            return False
        
        # Clean up minute_requests older than 60 seconds
        while self.minute_requests and current_time - self.minute_requests[0] > 60:
            self.minute_requests.popleft()
        
        # Check if we're at the per-minute limit
        if len(self.minute_requests) >= self.max_per_minute:
            wait_time = 60 - (current_time - self.minute_requests[0])
            if wait_time > 0:
                logger.info(f"Rate limit approaching: Waiting {wait_time:.2f} seconds before next request")
                time.sleep(wait_time)
        
        # Always wait the minimum delay between requests
        time.sleep(REQUEST_DELAY)
        
        # Record this request
        self.minute_requests.append(time.time())
        self.daily_requests += 1
        
        return True

# Create rate limiter instance
rate_limiter = RateLimiter(MAX_REQUESTS_PER_MINUTE, MAX_REQUESTS_PER_DAY)

# Helper function to check if a row has already been processed
def get_processed_indices(progress_file):
    """Load the set of indices that have already been processed"""
    if os.path.exists(progress_file):
        with open(progress_file, 'r') as f:
            return set(json.load(f))
    return set()

# Helper function to save processed indices
def save_processed_indices(progress_file, processed_indices):
    """Save the set of indices that have been processed"""
    with open(progress_file, 'w') as f:
        json.dump(list(processed_indices), f)

# Function to analyze sentiment via Gemini API with rate limiting
def analyze_sentiment_with_gemini(text):
    """
    Analyze sentiment of a single text using Gemini API
    """
    try:
        # For very long texts, use a simpler prompt to reduce token count
        if len(text) > 30000:
            prompt = """
Analyze this financial text for sentiment and provide a JSON response. 
Use a scale from -1 (strongly negative) to 1 (strongly positive) for all scores.

Text: {text}

Even if some aspects or topics are not explicitly mentioned, provide a score of 0 for them.

Format your response in this JSON structure:
{{
  "polarity_detection": score,
  "emotion_detection": score,
  "intent_analysis": score,
  "subjectivity_objectivity": score,
  "fine_grained_sentiment": {{
    "average_score": score
  }},
  "aspect_based_sentiment": {{
    "aspect_scores": [
      {{"Financial Performance - Revenue Growth": score}},
      {{"Financial Performance - Earnings": score}},
      {{"Financial Performance - Profit Margins": score}},
      {{"Financial Performance - Debt Levels": score}},
      {{"Financial Performance - Cash Flow": score}},
      {{"Management and Leadership": score}},
      {{"Product/Service Performance": score}},
      {{"Future Outlook": score}},
      {{"Legal and Risk": score}}
    ],
    "average_score": score
  }},
  "topic_sentiment_analysis": {{
    "topic_scores": [
      {{"Financial Performance": score}},
      {{"Management and Leadership": score}},
      {{"Product/Service Performance": score}},
      {{"Industry and Market Factors": score}},
      {{"Future Outlook": score}},
      {{"Legal and Risk": score}}
    ],
    "average_score": score
  }},
  "contextual_sentiment": {{
    "average_score": score
  }},
  "gemini_analysis": {{
    "overall_sentiment": score,
    "emotional_sentiment": score,
    "contextual_sentiment": score
  }}
}}
""".format(text=text)
        else:
            # Full detailed analysis for shorter texts
            prompt = """
Analyze the following text for sentiment using multiple methods. Provide sentiment scores between -1 (strongly negative) and 1 (strongly positive) for each method.

Text: {text}

Methods:

1. Polarity Detection (Basic Sentiment): Provide a single score representing the overall positive, negative, or neutral sentiment.

2. Fine-Grained Sentiment Analysis: Provide individual scores for different sections or perspectives within the text, if applicable, and an average score.

3. Aspect-Based Sentiment Analysis (ABSA): Identify key aspects or attributes in the text and provide sentiment scores for each aspect, and an average score. Use the following consistent aspects:
   * Financial Performance:
       * Revenue Growth
       * Earnings (EPS)
       * Profit Margins
       * Debt Levels
       * Cash Flow
   * Management and Leadership:
       * CEO Performance
       * Management Team
       * Corporate Governance
   * Product/Service Performance:
       * Product Innovation
       * Market Share
       * Customer Satisfaction
       * Product Quality
   * Future Outlook:
       * Growth Potential
       * Expansion Plans
       * Analyst Ratings
       * Investor Confidence
   * Legal and Risk:
       * Litigation
       * Financial Risks
       * Reputation Risks

4. Topic-Sentiment Analysis: Identify key topics or themes in the text and provide sentiment scores for each topic, and an average score. Use the following consistent topics:
   * Financial Performance
   * Management and Leadership
   * Product/Service Performance
   * Industry and Market Factors
   * Future Outlook
   * Legal and Risk

5. Emotion Detection: Provide a score representing the dominant emotion(s) expressed in the text. If multiple emotions are present, provide a weighted average.

6. Intent Analysis: Provide a score representing the overall intent or purpose of the text. If there are multiple intents, provide an average score.

7. Subjectivity/Objectivity Detection: Provide a score representing the overall level of subjectivity or objectivity in the text.

8. Contextual Sentiment Analysis: Provide individual scores for different contextual elements or speakers within the text, and an average score.

9. Deep Learning-Based Approach (Gemini's Analysis): Provide overall, emotional, and contextual sentiment scores using Gemini's advanced analysis.

Instructions for handling missing aspects or topics:
- If an aspect or topic is not mentioned in the text, assign it a score of 0 (neutral)
- Do not skip any predefined aspects or topics in your response
- For aspects or topics with minimal information, base the score on whatever limited information is available

Format your response as a JSON object with the following structure:

{
  "polarity_detection": score,
  "fine_grained_sentiment": {
    "individual_scores": [{"section/perspective": score}],
    "average_score": score
  },
  "aspect_based_sentiment": {
    "aspect_scores": [{"aspect": score}],
    "average_score": score
  },
  "topic_sentiment_analysis": {
    "topic_scores": [{"topic": score}],
    "average_score": score
  },
  "emotion_detection": score,
  "intent_analysis": score,
  "subjectivity_objectivity": score,
  "contextual_sentiment": {
    "context_scores": [{"context_element": score}],
    "average_score": score
  },
  "gemini_analysis": {
    "overall_sentiment": score,
    "emotional_sentiment": score,
    "contextual_sentiment": score
  }
}
""".format(text=text)

        # Check rate limits
        if not rate_limiter.check_and_wait():
            logger.warning("Rate limit reached. Skipping request.")
            return None

        # Use Gemini API with explicit safety settings to avoid filter issues
        try:
            response = client.models.generate_content(
                model=MODEL,
                contents=prompt
            )
            
            # Safety checks and handling
            if not hasattr(response, 'text'):
                if hasattr(response, 'prompt_feedback'):
                    logger.error(f"Content filtered: {response.prompt_feedback}")
                    return None
                else:
                    logger.error("Empty response from Gemini API")
                    return None
            
            return response.text
        except Exception as e:
            logger.error(f"Error in Gemini API call: {e}")
            # Print more detailed error info if available
            import traceback
            logger.error(f"Error details: {traceback.format_exc()}")
            return None
    
    except Exception as e:
        logger.error(f"Error preparing sentiment analysis: {e}")
        return None

def extract_json_from_response(response_text):
    """
    Extract the JSON object from the API response text
    """
    try:
        # Find the start and end of the JSON object
        start_idx = response_text.find('{')
        end_idx = response_text.rfind('}') + 1
        
        if start_idx != -1 and end_idx != -1:
            json_str = response_text[start_idx:end_idx]
            # Clean up potential issues before parsing
            json_str = re.sub(r',\s*}', '}', json_str)  # Remove trailing commas
            json_str = re.sub(r',\s*]', ']', json_str)  # Remove trailing commas in arrays
            return json.loads(json_str)
        else:
            logger.error("JSON object not found in response")
            return None
    except json.JSONDecodeError as e:
        logger.error(f"Error parsing JSON: {e}")
        logger.error(f"Problematic JSON: {response_text}")
        return None
    except Exception as e:
        logger.error(f"Unexpected error extracting JSON: {e}")
        return None

def flatten_sentiment_json(json_obj):
    """
    Flatten the nested JSON structure into a dictionary with simple key-value pairs
    """
    flat_dict = {}
    
    if not json_obj:
        logger.warning("Empty JSON object passed to flatten_sentiment_json")
        return flat_dict
    
    # Debug: Print the raw JSON
    logger.info(f"Flattening JSON with keys: {list(json_obj.keys())}")
    
    # Add default values for key metrics to ensure they exist even if missing in the JSON
    flat_dict["polarity_detection"] = 0.0
    flat_dict["emotion_detection"] = 0.0
    flat_dict["intent_analysis"] = 0.0
    flat_dict["subjectivity_objectivity"] = 0.0
    flat_dict["gemini_overall_sentiment"] = 0.0
    flat_dict["gemini_emotional_sentiment"] = 0.0
    flat_dict["gemini_contextual_sentiment"] = 0.0
    
    # Extract top-level simple scores
    for key in ['polarity_detection', 'emotion_detection', 'intent_analysis', 'subjectivity_objectivity']:
        if key in json_obj:
            try:
                # Extract the value and convert to float if needed
                value = json_obj[key]
                if isinstance(value, list) and len(value) > 0:
                    value = value[0]  # Extract first element if it's a list
                flat_dict[key] = float(value)
                logger.info(f"Extracted {key}: {flat_dict[key]}")
            except Exception as e:
                logger.error(f"Error extracting {key}: {e}, value was: {json_obj[key]}")
    
    # Extract Gemini Analysis scores
    if 'gemini_analysis' in json_obj:
        for sub_key, value in json_obj['gemini_analysis'].items():
            try:
                if isinstance(value, list) and len(value) > 0:
                    value = value[0]  # Extract first element if it's a list
                flat_dict[f"gemini_{sub_key}"] = float(value)
                logger.info(f"Extracted gemini_{sub_key}: {flat_dict[f'gemini_{sub_key}']}")
            except Exception as e:
                logger.error(f"Error extracting gemini_{sub_key}: {e}, value was: {value}")
    
    # Extract average scores from complex objects
    for key in ['fine_grained_sentiment', 'aspect_based_sentiment', 'topic_sentiment_analysis', 'contextual_sentiment']:
        if key in json_obj and 'average_score' in json_obj[key]:
            try:
                value = json_obj[key]['average_score']
                if isinstance(value, list) and len(value) > 0:
                    value = value[0]  # Extract first element if it's a list
                flat_dict[f"{key}_avg"] = float(value)
                logger.info(f"Extracted {key}_avg: {flat_dict[f'{key}_avg']}")
            except Exception as e:
                logger.error(f"Error extracting {key}_avg: {e}, value was: {json_obj[key]['average_score']}")
    
    # Extract standardized aspect and topic scores
    if 'aspect_based_sentiment' in json_obj and 'aspect_scores' in json_obj['aspect_based_sentiment']:
        aspect_scores = json_obj['aspect_based_sentiment']['aspect_scores']
        if isinstance(aspect_scores, list):
            for score_dict in aspect_scores:
                try:
                    for aspect, score in score_dict.items():
                        # Clean up the aspect name for column naming
                        clean_aspect = aspect.strip().replace(' - ', '_').replace('/', '_').replace(' ', '_').lower()
                        if isinstance(score, list) and len(score) > 0:
                            score = score[0]  # Extract first element if it's a list
                        flat_dict[f"aspect_{clean_aspect}"] = float(score)
                        logger.info(f"Extracted aspect_{clean_aspect}: {flat_dict[f'aspect_{clean_aspect}']}")
                except Exception as e:
                    logger.error(f"Error extracting aspect score: {e}, value was: {score_dict}")
    
    if 'topic_sentiment_analysis' in json_obj and 'topic_scores' in json_obj['topic_sentiment_analysis']:
        topic_scores = json_obj['topic_sentiment_analysis']['topic_scores']
        if isinstance(topic_scores, list):
            for score_dict in topic_scores:
                try:
                    for topic, score in score_dict.items():
                        # Clean up the topic name for column naming
                        clean_topic = topic.strip().replace('/', '_').replace(' ', '_').lower()
                        if isinstance(score, list) and len(score) > 0:
                            score = score[0]  # Extract first element if it's a list
                        flat_dict[f"topic_{clean_topic}"] = float(score)
                        logger.info(f"Extracted topic_{clean_topic}: {flat_dict[f'topic_{clean_topic}']}")
                except Exception as e:
                    logger.error(f"Error extracting topic score: {e}, value was: {score_dict}")
    
    # Store the complete JSON as a string for reference
    try:
        flat_dict['full_sentiment_json'] = json.dumps(json_obj)
    except Exception as e:
        logger.error(f"Error converting JSON to string: {e}")
        flat_dict['full_sentiment_json'] = str(json_obj)
    
    logger.info(f"Flattened JSON into {len(flat_dict)} key-value pairs")
    return flat_dict

def convert_filed_date(date_str):
    """Convert filing date string to date only (YYYY-MM-DD)"""
    try:
        # Handle different date formats
        date_formats = [
            "%Y-%m-%dT%H:%M:%S%z",    # Format with timezone offset
            "%Y-%m-%dT%H:%M:%S-%z",   # Format with timezone offset separated by dash
            "%Y-%m-%dT%H:%M:%S",      # Format without timezone
            "%Y-%m-%d"                # Just date
        ]
        
        for fmt in date_formats:
            try:
                # Convert to datetime then extract only the date part
                dt = pd.to_datetime(date_str, format=fmt)
                return dt.date()  # Return only the date part (YYYY-MM-DD)
            except:
                continue
                
        # If specific formats fail, try pandas default parser
        dt = pd.to_datetime(date_str)
        return dt.date()  # Return only the date part
    except:
        # Return original if all parsing attempts fail
        logger.warning(f"Could not parse date: {date_str}")
        return date_str

def process_mda_file(file_path, output_dir="sentiment_results/mda", num_rows=None, retry_limit=3, force_reprocess=True):
    """
    Process a single MDA file and run sentiment analysis on the specified number of rows
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV file
    output_dir : str
        Directory to save results
    num_rows : int or None
        Number of rows to process (None for all)
    retry_limit : int
        Number of retries for API calls
    force_reprocess : bool
        If True, reprocess all rows even if they've been processed before
    """
    try:
        # Create output directory if it doesn't exist
        os.makedirs(output_dir, exist_ok=True)
        
        # Extract year from filename
        year = os.path.basename(file_path).split('_')[-1].split('.')[0]
        
        # Load the CSV file
        logger.info(f"Loading MDA file: {file_path}")
        df = pd.read_csv(file_path)
        
        # Take specified number of rows for testing
        if num_rows and num_rows < len(df):
            df = df.head(num_rows)
        logger.info(f"Processing {len(df)} rows from {file_path}")
        
        # Convert FiledAt to datetime
        if 'FiledAt' in df.columns:
            logger.info("Converting FiledAt to date only")
            df['FiledAt'] = df['FiledAt'].apply(convert_filed_date)
        
        # Set up progress tracking
        progress_file = f"progress/mda/mda_{year}_progress.json"
        os.makedirs(os.path.dirname(progress_file), exist_ok=True)
        processed_indices = get_processed_indices(progress_file)
        
        # Find unprocessed rows
        unprocessed_indices = set(df.index)
        if not force_reprocess:
            # Only filter for unprocessed rows if not forcing reprocessing
            unprocessed_indices = unprocessed_indices - processed_indices
            logger.info(f"Found {len(unprocessed_indices)} unprocessed rows")
        else:
            logger.info(f"Force reprocessing all {len(unprocessed_indices)} rows")
        
        # Convert to list and sort for deterministic order
        unprocessed_indices = sorted(list(unprocessed_indices))
        
        # Track results
        results = {}
        
        # Process each row
        for idx in tqdm(unprocessed_indices, desc="Processing rows"):
            mda_text = df.loc[idx, 'MDA_Text']
            
            if not isinstance(mda_text, str) or not mda_text.strip():
                logger.warning(f"Empty or invalid MDA_Text for row {idx}, skipping")
                processed_indices.add(idx)
                continue
            
            logger.info(f"Analyzing sentiment for row {idx} (length: {len(mda_text)} chars)")
            
            # Truncate text if too long (Gemini has a token limit)
            if len(mda_text) > MAX_TEXT_LENGTH:
                logger.warning(f"Text too long ({len(mda_text)} chars), truncating to {MAX_TEXT_LENGTH} chars")
                mda_text = mda_text[:MAX_TEXT_LENGTH]
            
            # Retry mechanism
            success = False
            for attempt in range(retry_limit):
                try:
                    # Call Gemini API
                    response_text = analyze_sentiment_with_gemini(mda_text)
                    
                    # Debug: Print first 500 chars of the API response
                    logger.info(f"API Response (first 500 chars): {response_text[:500] if response_text else 'None'}")
                    
                    if not response_text:
                        logger.error(f"No response from API for row {idx}, attempt {attempt+1}/{retry_limit}")
                        time.sleep(3)  # Wait before retry
                        continue
                    
                    # Extract JSON from response
                    sentiment_json = extract_json_from_response(response_text)
                    
                    # Debug: Print if JSON was extracted
                    if sentiment_json:
                        logger.info(f"Successfully extracted JSON with {len(sentiment_json)} keys")
                    else:
                        logger.error(f"Failed to extract JSON from response for row {idx}, attempt {attempt+1}/{retry_limit}")
                        time.sleep(3)  # Wait before retry
                        continue
                    
                    # Flatten JSON
                    flat_sentiment = flatten_sentiment_json(sentiment_json)
                    
                    # Debug: Print flattened data
                    logger.info(f"Flattened sentiment data contains {len(flat_sentiment)} keys")
                    for key in list(flat_sentiment.keys())[:5]:  # Show first 5 keys as example
                        if key != 'full_sentiment_json':
                            logger.info(f"Sample data - {key}: {flat_sentiment[key]}")
                    
                    # Store results
                    results[idx] = flat_sentiment
                    processed_indices.add(idx)
                    
                    logger.info(f"Successfully analyzed sentiment for row {idx}")
                    success = True
                    break
                    
                except Exception as e:
                    logger.error(f"Error on attempt {attempt+1}/{retry_limit} for row {idx}: {e}")
                    time.sleep(3)  # Wait before retry
            
            if not success:
                logger.error(f"Failed to analyze sentiment for row {idx} after {retry_limit} attempts")
                
            # Save progress periodically after every 5 processed rows
            if len(results) % 5 == 0:
                save_processed_indices(progress_file, processed_indices)
                logger.info(f"Saved progress after processing {len(results)} rows")
        
        # Create a copy of the dataframe for results
        result_df = df.copy()
        
        # Add sentiment results to dataframe
        if results:
            logger.info(f"Adding {len(results)} results to dataframe")
            for idx, result in results.items():
                for key, value in result.items():
                    if key != 'full_sentiment_json':  # Exclude the full JSON from the main dataframe
                        result_df.at[idx, key] = value
                        
            # Debug: Check if data was added correctly
            sentiment_columns = [col for col in result_df.columns if col not in df.columns]
            logger.info(f"Added {len(sentiment_columns)} new columns to dataframe")
            if sentiment_columns:
                logger.info(f"New columns: {sentiment_columns[:5]}...")
            else:
                logger.error("No new columns were added to the dataframe!")
        else:
            logger.warning("No results to add to dataframe!")
        
        # Remove MDA_Text column if not saving it
        if not SAVE_MDA_TEXT and 'MDA_Text' in result_df.columns:
            logger.info("Removing MDA_Text column from output to save space")
            result_df = result_df.drop('MDA_Text', axis=1)
        
        # Debug: Print dataframe info
        logger.info(f"Result dataframe has {len(result_df)} rows and {len(result_df.columns)} columns")
        
        # Save results - using index=True to make sure we preserve row indices
        output_file = os.path.join(output_dir, f"mda_sentiment_{year}_results.csv")
        result_df.to_csv(output_file, index=False)
        logger.info(f"Saved results to {output_file}")
        
        # Save sample results for this file
        sample_size = min(5, len(result_df))
        if sample_size > 0:
            sample_df = result_df.head(sample_size)
            sample_file = os.path.join(output_dir, f"sample_mda_sentiment_{year}.csv")
            sample_df.to_csv(sample_file, index=False)
            logger.info(f"Saved sample of {sample_size} rows to {sample_file}")
        
        # Debug: Create a simplified CSV with just a few key columns for quick inspection
        debug_columns = ['Ticker', 'FormType', 'FiledAt']
        for col in result_df.columns:
            if any(term in col for term in ['polarity', 'emotion', 'intent', 'gemini']):
                debug_columns.append(col)
                
        if len(debug_columns) > 3:  # If we found any sentiment columns
            debug_df = result_df[debug_columns[:10]]  # First 10 columns for readability
            debug_file = os.path.join(output_dir, f"debug_sentiment_{year}.csv")
            debug_df.to_csv(debug_file, index=False)
            logger.info(f"Saved debug CSV with key columns to {debug_file}")
        
        # Add some dummy data if no sentiment was found (for testing)
        if len([col for col in result_df.columns if 'polarity' in col]) == 0:
            logger.warning("No sentiment columns found - adding dummy data for testing")
            result_df['polarity_detection'] = 0.1
            result_df['emotion_detection'] = 0.2
            result_df['intent_analysis'] = 0.3
            debug_file = os.path.join(output_dir, f"dummy_sentiment_{year}.csv")
            result_df.to_csv(debug_file, index=False)
            logger.info(f"Saved dummy data to {debug_file}")
        
        # Save full results with JSON
        full_output_file = os.path.join(output_dir, f"mda_sentiment_{year}_full_results.json")
        with open(full_output_file, 'w') as f:
            json.dump({str(idx): results[idx]['full_sentiment_json'] for idx in results if 'full_sentiment_json' in results[idx]}, f)
        logger.info(f"Saved full JSON results to {full_output_file}")
        
        # Save progress
        save_processed_indices(progress_file, processed_indices)
        
        return result_df
        
    except Exception as e:
        logger.error(f"Error processing file {file_path}: {e}")
        import traceback
        logger.error(f"Traceback: {traceback.format_exc()}")
        return None

def find_first_mda_file():
    """
    Find the first MDA file in the data directory
    """
    files = glob.glob("data/mda_data/mda_sections_*.csv")
    if files:
        return files[0]
    return None

def find_all_mda_files():
    """
    Find all MDA files in the data directory
    """
    files = glob.glob("data/mda_data/mda_sections_*.csv")
    return files

def process_multiple_mda_files(files=None, rows_per_file=None, force_reprocess=True):
    """
    Process multiple MDA files, running sentiment analysis on each
    
    Parameters:
    -----------
    files : list or None
        List of file paths to process. If None, processes all MDA files
    rows_per_file : int or None
        Number of rows to process per file. If None, processes all rows
    force_reprocess : bool
        If True, reprocess all rows even if they've been processed before
    """
    if files is None:
        # Find all MDA files
        files = find_all_mda_files()
    
    if not files:
        logger.error("No MDA files found")
        return
    
    logger.info(f"Found {len(files)} MDA files to process")
    
    for file_path in files:
        logger.info(f"Processing file: {file_path}")
        process_mda_file(file_path, num_rows=rows_per_file, force_reprocess=force_reprocess)
        
        # Ask if user wants to continue to the next file
        if file_path != files[-1]:  # If not the last file
            continue_choice = input(f"Continue to next file? (y/n): ")
            if continue_choice.lower() != 'y':
                logger.info("Stopping process as requested by user")
                break

def test_gemini_api():
    """Test the Gemini API with a small sample text"""
    logger.info("Testing Gemini API with a small sample...")
    sample_text = "The company reported strong revenue growth and exceeded analyst expectations for the quarter. However, challenges in the supply chain have impacted margins."
    
    try:
        # Simple prompt for testing
        prompt = "Analyze this text for sentiment and respond with a single word: positive, negative, or neutral: " + sample_text
        
        response = client.models.generate_content(
            model=MODEL,
            contents=prompt
        )
        
        logger.info(f"API Test Response: {response.text}")
        logger.info("Gemini API test successful!")
        return True
    except Exception as e:
        logger.error(f"Gemini API test failed: {e}")
        return False

def main():
    # First test if the API is working
    if not test_gemini_api():
        logger.error("Gemini API test failed. Please check your API key and connection.")
        return
    
    # Process all files
    print("\n=== PROCESSING ALL MDA FILES ===")
    print("This will process all available MDA files in the data directory")
    
    # Ask for processing parameters
    rows_input = input("\nHow many rows per file? (Enter a number or 'all' for all rows): ")
    force_reprocess_input = input("Force reprocessing of all rows, even if previously processed? (y/n): ")
    force_reprocess = force_reprocess_input.lower() == 'y'
    rows_per_file = None if rows_input.lower() == 'all' else int(rows_input)
    
    # Process files
    process_multiple_mda_files(rows_per_file=rows_per_file, force_reprocess=force_reprocess)

if __name__ == "__main__":
    main()

2025-03-31 23:46:02,307 - INFO - Testing Gemini API with a small sample...
2025-03-31 23:46:02,308 - INFO - AFC is enabled with max remote calls: 10.
2025-03-31 23:46:02,768 - INFO - AFC remote call 1 is done.
2025-03-31 23:46:02,769 - INFO - API Test Response: Positive

2025-03-31 23:46:02,769 - INFO - Gemini API test successful!



=== PROCESSING ALL MDA FILES ===
This will process all available MDA files in the data directory



How many rows per file? (Enter a number or 'all' for all rows):  all
Force reprocessing of all rows, even if previously processed? (y/n):  y


2025-03-31 23:46:07,154 - INFO - Found 1 MDA files to process
2025-03-31 23:46:07,155 - INFO - Processing file: data/mda_data/mda_sections_2023.csv
2025-03-31 23:46:07,155 - INFO - Loading MDA file: data/mda_data/mda_sections_2023.csv
2025-03-31 23:46:08,050 - INFO - Processing 1943 rows from data/mda_data/mda_sections_2023.csv
2025-03-31 23:46:08,051 - INFO - Converting FiledAt to date only
2025-03-31 23:46:08,112 - INFO - Force reprocessing all 1943 rows
Processing rows:   0%|          | 0/1943 [00:00<?, ?it/s]2025-03-31 23:46:08,117 - INFO - Analyzing sentiment for row 0 (length: 87024 chars)
2025-03-31 23:46:08,618 - INFO - AFC is enabled with max remote calls: 10.
2025-03-31 23:46:11,306 - INFO - AFC remote call 1 is done.
2025-03-31 23:46:11,307 - INFO - API Response (first 500 chars): ```json
{
  "polarity_detection": -0.4,
  "emotion_detection": -0.2,
  "intent_analysis": 0.1,
  "subjectivity_objectivity": -0.1,
  "fine_grained_sentiment": {
    "average_score": -0.25
  },
  "a