# Gemini Financial Sentiment Analysis Tool Reprocessing

## Overview

This Python script uses Google's Gemini API to analyze sentiment in Management Discussion and Analysis (MDA) sections from financial filings. It focuses on reprocessing documents with missing sentiment scores, extracting various sentiment metrics, and saving results in structured formats.

## Key Features

- Automated sentiment analysis of financial texts using Google's Gemini AI
- Reprocessing capability for documents with missing sentiment scores
- Multiple sentiment analysis methods on a scale from -1 (negative) to 1 (positive):
  - Basic polarity detection
  - Aspect-based analysis (revenue, earnings, management, etc.)
  - Topic-based analysis
  - Emotion and intent detection
- Robust rate limiting to prevent API throttling
- Comprehensive error handling and logging

## Installation

### Requirements
- Python 3.6+
- Google Gemini API key
- Required packages: pandas, tqdm, google-generativeai

### Setup
```bash
pip install pandas tqdm google-generativeai
```

## Configuration

Edit these parameters at the top of the script:

```python
API_KEY = "your-api-key-here"  # Required
MODEL = "gemini-2.0-flash"     # Gemini model to use
BATCH_SIZE = 5                 # Number of texts per batch
MAX_REQUESTS_PER_MINUTE = 1900 # API rate limit
REQUEST_DELAY = 0.5            # Seconds between requests
SAVE_FREQUENCY = 20            # How often to save progress
MAX_TEXT_LENGTH = 60000        # Maximum text length for API
```

## Usage

### Basic Operation

1. Run the script:
   ```bash
   python gemini_reprocess.py
   ```

2. Follow the interactive prompts:
   - Enter years to process (comma-separated or "all")
   - Confirm file paths or provide alternatives
   - Review progress in the terminal

### Expected File Structure

```
project/
├── data/
│   └── mda_data/
│       └── mda_sections_YEAR.csv    # Original MDA text data
├── sentiment_results/
│   └── mda/
│       ├── mda_sentiment_YEAR_results.csv      # Input with missing scores
│       └── mda_sentiment_YEAR_results_v2.csv   # Output with updated scores
└── gemini_reprocess.py
```

## Output Files

The script generates:

1. **Updated CSV file** (`mda_sentiment_YEAR_results_v2.csv`): Contains all sentiment scores
2. **Full JSON results** (`mda_sentiment_YEAR_results_v2_full_results.json`): Detailed sentiment data
3. **Log file** (`gemini_reprocess_DATETIME.log`): Processing details and errors

## Sentiment Analysis Details

The script extracts multiple sentiment metrics:

- **Polarity Detection**: Overall positive/negative sentiment
- **Aspect-Based Analysis**: Sentiment for specific aspects:
  - Financial Performance (Revenue, Earnings, Margins, Debt, Cash Flow)
  - Management and Leadership
  - Product/Service Performance
  - Future Outlook
  - Legal and Risk
- **Topic-Based Analysis**: Sentiment across key financial topics
- **Emotion Detection**: Dominant emotions in the text
- **Intent Analysis**: Purpose behind the text
- **Subjectivity/Objectivity**: Factual vs. opinion-based content
- **Contextual Analysis**: Sentiment in broader context

## Troubleshooting

- **Missing files**: The script will search for alternative paths or prompt for the correct location
- **API failures**: Multiple retry attempts with detailed error logging
- **Rate limits**: Automatic waiting when approaching API limits

## Error Handling

The script implements robust error handling:
- Graceful handling of API failures
- Detailed logging of all operations
- Safe JSON parsing with cleanup for malformed responses
- Multiple matching strategies for finding corresponding MDA texts

In [None]:
import pandas as pd
import json
import time
import os
import logging
import re
from datetime import datetime
from collections import deque
from tqdm import tqdm
from google import genai

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(f'gemini_reprocess_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

# Gemini API configuration
API_KEY = "AIzaSyAyPLiF-ckAV2N81bNwUZPzk1Vrrs-R9MI"  # Replace with your actual API key
MODEL = "gemini-2.0-flash"

# Set up Gemini client
client = genai.Client(api_key=API_KEY)

# Configure Gemini (optional parameters)
try:
    genai.configure(api_key=API_KEY)
except Exception as e:
    logger.warning(f"Could not configure genai: {e}, continuing with client-only approach")

# Constants for API limits and processing
BATCH_SIZE = 5
MAX_REQUESTS_PER_MINUTE = 1900
MAX_REQUESTS_PER_DAY = 1000000000
REQUEST_DELAY = 0.5
SAVE_FREQUENCY = 20
MAX_TEXT_LENGTH = 60000
SAVE_MDA_TEXT = False

# Create directories for outputs
os.makedirs("sentiment_results/mda", exist_ok=True)

# Class to track and enforce rate limits
class RateLimiter:
    def __init__(self, max_per_minute, max_per_day):
        self.max_per_minute = max_per_minute
        self.max_per_day = max_per_day
        self.minute_requests = deque()
        self.daily_requests = 0
        self.start_time = time.time()
    
    def check_and_wait(self):
        """Check if we can make a request, wait if needed, and track the request"""
        current_time = time.time()
        
        # Check daily limit
        if self.daily_requests >= self.max_per_day:
            logger.warning(f"Reached maximum daily request limit of {self.max_per_day}")
            return False
        
        # Clean up minute_requests older than 60 seconds
        while self.minute_requests and current_time - self.minute_requests[0] > 60:
            self.minute_requests.popleft()
        
        # Check if we're at the per-minute limit
        if len(self.minute_requests) >= self.max_per_minute:
            wait_time = 60 - (current_time - self.minute_requests[0])
            if wait_time > 0:
                logger.info(f"Rate limit approaching: Waiting {wait_time:.2f} seconds before next request")
                time.sleep(wait_time)
        
        # Always wait the minimum delay between requests
        time.sleep(REQUEST_DELAY)
        
        # Record this request
        self.minute_requests.append(time.time())
        self.daily_requests += 1
        
        return True

# Function to analyze sentiment via Gemini API with rate limiting
def analyze_sentiment_with_gemini(text):
    """
    Analyze sentiment of a single text using Gemini API
    """
    try:
        # For very long texts, use a simpler prompt to reduce token count
        if len(text) > 30000:
            prompt = f"""
Analyze this financial text for sentiment and provide a JSON response. 
Use a scale from -1 (strongly negative) to 1 (strongly positive) for all scores.

Text: {text}

Even if some aspects or topics are not explicitly mentioned, provide a score of 0 for them.

Format your response in this JSON structure:
{{
  "polarity_detection": score,
  "emotion_detection": score,
  "intent_analysis": score,
  "subjectivity_objectivity": score,
  "fine_grained_sentiment": {{
    "average_score": score
  }},
  "aspect_based_sentiment": {{
    "aspect_scores": [
      {{"Financial Performance - Revenue Growth": score}},
      {{"Financial Performance - Earnings": score}},
      {{"Financial Performance - Profit Margins": score}},
      {{"Financial Performance - Debt Levels": score}},
      {{"Financial Performance - Cash Flow": score}},
      {{"Management and Leadership": score}},
      {{"Product/Service Performance": score}},
      {{"Future Outlook": score}},
      {{"Legal and Risk": score}}
    ],
    "average_score": score
  }},
  "topic_sentiment_analysis": {{
    "topic_scores": [
      {{"Financial Performance": score}},
      {{"Management and Leadership": score}},
      {{"Product/Service Performance": score}},
      {{"Industry and Market Factors": score}},
      {{"Future Outlook": score}},
      {{"Legal and Risk": score}}
    ],
    "average_score": score
  }},
  "contextual_sentiment": {{
    "average_score": score
  }},
  "gemini_analysis": {{
    "overall_sentiment": score,
    "emotional_sentiment": score,
    "contextual_sentiment": score
  }}
}}
"""
        else:
            # Full detailed analysis for shorter texts
            prompt = f"""
Analyze the following text for sentiment using multiple methods. Provide sentiment scores between -1 (strongly negative) and 1 (strongly positive) for each method.

Text: {text}

Methods:

1. Polarity Detection (Basic Sentiment): Provide a single score representing the overall positive, negative, or neutral sentiment.

2. Fine-Grained Sentiment Analysis: Provide individual scores for different sections or perspectives within the text, if applicable, and an average score.

3. Aspect-Based Sentiment Analysis (ABSA): Identify key aspects or attributes in the text and provide sentiment scores for each aspect, and an average score. Use the following consistent aspects:
   * Financial Performance:
       * Revenue Growth
       * Earnings (EPS)
       * Profit Margins
       * Debt Levels
       * Cash Flow
   * Management and Leadership:
       * CEO Performance
       * Management Team
       * Corporate Governance
   * Product/Service Performance:
       * Product Innovation
       * Market Share
       * Customer Satisfaction
       * Product Quality
   * Future Outlook:
       * Growth Potential
       * Expansion Plans
       * Analyst Ratings
       * Investor Confidence
   * Legal and Risk:
       * Litigation
       * Financial Risks
       * Reputation Risks

4. Topic-Sentiment Analysis: Identify key topics or themes in the text and provide sentiment scores for each topic, and an average score. Use the following consistent topics:
   * Financial Performance
   * Management and Leadership
   * Product/Service Performance
   * Industry and Market Factors
   * Future Outlook
   * Legal and Risk

5. Emotion Detection: Provide a score representing the dominant emotion(s) expressed in the text. If multiple emotions are present, provide a weighted average.

6. Intent Analysis: Provide a score representing the overall intent or purpose of the text. If there are multiple intents, provide an average score.

7. Subjectivity/Objectivity Detection: Provide a score representing the overall level of subjectivity or objectivity in the text.

8. Contextual Sentiment Analysis: Provide individual scores for different contextual elements or speakers within the text, and an average score.

9. Deep Learning-Based Approach (Gemini's Analysis): Provide overall, emotional, and contextual sentiment scores using Gemini's advanced analysis.

Instructions for handling missing aspects or topics:
- If an aspect or topic is not mentioned in the text, assign it a score of 0 (neutral)
- Do not skip any predefined aspects or topics in your response
- For aspects or topics with minimal information, base the score on whatever limited information is available

Format your response as a JSON object with the following structure:

{{
  "polarity_detection": score,
  "fine_grained_sentiment": {{
    "individual_scores": [{{"section/perspective": score}}],
    "average_score": score
  }},
  "aspect_based_sentiment": {{
    "aspect_scores": [{{"aspect": score}}],
    "average_score": score
  }},
  "topic_sentiment_analysis": {{
    "topic_scores": [{{"topic": score}}],
    "average_score": score
  }},
  "emotion_detection": score,
  "intent_analysis": score,
  "subjectivity_objectivity": score,
  "contextual_sentiment": {{
    "context_scores": [{{"context_element": score}}],
    "average_score": score
  }},
  "gemini_analysis": {{
    "overall_sentiment": score,
    "emotional_sentiment": score,
    "contextual_sentiment": score
  }}
}}
"""

        # Check rate limits
        if not rate_limiter.check_and_wait():
            logger.warning("Rate limit reached. Skipping request.")
            return None

        # Use Gemini API with explicit safety settings to avoid filter issues
        try:
            logger.info("Sending request to Gemini API...")
            # Fixed approach: Create a simple prompt structure
            response = client.models.generate_content(
                model=MODEL,
                contents=[{"role": "user", "parts": [{"text": prompt}]}]
            )
            
            # Safety checks and handling
            if not hasattr(response, 'text'):
                if hasattr(response, 'prompt_feedback'):
                    logger.error(f"Content filtered: {response.prompt_feedback}")
                    return None
                elif hasattr(response, 'candidates') and len(response.candidates) > 0:
                    # Try to extract text from candidates
                    text_parts = []
                    for candidate in response.candidates:
                        if hasattr(candidate, 'content') and hasattr(candidate.content, 'parts'):
                            for part in candidate.content.parts:
                                if hasattr(part, 'text'):
                                    text_parts.append(part.text)
                    if text_parts:
                        return "".join(text_parts)
                    else:
                        logger.error("Could not extract text from response candidates")
                        return None
                else:
                    logger.error("Empty response from Gemini API")
                    logger.error(f"Response structure: {dir(response)}")
                    return None
            
            logger.info(f"Received response from Gemini API: {response.text[:100]}...")
            return response.text
        except Exception as e:
            logger.error(f"Error in Gemini API call: {e}")
            # Print more detailed error info if available
            import traceback
            logger.error(f"Error details: {traceback.format_exc()}")
            return None
    
    except Exception as e:
        logger.error(f"Error preparing sentiment analysis: {e}")
        logger.error(f"Error traceback: {traceback.format_exc()}")
        return None

def extract_json_from_response(response_text):
    """
    Extract the JSON object from the API response text
    """
    try:
        # Find the start and end of the JSON object
        start_idx = response_text.find('{')
        end_idx = response_text.rfind('}') + 1
        
        if start_idx != -1 and end_idx != -1:
            json_str = response_text[start_idx:end_idx]
            # Clean up potential issues before parsing
            json_str = re.sub(r',\s*}', '}', json_str)  # Remove trailing commas
            json_str = re.sub(r',\s*]', ']', json_str)  # Remove trailing commas in arrays
            return json.loads(json_str)
        else:
            logger.error("JSON object not found in response")
            return None
    except json.JSONDecodeError as e:
        logger.error(f"Error parsing JSON: {e}")
        logger.error(f"Problematic JSON: {response_text}")
        return None
    except Exception as e:
        logger.error(f"Unexpected error extracting JSON: {e}")
        return None

def flatten_sentiment_json(json_obj):
    """
    Flatten the nested JSON structure into a dictionary with simple key-value pairs
    """
    flat_dict = {}
    
    if not json_obj:
        logger.warning("Empty JSON object passed to flatten_sentiment_json")
        return flat_dict
    
    # Debug: Print the raw JSON
    logger.info(f"Flattening JSON with keys: {list(json_obj.keys())}")
    
    # Add default values for key metrics to ensure they exist even if missing in the JSON
    flat_dict["polarity_detection"] = 0.0
    flat_dict["emotion_detection"] = 0.0
    flat_dict["intent_analysis"] = 0.0
    flat_dict["subjectivity_objectivity"] = 0.0
    flat_dict["gemini_overall_sentiment"] = 0.0
    flat_dict["gemini_emotional_sentiment"] = 0.0
    flat_dict["gemini_contextual_sentiment"] = 0.0
    
    # Extract top-level simple scores
    for key in ['polarity_detection', 'emotion_detection', 'intent_analysis', 'subjectivity_objectivity']:
        if key in json_obj:
            try:
                # Extract the value and convert to float if needed
                value = json_obj[key]
                if isinstance(value, list) and len(value) > 0:
                    value = value[0]  # Extract first element if it's a list
                flat_dict[key] = float(value)
                logger.info(f"Extracted {key}: {flat_dict[key]}")
            except Exception as e:
                logger.error(f"Error extracting {key}: {e}, value was: {json_obj[key]}")
    
    # Extract Gemini Analysis scores
    if 'gemini_analysis' in json_obj:
        for sub_key, value in json_obj['gemini_analysis'].items():
            try:
                if isinstance(value, list) and len(value) > 0:
                    value = value[0]  # Extract first element if it's a list
                flat_dict[f"gemini_{sub_key}"] = float(value)
                logger.info(f"Extracted gemini_{sub_key}: {flat_dict[f'gemini_{sub_key}']}")
            except Exception as e:
                logger.error(f"Error extracting gemini_{sub_key}: {e}, value was: {value}")
    
    # Extract average scores from complex objects
    for key in ['fine_grained_sentiment', 'aspect_based_sentiment', 'topic_sentiment_analysis', 'contextual_sentiment']:
        if key in json_obj and 'average_score' in json_obj[key]:
            try:
                value = json_obj[key]['average_score']
                if isinstance(value, list) and len(value) > 0:
                    value = value[0]  # Extract first element if it's a list
                flat_dict[f"{key}_avg"] = float(value)
                logger.info(f"Extracted {key}_avg: {flat_dict[f'{key}_avg']}")
            except Exception as e:
                logger.error(f"Error extracting {key}_avg: {e}, value was: {json_obj[key]['average_score']}")
    
    # Extract standardized aspect and topic scores
    if 'aspect_based_sentiment' in json_obj and 'aspect_scores' in json_obj['aspect_based_sentiment']:
        aspect_scores = json_obj['aspect_based_sentiment']['aspect_scores']
        if isinstance(aspect_scores, list):
            for score_dict in aspect_scores:
                try:
                    for aspect, score in score_dict.items():
                        # Clean up the aspect name for column naming
                        clean_aspect = aspect.strip().replace(' - ', '_').replace('/', '_').replace(' ', '_').lower()
                        if isinstance(score, list) and len(score) > 0:
                            score = score[0]  # Extract first element if it's a list
                        flat_dict[f"aspect_{clean_aspect}"] = float(score)
                        logger.info(f"Extracted aspect_{clean_aspect}: {flat_dict[f'aspect_{clean_aspect}']}")
                except Exception as e:
                    logger.error(f"Error extracting aspect score: {e}, value was: {score_dict}")
    
    if 'topic_sentiment_analysis' in json_obj and 'topic_scores' in json_obj['topic_sentiment_analysis']:
        topic_scores = json_obj['topic_sentiment_analysis']['topic_scores']
        if isinstance(topic_scores, list):
            for score_dict in topic_scores:
                try:
                    for topic, score in score_dict.items():
                        # Clean up the topic name for column naming
                        clean_topic = topic.strip().replace('/', '_').replace(' ', '_').lower()
                        if isinstance(score, list) and len(score) > 0:
                            score = score[0]  # Extract first element if it's a list
                        flat_dict[f"topic_{clean_topic}"] = float(score)
                        logger.info(f"Extracted topic_{clean_topic}: {flat_dict[f'topic_{clean_topic}']}")
                except Exception as e:
                    logger.error(f"Error extracting topic score: {e}, value was: {score_dict}")
    
    # Store the complete JSON as a string for reference
    try:
        flat_dict['full_sentiment_json'] = json.dumps(json_obj)
    except Exception as e:
        logger.error(f"Error converting JSON to string: {e}")
        flat_dict['full_sentiment_json'] = str(json_obj)
    
    logger.info(f"Flattened JSON into {len(flat_dict)} key-value pairs")
    return flat_dict

def convert_filed_date(date_str):
    """Convert filing date string to date only (YYYY-MM-DD)"""
    try:
        # Handle different date formats
        date_formats = [
            "%Y-%m-%dT%H:%M:%S%z",    # Format with timezone offset
            "%Y-%m-%dT%H:%M:%S-%z",   # Format with timezone offset separated by dash
            "%Y-%m-%dT%H:%M:%S",      # Format without timezone
            "%Y-%m-%d"                # Just date
        ]
        
        for fmt in date_formats:
            try:
                # Convert to datetime then extract only the date part
                dt = pd.to_datetime(date_str, format=fmt)
                return dt.date()  # Return only the date part (YYYY-MM-DD)
            except:
                continue
                
        # If specific formats fail, try pandas default parser
        dt = pd.to_datetime(date_str)
        return dt.date()  # Return only the date part
    except:
        # Return original if all parsing attempts fail
        logger.warning(f"Could not parse date: {date_str}")
        return date_str

# New function to reprocess only rows with NaN sentiment scores
def standardize_date_format(date_str):
    """
    Standardize date string to YYYY-MM-DD format for consistent matching
    
    Parameters:
    -----------
    date_str : str
        Date string to standardize
    
    Returns:
    --------
    str
        Standardized date string in YYYY-MM-DD format
    """
    if pd.isna(date_str) or date_str is None or date_str == '':
        return None
    
    try:
        # Try to parse using pandas (handles many formats)
        dt = pd.to_datetime(date_str)
        return dt.strftime('%Y-%m-%d')
    except:
        logger.warning(f"Could not standardize date: {date_str}")
        return date_str

def reprocess_nan_sentiment_rows(input_file, output_file, mda_file_path, retry_limit=3):
    """
    Reprocess rows with NaN sentiment scores from an existing results file
    
    Parameters:
    -----------
    input_file : str
        Path to the input CSV file with sentiment results
    output_file : str
        Path to save the updated results
    mda_file_path : str
        Path to the original MDA file with the text data
    retry_limit : int
        Number of retries for API calls
    """
    try:
        logger.info(f"Loading existing sentiment results file: {input_file}")
        # Load the existing results
        results_df = pd.read_csv(input_file)
        
        # Load the original MDA file to get the text data
        logger.info(f"Loading original MDA file: {mda_file_path}")
        mda_df = pd.read_csv(mda_file_path)
        
        # Standardize date formats in both dataframes for accurate matching
        if 'FiledAt' in results_df.columns:
            logger.info("Standardizing FiledAt dates in results dataframe")
            results_df['FiledAt'] = results_df['FiledAt'].apply(standardize_date_format)
        
        if 'FiledAt' in mda_df.columns:
            logger.info("Standardizing FiledAt dates in MDA dataframe")
            mda_df['FiledAt'] = mda_df['FiledAt'].apply(standardize_date_format)
        
        # Identify sentiment columns (they all have NaN for failed rows)
        sentiment_columns = [col for col in results_df.columns if any(term in col.lower() for term in 
                                                                   ['polarity', 'emotion', 'intent', 'sentiment', 'aspect', 'topic'])]
        
        # If no sentiment columns found, use a default set
        if not sentiment_columns:
            sentiment_columns = [
                'polarity_detection', 'emotion_detection', 'intent_analysis', 
                'subjectivity_objectivity', 'gemini_overall_sentiment'
            ]
        
        logger.info(f"Found {len(sentiment_columns)} sentiment columns")
        
        # Find rows with NaN in any sentiment column
        nan_rows = results_df[results_df[sentiment_columns].isnull().any(axis=1)]
        logger.info(f"Found {len(nan_rows)} rows with NaN sentiment scores out of {len(results_df)} total rows")
        
        if len(nan_rows) == 0:
            logger.info("No rows with NaN sentiment scores found. No reprocessing needed.")
            return results_df
        
        # Create rate limiter for API calls
        global rate_limiter
        rate_limiter = RateLimiter(MAX_REQUESTS_PER_MINUTE, MAX_REQUESTS_PER_DAY)
        
        # Track results
        results = {}
        processed_indices = set()
        
        # Process each row with NaN scores
        for idx in tqdm(nan_rows.index, desc="Reprocessing rows with NaN scores"):
            # Extract row data for matching
            ticker = results_df.loc[idx, 'Ticker'] if 'Ticker' in results_df.columns else None
            cik = results_df.loc[idx, 'CIK'] if 'CIK' in results_df.columns else None
            form_type = results_df.loc[idx, 'FormType'] if 'FormType' in results_df.columns else None
            filed_at = results_df.loc[idx, 'FiledAt'] if 'FiledAt' in results_df.columns else None
            
            # Log what we're trying to match
            logger.info(f"Looking for match: Ticker={ticker}, CIK={cik}, FormType={form_type}, FiledAt={filed_at}")
            
            # Try different matching strategies
            matching_rows = None
            
            # Strategy 1: Try matching on all fields
            if matching_rows is None or len(matching_rows) == 0:
                mask = pd.Series([True] * len(mda_df), index=mda_df.index)
                
                if ticker is not None and 'Ticker' in mda_df.columns:
                    mask = mask & (mda_df['Ticker'] == ticker)
                
                if cik is not None and 'CIK' in mda_df.columns:
                    mask = mask & (mda_df['CIK'] == cik)
                
                if form_type is not None and 'FormType' in mda_df.columns:
                    mask = mask & (mda_df['FormType'] == form_type)
                
                if filed_at is not None and 'FiledAt' in mda_df.columns:
                    mask = mask & (mda_df['FiledAt'] == filed_at)
                
                matching_rows = mda_df[mask]
                logger.info(f"Strategy 1 (all fields): Found {len(matching_rows)} matches")
            
            # Strategy 2: Match on Ticker, CIK, and FormType but not date
            if matching_rows is None or len(matching_rows) == 0:
                mask = pd.Series([True] * len(mda_df), index=mda_df.index)
                
                if ticker is not None and 'Ticker' in mda_df.columns:
                    mask = mask & (mda_df['Ticker'] == ticker)
                
                if cik is not None and 'CIK' in mda_df.columns:
                    mask = mask & (mda_df['CIK'] == cik)
                
                if form_type is not None and 'FormType' in mda_df.columns:
                    mask = mask & (mda_df['FormType'] == form_type)
                
                matching_rows = mda_df[mask]
                logger.info(f"Strategy 2 (no date): Found {len(matching_rows)} matches")
            
            # Strategy 3: Match on just Ticker and CIK
            if matching_rows is None or len(matching_rows) == 0:
                mask = pd.Series([True] * len(mda_df), index=mda_df.index)
                
                if ticker is not None and 'Ticker' in mda_df.columns:
                    mask = mask & (mda_df['Ticker'] == ticker)
                
                if cik is not None and 'CIK' in mda_df.columns:
                    mask = mask & (mda_df['CIK'] == cik)
                
                matching_rows = mda_df[mask]
                logger.info(f"Strategy 3 (Ticker+CIK): Found {len(matching_rows)} matches")
            
            # If we still have no matches, log and continue
            if len(matching_rows) == 0:
                logger.warning(f"Could not find matching row in MDA file for row {idx}")
                processed_indices.add(idx)  # Mark as processed so we don't try again
                continue
            elif len(matching_rows) > 1:
                logger.warning(f"Found multiple matching rows ({len(matching_rows)}) in MDA file for row {idx}, using the first one")
            
            # Get the MDA text from the first matching row
            try:
                mda_text = matching_rows.iloc[0]['MDA_Text']
                
                if not isinstance(mda_text, str) or not mda_text.strip():
                    logger.warning(f"Empty or invalid MDA_Text for row {idx}, skipping")
                    processed_indices.add(idx)
                    continue
                
                logger.info(f"Found matching MDA text for row {idx} (length: {len(mda_text)} chars)")
                
                # Truncate text if too long
                if len(mda_text) > MAX_TEXT_LENGTH:
                    logger.warning(f"Text too long ({len(mda_text)} chars), truncating to {MAX_TEXT_LENGTH} chars")
                    mda_text = mda_text[:MAX_TEXT_LENGTH]
                
                # Retry mechanism for API calls
                success = False
                for attempt in range(retry_limit):
                    try:
                        # Call Gemini API
                        response_text = analyze_sentiment_with_gemini(mda_text)
                        
                        if not response_text:
                            logger.error(f"No response from API for row {idx}, attempt {attempt+1}/{retry_limit}")
                            time.sleep(3)  # Wait before retry
                            continue
                        
                        # Extract JSON from response
                        sentiment_json = extract_json_from_response(response_text)
                        
                        if not sentiment_json:
                            logger.error(f"Failed to extract JSON from response for row {idx}, attempt {attempt+1}/{retry_limit}")
                            time.sleep(3)  # Wait before retry
                            continue
                        
                        # Flatten JSON
                        flat_sentiment = flatten_sentiment_json(sentiment_json)
                        
                        # Store results
                        results[idx] = flat_sentiment
                        processed_indices.add(idx)
                        
                        logger.info(f"Successfully analyzed sentiment for row {idx}")
                        success = True
                        break
                        
                    except Exception as e:
                        logger.error(f"Error on attempt {attempt+1}/{retry_limit} for row {idx}: {e}")
                        time.sleep(3)  # Wait before retry
                
                if not success:
                    logger.error(f"Failed to analyze sentiment for row {idx} after {retry_limit} attempts")
                    processed_indices.add(idx)  # Mark as processed so we don't try again
            
            except Exception as e:
                logger.error(f"Error processing row {idx}: {e}")
                processed_indices.add(idx)  # Mark as processed so we don't try again
                
            # Save progress periodically
            if len(results) % 5 == 0 and len(results) > 0:
                logger.info(f"Processed {len(results)} rows with NaN scores so far")
                
                # Periodically save intermediate results
                if len(results) % SAVE_FREQUENCY == 0:
                    # Create a temporary copy with updated results
                    temp_df = results_df.copy()
                    for tmp_idx, result in results.items():
                        for key, value in result.items():
                            if key != 'full_sentiment_json':
                                temp_df.at[tmp_idx, key] = value
                    
                    # Save intermediate results
                    temp_output = output_file.replace('.csv', f'_temp_{len(results)}.csv')
                    temp_df.to_csv(temp_output, index=False)
                    logger.info(f"Saved intermediate results to {temp_output}")
        
        # Update the results dataframe with new sentiment scores
        if results:
            logger.info(f"Updating dataframe with {len(results)} new results")
            for idx, result in results.items():
                for key, value in result.items():
                    if key != 'full_sentiment_json':  # Exclude the full JSON from the main dataframe
                        results_df.at[idx, key] = value
        else:
            logger.warning("No new results to add to dataframe!")
        
        # Save the updated results
        os.makedirs(os.path.dirname(output_file), exist_ok=True)
        results_df.to_csv(output_file, index=False)
        logger.info(f"Saved updated results to {output_file}")
        
        # Save full results with JSON
        if results:
            output_dir = os.path.dirname(output_file)
            base_name = os.path.basename(output_file).split('.')[0]
            full_output_file = os.path.join(output_dir, f"{base_name}_full_results.json")
            with open(full_output_file, 'w') as f:
                json.dump({str(idx): results[idx]['full_sentiment_json'] for idx in results if 'full_sentiment_json' in results[idx]}, f)
            logger.info(f"Saved full JSON results to {full_output_file}")
        
        return results_df
        
    except Exception as e:
        logger.error(f"Error reprocessing NaN rows: {e}")
        import traceback
        logger.error(f"Traceback: {traceback.format_exc()}")
        return None

def test_gemini_api():
    """Test the Gemini API with a small sample text"""
    logger.info("Testing Gemini API with a small sample...")
    sample_text = "The company reported strong revenue growth and exceeded analyst expectations for the quarter. However, challenges in the supply chain have impacted margins."
    
    try:
        # Simple prompt for testing
        prompt = "Analyze this text for sentiment and respond with a single word: positive, negative, or neutral: " + sample_text
        
        response = client.models.generate_content(
            model=MODEL,
            contents=prompt
        )
        
        logger.info(f"API Test Response: {response.text}")
        logger.info("Gemini API test successful!")
        return True
    except Exception as e:
        logger.error(f"Gemini API test failed: {e}")
        return False

def find_and_fix_mda_file_paths(year):
    """
    Attempt to find the correct MDA file path even if the default path doesn't exist
    
    Parameters:
    -----------
    year : str
        Year to process
    
    Returns:
    --------
    str
        Path to MDA file if found, None otherwise
    """
    # Default path
    default_path = f"data/mda_data/mda_sections_{year}.csv"
    
    # Check if default path exists
    if os.path.exists(default_path):
        return default_path
    
    # Try alternative paths
    alternatives = [
        f"mda_data/mda_sections_{year}.csv",
        f"data/mda_sections_{year}.csv",
        f"mda_sections_{year}.csv"
    ]
    
    # Check parent directories
    current_dir = os.getcwd()
    parent_dir = os.path.dirname(current_dir)
    
    for alt_path in alternatives:
        # Check in current directory
        if os.path.exists(alt_path):
            return alt_path
        
        # Check in parent directory
        parent_path = os.path.join(parent_dir, alt_path)
        if os.path.exists(parent_path):
            return parent_path
    
    # If we've made it here, try to find any file that might match
    possible_paths = []
    
    # Search patterns
    patterns = [
        f"**/mda_*{year}*.csv",
        f"**/mda*{year}*.csv",
        f"**/*mda*{year}*.csv"
    ]
    
    for pattern in patterns:
        matches = glob.glob(pattern, recursive=True)
        possible_paths.extend(matches)
    
    if possible_paths:
        logger.info(f"Found possible MDA files: {possible_paths}")
        return possible_paths[0]  # Return the first match
    
    return None

def check_mda_text_column(mda_df):
    """
    Check if MDA_Text column exists in the dataframe, try to find alternatives if not
    
    Parameters:
    -----------
    mda_df : pandas.DataFrame
        MDA dataframe to check
    
    Returns:
    --------
    pandas.DataFrame
        Modified dataframe with MDA_Text column
    """
    if 'MDA_Text' in mda_df.columns:
        return mda_df
    
    # Try to find alternative text column
    text_columns = [col for col in mda_df.columns if 'text' in col.lower() or 'mda' in col.lower()]
    
    if text_columns:
        logger.info(f"Found potential text columns: {text_columns}")
        # Use the first as MDA_Text
        mda_df['MDA_Text'] = mda_df[text_columns[0]]
        return mda_df
    
    # If no obvious text column, look for the column with the longest strings
    max_len = 0
    max_col = None
    
    for col in mda_df.columns:
        if mda_df[col].dtype == 'object':  # Only check string columns
            sample = mda_df[col].dropna().head(10)
            if len(sample) > 0:
                avg_len = sample.astype(str).apply(len).mean()
                if avg_len > max_len:
                    max_len = avg_len
                    max_col = col
    
    if max_col and max_len > 100:  # Only use if strings are reasonably long
        logger.info(f"Using column '{max_col}' as MDA_Text (avg length: {max_len})")
        mda_df['MDA_Text'] = mda_df[max_col]
    
    return mda_df

def main():
    # First test if the API is working
    if not test_gemini_api():
        logger.error("Gemini API test failed. Please check your API key and connection.")
        return
    
    print("\n=== REPROCESSING ROWS WITH MISSING SENTIMENT SCORES ===")
    
    # Ask user for years to process
    years_input = input("Which years would you like to process? (Enter comma-separated values, e.g., '2023,2024' or 'all' for all years): ")
    
    if years_input.lower() == 'all':
        years = ['2023', '2024']  # Default years
    else:
        years = [year.strip() for year in years_input.split(',')]
    
    # Process each year
    for year in years:
        # Define file paths
        sentiment_file = f"sentiment_results/mda/mda_sentiment_{year}_results.csv"
        output_file = f"sentiment_results/mda/mda_sentiment_{year}_results_v2.csv"
        
        print(f"\n=== PROCESSING YEAR: {year} ===")
        print(f"Input file: {sentiment_file}")
        
        # Check if sentiment file exists
        if not os.path.exists(sentiment_file):
            print(f"Error: Sentiment file not found: {sentiment_file}")
            # Try alternative paths
            alternatives = [
                f"mda_sentiment_{year}_results.csv",
                f"sentiment_results_mda_{year}.csv",
                f"*sentiment*{year}*.csv"
            ]
            
            found = False
            for alt_pattern in alternatives:
                matches = glob.glob(alt_pattern)
                if matches:
                    sentiment_file = matches[0]
                    print(f"Found alternative sentiment file: {sentiment_file}")
                    found = True
                    break
            
            if not found:
                print("Could not find sentiment file. Skipping year.")
                continue
        
        # Try to find MDA file
        mda_file = find_and_fix_mda_file_paths(year)
        
        if not mda_file:
            print(f"Error: Could not find MDA file for year {year}")
            # Ask user to provide path
            user_path = input(f"Please provide the path to the MDA file for {year} (or press Enter to skip): ")
            if user_path.strip():
                mda_file = user_path
            else:
                print(f"Skipping year {year}")
                continue
        
        print(f"MDA file: {mda_file}")
        print(f"Output file: {output_file}")
        
        # Ask user to confirm
        confirm = input(f"Process {year}? (y/n): ")
        if confirm.lower() != 'y':
            print(f"Skipping year {year}")
            continue
        
        # Load MDA file to check if it has the right column
        try:
            test_mda_df = pd.read_csv(mda_file)
            test_mda_df = check_mda_text_column(test_mda_df)
            if 'MDA_Text' not in test_mda_df.columns:
                print(f"Warning: Could not find or create MDA_Text column in {mda_file}")
                if input("Continue anyway? (y/n): ").lower() != 'y':
                    print(f"Skipping year {year}")
                    continue
        except Exception as e:
            print(f"Error loading MDA file: {e}")
            if input("Continue anyway? (y/n): ").lower() != 'y':
                print(f"Skipping year {year}")
                continue
        
        # Reprocess missing rows
        reprocess_nan_sentiment_rows(sentiment_file, output_file, mda_file)
        
        print(f"Completed reprocessing for year {year}")
        
        # Clean up temporary files
        temp_files = [f for f in os.listdir(os.path.dirname(output_file)) if f.startswith(f"mda_sentiment_{year}_results_v2_temp_")]
        if temp_files and input("Remove temporary files? (y/n): ").lower() == 'y':
            for temp_file in temp_files:
                temp_path = os.path.join(os.path.dirname(output_file), temp_file)
                try:
                    os.remove(temp_path)
                    print(f"Removed {temp_file}")
                except Exception as e:
                    print(f"Could not remove {temp_file}: {e}")

if __name__ == "__main__":
    main()

2025-04-13 13:57:02,127 - INFO - Testing Gemini API with a small sample...
2025-04-13 13:57:02,127 - INFO - AFC is enabled with max remote calls: 10.
2025-04-13 13:57:02,508 - INFO - AFC remote call 1 is done.
2025-04-13 13:57:02,508 - INFO - API Test Response: Positive

2025-04-13 13:57:02,509 - INFO - Gemini API test successful!



=== REPROCESSING ROWS WITH MISSING SENTIMENT SCORES ===


Which years would you like to process? (Enter comma-separated values, e.g., '2023,2024' or 'all' for all years):  all



=== PROCESSING YEAR: 2023 ===
Input file: sentiment_results/mda/mda_sentiment_2023_results.csv
MDA file: data/mda_data/mda_sections_2023.csv
Output file: sentiment_results/mda/mda_sentiment_2023_results_v2.csv


Process 2023? (y/n):  y


2025-04-13 13:57:09,972 - INFO - Loading existing sentiment results file: sentiment_results/mda/mda_sentiment_2023_results.csv
2025-04-13 13:57:09,977 - INFO - Loading original MDA file: data/mda_data/mda_sections_2023.csv
2025-04-13 13:57:10,840 - INFO - Standardizing FiledAt dates in results dataframe
2025-04-13 13:57:11,142 - INFO - Standardizing FiledAt dates in MDA dataframe
2025-04-13 13:57:11,583 - INFO - Found 25 sentiment columns
2025-04-13 13:57:11,586 - INFO - Found 176 rows with NaN sentiment scores out of 1943 total rows
Reprocessing rows with NaN scores:   0%|          | 0/176 [00:00<?, ?it/s]2025-04-13 13:57:11,590 - INFO - Looking for match: Ticker=AOS, CIK=91142, FormType=10-Q, FiledAt=2023-07-28
2025-04-13 13:57:11,591 - INFO - Strategy 1 (all fields): Found 1 matches
2025-04-13 13:57:11,592 - INFO - Found matching MDA text for row 5 (length: 29493 chars)
2025-04-13 13:57:12,093 - INFO - Sending request to Gemini API...
2025-04-13 13:57:12,094 - INFO - AFC is enabled 

Completed reprocessing for year 2023


Remove temporary files? (y/n):  y


Removed mda_sentiment_2023_results_v2_temp_80.csv
Removed mda_sentiment_2023_results_v2_temp_100.csv
Removed mda_sentiment_2023_results_v2_temp_140.csv
Removed mda_sentiment_2023_results_v2_temp_40.csv
Removed mda_sentiment_2023_results_v2_temp_60.csv
Removed mda_sentiment_2023_results_v2_temp_120.csv
Removed mda_sentiment_2023_results_v2_temp_20.csv
Removed mda_sentiment_2023_results_v2_temp_160.csv

=== PROCESSING YEAR: 2024 ===
Input file: sentiment_results/mda/mda_sentiment_2024_results.csv
MDA file: data/mda_data/mda_sections_2024.csv
Output file: sentiment_results/mda/mda_sentiment_2024_results_v2.csv


Process 2024? (y/n):  y


2025-04-13 14:10:44,318 - INFO - Loading existing sentiment results file: sentiment_results/mda/mda_sentiment_2024_results.csv
2025-04-13 14:10:44,325 - INFO - Loading original MDA file: data/mda_data/mda_sections_2024.csv
2025-04-13 14:10:45,139 - INFO - Standardizing FiledAt dates in results dataframe
2025-04-13 14:10:45,440 - INFO - Standardizing FiledAt dates in MDA dataframe
2025-04-13 14:10:45,883 - INFO - Found 26 sentiment columns
2025-04-13 14:10:45,885 - INFO - Found 1958 rows with NaN sentiment scores out of 1959 total rows
Reprocessing rows with NaN scores:   0%|          | 0/1958 [00:00<?, ?it/s]2025-04-13 14:10:45,887 - INFO - Looking for match: Ticker=MMM, CIK=66740, FormType=10-Q, FiledAt=2024-10-22
2025-04-13 14:10:45,889 - INFO - Strategy 1 (all fields): Found 1 matches
2025-04-13 14:10:45,889 - INFO - Found matching MDA text for row 0 (length: 71554 chars)
2025-04-13 14:10:46,390 - INFO - Sending request to Gemini API...
2025-04-13 14:10:46,391 - INFO - AFC is enable