# Reddit Sentiment Analysis Tool Documentation

## Overview

This tool analyzes sentiment in Reddit posts from r/stocks and other financial subreddits using Google's Gemini AI. It processes CSV files containing Reddit data, extracts sentiment scores for different aspects of financial discussions, and outputs the results in a standardized CSV format for further analysis.

## Features

- **Batch Processing**: Efficiently processes large volumes of Reddit posts in configurable batches
- **Multi-dimensional Sentiment Analysis**: Extracts sentiment across multiple dimensions:
  - Basic polarity (positive/negative/neutral)
  - Fine-grained sentiment
  - Aspect-based sentiment analysis for financial topics
  - Emotion detection
  - Intent analysis
  - Subjectivity/objectivity detection
- **Financial Focus**: Specifically designed for financial discussions, tracking sentiment related to:
  - Company performance
  - Stock analysis
  - Investment strategies
  - Market trends
  - Company news
  - Risk assessment
- **Progress Tracking**: Maintains progress state to resume processing if interrupted
- **Rate Limiting**: Includes built-in rate limiting to respect API constraints
- **Robust Error Handling**: Implements retry mechanisms and comprehensive logging

## Setup

### Prerequisites

- Python 3.7+
- Required Python packages:
  - pandas
  - google-generativeai (Gemini API)
  - tqdm
  - requests

### Installation

1. Clone or download the script to your local machine
2. Install required dependencies:
   ```
   pip install pandas google-generativeai tqdm requests
   ```
3. Set up your directory structure:
   ```
   data/reddit_data/        # Location for Reddit CSV files
   sentiment_results/reddit/ # Output directory for results
   progress/reddit/          # Directory for progress tracking
   log/sentiment/reddit/     # Log files
   ```
4. Place your Reddit CSV files in the `data/reddit_data/` directory
   - Files should be named `reddit_2023.csv` and `reddit_2024.csv`

### API Setup

1. Obtain a Google Gemini API key from the Google AI Studio
2. Replace the placeholder API key in the script with your actual key:
   ```python
   API_KEY = "YOUR_API_KEY_HERE"
   ```

## Input Data Format

The script expects Reddit data in CSV files with the following columns:

- `subreddit`: The subreddit the post was made in (e.g., "/r/stocks")
- `username`: Reddit username (e.g., "/u/username")
- `date`: Post date in format "Day Month DD YYYY HH:MM:SS GMT-XXXX"
- `title`: Post title
- `content`: Post content (may be "[removed]" or "[deleted]")
- `reddit_link`: URL to the original Reddit post

## Usage

1. Navigate to the directory containing the script
2. Run the script:
   ```
   python reddit_sentiment_analysis.py
   ```
3. When prompted, specify:
   - Batch size (default is 10)
   - Whether to force reprocessing of previously processed posts
   - Number of rows to process per file (or 'all' for complete processing)

## Configuration Options

The script includes various configuration parameters:

- `MAX_REQUESTS_PER_MINUTE`: Maximum API requests per minute (default: 1900)
- `MAX_REQUESTS_PER_DAY`: Maximum API requests per day (default: 1000000)
- `REQUEST_DELAY`: Delay between requests in seconds (default: 0.5)
- `MAX_TEXT_LENGTH`: Maximum length of text to analyze (default: 60000)
- `SAVE_FREQUENCY`: How often to save progress (default: after every 5 posts)
- `YEARS_TO_PROCESS`: Years of Reddit data to process (default: [2023, 2024])

## How It Works

### Processing Flow

1. **Loading Data**: The script loads Reddit posts from CSV files
2. **Batching**: Posts are processed in batches for efficiency
3. **Content Preparation**: For each post:
   - If content is available, combines title and content
   - If content is "[removed]", only uses the title
4. **Sentiment Analysis**: Sends the prepared text to Gemini API with a specialized prompt
5. **Result Processing**: Extracts and standardizes sentiment scores from API response
6. **Output Generation**: Saves results to CSV files with standardized format

### Key Functions

- `clean_reddit_date()`: Converts Reddit date format to YYYY-MM-DD
- `analyze_sentiment_with_gemini()`: Sends Reddit post to Gemini API
- `flatten_sentiment_json()`: Processes API response into standardized scores
- `analyze_reddit_batch()`: Processes a batch of Reddit posts
- `process_reddit_file()`: Processes a complete Reddit CSV file
- `process_all_reddit_files()`: Coordinates processing of all available files

## Output Format

Results are saved in CSV files (`reddit_sentiment_YYYY.csv`) with the following columns:

- **Source Information**:
  - `subreddit`: Original subreddit
  - `username`: Reddit username
  - `date`: Standardized date (YYYY-MM-DD)
  - `title`: Post title
  - `reddit_link`: Link to original post

- **General Sentiment Metrics**:
  - `polarity_detection`: Overall sentiment (-1 to 1)
  - `emotion_detection`: Emotional charge of post (-1 to 1)
  - `intent_analysis`: Purpose or intent behind post (-1 to 1)
  - `subjectivity_objectivity`: How subjective vs. objective the post is (-1 to 1)

- **Aggregated Sentiment Scores**:
  - `fine_grained_sentiment_avg`: Average detailed sentiment
  - `aspect_based_sentiment_avg`: Average sentiment across aspects
  - `topic_sentiment_analysis_avg`: Average sentiment across topics
  - `contextual_sentiment_avg`: Average contextual sentiment

- **Gemini Analysis Scores**:
  - `gemini_overall_sentiment`: Overall sentiment from Gemini
  - `gemini_emotional_sentiment`: Emotional sentiment from Gemini
  - `gemini_contextual_sentiment`: Contextual sentiment from Gemini

- **Aspect-Specific Sentiment**:
  - `aspect_company_performance`: Sentiment toward company performance
  - `aspect_stock_analysis`: Sentiment toward stock analysis
  - `aspect_investment_strategy`: Sentiment toward investment strategies
  - `aspect_market_trends`: Sentiment toward market trends
  - `aspect_company_news`: Sentiment toward company news
  - `aspect_risk_assessment`: Sentiment toward risk assessment

- **Topic-Specific Sentiment**:
  - `topic_company_performance`: Similar to above but for topics
  - `topic_stock_analysis`
  - `topic_investment_strategy`
  - `topic_market_trends`
  - `topic_company_news`
  - `topic_risk_assessment`

## Logging

The script creates detailed log files in the `log/sentiment/reddit/` directory, with timestamps and information about:
- API requests
- Processing progress
- Errors and exceptions
- Completion status

## Troubleshooting

### Common Issues

1. **API Key Invalid**: 
   - Error: "Error in Gemini API call"
   - Solution: Verify your API key is correct and active

2. **Rate Limiting**: 
   - Error: "Rate limit approaching" or "Reached maximum daily request limit"
   - Solution: Adjust rate limiting parameters or wait until limits reset

3. **JSON Parsing Errors**: 
   - Error: "Failed to extract JSON from response"
   - Solution: Check the API response format or adjust the regex pattern

4. **File Not Found**: 
   - Error: "No Reddit files found"
   - Solution: Ensure CSV files are in the correct location with proper naming

5. **Memory Issues**: 
   - Error: Process terminates with memory error
   - Solution: Process data in smaller batches by reducing batch size

## Best Practices

1. Start with a small test set (e.g., 100 posts) to verify setup before full processing
2. Monitor logs regularly for errors or rate limit warnings
3. For large datasets, consider processing overnight or on a dedicated machine
4. Back up progress files regularly to avoid losing progress
5. Prioritize post titles for sentiment analysis as they often contain the core sentiment

## Examples

### Example Command Line Session

```
=== REDDIT SENTIMENT ANALYSIS ===
Enter batch size (default 10): 20
Force reprocessing of all rows, even if previously processed? (y/n): n
How many rows per file? (Enter a number or 'all' for all rows): 1000
```

### Example Output CSV Structure

```
subreddit,username,date,title,reddit_link,polarity_detection,...
/r/stocks,/u/user123,2023-01-15,Is AAPL a good buy now?,https://www.reddit.com/...,0.3,...
```

## Limitations

- Posts with removed content will only be analyzed based on titles
- Very long posts may be truncated to fit API limits
- Reddit-specific language, memes, and sarcasm may affect sentiment accuracy
- API costs may be significant for large datasets

## Support

For issues or questions about this tool, please:
1. Check the logs for specific error messages
2. Review the troubleshooting section
3. Consult the Google Gemini API documentation for API-specific issues

In [None]:
import pandas as pd
import json
import time
import os
import logging
import re
import glob
from datetime import datetime
from collections import deque
from tqdm import tqdm
from google import genai

# Set up logging
log_dir = "log/sentiment/reddit"
os.makedirs(log_dir, exist_ok=True)
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(f'{log_dir}/gemini_reddit_sentiment_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

# Gemini API configuration
API_KEY = "AIzaSyAyPLiF-ckAV2N81bNwUZPzk1Vrrs-R9MI"  # Replace with your actual API key
MODEL = "gemini-2.0-flash"

# Set up Gemini client
client = genai.Client(api_key=API_KEY)

# Constants for API limits and processing
MAX_REQUESTS_PER_MINUTE = 1000  # Using your specified rate limit
MAX_REQUESTS_PER_DAY = 1000000000  # Using your specified daily limit
REQUEST_DELAY = 1  # Using your specified delay
SAVE_FREQUENCY = 5  # Using your specified save frequency
MAX_TEXT_LENGTH = 60000  # Maximum length for text inputs
YEARS_TO_PROCESS = [2023, 2024]  # Years to process

# Create directories for outputs and progress tracking
os.makedirs("sentiment_results/reddit", exist_ok=True)
os.makedirs("progress/reddit", exist_ok=True)

# Class to track and enforce rate limits
class RateLimiter:
    def __init__(self, max_per_minute, max_per_day):
        self.max_per_minute = max_per_minute
        self.max_per_day = max_per_day
        self.minute_requests = deque()
        self.daily_requests = 0
        self.start_time = time.time()
    
    def check_and_wait(self):
        """Check if we can make a request, wait if needed, and track the request"""
        current_time = time.time()
        
        # Check daily limit
        if self.daily_requests >= self.max_per_day:
            logger.warning(f"Reached maximum daily request limit of {self.max_per_day}")
            return False
        
        # Clean up minute_requests older than 60 seconds
        while self.minute_requests and current_time - self.minute_requests[0] > 60:
            self.minute_requests.popleft()
        
        # Check if we're at the per-minute limit
        if len(self.minute_requests) >= self.max_per_minute:
            wait_time = 60 - (current_time - self.minute_requests[0])
            if wait_time > 0:
                logger.info(f"Rate limit approaching: Waiting {wait_time:.2f} seconds before next request")
                time.sleep(wait_time)
        
        # Always wait the minimum delay between requests
        time.sleep(REQUEST_DELAY)
        
        # Record this request
        self.minute_requests.append(time.time())
        self.daily_requests += 1
        
        return True

# Create rate limiter instance
rate_limiter = RateLimiter(MAX_REQUESTS_PER_MINUTE, MAX_REQUESTS_PER_DAY)

# Helper function to check if a row has already been processed
def get_processed_indices(progress_file):
    """Load the set of indices that have already been processed"""
    if os.path.exists(progress_file):
        with open(progress_file, 'r') as f:
            return set(json.load(f))
    return set()

# Helper function to save processed indices
def save_processed_indices(progress_file, processed_indices):
    """Save the set of indices that have been processed"""
    with open(progress_file, 'w') as f:
        json.dump(list(processed_indices), f)

# Helper function to clean and standardize Reddit dates
def clean_reddit_date(date_str):
    """Extract YYYY-MM-DD from Reddit date string"""
    try:
        # Parse the date string to datetime
        # Format example: "Sun Jan 01 2023 22:03:05 GMT-0600"
        match = re.search(r'(\w{3} \w{3} \d{2} \d{4})', date_str)
        if match:
            date_part = match.group(1)
            dt = datetime.strptime(date_part, '%a %b %d %Y')
            return dt.strftime('%Y-%m-%d')
        else:
            logger.warning(f"Could not parse date format: {date_str}")
            return None
    except Exception as e:
        logger.warning(f"Error parsing date: {date_str}, Error: {e}")
        return None

# Function to analyze sentiment via Gemini API with rate limiting
def analyze_sentiment_with_gemini(text):
    """
    Analyze sentiment of a single Reddit post using Gemini API
    """
    # Using the Reddit-specific prompt
    prompt = f"""
Analyze the following Reddit post from the r/stocks subreddit for sentiment using multiple methods. Provide sentiment scores between -1 (strongly negative) and 1 (strongly positive) for each method.

Reddit Post: {text}

Instructions:

1.  Focus on the financial sentiment expressed in the post, particularly as it relates to stock trading, investments, and company performance.
2.  Consider the post's potential impact on stock prices or investor sentiment.
3.  Pay attention to any specific tickers, companies, or financial terms mentioned.
4.  Be aware of the informal language, slang, and potential sarcasm or irony used in the post.
5.  If the user is providing their own "DD" or "Due Diligence" then take that into account.
6.  If the post is about a "Meme" stock, then take that into account.

Methods:

1.  Polarity Detection (Basic Sentiment): Provide a single score representing the overall positive, negative, or neutral sentiment of the Reddit post in relation to the company or stock.
2.  Fine-Grained Sentiment Analysis: Provide individual scores for different sections or opinions within the post, if applicable, and an average score.
3.  Aspect-Based Sentiment Analysis (ABSA): Identify the key financial aspects or attributes discussed in the post and provide sentiment scores for each identified aspect, and an average score. If possible, prioritize the following aspects:
    * Company Performance: Earnings, Revenue, Growth
    * Stock Analysis: Price Targets, Valuation, Technicals
    * Investment Strategy: Long/Short Positions, Options, Trading Plans
    * Market Trends: Industry News, Economic Factors
    * Company News: Announcements, Events, Rumors
    * Risk Assessment: Potential Losses, Volatility, Legal Issues
4.  Topic-Sentiment Analysis: Identify the key financial topics or themes discussed in the post and provide sentiment scores for each identified topic, and an average score. If possible, prioritize the following topics:
    * Company Performance
    * Stock Analysis
    * Investment Strategy
    * Market Trends
    * Company News
    * Risk Assessment
5.  Emotion Detection: Provide a score representing the dominant emotion(s) expressed in the Reddit post. If multiple emotions are present, provide a weighted average.
6.  Intent Analysis: Provide a score representing the overall intent or purpose of the Reddit post (e.g., sharing information, expressing an opinion, asking a question, promoting a stock). If there are multiple intents, provide an average score.
7.  Subjectivity/Objectivity Detection: Provide a score representing the overall level of subjectivity or objectivity in the Reddit post.
8.  Contextual Sentiment Analysis: Provide individual scores for different contextual elements or speakers within the post, and an average score.
9.  Deep Learning-Based Approach (Gemini's Analysis): Provide overall, emotional, and contextual sentiment scores using Gemini's advanced analysis.

Format your response as a JSON object with the following structure:

{{
  "polarity_detection": score,
  "fine_grained_sentiment": {{
    "individual_scores": [{{"section/opinion": score}}],
    "average_score": score
  }},
  "aspect_based_sentiment": {{
    "aspect_scores": [{{"aspect": score}}],
    "average_score": score
  }},
  "topic_sentiment_analysis": {{
    "topic_scores": [{{"topic": score}}],
    "average_score": score
  }},
  "emotion_detection": score,
  "intent_analysis": score,
  "subjectivity_objectivity": score,
  "contextual_sentiment": {{
    "context_scores": [{{"context_element": score}}],
    "average_score": score
  }},
  "gemini_analysis": {{
    "overall_sentiment": score,
    "emotional_sentiment": score,
    "contextual_sentiment": score
  }}
}}
"""

    try:
        # Check rate limits
        if not rate_limiter.check_and_wait():
            logger.warning("Rate limit reached. Skipping request.")
            return None

        # Use Gemini API
        try:
            response = client.models.generate_content(
                model=MODEL,
                contents=prompt
            )
            
            # Safety checks and handling
            if not hasattr(response, 'text'):
                logger.error("No text attribute in response")
                return None
            
            return response.text
        except Exception as e:
            logger.error(f"Error in Gemini API call: {e}")
            return None
    
    except Exception as e:
        logger.error(f"Error preparing sentiment analysis: {e}")
        return None

def extract_json_from_response(response_text):
    """
    Extract the JSON object from the API response text using a more robust method
    """
    try:
        # Look for JSON pattern using regex - this is more reliable
        match = re.search(r'({.*})', response_text, re.DOTALL)
        if match:
            json_str = match.group(1)
            # Clean up potential issues before parsing
            json_str = re.sub(r',\s*}', '}', json_str)  # Remove trailing commas
            json_str = re.sub(r',\s*]', ']', json_str)  # Remove trailing commas in arrays
            
            return json.loads(json_str)
        else:
            logger.error("No JSON pattern found in response")
            return None
    except json.JSONDecodeError as e:
        logger.error(f"Error parsing JSON: {e}")
        return None
    except Exception as e:
        logger.error(f"Unexpected error extracting JSON: {e}")
        return None

def flatten_sentiment_json(json_obj):
    """
    Flatten the nested JSON structure into a dictionary with consistent key-value pairs
    Only keeps predefined scores and categories for consistent CSV structure
    """
    flat_dict = {}
    
    if not json_obj:
        return flat_dict
    
    # Add default values for key metrics
    # Basic scores
    flat_dict["polarity_detection"] = 0.0
    flat_dict["emotion_detection"] = 0.0
    flat_dict["intent_analysis"] = 0.0
    flat_dict["subjectivity_objectivity"] = 0.0
    
    # Average scores
    flat_dict["fine_grained_sentiment_avg"] = 0.0
    flat_dict["aspect_based_sentiment_avg"] = 0.0
    flat_dict["topic_sentiment_analysis_avg"] = 0.0
    flat_dict["contextual_sentiment_avg"] = 0.0
    
    # Gemini analysis scores
    flat_dict["gemini_overall_sentiment"] = 0.0
    flat_dict["gemini_emotional_sentiment"] = 0.0
    flat_dict["gemini_contextual_sentiment"] = 0.0
    
    # Predefined aspects for Reddit posts (financial aspects)
    predefined_aspects = [
        "company_performance",
        "stock_analysis",
        "investment_strategy",
        "market_trends",
        "company_news",
        "risk_assessment"
    ]
    
    for aspect in predefined_aspects:
        flat_dict[f"aspect_{aspect}"] = 0.0
    
    # Predefined topics (same as aspects for this case)
    for topic in predefined_aspects:
        flat_dict[f"topic_{topic}"] = 0.0
    
    # Extract top-level simple scores
    for key in ['polarity_detection', 'emotion_detection', 'intent_analysis', 'subjectivity_objectivity']:
        if key in json_obj:
            try:
                value = json_obj[key]
                if isinstance(value, list) and len(value) > 0:
                    value = value[0]
                flat_dict[key] = float(value)
            except Exception:
                pass  # Keep default if error
    
    # Extract Gemini Analysis scores
    if 'gemini_analysis' in json_obj:
        for sub_key, value in json_obj['gemini_analysis'].items():
            try:
                if isinstance(value, list) and len(value) > 0:
                    value = value[0]
                flat_dict[f"gemini_{sub_key}"] = float(value)
            except Exception:
                pass
    
    # Extract average scores only (no individual sections)
    if 'fine_grained_sentiment' in json_obj and 'average_score' in json_obj['fine_grained_sentiment']:
        try:
            value = json_obj['fine_grained_sentiment']['average_score']
            if isinstance(value, list) and len(value) > 0:
                value = value[0]
            flat_dict["fine_grained_sentiment_avg"] = float(value)
        except Exception:
            pass
    
    # Extract other average scores
    for key in ['aspect_based_sentiment', 'topic_sentiment_analysis', 'contextual_sentiment']:
        if key in json_obj and 'average_score' in json_obj[key]:
            try:
                value = json_obj[key]['average_score']
                if isinstance(value, list) and len(value) > 0:
                    value = value[0]
                flat_dict[f"{key}_avg"] = float(value)
            except Exception:
                pass
    
    # Process predefined aspects only
    if 'aspect_based_sentiment' in json_obj and 'aspect_scores' in json_obj['aspect_based_sentiment']:
        aspect_scores = json_obj['aspect_based_sentiment']['aspect_scores']
        if isinstance(aspect_scores, list):
            for score_dict in aspect_scores:
                for aspect, score in score_dict.items():
                    # Normalize aspect name for matching
                    normalized_aspect = aspect.lower().replace('/', '_').replace(' ', '_').replace(':', '_')
                    
                    # Find matching predefined aspect
                    matched_aspect = None
                    for predefined in predefined_aspects:
                        if predefined in normalized_aspect:
                            matched_aspect = predefined
                            break
                    
                    # Store only if it matches a predefined aspect
                    if matched_aspect:
                        try:
                            if isinstance(score, list) and len(score) > 0:
                                score = score[0]
                            flat_dict[f"aspect_{matched_aspect}"] = float(score)
                        except Exception:
                            pass
    
    # Process predefined topics only (similar to aspects)
    if 'topic_sentiment_analysis' in json_obj and 'topic_scores' in json_obj['topic_sentiment_analysis']:
        topic_scores = json_obj['topic_sentiment_analysis']['topic_scores']
        if isinstance(topic_scores, list):
            for score_dict in topic_scores:
                for topic, score in score_dict.items():
                    # Normalize topic name for matching
                    normalized_topic = topic.lower().replace('/', '_').replace(' ', '_')
                    
                    # Find matching predefined topic
                    matched_topic = None
                    for predefined in predefined_aspects:  # Using same list as aspects
                        if predefined in normalized_topic:
                            matched_topic = predefined
                            break
                    
                    # Store only if it matches a predefined topic
                    if matched_topic:
                        try:
                            if isinstance(score, list) and len(score) > 0:
                                score = score[0]
                            flat_dict[f"topic_{matched_topic}"] = float(score)
                        except Exception:
                            pass
    
    return flat_dict

# Function to analyze a batch of Reddit posts
def analyze_reddit_batch(batch_items, reddit_data, retry_limit=3):
    """
    Process a batch of Reddit posts
    
    Parameters:
    -----------
    batch_items : list
        List of indices to process
    reddit_data : DataFrame
        DataFrame containing Reddit posts
    retry_limit : int
        Number of retries for API calls
    
    Returns:
    --------
    dict
        Dictionary mapping indices to sentiment results
    """
    batch_results = {}
    
    for idx in batch_items:
        post = reddit_data.iloc[idx]
        
        # Get title and content
        title = post.get('title', '')
        content = post.get('content', '')
        
        # Check if content is valid, if not, just use title
        if not content or not isinstance(content, str) or content == "[removed]" or content == "[deleted]":
            combined_text = f"Title: {title}"
            logger.info(f"Post {idx} has no valid content, using title only")
        else:
            combined_text = f"Title: {title}\n\nContent: {content}"
        
        # Skip if no title either
        if not title or not isinstance(title, str):
            logger.warning(f"Post {idx} has no valid title or content, skipping")
            continue
        
        # Limit text length if too long
        if len(combined_text) > MAX_TEXT_LENGTH:
            combined_text = combined_text[:MAX_TEXT_LENGTH]
        
        # Retry mechanism
        success = False
        for attempt in range(retry_limit):
            try:
                # Call Gemini API
                response_text = analyze_sentiment_with_gemini(combined_text)
                
                if not response_text:
                    logger.error(f"No response from API for post {idx}, attempt {attempt+1}/{retry_limit}")
                    time.sleep(3)  # Wait before retry
                    continue
                
                # Extract JSON from response
                sentiment_json = extract_json_from_response(response_text)
                
                if not sentiment_json:
                    logger.error(f"Failed to extract JSON from response for post {idx}, attempt {attempt+1}/{retry_limit}")
                    time.sleep(3)  # Wait before retry
                    continue
                
                # Flatten JSON
                flat_sentiment = flatten_sentiment_json(sentiment_json)
                
                # Store results
                batch_results[idx] = flat_sentiment
                
                success = True
                break
                
            except Exception as e:
                logger.error(f"Error on attempt {attempt+1}/{retry_limit} for post {idx}: {e}")
                time.sleep(3)  # Wait before retry
        
        if not success:
            logger.error(f"Failed to analyze sentiment for post {idx} after {retry_limit} attempts")
    
    return batch_results

def process_reddit_file(file_path, output_dir="sentiment_results/reddit", num_rows=None, retry_limit=3, force_reprocess=True, batch_size=10):
    """
    Process a Reddit CSV file and run sentiment analysis
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV file
    output_dir : str
        Directory to save results
    num_rows : int or None
        Number of rows to process (None for all)
    retry_limit : int
        Number of retries for API calls
    force_reprocess : bool
        If True, reprocess all rows even if they've been processed before
    batch_size : int
        Number of posts to process in a batch
    """
    try:
        # Create output directory if it doesn't exist
        os.makedirs(output_dir, exist_ok=True)
        
        # Extract year from filename
        year = os.path.basename(file_path).replace('reddit_', '').replace('.csv', '')
        
        # Load the CSV file
        logger.info(f"Loading Reddit file: {file_path}")
        reddit_data = pd.read_csv(file_path)
        
        logger.info(f"Loaded {len(reddit_data)} Reddit posts for {year}")
        
        # Limit rows if specified
        if num_rows and num_rows < len(reddit_data):
            reddit_data = reddit_data.iloc[:num_rows]
            logger.info(f"Limited to first {num_rows} posts for testing")
            
        # Set up progress tracking
        progress_file = f"progress/reddit/reddit_{year}_progress.json"
        os.makedirs(os.path.dirname(progress_file), exist_ok=True)
        processed_indices = get_processed_indices(progress_file)
        
        # Create a list of indices to process
        all_indices = list(range(len(reddit_data)))
        if not force_reprocess:
            # Only process unprocessed items
            indices_to_process = [idx for idx in all_indices if idx not in processed_indices]
            logger.info(f"Found {len(indices_to_process)} unprocessed posts")
        else:
            indices_to_process = all_indices
            logger.info(f"Force reprocessing all {len(indices_to_process)} posts")
        
        # Track results
        results = {}
        
        # Process in batches
        total_batches = (len(indices_to_process) + batch_size - 1) // batch_size
        logger.info(f"Processing {len(indices_to_process)} posts in {total_batches} batches of size {batch_size}")
        
        for batch_idx in tqdm(range(0, len(indices_to_process), batch_size), desc=f"Processing Reddit {year} batches"):
            # Get indices for this batch
            batch_indices = indices_to_process[batch_idx:batch_idx + batch_size]
            
            # Process the batch
            batch_results = analyze_reddit_batch(batch_indices, reddit_data, retry_limit)
            
            # Update results and processed indices
            results.update(batch_results)
            for idx in batch_indices:
                processed_indices.add(idx)
            
            # Save progress after each batch
            save_processed_indices(progress_file, processed_indices)
            
            # Log progress
            items_processed = min(batch_idx + batch_size, len(indices_to_process))
            logger.info(f"Processed {items_processed}/{len(indices_to_process)} items ({items_processed/len(indices_to_process)*100:.1f}%)")
        
        # Create the result dataframe
        rows = []
        for idx, sentiment in results.items():
            # Get original post data
            post = reddit_data.iloc[idx]
            
            # Clean the date
            clean_date = clean_reddit_date(post.get('date', ''))
            
            # Create a row with required fields
            row = {
                'subreddit': post.get('subreddit', ''),
                'username': post.get('username', ''),
                'date': clean_date,
                'title': post.get('title', ''),
                'reddit_link': post.get('reddit_link', '')
            }
            
            # Add sentiment scores
            for key, value in sentiment.items():
                row[key] = value
            
            rows.append(row)
        
        # Create dataframe from rows
        if rows:
            result_df = pd.DataFrame(rows)
            
            # Save results
            output_file = os.path.join(output_dir, f"reddit_sentiment_{year}.csv")
            result_df.to_csv(output_file, index=False)
            logger.info(f"Saved results to {output_file}")
        else:
            logger.warning(f"No results generated for {year}")
        
        # Save progress
        save_processed_indices(progress_file, processed_indices)
        
        return result_df
        
    except Exception as e:
        logger.error(f"Error processing file {file_path}: {e}")
        import traceback
        logger.error(f"Traceback: {traceback.format_exc()}")
        return None

def find_reddit_files():
    """
    Find all Reddit CSV files in the data directory
    """
    files = []
    for year in YEARS_TO_PROCESS:
        file_path = f"data/reddit_data/reddit_{year}.csv"
        if os.path.exists(file_path):
            files.append(file_path)
    return files

def process_all_reddit_files(num_rows=None, force_reprocess=True, batch_size=10):
    """
    Process all Reddit files, running sentiment analysis on each
    
    Parameters:
    -----------
    num_rows : int or None
        Number of rows to process per file. If None, processes all rows
    force_reprocess : bool
        If True, reprocess all rows even if they've been processed before
    batch_size : int
        Number of posts to process in a batch
    """
    files = find_reddit_files()
    
    if not files:
        logger.error("No Reddit files found")
        return
    
    total_files = len(files)
    logger.info(f"Found {total_files} Reddit files to process")
    
    for i, file_path in enumerate(files):
        logger.info(f"Processing file {i+1}/{total_files}: {file_path}")
        process_reddit_file(file_path, num_rows=num_rows, force_reprocess=force_reprocess, batch_size=batch_size)
    
    logger.info("Completed processing all files")

def test_gemini_api():
    """Test the Gemini API with a small sample text"""
    logger.info("Testing Gemini API with a small sample...")
    sample_text = "Is there a difference between buying a few shares of Apple stock vs buying a lot of shares? Is there a benefit of buying one or the other?"
    
    try:
        # Simple prompt for testing
        prompt = "Analyze this text for sentiment and respond with a single word: positive, negative, or neutral: " + sample_text
        
        response = client.models.generate_content(
            model=MODEL,
            contents=prompt
        )
        
        logger.info(f"API Test Response: {response.text}")
        logger.info("Gemini API test successful!")
        return True
    except Exception as e:
        logger.error(f"Gemini API test failed: {e}")
        return False

def main():
    # First test if the API is working
    if not test_gemini_api():
        logger.error("Gemini API test failed. Please check your API key and connection.")
        return
    
    # Process all files
    print("\n=== REDDIT SENTIMENT ANALYSIS ===")
    
    # Get batch size
    batch_size_input = input("Enter batch size (default 10): ")
    batch_size = int(batch_size_input) if batch_size_input.strip() else 10
    
    # Force reprocessing?
    force_reprocess_input = input("Force reprocessing of all rows, even if previously processed? (y/n): ")
    force_reprocess = force_reprocess_input.lower() == 'y'
    
    # Number of rows?
    rows_input = input("How many rows per file? (Enter a number or 'all' for all rows): ")
    rows_per_file = None if rows_input.lower() == 'all' else int(rows_input)
    
    # Process files
    process_all_reddit_files(num_rows=rows_per_file, force_reprocess=force_reprocess, batch_size=batch_size)

if __name__ == "__main__":
    main()

2025-04-07 03:08:08,830 - INFO - Testing Gemini API with a small sample...
2025-04-07 03:08:08,831 - INFO - AFC is enabled with max remote calls: 10.
2025-04-07 03:08:09,152 - INFO - AFC remote call 1 is done.
2025-04-07 03:08:09,153 - INFO - API Test Response: Neutral

2025-04-07 03:08:09,154 - INFO - Gemini API test successful!



=== REDDIT SENTIMENT ANALYSIS ===


Enter batch size (default 10):  10
Force reprocessing of all rows, even if previously processed? (y/n):  n
How many rows per file? (Enter a number or 'all' for all rows):  all


2025-04-07 03:08:22,651 - INFO - Found 2 Reddit files to process
2025-04-07 03:08:22,652 - INFO - Processing file 1/2: data/reddit_data/reddit_2023.csv
2025-04-07 03:08:22,653 - INFO - Loading Reddit file: data/reddit_data/reddit_2023.csv
2025-04-07 03:08:22,832 - INFO - Loaded 28082 Reddit posts for 2023
2025-04-07 03:08:22,835 - INFO - Found 28082 unprocessed posts
2025-04-07 03:08:22,835 - INFO - Processing 28082 posts in 2809 batches of size 10
Processing Reddit 2023 batches:   0%|          | 0/2809 [00:00<?, ?it/s]2025-04-07 03:08:22,841 - INFO - Post 0 has no valid content, using title only
2025-04-07 03:08:23,842 - INFO - AFC is enabled with max remote calls: 10.
2025-04-07 03:08:25,355 - INFO - AFC remote call 1 is done.
2025-04-07 03:08:25,356 - INFO - Post 1 has no valid content, using title only
2025-04-07 03:08:26,357 - INFO - AFC is enabled with max remote calls: 10.
2025-04-07 03:08:28,358 - INFO - AFC remote call 1 is done.
2025-04-07 03:08:28,359 - INFO - Post 2 has no 