# Summer Recap: Data Analysis with Pandas

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Review fundamental pandas operations for data manipulation and analysis.
* Apply data cleaning techniques to real-world social science datasets.
* Practice exploratory data analysis using descriptive statistics and basic visualizations.
* Demonstrate ability to filter, group, and aggregate data using pandas methods.
* Evaluate LLM-generated code for accuracy and best practices.
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
🤖 **AI Generated**: Code generated by an LLM that we'll test and debug.<br>

### Sections
1. [Data Loading and Initial Exploration](#section1)
2. [Data Cleaning and Basic Operations](#section2)
3. [Exploratory Data Analysis](#section3)
4. [Text Analysis Fundamentals](#section4)
5. [Working with LLM-Generated Code](#section5)

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

<a id='section1'></a>

# Data Loading and Initial Exploration

Today we'll work with data from Reddit's "Am I the Asshole?" (AITA) subreddit. This dataset contains posts where people describe situations and ask for community judgment about their behavior.

In [None]:
# Load the dataset
df = pd.read_csv('../../../data/aita_top_submissions.csv')

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"\nColumn names:")
print(df.columns.tolist())

In [None]:
# Display the first few rows
df.head()

## 🥊 Challenge 1: Data Overview

Explore the dataset structure and provide a summary of what you find. Use pandas methods to:
1. Check the data types of each column
2. Look for missing values
3. Get basic descriptive statistics for numerical columns

In [None]:
# YOUR CODE HERE



🔔 **Question**: What do you notice about the `selftext` column? What might this tell us about the data?

<a id='section2'></a>

# Data Cleaning and Basic Operations

Real-world data often requires cleaning before analysis. Let's examine our dataset for common issues.

In [None]:
# Check for duplicate posts
print(f"Number of duplicate rows: {df.duplicated().sum()}")

# Look at the distribution of some key variables
print(f"\nScore statistics:")
print(df['score'].describe())

## 🥊 Challenge 2: Data Cleaning

Clean the dataset by:
1. Removing any posts where `selftext` is missing or empty
2. Creating a new column called `text_length` that contains the character count of `selftext`
3. Filter out posts that are shorter than 100 characters (likely low-quality posts)

In [None]:
# YOUR CODE HERE



💡 **Tip**: Use the `.str.len()` method to get string lengths in pandas. Remember that missing values might cause issues, so handle them first!

## Working with Dates

The `created` column contains Unix timestamps. Let's convert these to readable dates.

In [None]:
# Convert Unix timestamp to datetime
df['created_date'] = pd.to_datetime(df['created'], unit='s')

# Extract useful date components
df['year'] = df['created_date'].dt.year
df['month'] = df['created_date'].dt.month
df['day_of_week'] = df['created_date'].dt.day_name()

print("Date range in dataset:")
print(f"From: {df['created_date'].min()}")
print(f"To: {df['created_date'].max()}")

<a id='section3'></a>

# Exploratory Data Analysis

Now let's explore patterns in the data using pandas grouping and aggregation functions.

## 🥊 Challenge 3: Score Analysis

Analyze post popularity by:
1. Finding the top 10 posts by score
2. Calculating the average score by year
3. Determining which day of the week gets the highest average scores

In [None]:
# YOUR CODE HERE



## Comment Engagement Analysis

In [None]:
# Explore the relationship between text length and engagement
correlation = df[['text_length', 'score', 'num_comments']].corr()
print("Correlation matrix:")
print(correlation)

🔔 **Question**: What does the correlation tell us about the relationship between post length and engagement?

## 🥊 Challenge 4: Engagement Categories

Create engagement categories and analyze them:
1. Create a new column `engagement_level` with categories:
   - 'Low': score < 100
   - 'Medium': score 100-500
   - 'High': score 500-2000
   - 'Viral': score > 2000
2. Calculate the percentage of posts in each category
3. Find the average text length for each engagement level

In [None]:
# YOUR CODE HERE



<a id='section4'></a>

# Text Analysis Fundamentals

Let's do some basic text analysis to understand the content patterns.

## 🥊 Challenge 5: Text Pattern Analysis

Analyze text patterns by:
1. Finding posts that contain the word "family" (case-insensitive)
2. Counting how many posts mention "wedding" or "marriage"
3. Creating a column indicating whether the post is about relationships (contains words like "boyfriend", "girlfriend", "husband", "wife")

In [None]:
# YOUR CODE HERE



💡 **Tip**: Use the `.str.contains()` method with pandas to search for text patterns. The `case=False` parameter makes the search case-insensitive.

## Author Analysis

In [None]:
# Analyze posting patterns by author
author_stats = df['author'].value_counts().head(10)
print("Top 10 most active authors:")
print(author_stats)

## 🥊 Challenge 6: Final Analysis

Combine multiple pandas operations to answer this question:
**"What are the characteristics of the most engaging posts about relationships?"**

Create an analysis that:
1. Filters for relationship-related posts
2. Groups them by engagement level
3. Calculates average text length, comment count, and any other relevant metrics
4. Presents a clear summary of your findings

In [None]:
# YOUR CODE HERE



⚠️ **Warning**: When working with text data, always be mindful of missing values and different text encodings that might cause unexpected results.

<a id='section5'></a>

# Working with LLM-Generated Code

Now let's explore how Large Language Models can assist with coding tasks. We'll generate some code using an LLM, test it, and discover common pitfalls and best practices.

## Generating Code with ChatGPT

Let's ask ChatGPT to help us create a function that analyzes sentiment patterns in our AITA dataset. Here's the prompt we'll use:

**Prompt to ChatGPT:**
*"Write a Python function that takes a pandas DataFrame with a 'selftext' column and creates a simple sentiment analysis. The function should count positive and negative words using predefined word lists, calculate a sentiment score for each post, and return a new DataFrame with sentiment columns added."*

## 🤖 AI Generated Code

Below is the code generated by ChatGPT. Let's run it and see what happens:

In [ ]:
# Generated by ChatGPT - Let's test this code!
def analyze_sentiment(df):
    """
    Analyze sentiment of posts in a DataFrame.
    
    Args:
        df: pandas DataFrame with 'selftext' column
    
    Returns:
        DataFrame with added sentiment columns
    """
    # Define positive and negative words
    positive_words = ['good', 'great', 'awesome', 'excellent', 'fantastic', 'wonderful', 
                     'amazing', 'perfect', 'best', 'love', 'happy', 'joy', 'pleased']
    
    negative_words = ['bad', 'terrible', 'awful', 'horrible', 'worst', 'hate', 'angry', 
                     'sad', 'upset', 'mad', 'furious', 'disgusting', 'annoying']
    
    # Create a copy to avoid modifying original
    result_df = df.copy()
    
    # Initialize sentiment columns
    result_df['positive_count'] = 0
    result_df['negative_count'] = 0
    result_df['sentiment_score'] = 0.0
    
    # Process each row
    for idx, row in result_df.iterrows():
        text = row['selftext'].lower()
        
        # Count positive words
        pos_count = sum(1 for word in positive_words if word in text)
        neg_count = sum(1 for word in negative_words if word in text)
        
        # Calculate sentiment score
        total_words = len(text.split())
        sentiment_score = (pos_count - neg_count) / total_words
        
        # Update DataFrame
        result_df.loc[idx, 'positive_count'] = pos_count
        result_df.loc[idx, 'negative_count'] = neg_count
        result_df.loc[idx, 'sentiment_score'] = sentiment_score
    
    return result_df

# Test the function
print("Testing AI-generated sentiment analysis function...")
sentiment_df = analyze_sentiment(df)
print("Function completed successfully!")

🔔 **Question**: Did the code run successfully? If you got an error, what do you think went wrong?

## Debugging the AI Code

Let's investigate what went wrong and fix the issues step by step:

In [ ]:
# Let's check what's in our dataset first
print("Checking for missing selftext values:")
print(f"Missing values: {df['selftext'].isna().sum()}")
print(f"Total rows: {len(df)}")

# Look at a few selftext examples
print("\nFirst few selftext values:")
for i in range(3):
    print(f"Row {i}: {repr(df.iloc[i]['selftext'])}")
    print(f"Type: {type(df.iloc[i]['selftext'])}")
    print("---")

## 🥊 Challenge 7: Fix the AI Code

The AI-generated code has several issues. Can you identify and fix them?

**Issues to look for:**
1. What happens if `selftext` contains missing values (NaN)?
2. What happens if `selftext` is not a string?
3. Are there performance issues with this approach?
4. Are there edge cases in the sentiment calculation?

Write an improved version of the function below:

In [ ]:
# YOUR CODE HERE - Fix the AI-generated function


## A Better Implementation

Here's an improved version that handles the edge cases:

In [ ]:
def analyze_sentiment_improved(df):
    """
    Improved sentiment analysis function that handles edge cases.
    
    Args:
        df: pandas DataFrame with 'selftext' column
    
    Returns:
        DataFrame with added sentiment columns
    """
    # Define positive and negative words
    positive_words = set(['good', 'great', 'awesome', 'excellent', 'fantastic', 'wonderful', 
                         'amazing', 'perfect', 'best', 'love', 'happy', 'joy', 'pleased'])
    
    negative_words = set(['bad', 'terrible', 'awful', 'horrible', 'worst', 'hate', 'angry', 
                         'sad', 'upset', 'mad', 'furious', 'disgusting', 'annoying'])
    
    # Create a copy to avoid modifying original
    result_df = df.copy()
    
    # Handle missing values first
    result_df['selftext'] = result_df['selftext'].fillna('')
    
    # Convert to string type to handle any non-string values
    result_df['selftext'] = result_df['selftext'].astype(str)
    
    # Vectorized approach using pandas string methods
    # Convert to lowercase
    text_lower = result_df['selftext'].str.lower()
    
    # Count positive words
    pos_pattern = '|'.join(positive_words)
    result_df['positive_count'] = text_lower.str.count(pos_pattern)
    
    # Count negative words  
    neg_pattern = '|'.join(negative_words)
    result_df['negative_count'] = text_lower.str.count(neg_pattern)
    
    # Calculate word count (handle empty strings)
    word_counts = text_lower.str.split().str.len().fillna(0)
    
    # Calculate sentiment score (avoid division by zero)
    sentiment_numerator = result_df['positive_count'] - result_df['negative_count']
    result_df['sentiment_score'] = np.where(
        word_counts > 0, 
        sentiment_numerator / word_counts, 
        0
    )
    
    return result_df

# Test the improved function
print("Testing improved sentiment analysis function...")
try:
    sentiment_df_improved = analyze_sentiment_improved(df)
    print("Function completed successfully!")
    print(f"Added columns: {['positive_count', 'negative_count', 'sentiment_score']}")
    
    # Show some results
    print("\nSample results:")
    print(sentiment_df_improved[['positive_count', 'negative_count', 'sentiment_score']].head())
    
except Exception as e:
    print(f"Error occurred: {e}")

## Performance Comparison

Let's compare the performance of the original AI code vs our improved version:

In [ ]:
import time

# Create a small test dataset for timing
test_df = df.head(50).copy()

# Test original function (if it works)
print("Testing original AI function performance...")
try:
    start_time = time.time()
    result1 = analyze_sentiment(test_df)
    original_time = time.time() - start_time
    print(f"Original function took: {original_time:.4f} seconds")
except Exception as e:
    print(f"Original function failed: {e}")
    original_time = None

# Test improved function
print("\nTesting improved function performance...")
start_time = time.time()
result2 = analyze_sentiment_improved(test_df)
improved_time = time.time() - start_time
print(f"Improved function took: {improved_time:.4f} seconds")

if original_time:
    speedup = original_time / improved_time
    print(f"\nSpeedup: {speedup:.2f}x faster")

## LLM Coding Guidelines

Based on our experience with the AI-generated code, let's establish some guidelines for working with LLMs:

### ✅ **DO:**
- **Provide clear context** and specify desired output format
- **Test the code immediately** after generation  
- **Check for edge cases** like missing values, empty strings, wrong data types
- **Verify performance** - AI often uses inefficient approaches
- **Document AI assistance** in comments (e.g., `# Generated with ChatGPT assistance`)
- **Understand the code** before using it in your projects
- **Ask for explanations** if you don't understand parts of the generated code

### ❌ **DON'T:**
- **Ask for too much at once** - break complex tasks into smaller parts
- **Blindly copy-paste** without understanding the code
- **Skip testing** - always run and verify the output
- **Ignore error handling** - AI often misses edge cases
- **Forget to document** AI usage (academic integrity requirement)
- **Use AI output** that leads to plagiarism or incorrect work

## Course Policy on AI Use

### 📋 **Academic Integrity Guidelines**

**You MAY use LLMs as coding assistants IF:**
- You **document their use** clearly (in code comments or assignment submissions)
- You **personally verify and understand** the solution
- You can **explain how the code works** when asked
- You **test the code thoroughly** and fix any bugs

**Examples of acceptable documentation:**
```python
# Used ChatGPT to help write this sentiment analysis function
# Modified the original output to handle edge cases
def my_function():
    pass
```

**You MAY NOT:**
- Use AI assistance **without acknowledgment** 
- Submit AI-generated code that you **don't understand**
- Use AI output that leads to **plagiarism or incorrect work**
- Claim AI-generated work as **entirely your own**

⚠️ **Remember**: Understanding the code is more important than having perfect code. Using LLMs can speed up development, but only if you comprehend what they produce!

## 🥊 Challenge 8: Evaluate AI Output

Now it's your turn! Ask ChatGPT (or another LLM) to generate code for one of these tasks:

1. **Create a function** that finds the most common words in AITA post titles
2. **Generate code** to create a simple visualization of post scores over time  
3. **Write a function** that categorizes posts by topic based on keywords

**Instructions:**
1. Copy your prompt and the AI's response into the cells below
2. Test the generated code
3. Document any bugs or improvements needed
4. Fix the issues and explain what you learned

**Your prompt to the LLM:**

In [ ]:
# YOUR AI-GENERATED CODE HERE
# Remember to add a comment acknowledging AI assistance!


**Issues found and fixes made:**

In [ ]:
# YOUR IMPROVED VERSION HERE

<div class="alert alert-success">

## ❗ Key Points

* Pandas provides powerful tools for loading, cleaning, and exploring real-world datasets.
* Always start data analysis by understanding your dataset structure and checking for data quality issues.
* The `.groupby()` method is essential for aggregating data and finding patterns across categories.
* Text data requires special handling, including case-insensitive searches and pattern matching.
* Correlation analysis helps identify relationships between numerical variables.
* Creating categorical variables from continuous data enables different types of analysis.
* **LLMs can accelerate coding but require careful testing and understanding of generated code.**
* **Always document AI assistance and verify that generated code handles edge cases properly.**

</div>

<div class="alert alert-success">

## ❗ Key Points

* Pandas provides powerful tools for loading, cleaning, and exploring real-world datasets.
* Always start data analysis by understanding your dataset structure and checking for data quality issues.
* The `.groupby()` method is essential for aggregating data and finding patterns across categories.
* Text data requires special handling, including case-insensitive searches and pattern matching.
* Correlation analysis helps identify relationships between numerical variables.
* Creating categorical variables from continuous data enables different types of analysis.

</div>