# Summer Recap: Data Analysis with Pandas

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Review fundamental pandas operations for data manipulation and analysis.
* Apply data cleaning techniques to real-world social science datasets.
* Practice exploratory data analysis using descriptive statistics and basic visualizations.
* Demonstrate ability to filter, group, and aggregate data using pandas methods.
* Evaluate LLM-generated code for accuracy and best practices.
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
🤖 **AI Generated**: Code generated by an LLM that we'll test and debug.<br>

### Sections
1. [Data Loading and Initial Exploration](#section1)
2. [Data Cleaning and Basic Operations](#section2)
3. [Exploratory Data Analysis](#section3)
4. [Text Analysis Fundamentals](#section4)
5. [Working with LLM-Generated Code](#section5)

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

<a id='section1'></a>

# Data Loading and Initial Exploration

Today we'll work with data from Reddit's "Am I the Asshole?" (AITA) subreddit. This dataset contains posts where people describe situations and ask for community judgment about their behavior.

In [3]:
# Load the dataset
df = pd.read_csv('../../data/aita_top_submissions.csv')

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"\nColumn names:")
print(df.columns.tolist())

Dataset shape: (20000, 18)

Column names:
['idint', 'idstr', 'created', 'self', 'nsfw', 'author', 'title', 'url', 'selftext', 'score', 'subreddit', 'distinguish', 'textlen', 'num_comments', 'flair_text', 'flair_css_class', 'augmented_at', 'augmented_count']


In [4]:
# Display the first few rows
df.head()

Unnamed: 0,idint,idstr,created,self,nsfw,author,title,url,selftext,score,subreddit,distinguish,textlen,num_comments,flair_text,flair_css_class,augmented_at,augmented_count
0,427576402,t3_72kg2a,1506433689,1.0,0.0,Ritsku,AITA for breaking up with my girlfriend becaus...,,My girlfriend recently went to the beach with ...,679.0,AmItheAsshole,,4917.0,434.0,no a--holes here,,,
1,551887974,t3_94kvhi,1533404095,1.0,0.0,hhhhhhffff678,AITA for banning smoking in my house and telli...,,My parents smoke like chimneys. I used to as w...,832.0,AmItheAsshole,,2076.0,357.0,asshole,ass,,
2,552654542,t3_951az2,1533562299,1.0,0.0,creepatthepool,AITA? Creep wears skimpy bathing suit to pool,,Hi guys. Throwaway for obv reasons.\n\nI'm a f...,23.0,AmItheAsshole,,1741.0,335.0,Shitpost,,,
3,556350346,t3_978ioa,1534254641,1.0,0.0,Pauly104,AITA for eating steak in front of my vegan GF?,,"Yesterday night, me and my GF decided to go ou...",1011.0,AmItheAsshole,,416.0,380.0,not the a-hole,not,,
4,560929656,t3_99yo3c,1535126620,1.0,0.0,ThatSpencerGuy,AITA for not wanting to cook my mother-in-law ...,,"My wife and I are vegetarians, much to my in-l...",349.0,AmItheAsshole,,1158.0,360.0,not the a-hole,not,,


## 🥊 Challenge 1: Data Overview

Explore the dataset structure and provide a summary of what you find. Use pandas methods to:
1. Check the data types of each column
2. Look for missing values
3. Get basic descriptive statistics for numerical columns

In [None]:
# YOUR CODE HERE



🔔 **Question**: What do you notice about the `selftext` column? What might this tell us about the data?

<a id='section2'></a>

# Data Cleaning and Basic Operations

Real-world data often requires cleaning before analysis. Let's examine our dataset for common issues.

In [None]:
# Check for duplicate posts
print(f"Number of duplicate rows: {df.duplicated().sum()}")

# Look at the distribution of some key variables
print(f"\nScore statistics:")
print(df['score'].describe())

## 🥊 Challenge 2: Data Cleaning

Clean the dataset by:
1. Removing any posts where `selftext` is missing or empty
2. Creating a new column called `text_length` that contains the character count of `selftext`
3. Filter out posts that are shorter than 100 characters (likely low-quality posts)

In [None]:
# YOUR CODE HERE



💡 **Tip**: Use the `.str.len()` method to get string lengths in pandas. Remember that missing values might cause issues, so handle them first!

## Working with Dates

The `created` column contains Unix timestamps. Let's convert these to readable dates.

In [None]:
# Convert Unix timestamp to datetime
df['created_date'] = pd.to_datetime(df['created'], unit='s')

# Extract useful date components
df['year'] = df['created_date'].dt.year
df['month'] = df['created_date'].dt.month
df['day_of_week'] = df['created_date'].dt.day_name()

print("Date range in dataset:")
print(f"From: {df['created_date'].min()}")
print(f"To: {df['created_date'].max()}")

<a id='section3'></a>

# Exploratory Data Analysis

Now let's explore patterns in the data using pandas grouping and aggregation functions.

## 🥊 Challenge 3: Score Analysis

Analyze post popularity by:
1. Finding the top 10 posts by score
2. Calculating the average score by year
3. Determining which day of the week gets the highest average scores

In [None]:
# YOUR CODE HERE



## Comment Engagement Analysis

In [None]:
# Explore the relationship between text length and engagement
correlation = df[['text_length', 'score', 'num_comments']].corr()
print("Correlation matrix:")
print(correlation)

🔔 **Question**: What does the correlation tell us about the relationship between post length and engagement?

## 🥊 Challenge 4: Engagement Categories

Create engagement categories and analyze them:
1. Create a new column `engagement_level` with categories:
   - 'Low': score < 100
   - 'Medium': score 100-500
   - 'High': score 500-2000
   - 'Viral': score > 2000
2. Calculate the percentage of posts in each category
3. Find the average text length for each engagement level

In [None]:
# YOUR CODE HERE



<a id='section4'></a>

# Text Analysis Fundamentals

Let's do some basic text analysis to understand the content patterns.

## 🥊 Challenge 5: Text Pattern Analysis

Analyze text patterns by:
1. Finding posts that contain the word "family" (case-insensitive)
2. Counting how many posts mention "wedding" or "marriage"
3. Creating a column indicating whether the post is about relationships (contains words like "boyfriend", "girlfriend", "husband", "wife")

In [None]:
# YOUR CODE HERE



💡 **Tip**: Use the `.str.contains()` method with pandas to search for text patterns. The `case=False` parameter makes the search case-insensitive.

## Author Analysis

In [None]:
# Analyze posting patterns by author
author_stats = df['author'].value_counts().head(10)
print("Top 10 most active authors:")
print(author_stats)

## 🥊 Challenge 6: Final Analysis

Combine multiple pandas operations to answer this question:
**"What are the characteristics of the most engaging posts about relationships?"**

Create an analysis that:
1. Filters for relationship-related posts
2. Groups them by engagement level
3. Calculates average text length, comment count, and any other relevant metrics
4. Presents a clear summary of your findings

In [None]:
# YOUR CODE HERE



⚠️ **Warning**: When working with text data, always be mindful of missing values and different text encodings that might cause unexpected results.

<div class="alert alert-success">

## ❗ Key Points

* Pandas provides powerful tools for loading, cleaning, and exploring real-world datasets.
* Always start data analysis by understanding your dataset structure and checking for data quality issues.
* The `.groupby()` method is essential for aggregating data and finding patterns across categories.
* Text data requires special handling, including case-insensitive searches and pattern matching.
* Correlation analysis helps identify relationships between numerical variables.
* Creating categorical variables from continuous data enables different types of analysis.

</div>