# Lab 8: NLP Analysis of Reddit AITA Posts

**COMPSS 211: Advanced Computing I**  
**Time:** 90 minutes  
**Due:** End of lab session

---

## Lab Overview

In this lab, you'll apply the NLP techniques learned in today's lesson to analyze Reddit posts from r/AmItheAsshole (AITA). You'll explore how people describe interpersonal conflicts and uncover linguistic patterns in moral judgment discussions.

### Learning Objectives
- Apply text preprocessing pipelines to social media data
- Use NLTK and spaCy for tokenization and text analysis
- Create and analyze Bag-of-Words and TF-IDF representations
- Perform basic topic analysis and visualization
- Build a simple text classifier for Reddit posts

### Deliverables
- Completed Jupyter notebook with all exercises
- Push your completed notebook to GitHub Classroom
- Brief reflection (last cell) on what you learned

### Grading Rubric
- Data preprocessing pipeline (25%)
- Tokenization and analysis (25%)
- TF-IDF analysis (25%)
- Visualization and interpretation (25%)

---

## Setup and Data Loading (10 minutes)

First, let's import the necessary libraries and load our dataset.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from wordcloud import WordCloud
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

In [None]:
# Install required packages if needed
# %pip install gdown wordcloud

In [None]:
# Import NLP libraries
import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
# Download required NLTK data (run once)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

In [None]:
# Load spaCy model
try:
    nlp = spacy.load('en_core_web_sm')
except:
    # If model not found, download it
    !python -m spacy download en_core_web_sm
    nlp = spacy.load('en_core_web_sm')

### Load the Reddit AITA Dataset

We'll download and load real Reddit data from r/AmItheAsshole, a subreddit where people post about interpersonal conflicts and ask for moral judgments.

In [None]:
import gdown
import os

# Create data directory if it doesn't exist
os.makedirs('../../data', exist_ok=True)

# Download the dataset
file_id = "1ct0UQ4Y4rvLYp-402Vb1C-0SXIZWEJ6h"
output_path = "../../data/aita_pp.csv"

# Download if file doesn't exist
if not os.path.exists(output_path):
    print("Downloading dataset...")
    gdown.download(f"https://drive.google.com/uc?id={file_id}", output_path, quiet=False)
else:
    print("Dataset already exists, loading...")

In [None]:
# Load the data
reddit_data = pd.read_csv(output_path)

print(f"Dataset shape: {reddit_data.shape}")
print(f"\nColumns: {list(reddit_data.columns)}")
print(f"\nData types:")
print(reddit_data.dtypes)

In [None]:
# Display first few rows to understand the data structure
reddit_data.head()

In [None]:
# Basic data exploration
print("Dataset Statistics:")
print(f"Total posts: {len(reddit_data)}")
print(f"\nMissing values:")
print(reddit_data.isnull().sum())

# Check for any categorical columns
for col in reddit_data.columns:
    if reddit_data[col].dtype == 'object':
        unique_count = reddit_data[col].nunique()
        if unique_count < 20:  # Only show if reasonable number of categories
            print(f"\n{col} distribution:")
            print(reddit_data[col].value_counts())

In [None]:
# Identify the main text column
# Look for columns that might contain the main text
text_columns = [col for col in reddit_data.columns if 'text' in col.lower() or 'body' in col.lower() or 'content' in col.lower()]
if text_columns:
    text_col = text_columns[0]
    print(f"Using '{text_col}' as the main text column")
else:
    # If no obvious text column, use the first string column with long text
    for col in reddit_data.columns:
        if reddit_data[col].dtype == 'object':
            avg_len = reddit_data[col].astype(str).str.len().mean()
            if avg_len > 100:  # Assume posts are longer than 100 chars on average
                text_col = col
                print(f"Using '{text_col}' as the main text column (avg length: {avg_len:.0f} chars)")
                break

# Display sample posts
print("\nSample posts:")
for i in range(min(3, len(reddit_data))):
    print(f"\nPost {i+1}:")
    print(reddit_data[text_col].iloc[i][:300] + "..." if len(str(reddit_data[text_col].iloc[i])) > 300 else reddit_data[text_col].iloc[i])

---

## Part 1: Text Preprocessing Pipeline (20 minutes)

### Exercise 1.1: Create a Comprehensive Preprocessing Function

Build a preprocessing pipeline that handles the specific challenges of Reddit text data.

In [None]:
def preprocess_reddit_text(text):
    """
    Preprocess Reddit post text.
    
    Steps to implement:
    1. Convert to lowercase
    2. Remove URLs (http/https links)
    3. Remove subreddit mentions (r/subreddit)
    4. Remove user mentions (u/username or /u/username)
    5. Replace numbers with 'NUM' token
    6. Remove extra whitespace
    7. Remove special characters but keep apostrophes
    
    Args:
        text (str): Raw Reddit post text
    
    Returns:
        str: Preprocessed text
    """
    
    # Handle None or non-string inputs
    if pd.isna(text) or not isinstance(text, str):
        return ""
    
    # Step 1: Lowercase
    text = text.lower()
    
    # Step 2: Remove URLs
    text = re.sub(r'http[s]?://\S+', '', text)
    text = re.sub(r'www\.\S+', '', text)
    
    # Step 3: Remove subreddit mentions
    text = re.sub(r'r/\w+', '', text)
    
    # Step 4: Remove user mentions
    text = re.sub(r'/u/\w+', '', text)
    text = re.sub(r'u/\w+', '', text)
    
    # Step 5: Replace numbers with NUM token
    text = re.sub(r'\d+', ' NUM ', text)
    
    # Step 6: Remove special characters except apostrophes
    text = re.sub(r"[^a-zA-Z0-9\s']", ' ', text)
    
    # Step 7: Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

In [None]:
# Test your preprocessing function
test_text = "Check out r/science! User u/john_doe shared this: https://example.com. It got 1500 upvotes!"
print(f"Original: {test_text}")
print(f"Processed: {preprocess_reddit_text(test_text)}")

In [None]:
# Apply preprocessing to all posts
reddit_data['processed_text'] = reddit_data[text_col].apply(preprocess_reddit_text)

# Remove empty processed texts
reddit_data = reddit_data[reddit_data['processed_text'].str.len() > 0].reset_index(drop=True)

print(f"Dataset after preprocessing: {reddit_data.shape}")
reddit_data[[text_col, 'processed_text']].head()

### Exercise 1.2: Compare Tokenization Methods

Compare how NLTK and spaCy tokenize Reddit posts differently.

In [None]:
from nltk.tokenize import word_tokenize

def compare_tokenizers(text):
    """
    Compare NLTK and spaCy tokenization.
    
    Args:
        text (str): Input text
    
    Returns:
        dict: Dictionary with both tokenization results
    """
    
    # NLTK tokenization
    nltk_tokens = word_tokenize(text)
    
    # spaCy tokenization
    doc = nlp(text)
    spacy_tokens = [token.text for token in doc]
    
    return {
        'nltk': nltk_tokens,
        'spacy': spacy_tokens,
        'nltk_count': len(nltk_tokens),
        'spacy_count': len(spacy_tokens)
    }

In [None]:
# Compare tokenization on a sample post
sample_post = reddit_data['processed_text'].iloc[0]
comparison = compare_tokenizers(sample_post[:200])  # Use first 200 chars for readability

print(f"Sample text: {sample_post[:200]}...\n")
print(f"NLTK tokens ({comparison['nltk_count']}): {comparison['nltk'][:20]}...\n")
print(f"spaCy tokens ({comparison['spacy_count']}): {comparison['spacy'][:20]}...")

---

## Part 2: Word Frequency and N-gram Analysis (20 minutes)

### Exercise 2.1: Analyze Word Frequencies

In [None]:
from nltk.corpus import stopwords

def get_word_frequencies(texts, remove_stopwords=True, top_n=10):
    """
    Get word frequencies from a list of texts.
    
    Args:
        texts (list): List of text strings
        remove_stopwords (bool): Whether to remove stopwords
        top_n (int): Number of top words to return
    
    Returns:
        list: List of (word, frequency) tuples
    """
    
    # Get English stopwords
    stop_words = set(stopwords.words('english')) if remove_stopwords else set()
    
    # Tokenize all texts and count frequencies
    all_words = []
    for text in texts:
        if pd.notna(text) and text:  # Check for valid text
            tokens = word_tokenize(text.lower())
            # Filter out stopwords and short words
            words = [token for token in tokens 
                    if token not in stop_words 
                    and len(token) > 2 
                    and token.isalpha()]
            all_words.extend(words)
    
    word_freq = Counter(all_words)
    return word_freq.most_common(top_n)

In [None]:
# Get overall word frequencies
word_freq = get_word_frequencies(reddit_data['processed_text'].tolist(), top_n=15)

print("Top 15 most frequent words in AITA posts:")
for word, freq in word_freq:
    print(f"  {word}: {freq}")

### Exercise 2.2: Visualize Word Frequencies

In [None]:
def plot_word_frequencies(word_freq_list, title, color='steelblue'):
    """
    Create a bar plot of word frequencies.
    
    Args:
        word_freq_list (list): List of (word, frequency) tuples
        title (str): Plot title
        color (str): Bar color
    """
    
    words, frequencies = zip(*word_freq_list)
    
    plt.figure(figsize=(10, 6))
    plt.barh(range(len(words)), frequencies, color=color)
    plt.yticks(range(len(words)), words)
    plt.gca().invert_yaxis()
    plt.title(title)
    plt.xlabel('Frequency')
    plt.tight_layout()
    plt.show()

In [None]:
# Plot word frequencies
plot_word_frequencies(word_freq, 'Top 15 Words in AITA Posts', color='coral')

### Exercise 2.3: Create Word Clouds

In [None]:
def create_wordcloud(texts, title, max_words=50):
    """
    Create a word cloud from texts.
    
    Args:
        texts (list): List of text strings
        title (str): Plot title
        max_words (int): Maximum words in cloud
    """
    
    # Combine all texts
    combined_text = ' '.join([text for text in texts if pd.notna(text)])
    
    # Create WordCloud
    wordcloud = WordCloud(width=800, height=400, 
                          background_color='white',
                          stopwords=set(stopwords.words('english')),
                          max_words=max_words).generate(combined_text)
    
    # Plot
    plt.figure(figsize=(12, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(title, fontsize=16)
    plt.tight_layout()
    plt.show()

In [None]:
# Create word cloud for the dataset
# Sample if dataset is large to avoid memory issues
sample_size = min(1000, len(reddit_data))
sampled_texts = reddit_data['processed_text'].sample(n=sample_size, random_state=42).tolist()
create_wordcloud(sampled_texts, 'AITA Posts Word Cloud')

### Exercise 2.4: N-gram Analysis

In [None]:
from nltk import bigrams, trigrams
from nltk.corpus import stopwords

def get_ngrams(texts, n=2, top_k=10):
    """
    Get top n-grams from texts.
    
    Args:
        texts (list): List of texts
        n (int): N-gram size (2 for bigrams, 3 for trigrams)
        top_k (int): Number of top n-grams to return
    
    Returns:
        list: List of (ngram, frequency) tuples
    """
    
    stop_words = set(stopwords.words('english'))
    all_ngrams = []
    
    for text in texts[:100]:  # Limit to first 100 texts for efficiency
        if pd.notna(text) and text:
            tokens = word_tokenize(text.lower())
            # Filter tokens
            tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
            
            if n == 2:
                text_ngrams = list(bigrams(tokens))
            elif n == 3:
                text_ngrams = list(trigrams(tokens))
            else:
                continue
                
            all_ngrams.extend(text_ngrams)
    
    ngram_freq = Counter(all_ngrams)
    return ngram_freq.most_common(top_k)

In [None]:
# Get bigrams and trigrams
bigrams_freq = get_ngrams(reddit_data['processed_text'].tolist(), n=2, top_k=10)
trigrams_freq = get_ngrams(reddit_data['processed_text'].tolist(), n=3, top_k=10)

print("Top 10 Bigrams:")
for ngram, freq in bigrams_freq:
    print(f"  {' '.join(ngram)}: {freq}")

print("\nTop 10 Trigrams:")
for ngram, freq in trigrams_freq:
    print(f"  {' '.join(ngram)}: {freq}")

---

## Part 3: TF-IDF Analysis (20 minutes)

### Exercise 3.1: Create TF-IDF Representations

In [None]:
def create_tfidf_features(texts, max_features=100, ngram_range=(1, 2), min_df=2, max_df=0.95):
    """
    Create TF-IDF features from texts.
    
    Args:
        texts (list): List of text strings
        max_features (int): Maximum number of features
        ngram_range (tuple): Range of n-grams to consider
        min_df (int): Minimum document frequency
        max_df (float): Maximum document frequency
    
    Returns:
        tuple: (TF-IDF matrix, vectorizer)
    """
    
    # Create and fit TF-IDF vectorizer
    vectorizer = TfidfVectorizer(
        max_features=max_features,
        ngram_range=ngram_range,
        stop_words='english',
        min_df=min_df,
        max_df=max_df,
        lowercase=True,
        token_pattern=r'\b[a-zA-Z]{2,}\b'  # Only words with 2+ letters
    )
    
    tfidf_matrix = vectorizer.fit_transform(texts)
    
    return tfidf_matrix, vectorizer

In [None]:
# Create TF-IDF features
# Use a sample if dataset is large
sample_size = min(2000, len(reddit_data))
sample_data = reddit_data.sample(n=sample_size, random_state=42).reset_index(drop=True)

tfidf_matrix, vectorizer = create_tfidf_features(sample_data['processed_text'].tolist())

# Convert to DataFrame for easier analysis
tfidf_df = pd.DataFrame(
    tfidf_matrix.todense(),
    columns=vectorizer.get_feature_names_out(),
    index=sample_data.index
)

print(f"TF-IDF matrix shape: {tfidf_df.shape}")
print(f"\nSample features: {list(tfidf_df.columns[:10])}")

### Exercise 3.2: Find Most Important Terms

In [None]:
def get_top_tfidf_terms(tfidf_df, top_n=15):
    """
    Get terms with highest mean TF-IDF scores.
    
    Args:
        tfidf_df (DataFrame): TF-IDF DataFrame
        top_n (int): Number of top terms to return
    
    Returns:
        Series: Top terms with their mean TF-IDF scores
    """
    
    # Calculate mean TF-IDF scores across all documents
    mean_tfidf = tfidf_df.mean(axis=0)
    
    return mean_tfidf.nlargest(top_n)

In [None]:
# Get top TF-IDF terms
top_terms = get_top_tfidf_terms(tfidf_df, top_n=15)

# Visualize top terms
plt.figure(figsize=(10, 6))
top_terms.sort_values().plot(kind='barh', color='darkgreen')
plt.title('Top 15 Terms by Mean TF-IDF Score')
plt.xlabel('Mean TF-IDF Score')
plt.tight_layout()
plt.show()

print("Top terms by TF-IDF:")
for term, score in top_terms.items():
    print(f"  {term}: {score:.4f}")

---

## Part 4: Text Classification (20 minutes)

### Exercise 4.1: Create Labels for Classification

Since we have Reddit posts, let's create a classification task based on post characteristics.

In [None]:
# Create a binary classification task based on post length
# Long posts vs short posts (this is just an example - you could use other criteria)
sample_data['text_length'] = sample_data['processed_text'].str.len()
median_length = sample_data['text_length'].median()
sample_data['post_type'] = (sample_data['text_length'] > median_length).map({True: 'long_post', False: 'short_post'})

print(f"Median text length: {median_length:.0f} characters")
print(f"\nPost type distribution:")
print(sample_data['post_type'].value_counts())

### Exercise 4.2: Build a Text Classifier

In [None]:
def build_classifier(X, y, test_size=0.2, random_state=42):
    """
    Build and evaluate a text classifier.
    
    Args:
        X: Feature matrix
        y: Labels
        test_size (float): Proportion of test set
        random_state (int): Random seed
    
    Returns:
        tuple: (trained model, X_test, y_test, predictions)
    """
    
    # Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )
    
    # Train a logistic regression classifier
    model = LogisticRegression(max_iter=1000, random_state=random_state)
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate accuracy
    train_accuracy = model.score(X_train, y_train)
    test_accuracy = model.score(X_test, y_test)
    
    print(f"Training Accuracy: {train_accuracy:.3f}")
    print(f"Test Accuracy: {test_accuracy:.3f}")
    
    return model, X_test, y_test, y_pred

In [None]:
# Build and evaluate the classifier
model, X_test, y_test, y_pred = build_classifier(
    tfidf_matrix, 
    sample_data['post_type']
)

# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

### Exercise 4.3: Analyze Feature Importance

In [None]:
def get_feature_importance(model, vectorizer, top_n=10):
    """
    Get the most important features for classification.
    
    Args:
        model: Trained classifier
        vectorizer: TF-IDF vectorizer
        top_n (int): Number of top features to return
    
    Returns:
        dict: Dictionary with positive and negative features
    """
    
    # Get feature names and coefficients
    feature_names = vectorizer.get_feature_names_out()
    coef = model.coef_[0]
    
    # Create feature importance dataframe
    feature_importance = pd.DataFrame({
        'feature': feature_names,
        'coefficient': coef
    }).sort_values('coefficient', ascending=False)
    
    # Get top positive and negative features
    positive_features = feature_importance.head(top_n)[['feature', 'coefficient']].values.tolist()
    negative_features = feature_importance.tail(top_n)[['feature', 'coefficient']].values.tolist()
    
    return {
        'positive': positive_features,
        'negative': negative_features
    }

In [None]:
# Get and visualize feature importance
important_features = get_feature_importance(model, vectorizer)

print("Features most indicative of long posts:")
for feature, score in important_features['positive']:
    print(f"  {feature}: {score:.3f}")

print("\nFeatures most indicative of short posts:")
for feature, score in important_features['negative']:
    print(f"  {feature}: {score:.3f}")

### Exercise 4.4: Test the Classifier on New Text

In [None]:
def predict_post_type(text, model, vectorizer, preprocess_func):
    """
    Predict post type for new text.
    
    Args:
        text (str): Input text
        model: Trained classifier
        vectorizer: TF-IDF vectorizer
        preprocess_func: Preprocessing function
    
    Returns:
        tuple: (predicted label, prediction probabilities)
    """
    
    # Preprocess text
    processed_text = preprocess_func(text)
    
    # Transform to TF-IDF features
    features = vectorizer.transform([processed_text])
    
    # Predict
    prediction = model.predict(features)
    probabilities = model.predict_proba(features)
    
    return prediction[0], probabilities[0]

In [None]:
# Test on new posts
test_posts = [
    "AITA for not going?",
    "So this is a long story but bear with me. It all started when I was in college and my roommate asked me to help with a project. I said yes initially but then realized it would take the entire weekend. The project was for a class I wasn't even in, and I had my own assignments due. But here's where it gets complicated...",
    "My friend is mad at me.",
    "I need some perspective on this situation. Last month, my sister planned a surprise party for our mother's 60th birthday. She asked everyone to contribute $100 for the venue and catering. I thought this was reasonable at first, but then I found out she chose the most expensive restaurant in town without consulting anyone."
]

for post in test_posts:
    pred, probs = predict_post_type(post, model, vectorizer, preprocess_reddit_text)
    print(f"\nPost: '{post[:50]}...'")
    print(f"Predicted: {pred}")
    print(f"Confidence: {max(probs):.2%}")

---

## Part 5: Advanced Analysis with spaCy (10 minutes)

### Exercise 5.1: Linguistic Feature Analysis

In [None]:
def analyze_linguistic_features(texts, sample_size=20):
    """
    Analyze linguistic features using spaCy.
    
    Args:
        texts (list): List of texts
        sample_size (int): Number of texts to sample
    
    Returns:
        dict: Dictionary of linguistic statistics
    """
    
    # Sample texts for efficiency
    sampled_texts = texts[:sample_size] if len(texts) > sample_size else texts
    
    stats = {
        'avg_sentence_length': [],
        'noun_ratio': [],
        'verb_ratio': [],
        'adj_ratio': [],
        'pronoun_ratio': []
    }
    
    for text in sampled_texts:
        if pd.notna(text) and len(text) > 0:
            # Process with spaCy (limit length for efficiency)
            doc = nlp(text[:1000])  # Process first 1000 chars
            
            # Count sentences
            sentences = list(doc.sents)
            if sentences:
                avg_sent_len = sum(len(sent.text.split()) for sent in sentences) / len(sentences)
                stats['avg_sentence_length'].append(avg_sent_len)
            
            # Count POS tags
            pos_counts = Counter(token.pos_ for token in doc)
            total_tokens = len(doc)
            
            if total_tokens > 0:
                stats['noun_ratio'].append(pos_counts.get('NOUN', 0) / total_tokens)
                stats['verb_ratio'].append(pos_counts.get('VERB', 0) / total_tokens)
                stats['adj_ratio'].append(pos_counts.get('ADJ', 0) / total_tokens)
                stats['pronoun_ratio'].append(pos_counts.get('PRON', 0) / total_tokens)
    
    # Calculate means
    return {k: np.mean(v) if v else 0 for k, v in stats.items()}

In [None]:
# Analyze linguistic features
linguistic_features = analyze_linguistic_features(sample_data['processed_text'].tolist(), sample_size=50)

# Create visualization
feature_df = pd.DataFrame([linguistic_features]).T
feature_df.columns = ['Value']

plt.figure(figsize=(10, 6))
feature_df.plot(kind='bar', color='purple', alpha=0.7)
plt.title('Linguistic Features in AITA Posts')
plt.ylabel('Average Value')
plt.xlabel('Feature')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print("Linguistic Feature Analysis:")
for feature, value in linguistic_features.items():
    print(f"  {feature}: {value:.3f}")

---

## Reflection and Submission (10 minutes)

Complete the reflection below.

### Lab Reflection

**1. What was the most interesting finding from your analysis of the AITA posts?**

_Your answer here_

**2. Which preprocessing step had the biggest impact on your results?**

_Your answer here_

**3. How might you extend this analysis for a real research project about online discourse or moral judgment?**

_Your answer here_

**4. What challenges did you encounter with the real Reddit data and how did you solve them?**

_Your answer here_

**5. Did you use any AI assistance for this lab? If so, describe how and include your prompts:**

_Your answer here_

---

## Bonus Challenges (Optional)

If you finish early, try these additional challenges:

1. **Sentiment Analysis**: Analyze the sentiment of AITA posts using TextBlob or VADER
2. **Topic Modeling**: Apply LDA to discover main topics discussed in the posts
3. **Judgment Prediction**: If the data has labels (YTA, NTA, etc.), build a classifier to predict judgments
4. **Temporal Analysis**: If timestamps are available, analyze how language changes over time
5. **Named Entity Recognition**: Use spaCy to extract and analyze people, places, and organizations mentioned

In [None]:
# Space for bonus challenges
