# HW5: Multi-Document NLP Analysis - r/ChangeMyView Conversations

**Name:** [Your Name Here]

**Date:** [Date]

---

## Assignment Overview

Building on the NLP fundamentals and word embeddings labs, you will analyze conversations in r/ChangeMyView by examining both posts and comments together. This multi-document analysis will help you understand how online discourse unfolds across linked texts.

### What You'll Do:
1. **Link datasets** - Connect posts with their comments
2. **Multi-document preprocessing** - Clean both posts and comments
3. **Conversation analysis** - Compare language patterns between posts and responses
4. **Social dynamics** - Use word embeddings to analyze how perspectives shift in conversations
5. **Discourse patterns** - Identify how different topics generate different types of discussions

### Skills Applied:
- Multi-document NLP techniques
- Dataset linking and relational analysis
- Word embeddings for social analysis (from lab)
- Conversation and discourse analysis

## 1. Setup and Import Libraries

*Building on the tools from your NLP labs*

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re
import warnings
warnings.filterwarnings('ignore')

# NLP libraries (from labs)
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA

# Word embeddings (from word embeddings lab)
import gensim.downloader as api
from scipy.spatial.distance import cosine

# Download required data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

# Set style
plt.style.use('seaborn-v0_8')
%matplotlib inline

## 2. Load and Link the Datasets

*Multi-document analysis requires linking related texts*

In [None]:
# Load both datasets
posts_df = pd.read_csv('../data/changemyview_posts.csv')
comments_df = pd.read_csv('../data/cmv_comments.csv')

print(f"Posts dataset: {posts_df.shape}")
print(f"Comments dataset: {comments_df.shape}")

# TODO: Explore the structure of both datasets
# What columns do they share for linking?
print(f"\nPosts columns: {posts_df.columns.tolist()}")
print(f"Comments columns: {comments_df.columns.tolist()}")

# Show sample data
print(f"\nSample post:")
print(posts_df[['title', 'score', 'num_comments']].head(1))
print(f"\nSample comment:")
print(comments_df[['body', 'score']].head(1))

In [None]:
# TODO: Link the datasets
# How do posts connect to their comments?
# Hint: Look for shared ID columns

# Find the linking column
linking_columns = set(posts_df.columns) & set(comments_df.columns)
print(f"\nShared columns for linking: {linking_columns}")

# TODO: Analyze the post-comment relationships
# How many posts have comments in our dataset?

print(f"\nUnique posts in posts_df: {posts_df['id'].nunique()}")
print(f"Unique link_ids in comments_df: {comments_df['link_id'].nunique()}")

# Check overlap
posts_with_comments = posts_df['id'].isin(comments_df['link_id'])
print(f"Posts with comments in our dataset: {posts_with_comments.sum()} ({posts_with_comments.mean()*100:.1f}%)")

# Sample conversation preview
sample_post_id = posts_df['id'].iloc[0]
print(f"\nSample conversation:")
print(f"Post: {posts_df[posts_df['id'] == sample_post_id]['title'].iloc[0][:100]}...")
related_comments = comments_df[comments_df['link_id'] == sample_post_id]
print(f"Number of comments: {len(related_comments)}")
if len(related_comments) > 0:
    print(f"Sample comment: {related_comments['body'].iloc[0][:100]}...")

## 3. Multi-Document Text Preprocessing

*Apply consistent preprocessing to both posts and comments*

In [None]:
# Preprocessing function (from labs)
def clean_text(text):
    """
    Clean and preprocess Reddit text
    """
    if pd.isna(text):
        return ''
    
    text = str(text).lower()
    
    # Remove Reddit-specific content
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'/u/\w+|/r/\w+', '', text)
    text = re.sub(r'\[removed\]|\[deleted\]', '', text)
    
    # Clean text
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

# TODO: Preprocess both posts and comments
# Apply consistent cleaning to both datasets

# Posts preprocessing
posts_df['combined_text'] = posts_df['title'].fillna('') + ' ' + posts_df['selftext'].fillna('')
posts_df['clean_text'] = posts_df['combined_text'].apply(clean_text)

# Comments preprocessing  
comments_df['clean_text'] = comments_df['body'].apply(clean_text)

# Remove empty texts
posts_clean = posts_df[posts_df['clean_text'].str.len() > 10].copy()
comments_clean = comments_df[comments_df['clean_text'].str.len() > 10].copy()

print(f"Clean posts: {len(posts_clean)} (removed {len(posts_df) - len(posts_clean)})")
print(f"Clean comments: {len(comments_clean)} (removed {len(comments_df) - len(comments_clean)})")

# Show preprocessing examples
print(f"\nSample clean post: {posts_clean['clean_text'].iloc[0][:100]}...")
print(f"Sample clean comment: {comments_clean['clean_text'].iloc[0][:100]}...")

In [None]:
# Tokenization (using approach from labs)
stop_words = set(stopwords.words('english'))
stop_words.update(['cmv', 'change', 'view', 'mind', 'think', 'would', 'could', 'should'])

def tokenize_text(text):
    if not text:
        return []
    tokens = word_tokenize(text.lower())
    return [t for t in tokens if t.isalpha() and len(t) > 2 and t not in stop_words]

# Apply tokenization to both datasets
posts_clean['tokens'] = posts_clean['clean_text'].apply(tokenize_text)
comments_clean['tokens'] = comments_clean['clean_text'].apply(tokenize_text)

print(f"Sample tokens from post: {posts_clean['tokens'].iloc[0][:10]}")
print(f"Sample tokens from comment: {comments_clean['tokens'].iloc[0][:10]}")

# Filter to posts/comments with sufficient tokens
posts_analysis = posts_clean[posts_clean['tokens'].str.len() > 5].copy()
comments_analysis = comments_clean[comments_clean['tokens'].str.len() > 3].copy()

print(f"\nDatasets for analysis:")
print(f"Posts: {len(posts_analysis)}")
print(f"Comments: {len(comments_analysis)}")

## 4. Comparative Word Analysis

*Compare language patterns between posts and comments*

In [None]:
# TODO: Compare word frequencies between posts and comments
# What words are more common in posts vs comments?

# Get word frequencies for each document type
post_tokens = [token for tokens in posts_analysis['tokens'] for token in tokens]
comment_tokens = [token for tokens in comments_analysis['tokens'] for token in tokens]

post_freq = Counter(post_tokens)
comment_freq = Counter(comment_tokens)

print(f"Total post tokens: {len(post_tokens):,}")
print(f"Total comment tokens: {len(comment_tokens):,}")

# Find words that are more common in posts vs comments
post_top = set([word for word, _ in post_freq.most_common(100)])
comment_top = set([word for word, _ in comment_freq.most_common(100)])

post_distinctive = post_top - comment_top
comment_distinctive = comment_top - post_top

print(f"\nWords more distinctive to posts: {list(post_distinctive)[:10]}")
print(f"Words more distinctive to comments: {list(comment_distinctive)[:10]}")

In [None]:
# TODO: Visualize the comparison
# Create a comparative visualization of post vs comment language

# Select common words for comparison
common_words = list((post_top & comment_top) - {'people', 'think', 'like', 'really', 'know', 'right', 'way'})[:10]

# Get normalized frequencies
post_total = len(post_tokens)
comment_total = len(comment_tokens)

post_rates = [post_freq[word] / post_total * 1000 for word in common_words]
comment_rates = [comment_freq[word] / comment_total * 1000 for word in common_words]

# Create comparison plot
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(common_words))
width = 0.35

ax.bar(x - width/2, post_rates, width, label='Posts', alpha=0.8)
ax.bar(x + width/2, comment_rates, width, label='Comments', alpha=0.8)

ax.set_xlabel('Words')
ax.set_ylabel('Rate per 1000 tokens')
ax.set_title('Word Usage: Posts vs Comments')
ax.set_xticks(x)
ax.set_xticklabels(common_words, rotation=45)
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
# TODO: What does this comparison tell us about how people communicate differently 
# in posts (starting discussions) vs comments (responding to discussions)?

# Your analysis here:
print("\nAnalysis Questions:")
print("1. What words are more common in posts vs comments?")
print("2. What does this suggest about different communication purposes?")
print("3. How might this relate to persuasion vs response dynamics?")

## 5. Conversation Analysis

*Analyze how conversations unfold between posts and their comments*

In [None]:
# TODO: Create linked post-comment pairs for conversation analysis
# Focus on posts that have multiple comments

# Find posts with substantial comment discussions
post_comment_counts = comments_analysis.groupby('link_id').size()
posts_with_discussion = post_comment_counts[post_comment_counts >= 5].index

conversation_posts = posts_analysis[posts_analysis['id'].isin(posts_with_discussion)].copy()
conversation_comments = comments_analysis[comments_analysis['link_id'].isin(posts_with_discussion)].copy()

print(f"Posts with 5+ comments: {len(conversation_posts)}")
print(f"Comments in these conversations: {len(conversation_comments)}")

# Create conversation pairs
conversations = []
for post_id in conversation_posts['id'].head(10):  # Sample 10 conversations
    post_text = conversation_posts[conversation_posts['id'] == post_id]['clean_text'].iloc[0]
    related_comments = conversation_comments[conversation_comments['link_id'] == post_id]['clean_text'].tolist()
    
    conversations.append({
        'post_id': post_id,
        'post_text': post_text,
        'comments': related_comments,
        'num_comments': len(related_comments)
    })

print(f"\nSample conversation structure:")
print(f"Post: {conversations[0]['post_text'][:100]}...")
print(f"First comment: {conversations[0]['comments'][0][:100]}...")
print(f"Comments in conversation: {conversations[0]['num_comments']}")

In [None]:
# TODO: Analyze conversation dynamics
# How do comment responses relate to the original post?

def calculate_text_similarity(text1, text2):
    """
    Calculate cosine similarity between two texts using TF-IDF
    """
    vectorizer = TfidfVectorizer(stop_words='english')
    try:
        tfidf_matrix = vectorizer.fit_transform([text1, text2])
        similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
        return similarity
    except:
        return 0

# Calculate post-comment similarities
similarities = []
for conv in conversations[:5]:  # Sample 5 conversations
    post_text = conv['post_text']
    for comment_text in conv['comments'][:3]:  # Top 3 comments per post
        sim = calculate_text_similarity(post_text, comment_text)
        similarities.append(sim)

print(f"\nPost-Comment Similarities:")
print(f"Mean similarity: {np.mean(similarities):.3f}")
print(f"Range: {np.min(similarities):.3f} - {np.max(similarities):.3f}")
print(f"\nInterpretation: Higher similarity = comments that closely mirror post language")
print(f"Lower similarity = comments that introduce new concepts/perspectives")

## 6. Word Embeddings Analysis

*Apply word embeddings from the lab to analyze discourse patterns*

In [None]:
# TODO: Load word embeddings (from your word embeddings lab)
# Use embeddings to analyze semantic patterns in posts vs comments

print("Loading word embeddings...")
model = api.load('glove-wiki-gigaword-50')  # Same as lab
print(f"Model loaded with {len(model.key_to_index):,} words")

# TODO: Define semantic axes for social discourse analysis
# Use the approach from your word embeddings lab

def create_semantic_axis(positive_words, negative_words):
    pos_vectors = [model[word] for word in positive_words if word in model]
    neg_vectors = [model[word] for word in negative_words if word in model]
    
    if not pos_vectors or not neg_vectors:
        return None
    
    pos_mean = np.mean(pos_vectors, axis=0)
    neg_mean = np.mean(neg_vectors, axis=0)
    axis = pos_mean - neg_mean
    return axis / np.linalg.norm(axis)

# Create discourse axes
axes = {
    'agreement_disagreement': create_semantic_axis(
        ['agree', 'support', 'correct', 'right'],
        ['disagree', 'oppose', 'wrong', 'incorrect']
    ),
    'emotional_rational': create_semantic_axis(
        ['feel', 'emotion', 'heart', 'passionate'],
        ['logic', 'rational', 'evidence', 'analysis']
    )
}

print(f"Created {len([a for a in axes.values() if a is not None])} semantic axes")

In [None]:
# TODO: Project key debate words onto semantic axes
# Analyze how posts vs comments differ in their semantic positioning

def project_word_onto_axis(word, axis):
    if word not in model or axis is None:
        return None
    word_vector = model[word] / np.linalg.norm(model[word])
    return np.dot(word_vector, axis)

# Get distinctive words from posts and comments
post_distinctive_words = [word for word, count in Counter(post_tokens).most_common(30) 
                        if word in model and count > 20]
comment_distinctive_words = [word for word, count in Counter(comment_tokens).most_common(30) 
                           if word in model and count > 50]

# Project onto agreement-disagreement axis
if axes['agreement_disagreement'] is not None:
    print("\nPost words on Agreement-Disagreement axis:")
    for word in post_distinctive_words[:10]:
        proj = project_word_onto_axis(word, axes['agreement_disagreement'])
        if proj is not None:
            direction = "→ agreement" if proj > 0 else "→ disagreement"
            print(f"  {word}: {proj:.3f} {direction}")
    
    print("\nComment words on Agreement-Disagreement axis:")
    for word in comment_distinctive_words[:10]:
        proj = project_word_onto_axis(word, axes['agreement_disagreement'])
        if proj is not None:
            direction = "→ agreement" if proj > 0 else "→ disagreement"
            print(f"  {word}: {proj:.3f} {direction}")

## 7. Social Dynamics Analysis

*Analyze patterns that reveal social and persuasive dynamics*

In [None]:
# TODO: Analyze social dynamics patterns
# Question: How do different types of posts generate different conversation patterns?

# Categorize posts by engagement type
posts_analysis['engagement_type'] = 'low'
posts_analysis.loc[posts_analysis['num_comments'] > posts_analysis['num_comments'].quantile(0.8), 'engagement_type'] = 'high_discussion'
posts_analysis.loc[(posts_analysis['score'] > posts_analysis['score'].quantile(0.8)) & 
                   (posts_analysis['num_comments'] < posts_analysis['num_comments'].quantile(0.5)), 'engagement_type'] = 'high_agreement'

print("Engagement type distribution:")
print(posts_analysis['engagement_type'].value_counts())

# TODO: Analyze language patterns by engagement type
for eng_type in ['high_discussion', 'high_agreement']:
    posts_subset = posts_analysis[posts_analysis['engagement_type'] == eng_type]
    if len(posts_subset) > 5:
        subset_tokens = [token for tokens in posts_subset['tokens'] for token in tokens]
        top_words = Counter(subset_tokens).most_common(10)
        print(f"\nTop words in {eng_type} posts:")
        print([word for word, count in top_words])

# Social dynamics questions for analysis:
print("\nSocial Dynamics Questions:")
print("1. What language patterns distinguish posts that generate debate vs agreement?")
print("2. How do comment responses differ from original post language?")
print("3. What semantic patterns suggest successful persuasion or perspective change?")

In [None]:
# TODO: Create a visualization of conversation dynamics
# Compare semantic positioning of posts vs their comments

# Sample a few conversations for detailed analysis
sample_conversations = conversations[:3]

for i, conv in enumerate(sample_conversations):
    print(f"\nConversation {i+1} Analysis:")
    print(f"Post preview: {conv['post_text'][:80]}...")
    
    # Get key words from post and comments
    post_words = tokenize_text(conv['post_text'])[:10]
    comment_words = []
    for comment in conv['comments'][:3]:
        comment_words.extend(tokenize_text(comment)[:5])
    
    # Find words that appear in both
    shared_concepts = set(post_words) & set(comment_words) & set(model.key_to_index.keys())
    
    if shared_concepts:
        print(f"Shared concepts: {list(shared_concepts)[:5]}")
        
        # Project shared concepts onto semantic axes
        if axes['agreement_disagreement'] is not None:
            print("Semantic analysis of shared concepts:")
            for word in list(shared_concepts)[:3]:
                proj = project_word_onto_axis(word, axes['agreement_disagreement'])
                if proj is not None:
                    tendency = "agreement" if proj > 0 else "disagreement"
                    print(f"  {word}: tends toward {tendency} ({proj:.2f})")
    else:
        print("No shared concepts found in embeddings vocabulary")

## 8. Discussion Quality Analysis

*Analyze what makes productive vs unproductive online discussions*

In [None]:
# TODO: Analyze discussion quality indicators
# What linguistic patterns are associated with productive discussions?

# Define quality indicators based on CMV context
quality_indicators = {
    'reasoning_words': ['because', 'therefore', 'evidence', 'research', 'study', 'data', 'analysis'],
    'perspective_words': ['understand', 'perspective', 'viewpoint', 'consider', 'acknowledge'],
    'civility_words': ['respect', 'appreciate', 'thank', 'interesting', 'valid'],
    'change_words': ['delta', 'changed', 'convinced', 'reconsidered', 'modified']
}

def count_quality_indicators(text, indicator_list):
    if pd.isna(text):
        return 0
    text_lower = str(text).lower()
    return sum(1 for word in indicator_list if word in text_lower)

# Apply quality analysis to posts and comments
for category, words in quality_indicators.items():
    posts_analysis[f'{category}_count'] = posts_analysis['clean_text'].apply(
        lambda x: count_quality_indicators(x, words)
    )
    comments_analysis[f'{category}_count'] = comments_analysis['clean_text'].apply(
        lambda x: count_quality_indicators(x, words)
    )

# Compare quality indicators between posts and comments
print("Quality Indicators Comparison:")
for category in quality_indicators.keys():
    post_avg = posts_analysis[f'{category}_count'].mean()
    comment_avg = comments_analysis[f'{category}_count'].mean()
    print(f"{category}: Posts {post_avg:.2f}, Comments {comment_avg:.2f}")

In [None]:
# TODO: Analyze which posts generate the highest quality discussions
# Look at posts with high comment engagement and quality indicators

posts_analysis['total_quality'] = (posts_analysis['reasoning_words_count'] + 
                                  posts_analysis['perspective_words_count'] + 
                                  posts_analysis['civility_words_count'])

# High quality discussion posts
high_quality_posts = posts_analysis[
    (posts_analysis['num_comments'] > posts_analysis['num_comments'].quantile(0.8)) &
    (posts_analysis['total_quality'] > 0)
]

print(f"\nHigh-quality discussion posts: {len(high_quality_posts)}")
if len(high_quality_posts) > 0:
    print("Sample high-quality post titles:")
    for title in high_quality_posts['title'].head(3):
        print(f"- {title[:100]}...")

# Compare quality indicators
print(f"\nQuality comparison:")
print(f"Average reasoning words - All posts: {posts_analysis['reasoning_words_count'].mean():.2f}")
print(f"Average reasoning words - High quality: {high_quality_posts['reasoning_words_count'].mean():.2f}")

## 9. Your Social Dynamics Interpretation

*Connect your findings to broader questions about online discourse*

In [None]:
# TODO: Synthesize your findings to address social science questions
# Use your multi-document NLP analysis to answer questions about online discourse

print("Social Dynamics Questions to Address:")
print("="*60)
print("1. How does language differ between posts (opinion statements) and comments (responses)?")
print("2. What linguistic patterns are associated with productive vs unproductive discussions?")
print("3. How do semantic patterns reveal different types of engagement (agreement, debate, etc.)?")
print("4. What does this analysis reveal about how people engage in persuasion online?")
print("5. How might these patterns inform design of better online discussion platforms?")

print("\n" + "="*60)
print("YOUR ANALYSIS AND INTERPRETATION:")
print("\nBased on your multi-document NLP analysis above, write 2-3 paragraphs addressing these questions.")
print("Connect your specific findings (word patterns, semantic analysis, conversation dynamics)")
print("to broader insights about online discourse and social persuasion.")
print("\nConsider:")
print("- What did the post vs comment language comparison reveal?")
print("- How did the word embeddings analysis show different types of discourse?")
print("- What patterns emerged in quality discussions vs lower-quality ones?")
print("- How do these findings relate to theories about online deliberation and opinion change?")

## 10. Your Multi-Document Analysis and Social Science Interpretation

### TODO: Complete this section with your analysis

Based on your multi-document NLP analysis above, write a 3-4 paragraph interpretation that addresses:

1. **Language Differences**: What did you discover about how language differs between posts (initial arguments) and comments (responses)? What does this suggest about different communicative purposes?

2. **Conversation Dynamics**: How do conversations unfold between posts and comments? What patterns did you find in terms of semantic similarity, quality indicators, and engagement types?

3. **Social Dynamics**: Using your word embeddings analysis, what did you learn about how different types of discourse (agreement vs disagreement, emotional vs rational) manifest in CMV discussions?

4. **Broader Implications**: What do these patterns tell us about online deliberation and opinion change? How might these insights inform the design of better platforms for constructive discourse?

**Your interpretation here:**

[Write your 3-4 paragraph interpretation here, drawing specifically on your analysis results]

### Key Evidence from Your Analysis

TODO: Reference specific findings from your analysis:
- Quote distinctive words/patterns you found
- Cite specific semantic axis results
- Reference conversation similarity scores
- Mention quality indicator patterns

### Additional Multi-Document Analysis (Optional)

Consider extending your analysis with:
- Temporal patterns in conversations
- Comment thread depth analysis
- Cross-referencing high-quality posts with comment patterns

In [None]:
# TODO: Optional Extensions
# 1. Analyze comment thread structures (parent-child relationships)
# 2. Look for evidence of opinion change in comment sequences
# 3. Compare early vs late comments in discussions
# 4. Analyze posts that received "delta" awards (successful view changes)

# Your additional analysis here:

## 11. Conclusions and Implications

### Key Takeaways from Multi-Document Analysis

TODO: List 4-5 key takeaways from your multi-document analysis:
1. **Language Patterns**: [Your finding about post vs comment language]
2. **Conversation Dynamics**: [Your finding about how discussions unfold]
3. **Semantic Patterns**: [Your finding from word embeddings analysis]
4. **Quality Indicators**: [Your finding about productive discussions]
5. **Social Implications**: [Your broader insight about online discourse]

### Methodological Insights

TODO: Reflect on the multi-document NLP approach:
- What did linking datasets reveal that single-document analysis would miss?
- How did word embeddings enhance your understanding of social dynamics?
- What challenges did you encounter in conversation analysis?

### Implications for Online Platform Design

TODO: Based on your findings, what recommendations would you make for designing better discussion platforms?

### Limitations and Future Work

TODO: Discuss limitations and extensions:
- Dataset limitations (time period, platform-specific effects)
- Methodological limitations (preprocessing choices, semantic axes)
- Future work (longitudinal analysis, cross-platform comparison, intervention studies)