# Lab: Word Embeddings for Social Text Analysis

**Time:** 90 minutes  
**Work in groups of 2-3**

---

## Lab Overview

In this lab, you'll apply word embeddings to analyze social media discourse. Building on the word embeddings lesson, you'll work with real social text data to explore semantic relationships, analyze bias, and compare different discussion topics.

### Learning Objectives
- Load and work with pre-trained word embeddings
- Analyze semantic similarity between words in social contexts
- Create and interpret semantic axes for social concepts
- Compare discourse patterns across different communities
- Visualize word relationships in social text

### Dataset
We'll use a sample from our r/ChangeMyView dataset to explore how people discuss different topics and frame arguments.

---

## Setup (5 minutes)

Import libraries and load pre-trained embeddings.

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Word embeddings
import gensim
import gensim.downloader as api
from scipy.spatial.distance import cosine
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

# Text processing
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

%matplotlib inline
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

In [None]:
# Load pre-trained GloVe embeddings (50-dimensional for speed)
print("Loading word embeddings...")
model = api.load('glove-wiki-gigaword-50')
print(f"Model loaded! Vocabulary size: {len(model.key_to_index):,} words")

## Part 1: Exploring Social Discourse with Embeddings (20 minutes)

### Exercise 1.1: Load and Explore Our Data

In [None]:
# Load our ChangeMyView data
posts_df = pd.read_csv('../data/changemyview_posts.csv')
comments_df = pd.read_csv('../data/cmv_comments.csv')

print(f"Posts: {len(posts_df):,}")
print(f"Comments: {len(comments_df):,}")

# Sample for this lab (for manageable processing time)
posts_sample = posts_df.sample(500, random_state=42)
print(f"\nWorking with {len(posts_sample)} sampled posts for this lab")

In [None]:
# Quick look at our data
print("Sample CMV post titles:")
for i in range(3):
    title = posts_sample.iloc[i]['title']
    print(f"{i+1}. {title[:80]}...")

### Exercise 1.2: Word Similarity in Debate Contexts

**Task:** Explore which debate-related words are most similar to key terms. Work with your group to:
1. Define lists of words related to different aspects of debate/argumentation
2. Find the most similar words to each
3. Discuss what this reveals about how these concepts are used

In [None]:
# Define debate-related terms to explore
debate_terms = {
    'argument': ['argument', 'debate', 'discussion', 'reasoning'],
    'opinion': ['opinion', 'belief', 'view', 'perspective'], 
    'evidence': ['evidence', 'proof', 'data', 'facts'],
    'morality': ['right', 'wrong', 'moral', 'ethical']
}

def explore_word_similarity(word, topn=8):
    """
    Find most similar words and display with similarity scores.
    """
    if word in model:
        similar = model.most_similar(word, topn=topn)
        print(f"\nWords most similar to '{word}':")
        for sim_word, score in similar:
            print(f"  {sim_word}: {score:.3f}")
    else:
        print(f"'{word}' not found in vocabulary")

# TODO: Choose 2-3 words from the lists above and explore their similarities
# Discuss with your group: What do you notice? Are there any surprising similarities?

# YOUR CODE HERE - explore word similarities


**Discussion:** What patterns do you notice? Are there unexpected word associations?

### Exercise 1.3: Word Analogies in Social Context

**Task:** Test social and political analogies using the `most_similar` function.

In [None]:
def test_analogy(pos_words, neg_words, description):
    """
    Test word analogies using most_similar function.
    """
    try:
        result = model.most_similar(positive=pos_words, negative=neg_words, topn=5)
        print(f"\n{description}")
        print(f"Formula: {' + '.join(pos_words)} - {' - '.join(neg_words)}")
        print("Results:")
        for word, score in result:
            print(f"  {word}: {score:.3f}")
    except Exception as e:
        print(f"Error with analogy {description}: {e}")

# Test some political/social analogies
analogies_to_test = [
    # Format: ([positive_words], [negative_words], "description")
    (['liberal', 'republican'], ['conservative'], "Liberal is to Republican as Conservative is to..."),
    (['government', 'freedom'], ['control'], "Government is to Control as Freedom is to..."),
    (['individual', 'society'], ['personal'], "Individual is to Personal as Society is to..."),
]

# TODO: Run these analogies and add 2-3 of your own
# Discuss: Do the results make sense? What do they reveal about political discourse?

for pos, neg, desc in analogies_to_test:
    test_analogy(pos, neg, desc)

# YOUR ANALOGIES HERE - try creating your own based on the CMV topics you've seen


---

## Part 2: Semantic Axes for Social Analysis (25 minutes)

### Exercise 2.1: Construct Semantic Axes

**Task:** Build semantic axes relevant to political and social discourse.

In [None]:
def create_semantic_axis(positive_words, negative_words, model):
    """
    Create a semantic axis from two sets of opposing words.
    
    Returns the normalized difference vector.
    """
    # Get embeddings for words that exist in vocabulary
    pos_vectors = [model[word] for word in positive_words if word in model]
    neg_vectors = [model[word] for word in negative_words if word in model]
    
    if not pos_vectors or not neg_vectors:
        raise ValueError("Some words not found in vocabulary")
    
    # Calculate mean vectors
    pos_mean = np.mean(pos_vectors, axis=0)
    neg_mean = np.mean(neg_vectors, axis=0)
    
    # Create semantic axis (difference vector)
    axis = pos_mean - neg_mean
    
    # Normalize
    axis = axis / np.linalg.norm(axis)
    
    return axis

# TODO: Define word lists for creating semantic axes
# Think about important dimensions in political/social discourse

# Example axes to create:
axis_definitions = {
    'liberal_conservative': {
        'positive': ['liberal', 'progressive', 'left'],  # Liberal end
        'negative': ['conservative', 'traditional', 'right']  # Conservative end
    },
    # TODO: Add your own axis definitions
    # Ideas: individual_collective, freedom_security, change_stability
}

# Create the semantic axes
semantic_axes = {}
for axis_name, words in axis_definitions.items():
    try:
        axis = create_semantic_axis(words['positive'], words['negative'], model)
        semantic_axes[axis_name] = axis
        print(f"✓ Created axis: {axis_name}")
    except Exception as e:
        print(f"✗ Failed to create {axis_name}: {e}")

print(f"\nCreated {len(semantic_axes)} semantic axes")

### Exercise 2.2: Project Words onto Semantic Axes

**Task:** Project social and political terms onto your axes to see where they fall.

In [None]:
def project_word_onto_axis(word, axis, model):
    """
    Project a word onto a semantic axis using cosine similarity.
    
    Positive values = closer to positive end of axis
    Negative values = closer to negative end of axis
    """
    if word not in model:
        return None
    
    word_vector = model[word]
    # Normalize word vector
    word_vector = word_vector / np.linalg.norm(word_vector)
    
    # Calculate projection (dot product with normalized axis)
    projection = np.dot(word_vector, axis)
    
    return projection

# TODO: Define lists of words to test on your axes
# Think about political terms, social issues, institutions, etc.

test_words = [
    'democracy', 'capitalism', 'socialism', 'freedom', 'equality',
    'government', 'individual', 'community', 'rights', 'responsibility',
    'progress', 'tradition', 'change', 'stability', 'reform'
    # TODO: Add more words relevant to CMV discussions
]

# Project words onto each axis
projections = {}

for axis_name, axis in semantic_axes.items():
    projections[axis_name] = {}
    print(f"\nProjections onto {axis_name} axis:")
    print("-" * 40)
    
    word_projections = []
    for word in test_words:
        proj = project_word_onto_axis(word, axis, model)
        if proj is not None:
            projections[axis_name][word] = proj
            word_projections.append((word, proj))
    
    # Sort by projection value
    word_projections.sort(key=lambda x: x[1], reverse=True)
    
    for word, proj in word_projections:
        direction = "→" if proj > 0 else "←"
        print(f"  {word:15s} {direction} {proj:6.3f}")

**Discussion:** What do these projections tell you about how these concepts are positioned in semantic space?

### Exercise 2.3: Visualize Semantic Relationships

**Task:** Create visualizations of how words relate on your semantic axes.

In [None]:
def plot_semantic_axis(word_projections, axis_name, positive_label, negative_label):
    """
    Plot words positioned along a semantic axis.
    """
    # Prepare data
    words = list(word_projections.keys())
    values = list(word_projections.values())
    
    # Sort by values for better visualization
    sorted_data = sorted(zip(words, values), key=lambda x: x[1])
    words_sorted = [x[0] for x in sorted_data]
    values_sorted = [x[1] for x in sorted_data]
    
    # Create color map (red for negative, blue for positive)
    colors = ['red' if v < 0 else 'blue' for v in values_sorted]
    
    # Create plot
    plt.figure(figsize=(12, 8))
    bars = plt.barh(range(len(words_sorted)), values_sorted, color=colors, alpha=0.7)
    
    # Customize plot
    plt.yticks(range(len(words_sorted)), words_sorted)
    plt.xlabel(f'{negative_label} ← Projection Score → {positive_label}')
    plt.title(f'Word Projections on {axis_name.title()} Axis')
    plt.axvline(x=0, color='black', linestyle='--', alpha=0.5)
    
    # Add legend
    plt.text(min(values_sorted) * 0.8, len(words_sorted) * 0.9, 
             negative_label, fontsize=12, ha='left', color='red')
    plt.text(max(values_sorted) * 0.8, len(words_sorted) * 0.9, 
             positive_label, fontsize=12, ha='right', color='blue')
    
    plt.tight_layout()
    plt.show()

# TODO: Create visualizations for your semantic axes
# Choose appropriate labels for the positive and negative ends

axis_labels = {
    'liberal_conservative': ('Liberal', 'Conservative'),
    # TODO: Add labels for your other axes
}

for axis_name in semantic_axes.keys():
    if axis_name in projections and axis_name in axis_labels:
        pos_label, neg_label = axis_labels[axis_name]
        plot_semantic_axis(projections[axis_name], axis_name, pos_label, neg_label)

---

## Part 3: Analyzing CMV Discourse (25 minutes)

### Exercise 3.1: Extract Key Terms from Posts

**Task:** Extract and analyze the most common meaningful words from CMV posts.

In [None]:
def extract_meaningful_words(texts, min_freq=3, max_words=50):
    """
    Extract meaningful words from texts, filtering by frequency and relevance.
    """
    stop_words = set(stopwords.words('english'))
    # Add CMV-specific stopwords
    stop_words.update(['cmv', 'think', 'people', 'would', 'could', 'one', 'get', 'like', 'also', 'really', 'much', 'make', 'even'])
    
    all_words = []
    
    for text in texts:
        if pd.notna(text):
            # Simple tokenization
            words = text.lower().split()
            # Filter words
            meaningful_words = [
                word.strip('.,!?;:"()[]') 
                for word in words 
                if len(word) > 3 
                and word.strip('.,!?;:"()[]').lower() not in stop_words
                and word.strip('.,!?;:"()[]').isalpha()
                and word.strip('.,!?;:"()[]') in model  # Only words in our embedding model
            ]
            all_words.extend(meaningful_words)
    
    # Count frequencies
    word_freq = Counter(all_words)
    
    # Filter by minimum frequency and return top words
    filtered_words = {word: freq for word, freq in word_freq.items() if freq >= min_freq}
    top_words = dict(Counter(filtered_words).most_common(max_words))
    
    return top_words

# Extract meaningful words from CMV posts
meaningful_words = extract_meaningful_words(
    posts_sample['title'].fillna('') + ' ' + posts_sample['selftext'].fillna(''),
    min_freq=2,
    max_words=30
)

print("Most common meaningful words in CMV posts:")
for word, freq in list(meaningful_words.items())[:15]:
    print(f"  {word}: {freq}")

### Exercise 3.2: Semantic Analysis of CMV Discourse

**Task:** Analyze where CMV discourse falls on your semantic axes.

In [None]:
# Project CMV words onto semantic axes
cmv_projections = {}

for axis_name, axis in semantic_axes.items():
    cmv_projections[axis_name] = {}
    
    print(f"\nCMV discourse on {axis_name} axis:")
    print("-" * 50)
    
    word_projections = []
    
    # Project the most common CMV words
    for word in list(meaningful_words.keys())[:20]:  # Top 20 words
        if word in model:
            proj = project_word_onto_axis(word, axis, model)
            if proj is not None:
                cmv_projections[axis_name][word] = proj
                word_projections.append((word, proj, meaningful_words[word]))
    
    # Sort by projection value
    word_projections.sort(key=lambda x: x[1], reverse=True)
    
    for word, proj, freq in word_projections:
        direction = "→" if proj > 0 else "←"
        print(f"  {word:15s} {direction} {proj:6.3f} (freq: {freq})")
    
    # Calculate overall discourse tendency
    if word_projections:
        # Weight by frequency
        weighted_avg = sum(proj * freq for _, proj, freq in word_projections) / sum(freq for _, _, freq in word_projections)
        print(f"\n  → Overall tendency: {weighted_avg:.3f}")
        if weighted_avg > 0.05:
            print(f"    CMV discourse leans toward positive end")
        elif weighted_avg < -0.05:
            print(f"    CMV discourse leans toward negative end")
        else:
            print(f"    CMV discourse is relatively balanced")

### Exercise 3.3: Topic-Specific Analysis

**Task:** Compare how different topics in CMV are positioned on your semantic axes.

In [None]:
def classify_posts_by_topic(posts_df):
    """
    Simple topic classification based on keywords in titles.
    """
    topics = {
        'politics': ['government', 'political', 'democrat', 'republican', 'conservative', 'liberal', 'policy', 'election'],
        'social': ['social', 'society', 'culture', 'community', 'people', 'human', 'relationship'],
        'economics': ['economic', 'money', 'wealth', 'capitalism', 'tax', 'income', 'business'],
        'ethics': ['moral', 'ethical', 'right', 'wrong', 'should', 'ought', 'justice']
    }
    
    topic_assignments = []
    
    for _, row in posts_df.iterrows():
        title_lower = str(row['title']).lower()
        text_lower = str(row['selftext']).lower() if pd.notna(row['selftext']) else ''
        combined_text = title_lower + ' ' + text_lower
        
        topic_scores = {}
        for topic, keywords in topics.items():
            score = sum(1 for keyword in keywords if keyword in combined_text)
            topic_scores[topic] = score
        
        # Assign to topic with highest score, or 'other' if no matches
        best_topic = max(topic_scores, key=topic_scores.get)
        if topic_scores[best_topic] > 0:
            topic_assignments.append(best_topic)
        else:
            topic_assignments.append('other')
    
    return topic_assignments

# Classify posts by topic
posts_sample['topic'] = classify_posts_by_topic(posts_sample)

print("Topic distribution in our sample:")
print(posts_sample['topic'].value_counts())

# TODO: For each topic, extract key words and analyze on semantic axes
# Compare how different topics are positioned

print("\nTopic-specific semantic analysis:")
print("=" * 50)

for topic in ['politics', 'social', 'ethics']:
    if topic in posts_sample['topic'].values:
        topic_posts = posts_sample[posts_sample['topic'] == topic]
        print(f"\n{topic.upper()} posts ({len(topic_posts)} posts):")
        
        # Extract words for this topic
        topic_texts = topic_posts['title'].fillna('') + ' ' + topic_posts['selftext'].fillna('')
        topic_words = extract_meaningful_words(topic_texts, min_freq=1, max_words=10)
        
        print(f"Key words: {list(topic_words.keys())[:8]}")
        
        # Analyze on first semantic axis
        if semantic_axes:
            first_axis_name = list(semantic_axes.keys())[0]
            first_axis = semantic_axes[first_axis_name]
            
            projections = []
            for word in list(topic_words.keys())[:8]:
                if word in model:
                    proj = project_word_onto_axis(word, first_axis, model)
                    if proj is not None:
                        projections.append(proj)
            
            if projections:
                avg_projection = np.mean(projections)
                print(f"Average projection on {first_axis_name}: {avg_projection:.3f}")

---

## Part 4: Visualization and Interpretation (15 minutes)

### Exercise 4.1: Create Comprehensive Visualizations

**Task:** Create a visualization comparing multiple topics on multiple semantic dimensions.

In [None]:
def create_topic_comparison_plot(posts_df, semantic_axes_dict, model):
    """
    Create a scatter plot comparing topics across two semantic dimensions.
    """
    if len(semantic_axes_dict) < 2:
        print("Need at least 2 semantic axes for comparison plot")
        return
    
    # Get first two axes
    axis_names = list(semantic_axes_dict.keys())[:2]
    axis1_name, axis2_name = axis_names
    axis1, axis2 = semantic_axes_dict[axis1_name], semantic_axes_dict[axis2_name]
    
    # Analyze each topic
    topic_positions = {}
    
    for topic in posts_df['topic'].unique():
        if topic == 'other':
            continue
            
        topic_posts = posts_df[posts_df['topic'] == topic]
        if len(topic_posts) < 5:  # Skip topics with too few posts
            continue
            
        # Extract key words for this topic
        topic_texts = topic_posts['title'].fillna('') + ' ' + topic_posts['selftext'].fillna('')
        topic_words = extract_meaningful_words(topic_texts, min_freq=1, max_words=15)
        
        # Project onto both axes
        axis1_projections = []
        axis2_projections = []
        
        for word in list(topic_words.keys())[:10]:
            if word in model:
                proj1 = project_word_onto_axis(word, axis1, model)
                proj2 = project_word_onto_axis(word, axis2, model)
                
                if proj1 is not None and proj2 is not None:
                    axis1_projections.append(proj1)
                    axis2_projections.append(proj2)
        
        if axis1_projections and axis2_projections:
            topic_positions[topic] = {
                'x': np.mean(axis1_projections),
                'y': np.mean(axis2_projections),
                'count': len(topic_posts)
            }
    
    # Create plot
    if topic_positions:
        plt.figure(figsize=(10, 8))
        
        for topic, pos in topic_positions.items():
            plt.scatter(pos['x'], pos['y'], s=pos['count']*3, alpha=0.7, label=f"{topic} ({pos['count']})")
            plt.annotate(topic.title(), (pos['x'], pos['y']), 
                        xytext=(5, 5), textcoords='offset points', fontsize=12)
        
        plt.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
        plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
        
        plt.xlabel(f'{axis1_name.replace("_", " ").title()}')
        plt.ylabel(f'{axis2_name.replace("_", " ").title()}')
        plt.title('CMV Topics Positioned on Semantic Axes')
        plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
        
        plt.tight_layout()
        plt.show()
    
    return topic_positions

# Create the comparison plot
if len(semantic_axes) >= 2:
    topic_positions = create_topic_comparison_plot(posts_sample, semantic_axes, model)
else:
    print("Create at least 2 semantic axes to see topic comparison plot")

---

## Part 5: Group Discussion and Reflection (10 minutes)

### Discussion Questions

Work with your group to discuss these questions. Be prepared to share your insights with the class:

1. **Semantic Patterns**: What were the most interesting patterns you discovered in how words cluster in semantic space?

2. **Bias and Assumptions**: What biases did you notice in the word embeddings? How might these affect analysis of social discourse?

3. **Topic Differences**: How do different CMV topics (politics, social, ethics) differ in their semantic positioning?

4. **Methodological Insights**: What are the strengths and limitations of using word embeddings for analyzing social discourse?

5. **Applications**: How could these techniques be useful for social science research?

### Your Reflections

**TODO: Write your group's key insights here:**

1. Most interesting semantic pattern we found:
   - *Your answer*

2. Most surprising bias or assumption in the embeddings:
   - *Your answer*

3. Biggest difference between topic areas:
   - *Your answer*

4. One potential application for social science research:
   - *Your answer*

5. Biggest limitation of this approach:
   - *Your answer*

---

## Optional Extensions (If Time Permits)

If your group finishes early, try these additional analyses:

In [None]:
# Extension 1: Temporal analysis - do word usage patterns change over time?
# (if your data has timestamps)

# Extension 2: Compare high vs low engagement posts
# Do posts that get more comments use different semantic patterns?

# Extension 3: Create your own custom semantic axes
# Based on what you learned about CMV discourse

# Extension 4: Analyze comment language vs post language
# Do commenters use different semantic patterns than original posters?

# YOUR EXTENSION CODE HERE


---

## Summary

In this lab, you've learned to:
- Apply word embeddings to real social text data
- Create semantic axes for analyzing social and political concepts
- Visualize and interpret word relationships in social discourse
- Compare different topics and communities using embedding techniques
- Identify biases and limitations in embedding-based analysis

These techniques are powerful tools for computational social science, but remember to always consider their limitations and potential biases when interpreting results.

**Great work!** 🎉