# Lab 8: NLP Analysis

**COMPSS 211: Advanced Computing I**  
**Time:** 90 minutes  
**Due:** End of lab session

---

## Lab Overview

In this lab, you'll apply the NLP techniques learned in today's lesson to analyze your own dataset.

### Learning Objectives
- Apply text preprocessing pipelines 
- Use NLTK and spaCy for tokenization and text analysis
- Create and analyze Bag-of-Words and TF-IDF representations
- Perform basic topic analysis and visualization
- Build a simple text classifier 

### Deliverables
- Completed Jupyter notebook with all exercises
- Brief reflection (last cell) on what you learned

---

## Setup and Data Loading

First, let's import the necessary libraries and load our dataset.

In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from wordcloud import WordCloud
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

In [None]:
# Import NLP libraries
import nltk
from nltk.tokenize import word_tokenize
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

In [5]:
# Download required NLTK data (run once)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

True

In [6]:
# Load spaCy model
try:
    nlp = spacy.load('en_core_web_sm')
except:
    # If model not found, download it
    !python -m spacy download en_core_web_sm
    nlp = spacy.load('en_core_web_sm')

### Load the Dataset

In [None]:
import os

df = ...

Downloading dataset...


Downloading...
From: https://drive.google.com/uc?id=1e8XhmSwr81BuMs1hH1bKm4LaHh7DGCyr
To: /Users/tomvannuenen/Library/CloudStorage/Dropbox/GitHub/DEV/COMPSS-211/data/changemyview_lg.csv
100%|██████████| 21.8M/21.8M [00:00<00:00, 41.5MB/s]


In [None]:
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nData types:")
print(df.dtypes)

Dataset shape: (10003, 33)

Columns: ['secure_media', 'thumbnail', 'over_18', 'stickied', 'media', 'ups', 'report_reasons', 'permalink', 'id', 'is_self', 'edited', 'domain', 'author', 'user_reports', 'selftext_html', 'score', 'gilded', 'num_comments', 'mod_reports', 'subreddit_id', 'url', 'retrieved_on', 'selftext', 'author_flair_css_class', 'downs', 'created_utc', 'author_flair_text', 'link_flair_css_class', 'link_flair_text', 'distinguished', 'banned_by', 'title', 'subreddit']

Data types:
secure_media               object
thumbnail                  object
over_18                    object
stickied                   object
media                     float64
ups                       float64
report_reasons             object
permalink                  object
id                         object
is_self                    object
edited                     object
domain                     object
author                     object
user_reports               object
selftext_html              

In [None]:
text_col = df['PICK_YOUR_TEXT_COLUMN_NAME_HERE']  # Replace with actual text column name

---

## Part 1: Text Preprocessing Pipeline

### Create a Comprehensive Preprocessing Function

Build a preprocessing pipeline that handles the specific challenges of your text data.

In [11]:
def preprocess_text(text):
    """
    Preprocess text.
    
    Steps to implement:
    1. Convert to lowercase
    2. Remove URLs (http/https links)
    3. Remove subreddit mentions (r/subreddit)
    4. Remove user mentions (u/username or /u/username)
    5. Replace numbers with 'NUM' token
    6. Remove extra whitespace
    7. Remove special characters but keep apostrophes
    
    Args:
        text (str): Raw Reddit post text
    
    Returns:
        str: Preprocessed text
    """

    return cleaned_text

In [None]:
# Test your preprocessing function
test_text = "Check out r/science! User u/john_doe shared this: https://example.com. It got 1500 upvotes!"
print(f"Original: {test_text}")
print(f"Processed: {preprocess_reddit_text(test_text)}")

In [None]:
# Apply preprocessing to your data
df['processed_text'] = df[text_col].apply(preprocess_text)

# Remove empty processed texts


### Compare Tokenization Methods

Compare how NLTK or spaCy tokenizes your posts differently.

In [None]:

def tokenizer(text):
    """
    Compare NLTK and spaCy tokenization.
    
    Args:
        text (str): Input text
    
    Returns:
        dict: Dictionary with both tokenization results
    """
    

In [None]:
# Compare tokenization on a sample post
sample_post = df['processed_text'].iloc[0]



---

## Part 2: Word Frequency and N-gram Analysis (20 minutes)

### Analyze Word Frequencies

In [None]:
from nltk.corpus import stopwords

def get_word_frequencies(texts, remove_stopwords=True, top_n=10):
    """
    Get word frequencies from a list of texts.
    
    Args:
        texts (list): List of text strings
        remove_stopwords (bool): Whether to remove stopwords
        top_n (int): Number of top words to return
    
    Returns:
        list: List of (word, frequency) tuples
    """
    
    # Get English stopwords
    stop_words = set(stopwords.words('english')) if remove_stopwords else set()
    
    # Tokenize all texts and count frequencies

    return #

In [None]:
# Print your  word frequencies


### Exercise 2.2: Visualize Word Frequencies

In [None]:
def plot_word_frequencies(word_freq_list, title, color='steelblue'):
    """
    Create a bar plot of word frequencies.
    
    Args:
        word_freq_list (list): List of (word, frequency) tuples
        title (str): Plot title
        color (str): Bar color
    """
    
    words, frequencies = zip(*word_freq_list)
    
    plt.figure(figsize=(10, 6))
    plt.barh(range(len(words)), frequencies, color=color)
    plt.yticks(range(len(words)), words)
    plt.gca().invert_yaxis()
    plt.title(title)
    plt.xlabel('Frequency')
    plt.tight_layout()
    plt.show()

In [None]:
# Plot word frequencies
plot_word_frequencies(word_freq, 'Top 15 Words, color='coral')

### Exercise 2.3: Create Word Clouds

In [None]:
def create_wordcloud(texts, title, max_words=50):
    """
    Create a word cloud from texts.
    
    Args:
        texts (list): List of text strings
        title (str): Plot title
        max_words (int): Maximum words in cloud
    """
    
    # Combine all texts
    combined_text = ' '.join([text for text in texts if pd.notna(text)])
    
    # Create WordCloud
    wordcloud = WordCloud(width=800, height=400, 
                          background_color='white',
                          stopwords=set(stopwords.words('english')),
                          max_words=max_words).generate(combined_text)
    
    # Plot
    plt.figure(figsize=(12, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(title, fontsize=16)
    plt.tight_layout()
    plt.show()

In [None]:
# Create word cloud for the dataset
# Sample if dataset is large to avoid memory issues


---

## Part 3: TF-IDF Analysis

### Create TF-IDF Representations

In [None]:
def create_tfidf_features(texts, max_features=100, ngram_range=(1, 2), min_df=2, max_df=0.95):
    """
    Create TF-IDF features from texts.
    
    Args:
        texts (list): List of text strings
        max_features (int): Maximum number of features
        ngram_range (tuple): Range of n-grams to consider
        min_df (int): Minimum document frequency
        max_df (float): Maximum document frequency
    
    Returns:
        tuple: (TF-IDF matrix, vectorizer)
    """
    
    # Create and fit TF-IDF vectorizer
    vectorizer = TfidfVectorizer(
        max_features=max_features,
        ngram_range=ngram_range,
        stop_words='english',
        min_df=min_df,
        max_df=max_df,
        lowercase=True,
        token_pattern=r'\b[a-zA-Z]{2,}\b'  # Only words with 2+ letters
    )
    
    tfidf_matrix = vectorizer.fit_transform(texts)
    
    return tfidf_matrix, vectorizer

In [None]:
# Create TF-IDF features
# Use a sample if dataset is large
sample_size = min(2000, len(reddit_data))
sample_data = reddit_data.sample(n=sample_size, random_state=42).reset_index(drop=True)

tfidf_matrix, vectorizer = create_tfidf_features(sample_data['processed_text'].tolist())

# Convert to DataFrame for easier analysis
tfidf_df = pd.DataFrame(
    tfidf_matrix.todense(),
    columns=vectorizer.get_feature_names_out(),
    index=sample_data.index
)

print(f"TF-IDF matrix shape: {tfidf_df.shape}")
print(f"\nSample features: {list(tfidf_df.columns[:10])}")

### Find Most Important Terms

In [None]:
def get_top_tfidf_terms(tfidf_df, top_n=15):
    """
    Get terms with highest mean TF-IDF scores.
    
    Args:
        tfidf_df (DataFrame): TF-IDF DataFrame
        top_n (int): Number of top terms to return
    
    Returns:
        Series: Top terms with their mean TF-IDF scores
    """
    
    # Calculate mean TF-IDF scores across all documents
    mean_tfidf = tfidf_df.mean(axis=0)
    
    return mean_tfidf.nlargest(top_n)

In [None]:
# Get top TF-IDF terms
top_terms = get_top_tfidf_terms(tfidf_df, top_n=15)

# Visualize top terms
plt.figure(figsize=(10, 6))
top_terms.sort_values().plot(kind='barh', color='darkgreen')
plt.title('Top 15 Terms by Mean TF-IDF Score')
plt.xlabel('Mean TF-IDF Score')
plt.tight_layout()
plt.show()

print("Top terms by TF-IDF:")
for term, score in top_terms.items():
    print(f"  {term}: {score:.4f}")

---

## Part 4: Text Classification

### Create Labels for Classification

Since we have Reddit posts, let's create a classification task based on post characteristics.

In [None]:
# Create a binary classification task based on post length
# Long posts vs short posts (this is just an example - you could use other criteria)
sample_data['text_length'] = sample_data['processed_text'].str.len()
median_length = sample_data['text_length'].median()
sample_data['post_type'] = (sample_data['text_length'] > median_length).map({True: 'long_post', False: 'short_post'})

print(f"Median text length: {median_length:.0f} characters")
print(f"\nPost type distribution:")
print(sample_data['post_type'].value_counts())

### Build a Text Classifier

In [None]:
def build_classifier(X, y, test_size=0.2, random_state=42):
    """
    Build and evaluate a text classifier.
    
    Args:
        X: Feature matrix
        y: Labels
        test_size (float): Proportion of test set
        random_state (int): Random seed
    
    Returns:
        tuple: (trained model, X_test, y_test, predictions)
    """
    
    # Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )
    
    # Train a logistic regression classifier
    model = LogisticRegression(max_iter=1000, random_state=random_state)
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate accuracy
    train_accuracy = model.score(X_train, y_train)
    test_accuracy = model.score(X_test, y_test)
    
    print(f"Training Accuracy: {train_accuracy:.3f}")
    print(f"Test Accuracy: {test_accuracy:.3f}")
    
    return model, X_test, y_test, y_pred

In [None]:
# Build and evaluate the classifier
model, X_test, y_test, y_pred = build_classifier(
    tfidf_matrix, 
    sample_data['post_type']
)

# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

### Analyze Feature Importance

In [None]:
def get_feature_importance(model, vectorizer, top_n=10):
    """
    Get the most important features for classification.
    
    Args:
        model: Trained classifier
        vectorizer: TF-IDF vectorizer
        top_n (int): Number of top features to return
    
    Returns:
        dict: Dictionary with positive and negative features
    """
    
    # Get feature names and coefficients
    feature_names = vectorizer.get_feature_names_out()
    coef = model.coef_[0]
    
    # Create feature importance dataframe
    feature_importance = pd.DataFrame({
        'feature': feature_names,
        'coefficient': coef
    }).sort_values('coefficient', ascending=False)
    
    # Get top positive and negative features
    positive_features = feature_importance.head(top_n)[['feature', 'coefficient']].values.tolist()
    negative_features = feature_importance.tail(top_n)[['feature', 'coefficient']].values.tolist()
    
    return {
        'positive': positive_features,
        'negative': negative_features
    }

In [None]:
# Get and visualize feature importance
important_features = get_feature_importance(model, vectorizer)

print("Features most indicative of long posts:")
for feature, score in important_features['positive']:
    print(f"  {feature}: {score:.3f}")

print("\nFeatures most indicative of short posts:")
for feature, score in important_features['negative']:
    print(f"  {feature}: {score:.3f}")

###  Test the Classifier on New Text

In [None]:
def predict_post_type(text, model, vectorizer, preprocess_func):
    """
    Predict post type for new text.
    
    Args:
        text (str): Input text
        model: Trained classifier
        vectorizer: TF-IDF vectorizer
        preprocess_func: Preprocessing function
    
    Returns:
        tuple: (predicted label, prediction probabilities)
    """
    
    # Preprocess text
    processed_text = preprocess_func(text)
    
    # Transform to TF-IDF features
    features = vectorizer.transform([processed_text])
    
    # Predict
    prediction = model.predict(features)
    probabilities = model.predict_proba(features)
    
    return prediction[0], probabilities[0]

In [None]:
# Test on new posts
test_posts = [
    "AITA for not going?",
    "So this is a long story but bear with me. It all started when I was in college and my roommate asked me to help with a project. I said yes initially but then realized it would take the entire weekend. The project was for a class I wasn't even in, and I had my own assignments due. But here's where it gets complicated...",
    "My friend is mad at me.",
    "I need some perspective on this situation. Last month, my sister planned a surprise party for our mother's 60th birthday. She asked everyone to contribute $100 for the venue and catering. I thought this was reasonable at first, but then I found out she chose the most expensive restaurant in town without consulting anyone."
]

for post in test_posts:
    pred, probs = predict_post_type(post, model, vectorizer, preprocess_reddit_text)
    print(f"\nPost: '{post[:50]}...'")
    print(f"Predicted: {pred}")
    print(f"Confidence: {max(probs):.2%}")

---

## Reflection and Submission

Complete the reflection below.

### Lab Reflection

**1. What was the most interesting finding from your analysis of the AITA posts?**

_Your answer here_

**2. Which preprocessing step had the biggest impact on your results?**

_Your answer here_

**3. How might you extend this analysis for a real research project about online discourse or moral judgment?**

_Your answer here_

**4. What challenges did you encounter with the real Reddit data and how did you solve them?**

_Your answer here_

**5. Did you use any AI assistance for this lab? If so, describe how and include your prompts:**

_Your answer here_

---

## 🌟 Stretch Goals

If you finish early, try these additional challenges:

So far we’ve seen how tokenizers break text into tokens. But on Hugging Face, you can also find fine-tuned transformer models that build on tokenization to provide linguistic annotations. 

These annotations (like POS tags and NER labels) are not produced by the tokenizer itself — they come from models that were trained on labeled data for those tasks.

1. POS tagging → Find a POS tagging model on Hugging Face (hint: try searching for “POS tagging” or use vblagoje/bert-english-uncased-finetuned-pos) and run it on your text.
2. NER → Find a NER model (hint: try dslim/bert-base-NER) and run it on your text.
3. Compare the outputs:
    - How does POS tagging label the words?
	- What entities does the NER model detect?
	- How do these results differ from spaCy’s POS and NER?

**💡 Reflection:** What do these annotations add on top of tokenization? How do they help structure the text for further analysis?

In [None]:
# Space for stretch goals
