# Framing Poverty in Everyday Conversations: Insights from Reddit

**Author:** Darius Nader \
**Date:** 22.08.2025  

---

#### Purpose of this Notebook

This notebook provides the code for the term paper *“Framing Poverty in Everyday Conversations: Insights from Reddit.”* It provides a step-by-step workflow to replicate the main results. The analysis covers data preparation, cleaning, topic modeling (LDA), framing classification, sentiment analysis and visualization.  

---

#### Important information

- Place the dataset (`reddit_posts.jsonl`) inside a dedicated folder
- Update the working directory at the beginning of the notebook
- Make sure all required Python packages are installed before running the notebook
- The notebook has been designed so that it works when executed from top to bottom

In [None]:
# Install required packages if not already installed
!pip install pandas regex nltk gensim scikit-learn numpy seaborn matplotlib

---
## 1. Importing Data and Pre-Processing

We begin by loading the Reddit dataset that underpins the paper’s analysis. The raw data comes in JSON Lines (`.jsonl`) format and contains posts from the `r/poor` subreddit between January 1, 2023 and August 3, 2025.  

In [None]:
# Adjust the working directory to where the dataset was saved
%cd "/Users/dn/poverty-frames-reddit/01_data/raw/"

In [None]:
# Import library
import pandas as pd  

# Load dataset (json lines format, each line = one reddit post)
df = pd.read_json("reddit_posts.jsonl", lines=True)

# Keep only relevant columns
columns_to_keep = [
    "id", # needed for merge
    "title",
    "selftext",
    "created_utc",
    "removed_by_category",
]
df = df[columns_to_keep]

# Quick inspection of structure and first rows
df.info()
df.head(5)

We now have 7,671 posts in our dataset. Many entries are spam-like, deleted or removed by moderators, these will need further preprocessing in the next step.

Now we prepare the dataset for analysis by:
- converting timestamps to datetime  
- excluding posts after July 2025 (not part of the time period of the study)  
- removing posts deleted or flagged by moderators/Reddit  
- dropping helping variables no longer needed


In [None]:
# Convert Unix timestamp to datetime
df["datetime"] = pd.to_datetime(df["created_utc"], unit="s")

# Drop posts after July 2025
df = df[~((df["datetime"].dt.year == 2025) & (df["datetime"].dt.month == 8))]

# Filter out removed/deleted posts
df = df[~df["selftext"].str.contains("[removed]", regex=False, na=False)]
df = df[~df["removed_by_category"].str.contains("moderator|reddit", na=False)]

# Drop columns that are no longer needed
df = df.drop(columns=["removed_by_category", "created_utc"])

# Quick check
df.info()
df.head(5)

The total of number of posts in the sample is now 5,255 between January 1, 2023 and July 30, 2025. 

In [None]:
# Save processed dataset
output_path = "/Users/dn/poverty-frames-reddit/01_data/processed/processed_posts.jsonl"
df.to_json(output_path, orient="records", lines=True, force_ascii=False)

---
## 2. Descriptive Statistics of Sample

Before diving into preprocessing, we take a quick look at the descriptive statistics of the dataset.

In [None]:
# Load processed dataset
processed_path = "/Users/dn/poverty-frames-reddit/01_data/processed/processed_posts.jsonl"
df_processed = pd.read_json(processed_path, lines=True)

# Create simple summary table

print("Begin:", df_processed["datetime"].min().date())
print("End:", df_processed["datetime"].max().date())

print("---")

# Count threads per year
print(df_processed["datetime"].dt.year.value_counts().sort_index())

The dataset now contains posts, spanning from January 2023 to July 2025. Looking at the annual distribution, there are 1,486 posts in 2023, 2,551 posts in 2024 and 1,218 posts in 2025 (up to July). 
This confirms that the cleaned sample covers the intended study period and provides enough observations for further analysis.

---
## 3. Processing Text for the Data Analysis

To prepare the texts for topic modeling and the sentiment analysis, we first combine the `title` and `selftext` fields into a single string.  A quick sample of posts shows that the raw data often contains promotional spam, referral links, HTML artifacts, or deleted content, which must be cleaned before starting the data analysis.


In [None]:
# Combine title and selftext into a single field
df_processed["combined_text"] = df_processed["title"] + " " + df_processed["selftext"]

# Preview three random examples to illustrate raw content
for r, row in df_processed.sample(3, random_state=123)[["combined_text"]].iterrows():
    print(row["combined_text"])
    print("---")

Now we need to clean the combined text more thoroughly. The raw posts still contain a number of artifacts that would interfere with the data analysis, such as:  

- Referral codes and promotional links  
- URLs and email addresses  
- Artifacts from deleted or removed posts  
- HTML tags and escape sequences  
- Spam posts identified during manual inspection  

In [None]:
# Remove links
df_processed["combined_text"] = df_processed["combined_text"].str.replace(r"https?://\S+", "", regex=True)

# Clean out unwanted patterns (deleted markers, HTML escapes/tags, email addresses)
df_processed["combined_text"] = df_processed["combined_text"].str.replace("[deleted]", "", regex=False)   # Remove [deleted] markers
df_processed["combined_text"] = df_processed["combined_text"].str.replace(r"&[^;]+;", "", regex=True)     # Remove HTML escape sequences
df_processed["combined_text"] = df_processed["combined_text"].str.replace(r"</?\w[^>]*>", "", regex=True) # Remove HTML tags
df_processed["combined_text"] = df_processed["combined_text"].str.replace(r"[\w\.-]+@[\w\.-]+\.\w+", "", regex=True)  # Remove email addresses

# Filter out spam identified by manual inspection
df_processed = df_processed[~df_processed["selftext"].str.contains("TEMU code glitch")]

# Check updated dataset
df_processed.info()

After cleaning, the dataset contains 5,244 posts with a `combined_text` column that will be the base for the data analysis.

---
## 4. LDA Topic Modeling

We create an LDA-specific copy of the dataset. For topic modeling, we remove most punctuation, lowercase everything but keep hyphens so hyphenated terms can remain intact for later tokenization.

In [None]:
# Separate copy for LDA-only preprocessing
df_lda = df_processed.copy()

# Remove punctuation/special chars but keep hyphens; also strip apostrophes so "I'm" -> "I m"
df_lda["combined_text"] = df_lda["combined_text"].str.replace(r"[^A-Za-z0-9\s-]", " ", regex=True)  # Normalize punctuation

# Lowercase text
df_lda["combined_text"] = df_lda["combined_text"].str.lower()

# Collapse multiple spaces introduced by replacements
df_lda["combined_text"] = df_lda["combined_text"].str.replace(r"\s+", " ", regex=True).str.strip()

df_lda.head()

### 4.1 Preparing DTM: Tokenization & Stopwords

Now the text for the Document-Term Matrix (DTM) will be prepared in addition to assigning the stop word list. The steps are the following: 
- Sentence/word tokenization  
- Filtering out non-alphabetic tokens  
- Defining an extended stopword list  

In [None]:
import regex
import nltk
from nltk.tokenize import WhitespaceTokenizer
from nltk.corpus import stopwords

# Download required NLTK resources (will be skipped if already available)
nltk.download("stopwords", quiet=True) # English stopword list
nltk.download("punkt", quiet=True) # Sentence tokenizer data

# Custom tokenizer: whitespace + keep only tokens containing letters
class MyTokenizer:
    def __init__(self):
        self.ws = WhitespaceTokenizer()
        self.letter_pat = r"\p{letter}" # Matches letter

    def tokenize(self, text: str):
        result = []
        for sent in nltk.sent_tokenize(text): # Sentence split
            tokens = self.ws.tokenize(sent) # Whitespace split
            tokens = [t for t in tokens if regex.search(self.letter_pat, t)] # Keeps only tokens with letters
            result.extend(tokens)
        return result

# Use the custom tokenizer
mytokenizer = MyTokenizer()
print(mytokenizer.tokenize(df_lda["combined_text"].iloc[5]))  # Quick test of tokenizing a post

We use NLTK’s English stopword list and extend it with a few additional words that are not included in the default list.  
These additions cover leftover fragments from contractions (`"im"`, `"ive"`) as well as common filler words such as `"like"`, `"get"` and `"got"`.

In [None]:
# Print the full NLTK stopword list
print(stopwords.words("english"))

# Extend the list with additional custom stopwords
stopwords_list = ["like", "im", "ive", "get", "got"] + stopwords.words("english")

### 4.2 Generating the Model

We now move on to topic modeling using **Latent Dirichlet Allocation (LDA)**. Steps in this section include:  
- Preparing the text data for LDA by creating a document-term matrix (DTM) with unigrams and bigrams
- Converting the DTM into a format compatible with `gensim`
- Fit multiple LDA models with different numbers of topics, passes and iterations to compare performance
- Evaluating models using perplexity and coherence
- Selecting the most promising models and inspect the top words per topic

In [None]:
import gensim
from gensim import matutils
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel

from sklearn.feature_extraction.text import CountVectorizer

# Prepare text data (use cleaned text from df_lda)
p_reddit = df_lda['combined_text'].astype(str)

# Create DTM with unigrams and bigrams, filter by min_df and apply custom tokenizer and stopwords
cv = CountVectorizer(ngram_range=(1, 2), min_df=0.01, stop_words=stopwords_list, tokenizer=mytokenizer.tokenize)
dtm = cv.fit_transform(p_reddit)
dtm

The DTM contains 5,244 posts and 1,209 unique words. However, since LDA does not take a DTM as input, we first need to convert the DTM. The DTM also does not include feature names, so we store them separately in order to use them later in the LDA model.

In [None]:
# Convert DTM into gensim corpus format
corpus = matutils.Sparse2Corpus(dtm, documents_columns=False)

# Store vocab separately since DTM has no feature names
vocab = dict(enumerate(cv.get_feature_names_out()))

To determine a suitable number of topics and training parameters, we run several LDA models in a grid search. I vary the number of topics (`k`), the number of passes and the number of iterations, and record both perplexity and coherence scores for comparison. This step will take a while.

In [None]:
# Grid search over different topic numbers, passes, and iterations
from gensim.models import LdaModel, CoherenceModel
import pandas as pd

results = []

for k in [5, 10, 15, 20]:  # Number of topics
    for passes in [10, 20, 50]:  # Training passes
        for iterations in [100, 200, 500]:  # Iterations
            m = LdaModel(
                corpus=corpus,
                num_topics=k,
                id2word=vocab,
                random_state=123,
                alpha="auto",
                passes=passes,
                iterations=iterations,
                eval_every=None
            )
            perplexity = m.log_perplexity(corpus)
            coherence = CoherenceModel(
                model=m, corpus=corpus, coherence="u_mass"
            ).get_coherence()
            results.append(dict(
                topics=k,
                passes=passes,
                iterations=iterations,
                perplexity=perplexity,
                coherence=coherence
            ))

results_df = pd.DataFrame(results)
print(results_df)

The model comparison shows that the best-performing setting is 5 topics with 10 passes and 100 iterations (based on our perplexity/coherence results). We now fit the LDA model with these parameters and inspect the top 10 words for each of the 5 topics.


In [None]:
# Fit the model with 5 topics (baseline)
lda = LdaModel(
    corpus, id2word=vocab, num_topics=5, passes=10, iterations=100, random_state=123, alpha="auto"
)

# Show the top 10 words for each of the 5 topics
pd.DataFrame(
    {f"Topic {n}": [w for (w, tw) in words] for (n, words) in lda.show_topics(formatted=False)}
)

The resulting 5-topic solution looks acceptable but the topics are not very distinct and appear somewhat mixed. To improve interpretability, we also test a 10-topic model. According to the grid search, the highest coherence score was achieved with 10 topics, 20 passes and 200 iterations, although 100 iterations were enough to receive a result with meaningful topics and have a coherence score that is only lower by 0.004.

In [None]:
# Fit the model with 10 topics
lda = LdaModel(
    corpus, id2word=vocab, num_topics=10, passes=20, iterations=100, random_state=123, alpha="auto"
)

# Show the top 10 words for each of the 10 topics
pd.DataFrame(
    {f"Topic {n}": [w for (w, tw) in words] for (n, words) in lda.show_topics(formatted=False)}
)


The 10-topic solution looks more distinct and interpretable than the 5-topic model. Next, we extract the topic distributions for each document and combine them with the original metadata. This gives us a new dataset where every post is linked to its topic proportions, which will be useful for labeling topics and connecting them to the different frames.

In [None]:
# Analyze topic distributions across all documents
topics = pd.DataFrame(
    [dict(lda.get_document_topics(doc, minimum_probability=0.0)) for doc in corpus]
)

# Keep metadata for later analysis
meta = df_lda.drop(columns=["combined_text"]).copy()
meta['text'] = p_reddit.reset_index(drop=True)  # Re-add cleaned text

# Combine everything (topics + metadata)
tpd = pd.concat([df_lda.reset_index(drop=True), topics], axis=1)
tpd.head()

### 4.2 Analyze Topic Modeling Results

To simplify interpretation, we assign each post to its most dominant topic. We also store the corresponding topic score and apply a threshold of `0.4`. If a post does not reach this minimum probability for any topic, it is not assigned to a specific topic. This helps reduce noise from posts that are too mixed across topics.

In [None]:
import numpy as np

# Define topic columns (0–9 for the 10-topic model)
topic_cols = list(range(10))

# Assign each post to the topic with the highest probability
tpd["top_topic"] = tpd[topic_cols].idxmax(axis=1)

# Store the corresponding maximum topic score
tpd["top_topic_score"] = tpd[topic_cols].max(axis=1)

# Apply threshold: if the top topic score is below 0.4, set topic to None
tpd.loc[tpd["top_topic_score"] < 0.4, "top_topic"] = None

# Show the distribution of assigned top topics (including None)
print(tpd["top_topic"].value_counts(dropna=False).sort_index())

Now that each post has been assigned a top topic, we manually inspect a random sample of documents for a selected topic. This helps us qualitatively judge whether the model has grouped posts into coherent and meaningful themes.

In [None]:
# Draw a random sample of 20 posts, need to update "top_topic" depending on the topic interested in
for r, row in tpd.loc[tpd["top_topic"] == 0].sample(20, random_state=123)[["title", "selftext"]].iterrows(): # Topic 0
    print(row["title"])      # Print the post title (unformated)
    print(row["selftext"])   # Print the post body (unformated)
    print("---")             # Separator for readability

Based on the manual inspection of topics, we now map them into frames:  
- Topic 1, 9: Individual framing 
- Topic 2, 5, 6: Structural
- Topic 0, 4, 8: Political

Posts that do not clearly belong to any of these categories (Topic 3, 7) or have no dominant topic are assigned to Unassigned.

In [None]:
# Assign topics 1 and 9 to the "Individual" frame
tpd.loc[tpd["top_topic"].isin([1, 9]), "frame"] = "Individual"

# Assign topics 2, 5, and 6 to the "Structural" frame
tpd.loc[tpd["top_topic"].isin([2, 5, 6]), "frame"] = "Structural"

# Assign topics 0, 4, and 8 to the "Political" frame
tpd.loc[tpd["top_topic"].isin([0, 4, 8]), "frame"] = "Political"

# Assign all other cases (including None/NaN) to "Unassigned"
tpd.loc[~tpd["top_topic"].isin([1, 9, 2, 4, 5, 6, 0, 8]) | tpd["top_topic"].isna(), "frame"] = "Unassigned"

# Show the distribution of assigned frames
print(tpd["frame"].value_counts(dropna=False).sort_index())

---

## 5. Sentiment Analysis

I now analyze the emotional tone of the posts using VADER, a lexicon-based sentiment analysis tool. Each text is assigned a sentiment score between -1 (negative) and +1 (positive).  

To better fit the context of this dataset, we adjust some words in the VADER lexicon that would otherwise bias the scores. For example, words like *like*, *free*, *please*, *rich* or *credit* often appear in descriptive contexts or as stop words, so I adjusted them to be neutral at a score of 0.0. This ensures the sentiment scores more accurately reflect the tone of the posts in the sample.

In [None]:
df_sent = df_processed.copy()

# Load the VADER lexicon required for sentiment analysis
nltk.download("vader_lexicon", quiet = True)

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk

# Create analyzer
analyzer = SentimentIntensityAnalyzer()

# Neutralize or adjust specific words to better fit the context
custom_words = {
    "like": 0.0,
    "free": 0.0,
    "care": 0.0,
    "please": 0.0,
    "credit": 0.0,
    "rich": 0.0
}
analyzer.lexicon.update(custom_words)

# Calculate sentiment scores for each text
sentiments = []
for text in df_sent["combined_text"]:
    score = analyzer.polarity_scores(str(text))["compound"]
    sentiments.append(score)

# Add scores as new column
df_sent["sentiment"] = sentiments

---
## 6. Results

Now I merge the sentiment scores with the topic and frame assignments, so that it's possible to compare average sentiment across topics and framing.

In [None]:
merged_df = df_sent.merge(
    tpd[["id", "top_topic", "top_topic_score", "frame"]],
    on="id",
    how="left"
)

merged_df.info()

Compound scores of each topic and frame:

In [None]:
print(merged_df.groupby("top_topic")["sentiment"].mean())
print("--------------------")
print(merged_df.groupby("frame")["sentiment"].mean())

---
## 7. Figures

The next step is to visualize the results to better understand differences in sentiment across topics and frames.  
Two bar plots are created:  

1. **Average sentiment per topic**, with colors indicating the assigned frame. Horizontal lines mark neutral sentiment (0) and small thresholds (+0.05 and -0.05) to highlight deviations.  

2. **Average sentiment per frame**, ordered by Political, Individual, and Structural. The same horizontal lines help interpret whether the average sentiment leans neutral, positive, or negative.  


In [None]:
# Choose path for saving figures
%cd "/Users/dn/poverty-frames-reddit/03_manuscript/figures"

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Define consistent color palette for frames
palette = {
    "Structural": "skyblue",
    "Individual": "lightcoral",
    "Political": "thistle",
    "Unassigned": "silver"
}

# Set the theme for a darker grid
sns.set_style("darkgrid")

# Figure a: Sentiment per Topic
temp_df = merged_df[merged_df["top_topic"].notna()].copy()
temp_df["topic"] = temp_df["top_topic"] + 1  # Shift topic labels to 1–10opic

sns.barplot(
    data=temp_df,
    x="topic", y="sentiment", hue="frame", palette=palette,
    err_kws={'linewidth': 1}, errorbar=('ci', False), edgecolor="none"
)

plt.legend(title="Framing", fontsize=9, facecolor="white") # Legend, white background
plt.axhline(0, color='black', linestyle='-', linewidth=1)      # Neutral baseline
plt.axhline(0.05, color='green', linestyle='--', linewidth=1)  # Positive threshold
plt.axhline(-0.05, color='red', linestyle='--', linewidth=1)   # Negative threshold
plt.xlabel("Topic", fontsize=13, labelpad=10)  # Rename label and add a bit of space
plt.ylabel("Sentiment", fontsize=13)  # Rename label


plt.text(-0.15, 1, "a", transform=plt.gca().transAxes, fontsize=20, fontweight="bold") # Letter (a)

plt.savefig("average_sentiment_per_topic.pdf", bbox_inches="tight") # Save as PDF
plt.show()

In [None]:
# Figure b: Sentiment per Frame
sns.barplot(
    data=merged_df,
    x="frame", y="sentiment", hue="frame", palette=palette,
    err_kws={'linewidth': 1}, errorbar=('ci', False),
    order=["Political", "Individual", "Structural"], edgecolor="none"
)

plt.axhline(0, color='black', linestyle='-', linewidth=1)      # Neutral baseline
plt.axhline(0.05, color='green', linestyle='--', linewidth=1)  # Positive threshold
plt.axhline(-0.05, color='red', linestyle='--', linewidth=1)   # Negative threshold
plt.xlabel("Framing", fontsize = 13, labelpad = 10) # Rename label + add space
plt.ylabel("Sentiment", fontsize = 13) # Rename label

plt.text(-0.15, 1, "b", transform=plt.gca().transAxes, fontsize=20, fontweight="bold") # Letter (b)

plt.savefig("average_sentiment_per_frame.pdf", bbox_inches="tight") # Save as PDF
plt.show()