# Notebook 02 – LDA and topic assignment
*This notebook analyzes Reddit "Change My View" (CMV) posts using topic modeling. We're working with a pre-processed dataset from GCS that already includes tokenized text.*

> **Goal in one line:** build a fully distributed Spark pipeline that analyses content and assigns a topic to each post.




In [None]:
# Initialize Spark session for distributed computing
spark = SparkSession.builder \
    .appName("GCS Loader") \
    .getOrCreate()

25/04/23 14:46:33 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### Import statements

In [None]:
# Import required libraries for PySpark data processing and schema definition
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, ArrayType, LongType
from pyspark.sql.functions import from_unixtime, year, month, date_format

# Import NLP libraries for text preprocessing
import re
from nltk.corpus import stopwords, words
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Import PySpark functions for user-defined functions and column operations
from pyspark.sql.functions import udf, concat_ws, col
from pyspark.sql.types import StringType

# Download required NLTK resources for text processing
import nltk
nltk.download('stopwords')  # Common words to exclude
nltk.download('punkt')      # For tokenization
nltk.download('wordnet')    # For lemmatization
nltk.download('words')      # English dictionary
from pyspark.ml.feature import CountVectorizer

# Initialize NLP resources for text preprocessing
stop_words = set(stopwords.words('english'))
english_words = set(words.words())
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /usr/local/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/local/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /usr/local/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package words to /usr/local/share/nltk_data...
[nltk_data]   Package words is already up-to-date!


## 1 – Data preprocessing

In [None]:
# Define schema following Notebook 01
schema_two = StructType([
    StructField("num_comments", IntegerType(), True),    # Number of comments on the post
    StructField("selftext", StringType(), True),         # Body text of the post
    StructField("score", IntegerType(), True),           # Reddit score (upvotes - downvotes)
    StructField("title", StringType(), True),            # Title of the post
    StructField("delta", BooleanType(), True),           # Whether the post received a delta (changed view)
    StructField("urls", ArrayType(StringType()), True),  # URLs mentioned in the post
    StructField("name", StringType(), True),             # Unique identifier of the post
    StructField("processed", ArrayType(StringType()), True),  # Preprocessed tokens
    StructField("merged", StringType(), True),           # Merged text field (likely title + selftext)
    StructField("year_month", StringType(), True)        # Time period for temporal analysis
])

# Read preprocessed data from Buckket
n1_df = spark.read.schema(schema_two).json("gs://st446-cmv/n1_preprocessing_df")
n1_df.show(5, truncate=False)  # Display first 5 rows with full content

+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
# Print dataset dimensions
shape = (n1_df.count(), len(n1_df.columns))
print(f"Shape: {shape}")

[Stage 54:>                                                         (0 + 1) / 1]

Shape: (65169, 10)


                                                                                

### Converting Text to Sparse Feature Vectors

To perform topic modeling, we need to convert our tokenized text into numerical feature vectors. We'll use the `CountVectorizer` from PySpark ML to create a document-term matrix.

#### CountVectorizer Parameters:
- **inputCol**: "processed" - our pre-tokenized text array
- **outputCol**: "features" - sparse vector output
- **minDF**: 8 - terms must appear in at least 8 documents

The `minDF` parameter helps filter out rare terms that might be noise or typos. This value was determined through trial and error to balance between:
- Including enough meaningful terms
- Excluding rare terms that don't contribute to topic cohesion
- Managing computational complexity

This step transforms our text data into the feature vectors required for LDA modeling.

In [None]:

# Initialize CountVectorizer with minimum document frequency of 8 (filters rare words)
cv = CountVectorizer(inputCol="processed", outputCol="features", minDF=8)

# Fit the CountVectorizer and transform the dataset to add feature vectors
cv_model = cv.fit(n1_df)

corpus_df_w_features = cv_model.transform(n1_df)
corpus_df_w_features.show(20)

# Extract and print vocabulary information
vocabulary = cv_model.vocabulary
print("Vocabb size:", len(vocabulary)) # Vocabb size: 14179
print("Sample Terms:", vocabulary[:20])

                                                                                

+------------+--------------------+-----+--------------------+-----+--------------------+---------+--------------------+--------------------+----------+--------------------+
|num_comments|            selftext|score|               title|delta|                urls|     name|           processed|              merged|year_month|            features|
+------------+--------------------+-----+--------------------+-----+--------------------+---------+--------------------+--------------------+----------+--------------------+
|           1|                    |    1|I believe that Ap...|false|                  []|     NULL|[apple, product, ...| I believe that A...|   2013-07|(14179,[83,346,71...|
|           2|Every single year...|    2|CMV I Believe We ...|false|         [imgur.com]|     NULL|[every, single, y...|Every single year...|   2013-10|(14179,[5,22,27,3...|
|          44|For instance, gir...|   22|I believe many fe...|false|                  []|     NULL|[instance, girl, ...|For instan

## 2 –Topic Modeling using Latent Dirichlet Allocation (LDA)

Now let's apply Latent Dirichlet Allocation (LDA) to discover latent topics in our corpus.

### LDA Parameters:
- **k**: 45 - number of topics to discover
- **seed**: 434 - for reproducible results
- **featuresCol**: "features" - our document-term vectors

The choice of 45 topics was determined to be the optimal after multiple iterations. With fewer topics, distinct themes merged together, while more topics created excessive fragmentation.

After fitting the model, we'll extract the most representative terms for each topic to understand what they represent.

### What LDA Does:
1. Discovers latent topics in the corpus
2. Models each document as a mixture of topics
3. Models each topic as a distribution over words
4. Provides topic distributions for each document

In [None]:
from pyspark.ml.clustering import LDA
num_topics = 45 # Sweet spot for nice topic representation after iteration
# Initialize the LDA model with the number of topics and features column
lda = LDA(k=num_topics,  seed = 434, featuresCol="features")

# Fit the LDA model to the data
lda_model = lda.fit(corpus_df_w_features)

                                                                                

In [None]:
# Get the top 8 most representative words for each topic
topics = lda_model.describeTopics(maxTermsPerTopic=8)
topics.show(truncate=False)

# Print each topic with its most representative terms and their weights
vocab = cv_model.vocabulary
print("Topics, 8 words, and weights:")
topics_local = topics.collect()
for topic in topics_local:
    topic_id = topic['topic']
    term_indices = topic['termIndices']
    term_weights = topic['termWeights']
    terms = [vocab[idx] for idx in term_indices]
    print(f"\nTopic {topic_id}:")
    for term, weight in zip(terms, term_weights):
        print(f"  {term}: {weight:.4f}")

+-----+-----------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|topic|termIndices                                    |termWeights                                                                                                                                                                            |
+-----+-----------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0    |[3419, 4158, 96, 5, 4776, 115, 4145, 3]        |[0.0026874734882793203, 0.0024505357754823487, 0.0019802792168118312, 0.001919252408444004, 0.0018686288379947394, 0.0018013539676389053, 0.001795176850463054, 0.0016209544471248488] |
|1    |[19, 15, 14, 149, 44, 107, 243, 3

In [None]:
#this fetches the highest probability that the documents (= rows = original posts) is on that topic
doc_topics = lda_model.transform(corpus_df_w_features)


def get_top_topic(dist):
    return int(dist.argmax()) # Find index of highest probability topic

get_top_topic_udf = udf(get_top_topic, IntegerType())
doc_topics = doc_topics.withColumn("dominant_topic", get_top_topic_udf(doc_topics["topicDistribution"]))



### Manual mapping of topic IDs to human-readable categories

After running the LDA model, we need to carefully interpret the generated topics. Each topic is a probability distribution over words, and we can examine the most probable words to understand what the topic represents.

#### Topic Interpretation Process:

1. **Examine Top Terms**: Look at the most probable words for each topic
2. **Find Coherent Themes**: Identify patterns and relationships between these terms
3. **Label Topics**: Assign descriptive labels based on the term patterns
4. **Group Similar Topics**: Consolidate topics with similar themes into broader categories

For example, topics with terms like "government," "tax," "vote," and "law" clearly relate to politics, while terms like "eat," "food," "diet," and "meat" indicate discussions about food and nutrition.

Several Topics with unclear categorization will be defaulted to "Other".


In [None]:
# Manual mapping of topic IDs to human-readable categories
topic_to_category = {
    0: "Other",
    1: "Politics",
    2: "Philosophy",
    3: "Politics",
    4: "Other",
    5: "Politics",
    6: "Other",
    7: "Society",
    8: "Culture",
    9: "Culture",
    10: "Education",
    11: "Animals",
    12: "Education",
    13: "Health",
    14: "Other",
    15: "Culture",
    16: "Society",
    17: "Gender",
    18: "Culture",
    19: "Economy",
    20: "Food",
    21: "Society",
    22: "Society",
    23: "Sports",
    24: "Gender",
    25: "Other",
    26: "Other",
    27: "Society",
    28: "Economy",
    29: "Other",
    30: "Other",
    31: "Health",
    32: "Other",
    33: "Culture",
    34: "Gender",
    35: "Other",
    36: "Culture",
    37: "Technology",
    38: "Environment",
    39: "Other",
    40: "Other",
    41: "Technology",
    42: "Politics",
    43: "Culture",
    44: "Health"
}

In [None]:
# Create function to map topic IDs to category names
def get_category_title(topic_num):
    return topic_to_category.get(topic_num, "Other")  # Default to "Other" if not found

# Register UDF and add category title column
get_category_udf = udf(get_category_title, StringType())
doc_topics = doc_topics.withColumn("category_title", get_category_udf(doc_topics["dominant_topic"]))


## 3 – Display results

We've now completed our topic modeling pipeline and are saving the categorized dataset back to Google Cloud Storage for future analysis.

The final DataFrame contains the following fields:

| Column | Description |
|--------|-------------|
| `num_comments` | Number of comments the original post (OP) received. Serves as a proxy for engagement. |
| `selftext` | Main body of the OP. May be empty; some users only post in the `title`. |
| `score` | Reddit score = upvotes − downvotes. |
| `title` | Post title |
| `delta` | Boolean indicating the post received a delta (changed view). |
| `urls` | List of external links mentioned in the post. Useful for detecting evidence use. |
| `name` | Unique Reddit ID of the post. |
| `processed` | Tokenised and lemmatised version of the merged post content. |
| `merged` | Concatenated `title + selftext` |
| `year_month` | Temporal stamp for time-based analysis. |
| `features` | Sparse vector of term frequencies from `CountVectorizer` (input to LDA). |
| `topicDistribution` | Vector of length *k* = 45 representing the post's probabilistic distribution over topics. |
| `dominant_topic` | The topic ID with the highest weight for the post. |
| `category_title` | Personal label assigned to each topic. |


This output will be used as input for topic modelling in Notebook 03.

In [None]:
# Display results
categorized_df = doc_topics
categorized_df.show(5)

25/04/23 15:12:20 WARN DAGScheduler: Broadcasting large task binary with size 5.1 MiB
[Stage 55:>                                                         (0 + 1) / 1]

+------------+--------------------+-----+--------------------+-----+-----------+---------+--------------------+--------------------+----------+--------------------+--------------------+--------------+--------------+
|num_comments|            selftext|score|               title|delta|       urls|     name|           processed|              merged|year_month|            features|   topicDistribution|dominant_topic|category_title|
+------------+--------------------+-----+--------------------+-----+-----------+---------+--------------------+--------------------+----------+--------------------+--------------------+--------------+--------------+
|           1|                    |    1|I believe that Ap...|false|         []|     NULL|[apple, product, ...| I believe that A...|   2013-07|(14179,[83,346,71...|[0.00353616343831...|            33|       Culture|
|           2|Every single year...|    2|CMV I Believe We ...|false|[imgur.com]|     NULL|[every, single, y...|Every single year...|   2

                                                                                

In [None]:
categorized_df.write \
         .mode("overwrite") \
         .option("header", "true") \
         .json("gs://st446-cmv/n2_categorized_df/")

25/04/23 15:14:22 WARN DAGScheduler: Broadcasting large task binary with size 5.3 MiB
                                                                                

## Appendix

Topics from our LDA from which we manually added the topic

Topics, 8 words, and weights:

Topic 0:
  poetry: 0.0027
  trek: 0.0025
  black: 0.0020
  year: 0.0019
  juvenile: 0.0019
  culture: 0.0018
  captain: 0.0018
  good: 0.0016

Topic 1:
  government: 0.0110
  country: 0.0098
  state: 0.0091
  gun: 0.0089
  law: 0.0077
  war: 0.0066
  police: 0.0061
  military: 0.0048

Topic 2:
  trolley: 0.0030
  lever: 0.0024
  abortion: 0.0023
  gun: 0.0016
  game: 0.0013
  life: 0.0012
  review: 0.0009
  time: 0.0009

Topic 3:
  nuclear: 0.0043
  atheism: 0.0036
  country: 0.0035
  military: 0.0023
  cream: 0.0021
  war: 0.0020
  rule: 0.0019
  state: 0.0018

Topic 4:
  pizza: 0.0032
  server: 0.0017
  blizzard: 0.0017
  topping: 0.0014
  light: 0.0013
  pi: 0.0009
  fake: 0.0008
  legacy: 0.0007

Topic 5:
  woman: 0.0054
  men: 0.0044
  law: 0.0019
  police: 0.0017
  officer: 0.0015
  airport: 0.0013
  country: 0.0012
  electron: 0.0011

Topic 6:
  seat: 0.0032
  vote: 0.0019
  crush: 0.0019
  recline: 0.0018
  woman: 0.0016
  men: 0.0016
  witch: 0.0009
  music: 0.0009

Topic 7:
  drug: 0.0304
  car: 0.0218
  alcohol: 0.0087
  wage: 0.0082
  tip: 0.0077
  driver: 0.0070
  minimum: 0.0067
  marijuana: 0.0067

Topic 8:
  accent: 0.0059
  vampire: 0.0033
  soda: 0.0025
  font: 0.0025
  dinosaur: 0.0020
  woman: 0.0020
  banana: 0.0020
  theft: 0.0019

Topic 9:
  gaming: 0.0015
  console: 0.0014
  party: 0.0011
  time: 0.0010
  argument: 0.0009
  idea: 0.0009
  crusader: 0.0008
  point: 0.0008

Topic 10:
  frank: 0.0026
  loan: 0.0020
  college: 0.0020
  portrayal: 0.0015
  square: 0.0015
  triangle: 0.0015
  student: 0.0014
  money: 0.0014

Topic 11:
  dog: 0.0630
  cat: 0.0249
  pet: 0.0143
  animal: 0.0081
  breed: 0.0072
  owner: 0.0057
  puppy: 0.0043
  bull: 0.0032

Topic 12:
  system: 0.0038
  school: 0.0028
  game: 0.0026
  vote: 0.0023
  child: 0.0022
  state: 0.0021
  teacher: 0.0018
  sock: 0.0016

Topic 13:
  removed: 0.0041
  flu: 0.0035
  time: 0.0015
  whiskey: 0.0013
  person: 0.0012
  shot: 0.0011
  good: 0.0011
  better: 0.0010

Topic 14:
  purpose: 0.0050
  jury: 0.0040
  rubber: 0.0027
  chess: 0.0024
  human: 0.0019
  disabled: 0.0018
  involuntary: 0.0018
  animal: 0.0017

Topic 15:
  batman: 0.0187
  superman: 0.0073
  joker: 0.0069
  movie: 0.0029
  pitcher: 0.0024
  villain: 0.0016
  superhero: 0.0016
  juror: 0.0014

Topic 16:
  parking: 0.0027
  spot: 0.0018
  handicapped: 0.0015
  religion: 0.0013
  driver: 0.0011
  world: 0.0011
  country: 0.0011
  fluoride: 0.0009

Topic 17:
  woman: 0.0029
  gender: 0.0028
  courage: 0.0018
  men: 0.0016
  person: 0.0015
  sex: 0.0010
  doe: 0.0009
  wrong: 0.0009

Topic 18:
  harry: 0.0016
  potter: 0.0009
  child: 0.0008
  book: 0.0008
  time: 0.0006
  school: 0.0005
  year: 0.0005
  two: 0.0005

Topic 19:
  bob: 0.0095
  insider: 0.0022
  brother: 0.0021
  trading: 0.0017
  distinction: 0.0011
  apple: 0.0011
  inaction: 0.0010
  game: 0.0010

Topic 20:
  breakfast: 0.0048
  fry: 0.0045
  orange: 0.0040
  signature: 0.0033
  pie: 0.0029
  cook: 0.0019
  potato: 0.0017
  cake: 0.0016

Topic 21:
  password: 0.0031
  resident: 0.0021
  street: 0.0019
  sweeping: 0.0013
  guest: 0.0011
  city: 0.0010
  fingerprint: 0.0009
  sweep: 0.0008

Topic 22:
  time: 0.0076
  work: 0.0052
  good: 0.0052
  year: 0.0052
  money: 0.0044
  game: 0.0042
  better: 0.0041
  school: 0.0041

Topic 23:
  player: 0.0210
  circumcision: 0.0109
  soccer: 0.0108
  team: 0.0107
  sport: 0.0097
  football: 0.0066
  basketball: 0.0059
  play: 0.0038

Topic 24:
  woman: 0.0665
  men: 0.0389
  sex: 0.0130
  male: 0.0119
  man: 0.0115
  female: 0.0114
  gender: 0.0100
  girl: 0.0097

Topic 25:
  removed: 0.1144
  duck: 0.0034
  thou: 0.0021
  astrology: 0.0021
  religion: 0.0015
  bad: 0.0014
  world: 0.0014
  thy: 0.0013

Topic 26:
  helmet: 0.0031
  year: 0.0010
  point: 0.0010
  life: 0.0008
  time: 0.0007
  past: 0.0006
  term: 0.0006
  country: 0.0006

Topic 27:
  marriage: 0.0382
  gay: 0.0102
  married: 0.0093
  legal: 0.0068
  relationship: 0.0067
  couple: 0.0055
  divorce: 0.0053
  government: 0.0052

Topic 28:
  tax: 0.0631
  smoking: 0.0170
  cigarette: 0.0133
  smoke: 0.0122
  smoker: 0.0073
  income: 0.0071
  pay: 0.0070
  coffee: 0.0068

Topic 29:
  bottle: 0.0038
  pirate: 0.0014
  water: 0.0009
  reason: 0.0007
  city: 0.0007
  bathroom: 0.0006
  free: 0.0006
  school: 0.0006

Topic 30:
  person: 0.0081
  life: 0.0079
  child: 0.0077
  human: 0.0051
  wrong: 0.0045
  argument: 0.0045
  time: 0.0040
  reason: 0.0040

Topic 31:
  vaccine: 0.0096
  neutrality: 0.0082
  net: 0.0055
  power: 0.0047
  war: 0.0040
  rewind: 0.0039
  vaccination: 0.0035
  life: 0.0033

Topic 32:
  causation: 0.0007
  correlation: 0.0006
  woman: 0.0006
  pet: 0.0006
  free: 0.0006
  mean: 0.0005
  need: 0.0005
  society: 0.0005

Topic 33:
  music: 0.0275
  meat: 0.0190
  food: 0.0165
  animal: 0.0165
  eating: 0.0136
  eat: 0.0136
  song: 0.0119
  art: 0.0099

Topic 34:
  feminism: 0.0295
  feminist: 0.0294
  men: 0.0081
  character: 0.0072
  movement: 0.0062
  star: 0.0049
  woman: 0.0048
  issue: 0.0045

Topic 35:
  life: 0.0010
  human: 0.0007
  person: 0.0005
  million: 0.0005
  argument: 0.0004
  genitalia: 0.0004
  breast: 0.0004
  rape: 0.0004

Topic 36:
  pant: 0.0048
  statue: 0.0046
  lee: 0.0045
  monument: 0.0016
  love: 0.0011
  movie: 0.0008
  live: 0.0008
  white: 0.0007

Topic 37:
  language: 0.0076
  android: 0.0026
  corn: 0.0022
  mar: 0.0021
  country: 0.0020
  pyramid: 0.0016
  phone: 0.0016
  io: 0.0016

Topic 38:
  animal: 0.0115
  human: 0.0103
  water: 0.0068
  world: 0.0067
  specie: 0.0065
  life: 0.0063
  climate: 0.0062
  earth: 0.0055

Topic 39:
  parent: 0.0114
  child: 0.0082
  alien: 0.0067
  vehicle: 0.0046
  bus: 0.0036
  space: 0.0030
  car: 0.0026
  birth: 0.0021

Topic 40:
  drone: 0.0060
  sex: 0.0010
  water: 0.0010
  country: 0.0008
  citizen: 0.0007
  blood: 0.0007
  military: 0.0007
  bottled: 0.0007

Topic 41:
  simulation: 0.0022
  ai: 0.0008
  problem: 0.0008
  group: 0.0007
  white: 0.0006
  two: 0.0006
  ghetto: 0.0006
  universe: 0.0006

Topic 42:
  vote: 0.0196
  trump: 0.0148
  state: 0.0139
  party: 0.0129
  political: 0.0111
  government: 0.0111
  election: 0.0098
  candidate: 0.0090

Topic 43:
  toilet: 0.0092
  bidet: 0.0034
  wipe: 0.0031
  paper: 0.0025
  butt: 0.0015
  sitting: 0.0014
  clown: 0.0014
  joke: 0.0014

Topic 44:
  : 0.0112
  cancer: 0.0071
  cell: 0.0046
  tumor: 0.0036
  brain: 0.0031
  radiation: 0.0026
  phone: 0.0020
  pronunciation: 0.0019