This notebook conducts topic modeling about the comments.

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
import pandas as pd

In [4]:
df = pd.read_csv('/content/gdrive/MyDrive/reddit/reddit_comments.csv', encoding='utf-8', index_col=0, sep=';')

In [5]:
df.head()

Unnamed: 0,comments
0,Idk what ya’ll are mad about. I’m pretty excit...
1,Can't wait to see a crippled Levi fighting din...
2,Can't wait for when Bellen Kristein would figh...
3,Honestly speaking I wouldn't mind reading it (...
4,"Beren is stunning, the ending is bad.\nReasons..."


**Clean the data**

In [6]:
import re
import nltk
from nltk.corpus import stopwords
import string

nltk.download('stopwords')

def preprocess_text(text):
    # Convert text to lowercase
    processed_text = text.lower()
    # Remove URLs and user mentions
    processed_text = re.sub(r"http\S+|www\S+|https\S+|\/\/t|co\/|\@\w+", '', processed_text, flags=re.MULTILINE)
    # Remove punctuation
    processed_text = processed_text.translate(str.maketrans('', '', string.punctuation))
    # Remove numbers
    processed_text = re.sub(r'\d+', '', processed_text)
    # Tokenize the text
    words = processed_text.split()
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words]
    # Join the filtered words back into a string
    processed_text = ' '.join(filtered_words)

    return processed_text

training_data = []
original_texts = []

for index, row in df.iterrows():
    # Preprocess the tweet text
    processed_text = preprocess_text(row['comments'])
    # Add the processed text to the 'training_data' list
    training_data.append(processed_text)
    # Add the original text to the 'original_texts' list
    original_texts.append(row['comments'])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [8]:
!pip install tomotopy

Collecting tomotopy
  Downloading tomotopy-0.12.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.2/17.2 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tomotopy
Successfully installed tomotopy-0.12.7


In [9]:
!pip install little_mallet_wrapper

Collecting little_mallet_wrapper
  Downloading little_mallet_wrapper-0.5.0-py3-none-any.whl (19 kB)
Installing collected packages: little_mallet_wrapper
Successfully installed little_mallet_wrapper-0.5.0


**Train an LDA topic model with tomotopy**

In [10]:
import tomotopy as tp
import little_mallet_wrapper
import seaborn
import glob
from pathlib import Path

In [11]:
# Number of topics to return
num_topics = 15
# Numer of topic words to print out
num_topic_words = 10

# Intialize the model
model = tp.LDAModel(k=num_topics)

# Add each document to the model, after splitting it up into words
for text in training_data:
    model.add_doc(text.strip().split())

print("Topic Model Training...\n\n")
# Iterate over the data 10 times
iterations = 10
for i in range(0, 100, iterations):
    model.train(iterations)
    print(f'Iteration: {i}\tLog-likelihood: {model.ll_per_word}')

Topic Model Training...


Iteration: 0	Log-likelihood: -8.725279041711211
Iteration: 10	Log-likelihood: -8.596976075767682
Iteration: 20	Log-likelihood: -8.552565295448925
Iteration: 30	Log-likelihood: -8.515113571037027
Iteration: 40	Log-likelihood: -8.504540745146736
Iteration: 50	Log-likelihood: -8.495437891732678
Iteration: 60	Log-likelihood: -8.43721286680187
Iteration: 70	Log-likelihood: -8.444449159186256
Iteration: 80	Log-likelihood: -8.434859519875749
Iteration: 90	Log-likelihood: -8.435435407055172


**Print out the top words for each topic**

In [12]:
topics = []
topic_individual_words = []
for topic_number in range(0, num_topics):
    topic_words = ' '.join(word for word, prob in model.get_topic_words(topic_id=topic_number, top_n=num_topic_words))
    topics.append(topic_words)
    topic_individual_words.append(topic_words.split())
    print(f"✨Topic {topic_number}✨\n\n{topic_words}\n")

✨Topic 0✨

aot people lot got see part time themes character wrong

✨Topic 1✨

ending like didn’t one don’t people it’s bad everyone characters

✨Topic 2✨

could know see that’s isayama life — sure marley kid

✨Topic 3✨

world paradis time dont would even always final minority point

✨Topic 4✨

rumbling series thing think died best though attack opinion plot

✨Topic 5✨

good got hate want thats historia island lol generations stockholm

✨Topic 6✨

think first many well pretty since actually liked right manga

✨Topic 7✨

armin peace back titans conflict eren lives genocide reiner hope

✨Topic 8✨

war us mean it’s take yeah die points maybe desire

✨Topic 9✨

ending like way people end something feel years anything understand

✨Topic 10✨

eren ymir mikasa titan even love titans going also free

✨Topic 11✨

sense extra makes read pages get make long big show

✨Topic 12✨

kind plot didnt im dina past point agree issue need

✨Topic 13✨

chapter cycle hatred end saw also arc every killed cam