## Topic Modelling

<b> Author:</b> Miraya Gupta \
<b> Date: </b> 04/05

## Table of Contents
1. [Loading Data](#ld)
2. [Preprocessing Data](#pd)
3. [Creating Vectors](#cm)
4. [Fitting Model](#fm)
5. [Grid Search](#gs)
#6. [Popularity Analysis by User over Time](#by-user-over-time)

In [1]:
#importing packages
import pandas as pd
import csv
import json
import numpy as np
import os
from sklearn.feature_extraction.text import CountVectorizer

## 1. Loading Data<a id="ld"></a>

In [2]:
def get_df_for_year(year):
    '''
    Take in the year and return a df with data from that year
    '''
    directory = 'data'
    filename = f'final_result_{year}.csv'
    path = os.path.join(directory, filename)
    df = pd.read_csv(path)
    return df

In [3]:
#calling function to get csv for each year
all_dfs = []
for yr in ['2020', '2021', '2022', '2023', '2024']:
    df = get_df_for_year(yr)
    all_dfs.append(df)

In [4]:
#testing
all_dfs[0].columns

Index(['Unnamed: 0', 'video_id', 'video_timestamp', 'video_duration',
       'video_locationcreated', 'suggested_words', 'video_diggcount',
       'video_sharecount', 'video_commentcount', 'video_playcount',
       'video_description', 'video_is_ad', 'video_stickers', 'author_username',
       'author_name', 'author_followercount', 'author_followingcount',
       'author_heartcount', 'author_videocount', 'author_diggcount',
       'author_verified', 'search_term', 'year', 'File Name', 'Content',
       'Subjectivity/Objectivity'],
      dtype='object')

<b> The column 'content' contains the document information for each video. </b>

## 2. Preprocessing Data<a id="pd"></a>

In [5]:
#concatenate all
allDocs = pd.concat(all_dfs)

In [6]:
#testing
allDocs.shape

(17062, 31)

In [7]:
pd.set_option("display.max_colwidth",1000)

In [8]:
allDocs.head(2)

Unnamed: 0.1,Unnamed: 0,video_id,video_timestamp,video_duration,video_locationcreated,suggested_words,video_diggcount,video_sharecount,video_commentcount,video_playcount,...,search_term,year,File Name,Content,Subjectivity/Objectivity,transcript_File Name,transcript_Content,transcript_Unnamed: 0,transcript_Subjectivity/Objectivity,transcript_video_id
0,0.0,7104707494194269482,2022-06-02T14:13:40,15.0,US,,106.0,1.0,4.0,4690.0,...,Public Safety,2020.0,@chayse.me_video_7104707494194269482.txt,electronica esosledge och,not sure,,,,,
1,1.0,7167954365922151726,2022-11-19T23:43:58,23.0,US,,35800.0,974.0,228.0,323000.0,...,Public Safety,2020.0,@elevendouble9_video_7167954365922151726.txt,"What's your schedule like coming here? We work on 6 a.m. every day. 6 a.m. We're going to get here. We're going to get here at 5.30 to stretch and warm up. What the hell time do you get up? About like 4.45. What time do you get up? 9.30. Guess there's no lonely nights if you're going to bed at 9. Yeah, go to bed early. Where are you? Oh, wait a second. I'm not just thinking about life. What is there to think about it that time?",subjective,,,,,


In [9]:
nan_indices = allDocs['Content'].isnull()
allDocs = allDocs.dropna(subset=['Content'])

In [10]:
# Initialize the vectorizer
vectorizer = CountVectorizer(
    strip_accents='unicode',
    stop_words='english',
    lowercase=True,
    token_pattern=r'\b[a-zA-Z]{3,}\b', # we want only words that contain letters and are 3 or more characters long
)

# Transform our data into the document-term matrix
dtm = vectorizer.fit_transform(allDocs['Content'])
dtm

<11167x25340 sparse matrix of type '<class 'numpy.int64'>'
	with 644396 stored elements in Compressed Sparse Row format>

## 3. Creating Vectors <a id="cm"></a>

In [11]:
feature_names = vectorizer.get_feature_names_out()
feature_names

array(['aaaaaahhh', 'aachenklaas', 'aachenklaus', ..., 'zygote', 'zypact',
       'zyprophloxicin'], dtype=object)

In [12]:
feature_names.shape

(25340,)

In [13]:
feature_names[300:350]

array(['adjacent', 'adjouish', 'adjourned', 'adju', 'adjudicate',
       'adjudicated', 'adjudication', 'adjust', 'adjusted', 'adjuster',
       'adjustment', 'adl', 'admin', 'administer', 'administered',
       'administers', 'administracion', 'administration',
       'administrations', 'administrative', 'administrator', 'admiral',
       'admiration', 'admire', 'admired', 'admissibility', 'admission',
       'admissions', 'admit', 'admits', 'admitted', 'admittedly',
       'admitting', 'admusy', 'ado', 'adoe', 'adolescent', 'adolescents',
       'adolf', 'adopt', 'adopted', 'adopting', 'adoption', 'adoration',
       'adorned', 'adou', 'adquienes', 'adreisi', 'adrenal', 'ads'],
      dtype=object)

<b>Observation: there are several words here with the same root word such as 'administered', 'administration' and words with the same root word in other languages such as 'administracion'</b>

In [14]:
doc1 = dtm[0]
doc1

<1x25340 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [15]:
row_index = 0
doc_vec = dtm.getrow(row_index).toarray()

non_zero_indices = doc_vec.nonzero()[1]
dtm_scores = doc_vec[0, non_zero_indices] # goes and retrieves the values corresponding to the non_zero_indices
words = [feature_names[i] for i in non_zero_indices]

for word, score in zip(words, dtm_scores):
    print(f"{word}: {score}")

electronica: 1
esosledge: 1
och: 1


In [16]:
non_zero_indices

array([ 7181,  7664, 15590])

In [17]:
dtm.getcol(599).toarray().T.sum()

46

In [18]:
np.count_nonzero(dtm.getcol(7181).toarray().T)

5

In [19]:
dct = {}
for i in range(4504):
    value = np.count_nonzero(dtm.getcol(i).toarray().T)
    key =  feature_names[i]
    dct[key] = value
sorted(dct.items(), reverse = True, key=lambda index : index[1])[:5]

[('come', 1346),
 ('actually', 1325),
 ('called', 892),
 ('big', 886),
 ('care', 807)]

<b> Hard to tell if these words are part of a topic, since they are very general. What does come across is that most of the words are conversational, such as actually or called. </b>

In [20]:
def matrix2Doc(dtMatrix, features, index):
    """Turns each row of the document-term matrix into a list of terms"""
    row = dtMatrix.getrow(index).toarray()
    non_zero_indices = row.nonzero()[1]
    words = [features[idx] for idx in non_zero_indices]
    return words

In [21]:
allDocsAsTerms = [matrix2Doc(dtm, feature_names, i) for i in range(dtm.shape[0])]

In [22]:
len(allDocsAsTerms)

11167

In [23]:
allDocs['terms'] = allDocsAsTerms
allDocs.head()

Unnamed: 0.1,Unnamed: 0,video_id,video_timestamp,video_duration,video_locationcreated,suggested_words,video_diggcount,video_sharecount,video_commentcount,video_playcount,...,year,File Name,Content,Subjectivity/Objectivity,transcript_File Name,transcript_Content,transcript_Unnamed: 0,transcript_Subjectivity/Objectivity,transcript_video_id,terms
0,0.0,7104707494194269482,2022-06-02T14:13:40,15.0,US,,106.0,1.0,4.0,4690.0,...,2020.0,@chayse.me_video_7104707494194269482.txt,electronica esosledge och,not sure,,,,,,"[electronica, esosledge, och]"
1,1.0,7167954365922151726,2022-11-19T23:43:58,23.0,US,,35800.0,974.0,228.0,323000.0,...,2020.0,@elevendouble9_video_7167954365922151726.txt,"What's your schedule like coming here? We work on 6 a.m. every day. 6 a.m. We're going to get here. We're going to get here at 5.30 to stretch and warm up. What the hell time do you get up? About like 4.45. What time do you get up? 9.30. Guess there's no lonely nights if you're going to bed at 9. Yeah, go to bed early. Where are you? Oh, wait a second. I'm not just thinking about life. What is there to think about it that time?",subjective,,,,,,"[bed, coming, day, early, going, guess, hell, just, life, like, lonely, nights, schedule, second, stretch, think, thinking, time, wait, warm, work, yeah]"
2,2.0,7277689387939417386,2023-09-11T17:52:00,46.0,US,,157600.0,2252.0,163.0,839100.0,...,2020.0,@collegelifeshorts_video_7277689387939417386.txt,"Hey, I got my bike stolen. Oh nice. Okay. Great. Yes. Do we have a list of suspects? Yeah, no I think it was just someone random. I left it unlocked. Do you have any close friends? Do you have any friends? Yes? It was him. Let's ride Something's not adding up. Okay. I have a couple leads. I'm hoping the university is willing to fund. I found it You know supposed to have that candle also my car got broken in Maybe that's where your bike went. I found my bike. So your car is stolen We'll have to lure the culprit back onto campus before I can do anything about it. Is that a vapor? Yeah, you can't have that. Yeah, I can I'm just gonna call the actual no, no, no, no, no once we get him back I'll revoke their parking pass can't guarantee we'll get the car back But I will get the pass bottom line you will no longer have a parking pass. Oh, there's your bike That must mean your car is somewhere in this room",subjective,,,,,,"[actual, adding, bike, broken, campus, candle, car, close, couple, culprit, friends, fund, gonna, got, great, guarantee, hey, hoping, just, know, leads, left, let, line, list, longer, lure, maybe, mean, nice, okay, parking, pass, random, revoke, ride, room, stolen, supposed, suspects, think, university, unlocked, vapor, went, willing, yeah, yes]"
3,3.0,7176634327122365738,2022-12-13T09:06:40,20.0,US,"disposable straws, disposable pot",1100000.0,249900.0,13800.0,15800000.0,...,2020.0,@adequatevapor_video_7176634327122365738.txt,"I am reposting this here because I just assumed that I have some followers on all platforms that happen to use these things. This is an Elf Bar, a legitimate Elf Bar, not that that really matters. If it tastes like it's almost out of juice, get a new one. Don't keep using it because you're going to end up inhaling plastic,",subjective,,,,,,"[assumed, bar, don, elf, end, followers, going, happen, inhaling, juice, just, legitimate, like, matters, new, plastic, platforms, really, reposting, tastes, things, use, using]"
4,4.0,7212690611852578094,2023-03-20T14:03:32,50.0,US,,15700.0,2171.0,1110.0,299900.0,...,2020.0,@illinoispolicy_video_7212690611852578094.txt,"So this is the entrance to Horner Park, supposedly being protected by the city's most profitable speed camera. Except this is not the street that the speed camera is on. You can actually see a sign over there that says it's up and around the corner. That's not it, that's a red light camera. And here we are. This is the most profitable speed camera in the entire city of Chicago. You've got the other one there across the street. This camera is pulling in $4 million a year for the city. So there you have it. There's a traffic camera on the busy street, the state highway, that gets all the cars going by. But none of the actual entrance to the park, which by the way also happens to have a preschool across the street.",subjective,,,,,,"[actual, actually, busy, camera, cars, chicago, city, corner, entire, entrance, gets, going, got, happens, highway, horner, light, million, park, preschool, profitable, protected, pulling, red, says, sign, speed, state, street, supposedly, traffic, way, year]"


<b> Limitations of the terms in capturing the meaning of the transcripts </b>
1. Proper nouns like 'Elf' from 'Elf Bars' and 'Adolf' from 'Adolf Hitler' are interpreted as common nouns.
2. Some words are incomprehensible in English, for example 'och'. This could be a limitation of the transcription rather than the term extraction.
3. Some words that should technically be stopwords within the given context of the transcript are left in.
4. Some platform specific words like 'reposting' are captured as words that add meaning to the transcript. 

## 4. Fitting Model <a id="fm"></a>

In [24]:
from sklearn.decomposition import LatentDirichletAllocation

# Step 1: Initialize the model

lda = LatentDirichletAllocation(n_components=15, # we are picking the number of topics arbitrarely at the moment
                                random_state=0)

# Step 2: Fit the model
lda.fit(dtm)

In [25]:
#accessing representations of topics
lda.components_

array([[0.06666667, 0.06666667, 0.06666667, ..., 0.06666667, 0.06666667,
        0.06666667],
       [0.06666667, 0.06666667, 0.06666667, ..., 0.06666667, 0.06666671,
        0.06666671],
       [0.06666667, 0.06666667, 0.06666667, ..., 0.06666667, 0.06666667,
        0.06666667],
       ...,
       [0.06666667, 0.06666667, 0.06666684, ..., 0.06666667, 0.06666667,
        0.06666667],
       [0.06666667, 2.06666664, 0.06666667, ..., 0.06666667, 0.06666667,
        0.06666667],
       [0.06666667, 0.06666667, 0.06666667, ..., 0.06666667, 0.06666667,
        0.06666667]])

In [26]:
lda.components_.shape

(15, 25340)

Each row is one of 15 topics and each column is a specific term. The numbers represent how many times a specific term was assigned to a topic in the corpus. 

In [27]:
def display_topics(model, features, no_top_words):
    """Helper function to show the top words of a model"""
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([features[i]
                        for i in topic.argsort()[:-no_top_words-1:-1]])) # syntax for reversing a list [::-1]

display_topics(lda, feature_names, 15)

Topic 0:
know people gaza just israel god president health unrwa going jesus really like jew make
Topic 1:
gonna just like money time people make know don emergency need going place want school
Topic 2:
que think end conversation los para fala una esta mas por eso porque las pero
Topic 3:
trump president donald just people like said biden ukraine question political know did georgia years
Topic 4:
israel like right going people don just know war want lot united states let money
Topic 5:
piste going just security committee law know federal department fbi house time government people vote
Topic 6:
like know don just right want going got people yeah think okay come say really
Topic 7:
people like health want don know just think work public way right community lot need
Topic 8:
people know president don think just right years debate police person want democracy said say
Topic 9:
like state sorry abortion medical want people social know just going republican court website government
Topic 10

Two topics pertain very clearly to the conflict in Gaza, 0, 4 and 13. Topic 0 seems to connect more with the religious element of the conflict, 4 with the funding and 13 with the armed conflict. 

In [28]:
doc_topic_dist = lda.transform(dtm)
doc_topic_dist

array([[0.01666667, 0.01666667, 0.01666667, ..., 0.01666667, 0.01666667,
        0.01666667],
       [0.00229885, 0.00229886, 0.00229885, ..., 0.00229885, 0.00229885,
        0.13961789],
       [0.0010101 , 0.0809858 , 0.0010101 , ..., 0.0010101 , 0.0010101 ,
        0.0010101 ],
       ...,
       [0.00095238, 0.42112601, 0.00095238, ..., 0.00095238, 0.00095238,
        0.00095238],
       [0.00082305, 0.00082305, 0.00082305, ..., 0.05098999, 0.00082305,
        0.00082305],
       [0.02222225, 0.02222224, 0.02222222, ..., 0.02222226, 0.02222224,
        0.02222228]])

In [29]:
doc_topic_dist.shape

(11167, 15)

In [30]:
def displayHeader(model, features, no_top_words):
    """Helper function to show the top words of a model"""
    topicNames = []
    for topic_idx, topic in enumerate(model.components_):
        topicNames.append(f"Topic {topic_idx}: " + (", ".join([features[i]
                             for i in topic.argsort()[:-no_top_words-1:-1]])))
    return topicNames

In [31]:
# column names
topicnames = displayHeader(lda, feature_names, 5)

# index names
docnames = allDocs.index.tolist() # We will use the original names of the documents

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(doc_topic_dist, 3), 
                                 columns=topicnames, 
                                 index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1) # finds the maximum argument
df_document_topic['dominant_topic'] = dominant_topic

df_document_topic.head()

Unnamed: 0,"Topic 0: know, people, gaza, just, israel","Topic 1: gonna, just, like, money, time","Topic 2: que, think, end, conversation, los","Topic 3: trump, president, donald, just, people","Topic 4: israel, like, right, going, people","Topic 5: piste, going, just, security, committee","Topic 6: like, know, don, just, right","Topic 7: people, like, health, want, don","Topic 8: people, know, president, don, think","Topic 9: like, state, sorry, abortion, medical","Topic 10: going, like, people, just, let","Topic 11: going, trump, people, just, think","Topic 12: vote, states, new, election, voting","Topic 13: israel, people, palestinian, israeli, gaza","Topic 14: like, food, right, people, going",dominant_topic
0,0.017,0.017,0.017,0.017,0.017,0.017,0.017,0.017,0.017,0.017,0.767,0.017,0.017,0.017,0.017,10
1,0.002,0.002,0.002,0.002,0.002,0.002,0.83,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.14,6
2,0.001,0.081,0.001,0.328,0.001,0.001,0.579,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,6
3,0.003,0.003,0.003,0.003,0.003,0.003,0.811,0.003,0.003,0.003,0.003,0.003,0.156,0.003,0.003,6
4,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.981,0.001,0.001,0.001,0.001,0.001,0.001,8


In [32]:
df_document_topic.shape

(11167, 16)

In [33]:
df_document_topic[76:86]

Unnamed: 0,"Topic 0: know, people, gaza, just, israel","Topic 1: gonna, just, like, money, time","Topic 2: que, think, end, conversation, los","Topic 3: trump, president, donald, just, people","Topic 4: israel, like, right, going, people","Topic 5: piste, going, just, security, committee","Topic 6: like, know, don, just, right","Topic 7: people, like, health, want, don","Topic 8: people, know, president, don, think","Topic 9: like, state, sorry, abortion, medical","Topic 10: going, like, people, just, let","Topic 11: going, trump, people, just, think","Topic 12: vote, states, new, election, voting","Topic 13: israel, people, palestinian, israeli, gaza","Topic 14: like, food, right, people, going",dominant_topic
76,0.001,0.001,0.001,0.001,0.001,0.001,0.156,0.001,0.001,0.001,0.34,0.494,0.001,0.001,0.001,11
77,0.004,0.004,0.004,0.004,0.004,0.004,0.004,0.004,0.004,0.942,0.004,0.004,0.004,0.004,0.004,9
78,0.01,0.01,0.01,0.01,0.01,0.01,0.224,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.652,14
79,0.022,0.022,0.022,0.022,0.022,0.022,0.022,0.022,0.689,0.022,0.022,0.022,0.022,0.022,0.022,8
80,0.033,0.033,0.033,0.033,0.533,0.033,0.033,0.033,0.033,0.033,0.033,0.033,0.033,0.033,0.033,4
81,0.004,0.004,0.004,0.004,0.004,0.004,0.004,0.004,0.004,0.004,0.004,0.945,0.004,0.004,0.004,11
82,0.0,0.0,0.0,0.0,0.0,0.0,0.052,0.0,0.0,0.0,0.0,0.945,0.0,0.0,0.0,11
83,0.001,0.988,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,1
84,0.222,0.003,0.003,0.003,0.003,0.003,0.743,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,6
85,0.011,0.011,0.011,0.282,0.011,0.011,0.574,0.011,0.011,0.011,0.011,0.011,0.011,0.011,0.011,6


In [34]:
df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['Topic Num', 'Num Documents']
df_topic_distribution

Unnamed: 0,Topic Num,Num Documents
0,6,1588
1,11,1217
2,0,1042
3,13,931
4,1,915
5,7,702
6,12,701
7,4,650
8,10,628
9,14,592


## 5. Grid Search <a id="gs"></a>

In [None]:
from sklearn.model_selection import GridSearchCV

# We are going to test multiple values for the number of topics
search_params = {'n_components': [5, 10, 15, 20, 25, 30, 35]}

# Initialize the LDA model
lda = LatentDirichletAllocation()

# Initialize a Grid Search with cross-validation instance
grid = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
grid.fit(dtm)

In [None]:
# Best Model
best_lda_model = grid.best_estimator_

# Model Parameters
print("Best Model's Params: ", grid.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", grid.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(dtm))