# LDA by individual post 

Each vector is a distinct post. Predictions would be performed on single new posts to determine individual probability of belonging to a cluster. 

Code adapted from 
- General Assembly Lesson by Caroline Schmitt: Intro to Latent Dirichlet Allocation. 
- Tutorial from Machine Learning Plus by Selva Prabhakaran https://www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.decomposition import LatentDirichletAllocation 
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(42)

## Import preprocessed data

In [2]:
df = pd.read_csv('data/posts_01_cleaned_sample.csv')

In [3]:
df.head()

Unnamed: 0,body
0,"Ok, great, now be like Pa!"
1,Glad to see you here on Parler where free spee...
2,Now we’re getting somewhere👍🇺🇸
3,Imagine if they gave him truth serum and start...
4,I don’t think DC police will do much against t...


In [5]:
# need to dropna in importing step with entire df - REMOVE
df = df.dropna().reset_index(drop=True)

In [5]:
df.head()

Unnamed: 0,body
0,"Ok, great, now be like Pa!"
1,Glad to see you here on Parler where free spee...
2,Now we’re getting somewhere👍🇺🇸
3,Imagine if they gave him truth serum and start...
4,I don’t think DC police will do much against t...


## Train test split

In [6]:
X = df['body']

In [8]:
X_train, X_test = train_test_split(X, test_size=0.1, random_state=42)

## Count Vectorize

In [9]:
default_words = list(CountVectorizer(stop_words='english').get_stop_words())
custom_stopwords = default_words + ['ve', 're']

In [11]:
cv = CountVectorizer(min_df = 3, 
                     max_df = .5, 
                     stop_words = custom_stopwords, 
                     strip_accents='ascii', 
                )
cv.fit(X_train)

cv_train = cv.transform(X_train)
cv_test = cv.transform(X_test)
cv_df = pd.DataFrame(cv_train.todense(), columns=cv.get_feature_names_out())
cv_df.head(2)

Unnamed: 0,00,000,00am,01,02,03,04,05,06,07,...,zionist,zip,zombie,zombies,zone,zones,zoom,zuck,zucker,zuckerberg
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Fit LDA model

In [12]:
lda = LatentDirichletAllocation(n_components=5, 
                                random_state=42
                               )
lda.fit(cv_train)

LatentDirichletAllocation(n_components=5, random_state=42)

In [14]:
len(lda.components_)

5

In [15]:
for each in lda.components_:
    print(len(each))

12541
12541
12541
12541
12541


In [17]:
features = cv.get_feature_names_out()

In [22]:
def display_topics(model, feature_names, num_words):
    for ix, topic in enumerate(model.components_):
        print(f'Topic {ix}:')
        words = [feature_names[i] for i in topic.argsort()[:-num_words - 1:-1]]
        print('\n'.join(words))
        print()


display_topics(lda, features, 10)

Topic 0:
right
great
share
follow
welcome
new
parler
life
work
god

Topic 1:
parler
looking
forward
face
just
alive
truth
free
speech
joined

Topic 2:
good
parler
people
check
youtube
page
videos
looking
data
channel

Topic 3:
like
just
dont
time
need
thank
know
people
yes
god

Topic 4:
trump
president
biden
people
election
just
vote
know
fraud
media



In [1]:
lda_8 = LatentDirichletAllocation(n_components=8, 
                                random_state=42
                               )
lda_8.fit(cv_train)

NameError: name 'LatentDirichletAllocation' is not defined

In [14]:
len(lda.components_)

5

In [23]:
display_topics(lda, features, 15)

Topic 0:
right
great
share
follow
welcome
new
parler
life
work
god

Topic 1:
parler
looking
forward
face
just
alive
truth
free
speech
joined

Topic 2:
good
parler
people
check
youtube
page
videos
looking
data
channel

Topic 3:
like
just
dont
time
need
thank
know
people
yes
god

Topic 4:
trump
president
biden
people
election
just
vote
know
fraud
media



In [24]:
lda_8 = LatentDirichletAllocation(n_components=25, 
                                random_state=42
                               )
lda_8.fit(cv_train)

LatentDirichletAllocation(n_components=25, random_state=42)

In [25]:
len(lda.components_)

5

In [26]:
display_topics(lda, features, 10)

Topic 0:
right
great
share
follow
welcome
new
parler
life
work
god

Topic 1:
parler
looking
forward
face
just
alive
truth
free
speech
joined

Topic 2:
good
parler
people
check
youtube
page
videos
looking
data
channel

Topic 3:
like
just
dont
time
need
thank
know
people
yes
god

Topic 4:
trump
president
biden
people
election
just
vote
know
fraud
media



## Examine words in topics