# LDA by user

Each vector is a distinct post. Predictions would be performed on single new posts to determine individual probability of belonging to a cluster. 

Code adapted from 
- General Assembly Lesson by Caroline Schmitt: Intro to Latent Dirichlet Allocation. 
- Tutorial from Machine Learning Plus by Selva Prabhakaran https://www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.decomposition import LatentDirichletAllocation 
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(42)

## Import preprocessed data

In [2]:
df = pd.read_csv('data/posts_by_use_sample.csv')

In [3]:
df.head()

Unnamed: 0,username,body
0,-generaldebellis,Florida loves our Great Governor DeSantis
1,0,Free speech is a beautiful thing thank you f...
2,0000000000000111,"If I could wrap my hands around your throat, I..."
3,00004kaptain,These Black female soldiers brought order to c...
4,0019,لا عاد ما نبي نجيب كلمة هوامير معنا نبي نخترع ...


In [4]:
# need to dropna in importing step with entire df - REMOVE
df = df.dropna().reset_index(drop=True)

In [5]:
df.head()

Unnamed: 0,username,body
0,-generaldebellis,Florida loves our Great Governor DeSantis
1,0,Free speech is a beautiful thing thank you f...
2,0000000000000111,"If I could wrap my hands around your throat, I..."
3,00004kaptain,These Black female soldiers brought order to c...
4,0019,لا عاد ما نبي نجيب كلمة هوامير معنا نبي نخترع ...


## Train test split

In [6]:
X = df['body']

In [7]:
X_train, X_test = train_test_split(X, test_size=0.1, random_state=42)

## Count Vectorize

In [26]:
default_words = list(CountVectorizer(stop_words='english').get_stop_words())
my_additions = ['ve', 're']
custom_stopwords = default_words + my_additions 

In [27]:
cv = CountVectorizer(min_df = 3, 
                     stop_words = custom_stopwords,  
                )
cv.fit(X_train)

cv_train = cv.transform(X_train)
cv_test = cv.transform(X_test)
cv_df = pd.DataFrame(cv_train.todense(), columns=cv.get_feature_names_out())
cv_df.head(2)

Unnamed: 0,00,000,00pm,01,02,03,038,04,05,06,...,گفتم,گفتن,گوش,گیر,یا,یادت,یعنی,یه,یک,یکی
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Fit LDA model

In [28]:
lda = LatentDirichletAllocation(n_components=5, 
                                random_state=42
                               )
lda.fit(cv_train)

LatentDirichletAllocation(n_components=5, random_state=42)

In [29]:
len(lda.components_)

5

In [30]:
for each in lda.components_:
    print(len(each))

16929
16929
16929
16929
16929


In [31]:
features = cv.get_feature_names_out()

In [32]:
def display_topics(model, feature_names, num_words):
    for ix, topic in enumerate(model.components_):
        print(f'Topic {ix}:')
        words = [feature_names[i] for i in topic.argsort()[:-num_words - 1:-1]]
        print('\n'.join(words))
        print()


display_topics(lda, features, 10)

Topic 0:
trump
news
biden
election
president
fox
vote
yes
votes
watch

Topic 1:
like
just
lol
don
know
right
good
got
think
ll

Topic 2:
twitter
thanks
man
covid
know
good
yep
facebook
new
vaccine

Topic 3:
people
trump
just
don
time
president
going
biden
need
country

Topic 4:
god
thank
love
great
president
good
amen
people
bless
truth



In [33]:
lda_8 = LatentDirichletAllocation(n_components=8, 
                                random_state=42
                               )
lda_8.fit(cv_train)

LatentDirichletAllocation(n_components=8, random_state=42)

In [34]:
len(lda_8.components_)

8

In [35]:
display_topics(lda_8, features, 10)

Topic 0:
news
trump
fox
yes
watch
new
maga
trump2020
video
biden

Topic 1:
just
right
like
ll
know
got
better
said
trump
true

Topic 2:
twitter
thanks
facebook
من
fb
share
hi
bye
بس
انا

Topic 3:
trump
president
election
biden
vote
fraud
people
going
time
need

Topic 4:
god
thank
love
president
amen
bless
great
truth
trump
america

Topic 5:
like
que
black
white
não
sure
just
em
com
da

Topic 6:
people
just
don
like
country
want
know
think
time
right

Topic 7:
people
good
like
lol
don
just
agree
biden
covid
joe



In [36]:
lda_25 = LatentDirichletAllocation(n_components=25, 
                                random_state=42
                               )
lda_25.fit(cv_train)

LatentDirichletAllocation(n_components=25, random_state=42)

In [37]:
len(lda_25.components_)

25

In [38]:
display_topics(lda_25, features, 10)

Topic 0:
states
follow
united
red
patriots
lock
funny
hope
blue
traitors

Topic 1:
que
em
com
não
da
ha
se
para
um
glad

Topic 2:
thanks
من
bye
بس
انا
followed
الله
ما
ya
sorry

Topic 3:
china
court
supreme
chinese
war
win
going
america
trump
case

Topic 4:
god
thank
bless
president
pray
jesus
evil
lord
america
trump

Topic 5:
matter
black
christmas
lives
la
said
hello
merry
like
el

Topic 6:
just
like
hey
democrats
country
bitch
people
stupid
dumb
time

Topic 7:
good
lol
like
exactly
day
people
idea
oh
just
great

Topic 8:
people
just
trump
know
way
country
corrupt
state
need
right

Topic 9:
like
mask
echo
virus
looks
masks
going
let
just
covid

Topic 10:
soros
just
new
patriot
usa
biden
george
kamala
interesting
harris

Topic 11:
term
truth
makes
sick
limits
just
listen
haha
years
knows

Topic 12:
time
law
police
yeah
hell
better
people
just
share
long

Topic 13:
great
awesome
did
people
make
job
big
work
good
america

Topic 14:
twitter
maga
vote
trump
trump2020
facebook
republican
s

## Examine words in topics