topic modeling :  문서를 특정 토픽으로 할당 (차원 축소)

- NMF (None-negative Matrix Factorization) :  비음수 행렬 분해

- SVD (Singluar Value Decompoisition) :  특이값 분해

- LDA (Latent Dirichlet Allocation) : 잠재 디리크레 할당 -> 베이즈 기반 확률적 생성 모델

    단어들이 어떤 주제와 관련있을지 확률적으로 분석하여 분류

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("../datasets/imdb_sentiment.csv")
df
# 0 : 부장 / 1 : 긍정 

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
...,...,...
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# max_features : 빈도수 높은 것부터 가져오는 갯수, 
# max_df: 상위 몇프로 이상 나오는 것을 제외하겠다.
cv = CountVectorizer(max_features=1000, max_df=.15)
x = cv.fit_transform(df["review"])

In [6]:
from sklearn.decomposition import LatentDirichletAllocation

In [8]:
# 10개의 토픽으로 분류
lda = LatentDirichletAllocation(n_components=10, learning_method="batch", max_iter=25, random_state=0, n_jobs=-1)
topics = lda.fit_transform(x)

In [None]:
# 각각의 토픽의 확률(높은 수치)
topics

array([[0.3878013 , 0.38698177, 0.00204136, ..., 0.00204111, 0.00204129,
        0.00204113],
       [0.07488885, 0.0019236 , 0.00192368, ..., 0.18561466, 0.00192364,
        0.00192357],
       [0.13752401, 0.17370466, 0.19819481, ..., 0.00137024, 0.00137027,
        0.08804367],
       ...,
       [0.29507083, 0.103182  , 0.00344904, ..., 0.00344962, 0.00344932,
        0.00344922],
       [0.12626781, 0.21144047, 0.00526428, ..., 0.62543944, 0.00526661,
        0.00526371],
       [0.00909195, 0.00909281, 0.00909392, ..., 0.00909345, 0.79503063,
        0.13222446]])

In [10]:
topics.shape

(50000, 10)

In [None]:
# 해당 토픽에서의 얼마나 중요한지 중요도를 나타내준다.
lda.components_

array([[5.98838153e+01, 1.16855610e+01, 1.02667502e+01, ...,
        6.12098071e+00, 1.00022418e-01, 1.00006700e-01],
       [2.18891627e+02, 1.95424852e+02, 1.81741443e+02, ...,
        6.92030901e+02, 2.61037853e-01, 1.00006246e-01],
       [1.50507956e+02, 1.18038474e-01, 4.77710417e+00, ...,
        5.97564864e+00, 1.56310673e+02, 1.00005190e-01],
       ...,
       [7.03612278e+02, 2.09670105e+02, 3.35255864e+02, ...,
        2.31117679e+00, 7.08006373e+02, 1.00011532e-01],
       [2.05039825e+03, 3.60290033e+00, 7.70470048e+01, ...,
        5.37425804e+01, 1.21895222e+02, 1.00003452e-01],
       [8.59103849e+02, 2.70817323e+00, 7.65850911e+00, ...,
        1.01514984e+02, 1.75465526e+02, 1.00004619e-01]])

In [12]:
lda.components_.shape

(10, 1000)

In [13]:
import numpy as np

In [14]:
feature_names = np.array(cv.get_feature_names_out())
feature_names

array(['10', '15', '20', '30', '80', '90', 'able', 'above', 'absolutely',
       'across', 'act', 'acted', 'action', 'actor', 'actors', 'actress',
       'actual', 'actually', 'add', 'admit', 'adventure', 'again',
       'against', 'age', 'ago', 'agree', 'air', 'alien', 'alive',
       'almost', 'alone', 'along', 'already', 'although', 'always', 'am',
       'amazing', 'america', 'american', 'among', 'amount', 'amusing',
       'animation', 'annoying', 'another', 'anti', 'anyone', 'anything',
       'anyway', 'apart', 'apparently', 'appear', 'appears', 'appreciate',
       'aren', 'around', 'art', 'ask', 'atmosphere', 'attempt',
       'attempts', 'attention', 'audience', 'average', 'avoid', 'away',
       'awful', 'baby', 'background', 'badly', 'band', 'based', 'basic',
       'basically', 'battle', 'beautiful', 'beauty', 'became', 'become',
       'becomes', 'before', 'begin', 'beginning', 'begins', 'behind',
       'believable', 'believe', 'ben', 'between', 'beyond', 'big',
       '

In [None]:
# 
for idx, words in enumerate(lda.components_):
    total = words.sum()
    largest = words.argsort()[::-1]
    print(f"topic {idx+1 : 2}", end=" : ")

    for i in range(0, 10):
        print(f"{feature_names[largest[i]]}({words[largest[i]] * 100 / total:.2f})", end=" ")

    print()

topic  1 : action(1.30) john(0.95) director(0.76) cast(0.76) role(0.70) black(0.70) star(0.68) new(0.64) plays(0.62) michael(0.60) 
topic  2 : family(1.97) young(1.88) father(1.67) woman(1.35) mother(1.33) wife(1.29) old(1.18) girl(1.11) son(1.10) home(0.93) 
topic  3 : us(0.84) world(0.80) seems(0.73) director(0.72) between(0.67) work(0.66) real(0.62) point(0.62) may(0.62) own(0.60) 
topic  4 : horror(3.41) effects(1.20) pretty(1.12) killer(1.06) blood(0.98) gore(0.97) budget(0.88) scary(0.80) special(0.79) dead(0.77) 
topic  5 : book(2.45) music(2.39) version(1.57) original(1.13) read(1.09) song(1.05) dvd(1.01) songs(0.87) saw(0.83) years(0.83) 
topic  6 : worst(1.48) actors(1.06) minutes(1.05) script(0.99) awful(0.96) nothing(0.94) terrible(0.88) waste(0.84) money(0.79) boring(0.78) 
topic  7 : show(4.48) series(3.90) tv(2.28) war(2.26) years(1.92) episode(1.89) new(1.12) shows(1.09) episodes(1.04) television(1.01) 
topic  8 : guy(1.24) going(1.05) didn(1.02) re(0.99) now(0.86) want