# Latent Dirichlet Allocation
- 토픽 모델링 방법 중 하나
    - 토픽 모델링 : <br> 각각의 텍스트(문장 혹은 문단)은 특정 주제(정보의 카테고리)의 집합이라는 가정하에 각 텍스트가 포함하는 주제를 통계적 방법(주제에 관련된 단어의 등장 빈도수 등)을 이용해 파악하는 방법. <br> 여러가지 방법이 있는데, 그 중 NMF(비음수 행렬인수분해)를 이용해 특징(단어)를 함축적인 차원으로 압축해 주제를 특정해내는 방법 등이 있다.
- 주제별 단어 분포수를 바탕으로, 주어진 텍스트의 단어수 분포를 분석해서 해당 텍스트가 어떤 주제를 다루고 있는지 예측한다.
- TF-IDF를 필두로 하여 잠재 의미 분석(Latent semantic indexing, LSI), 확률 잠재 의미 분석(Probabilistic latent semantic analysis, pLSA)을 거쳐서 고안된 방법 (디리클레 분포를 따른다)
- (주관적인 의견이지만) '잠재'라는 이름이 붙은 이유는 클러스터링을 한 뒤 각 센트로이드가 정확히 무엇이다라고 분류할 순 없지만 "잠재적으로 각 차원의 특징을 반영하는 어떤 의미를 지닐 것이다."라고 가정하듯이, 각 단어를 이용해 어떤 잠재적인 의미(주제)를 찾아내는 기법이라 그런 것 같다. (NMF를 생각하면 이해가 빠를 듯)
- 그러나, bag of words가 그렇듯이 단어의 순서 정보는 반영하지 못한다. 즉, 단어의 교환성을 가정한다.

In [1]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora

In [2]:
#load input data
def load_data(input_file) : 
    data = []
    with open(input_file, "r") as fp :
        for line in fp.readlines() : 
            data.append(line[:-1])
            
    return data

In [6]:
#tokenize, remove stopwords, and extract word stem
def process(input_text) : 
    #create object
    tokenizer = RegexpTokenizer(r"\w+")
    stemmer = SnowballStemmer("english") 
    stop_words = stopwords.words("english")
    
    tokens = tokenizer.tokenize(input_text.lower()) 
    tokens = [x for x in tokens if not x in stop_words]
    tokens_stemmed = [stemmer.stem(x) for x in tokens]
    
    return tokens_stemmed

In [7]:
data = load_data("data.txt")

In [14]:
data

['The Roman empire expanded very rapidly and it was the biggest empire in the world for a long time.',
 'An algebraic structure is a set with one or more finitary operations defined on it that satisfies a list of axioms.',
 'Renaissance started as a cultural movement in Italy in the Late Medieval period and later spread to the rest of Europe.',
 'The line of demarcation between prehistoric and historical times is crossed when people cease to live only in the present.',
 'Mathematicians seek out patterns and use them to formulate new conjectures.  ',
 'A notational symbol that represents a number is called a numeral in mathematics. ',
 'The process of extracting the underlying essence of a mathematical concept is called abstraction.',
 'Historically, people have frequently waged wars against each other in order to expand their empires.',
 'Ancient history indicates that various outside influences have helped formulate the culture and traditions of Eastern Europe.',
 'Mappings between se

In [8]:
tokens = [process(x) for x in data]

In [15]:
tokens

[['roman',
  'empir',
  'expand',
  'rapid',
  'biggest',
  'empir',
  'world',
  'long',
  'time'],
 ['algebra',
  'structur',
  'set',
  'one',
  'finitari',
  'oper',
  'defin',
  'satisfi',
  'list',
  'axiom'],
 ['renaiss',
  'start',
  'cultur',
  'movement',
  'itali',
  'late',
  'mediev',
  'period',
  'later',
  'spread',
  'rest',
  'europ'],
 ['line',
  'demarc',
  'prehistor',
  'histor',
  'time',
  'cross',
  'peopl',
  'ceas',
  'live',
  'present'],
 ['mathematician', 'seek', 'pattern', 'use', 'formul', 'new', 'conjectur'],
 ['notat', 'symbol', 'repres', 'number', 'call', 'numer', 'mathemat'],
 ['process',
  'extract',
  'under',
  'essenc',
  'mathemat',
  'concept',
  'call',
  'abstract'],
 ['histor', 'peopl', 'frequent', 'wage', 'war', 'order', 'expand', 'empir'],
 ['ancient',
  'histori',
  'indic',
  'various',
  'outsid',
  'influenc',
  'help',
  'formul',
  'cultur',
  'tradit',
  'eastern',
  'europ'],
 ['map',
  'set',
  'preserv',
  'structur',
  'special',

In [16]:
dict_tokens = corpora.Dictionary(tokens)
doc_term_mat = [dict_tokens.doc2bow(token) for token in tokens]

In [21]:
doc_term_mat

[[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1)],
 [(18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1)],
 [(6, 1),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 1)],
 [(39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1)],
 [(46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1)],
 [(46, 1), (47, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1)],
 [(1, 1), (2, 1), (33, 1), (36, 1), (59, 1), (60, 1), (61, 1), (62, 1)],
 [(18, 1),
  (19, 1),
  (40, 1),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 1),
  (69, 1),
  (70, 1),
  (71, 1)],
 [(16, 1),
  (17, 1),
  (47, 1),
  (72, 1),
  (73, 1),
  (74, 1),
  (75, 1),
  (76, 1),
  (77, 1)]]

In [9]:
num_topics = 2 #argument for LDA model
ldamodel = models.ldamodel.LdaModel(doc_term_mat, num_topics=num_topics, id2word=dict_tokens, passes=25)

In [10]:
num_words = 5 #essential words for each topic

In [23]:
ldamodel.print_topics(num_topics=num_topics, num_words=num_words)

[(0,
  '0.022*"cultur" + 0.022*"europ" + 0.022*"formul" + 0.022*"call" + 0.022*"time"'),
 (1,
  '0.027*"expand" + 0.027*"structur" + 0.027*"set" + 0.027*"empir" + 0.027*"histor"')]

In [13]:
print("\nTop " + str(num_words) + " contributing words to each topic:")
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words) : 
    print("\nTopic", item[0])
    
    #essential words
    list_of_strings = item[1].split(" + ")
    for text in list_of_strings : 
        weight = text.split("*")[0]
        word = text.split("*")[1]
        print(word, "==>", str(round(float(weight)*100, 2)) + "%")


Top 5 contributing words to each topic:

Topic 0
"cultur" ==> 2.2%
"europ" ==> 2.2%
"formul" ==> 2.2%
"call" ==> 2.2%
"time" ==> 2.2%

Topic 1
"expand" ==> 2.7%
"structur" ==> 2.7%
"set" ==> 2.7%
"empir" ==> 2.7%
"histor" ==> 2.7%
