主题建模的工作原理是，识别一个文件中的主题关键词。这些词将决定文件内容的主题是关于什么的。

1、我们采用的是正则表达式标注器 RegexpTokenizer，因为我们只想得到words,而不需要标点符号或其他tokens；

2、移除停用词是另一个重要步骤，这有助于我们排除噪声，如is,the这次词；

3、之后，我们需要截取词干，获得词的原型。

以上3个步骤我们打包成了一个preprocessor类。

用于建模的技法是Latent Dirichlet Allocation (LDA) 隐含狄利克雷分布。LDA基本上是把一篇文档看作一组主题的集合，即一篇文档可以有多个主题。LDA可以将每篇文档的主题按照概率分布的形式给出。

文档中的words,每个词都由其中某一主题生成，所有的词组成了一篇文档。各主题依据一定概率生成对应的词，我们现拥有词，目的是找出topics。

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [2]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora
from nltk.corpus import stopwords

In [3]:
# load input file
def load_data(input_file):
    data = []
    with open(input_file, 'r') as f:
        for line in f.readlines():
            data.append(line[:-1])
    return data

## To define a class to preprocess text

In [8]:
class Preprocessor(object):
    # Initialize various operators
    def __init__(self):
        # Create a regular expression tokenizer
        self.tokenizer = RegexpTokenizer(r'\w+')
        
        # 停用词列表，取自nltk的语料库。分析时排除它们，如一些常用词is,in,the
        self.stop_words_english = stopwords.words('english')
        
        # Create a stemmer
        self.stemmer = SnowballStemmer('english')
        
        # Define a processor function(method) that takes care of 
        # Tokenization, stop words removal, stemming
    def process(self, input_text):
        tokens = self.tokenizer.tokenize(input_text.lower()) # tokenize the string
            
        tokens_stopwords = [x for x in tokens if not x in self.stop_words_english] # remove the stop words
            
        tokens_stemmed = [self.stemmer.stem(x) for x in tokens_stopwords] # Perform stemming on the tokens
            
        return tokens_stemmed
            

In [9]:
input_file = 'data_topic_modeling.txt'
data = load_data(input_file)

In [13]:
data

['He spent a lot of time studying cryptography. ',
 'You need to have a very good understanding of modern encryption systems in order to work there.',
 "If their team doesn't win this match, they will be out of the competition.",
 'Those codes are generated by a specialized machine. ',
 'The club needs to develop a policy of training and promoting younger talent. ',
 'His movement off the ball is really great. ',
 'In order to evade the defenders, he needs to move swiftly.',
 'We need to make sure only the authorized parties can read the message.']

In [10]:
preprocessor = Preprocessor()  # create a preprocessor object

In [11]:
processed_tokens = [preprocessor.process(x) for x in data] # Create a list for processed documents

In [12]:
processed_tokens

[['spent', 'lot', 'time', 'studi', 'cryptographi'],
 ['need',
  'good',
  'understand',
  'modern',
  'encrypt',
  'system',
  'order',
  'work'],
 ['team', 'win', 'match', 'competit'],
 ['code', 'generat', 'special', 'machin'],
 ['club', 'need', 'develop', 'polici', 'train', 'promot', 'younger', 'talent'],
 ['movement', 'ball', 'realli', 'great'],
 ['order', 'evad', 'defend', 'need', 'move', 'swift'],
 ['need', 'make', 'sure', 'author', 'parti', 'read', 'messag']]

### 将tokens（分词后）转换为字典格式

In [14]:
dict_tokens = corpora.Dictionary(processed_tokens) # Create a dictionary based on the tokenized documents 用于主题建模
dict_tokens

<gensim.corpora.dictionary.Dictionary at 0x11441f828>

In [15]:
# 构建 document-term matrix 文档-词矩阵
corpus = [dict_tokens.doc2bow(text) for text in processed_tokens]
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)],
 [(5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(13, 1), (14, 1), (15, 1), (16, 1)],
 [(17, 1), (18, 1), (19, 1), (20, 1)],
 [(5, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1)],
 [(28, 1), (29, 1), (30, 1), (31, 1)],
 [(5, 1), (11, 1), (32, 1), (33, 1), (34, 1), (35, 1)],
 [(5, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1)]]

### 采用隐含狄利克雷分布 Latent Dirichlet Allocation(LDA) 来做主题建模

隐含狄利克雷分布简称LDA(Latent Dirichlet allocation)，是一种主题模型，它可以将每篇文档的主题按照概率分布的形式给出。

同时它是一种无监督学习算法，在训练时不需要手工标注的训练集，需要的仅仅是文档集以及指定主题的数量k即可。此外LDA的另一个优点则是，对于每一个主题均可找出一些词语来描述它。

In [20]:
# Generate the LDA model based on the corpus we just created
# 先定义所需的参数，并初始化 LDA model object

num_topics = 2  # 假设文本可以被分成2个主题
num_words = 4

ldamodel = models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word=dict_tokens, passes=25)

In [23]:
ldamodel.print_topics(num_topics=num_topics,num_words=num_words)

[(0, '0.050*"studi" + 0.050*"lot" + 0.050*"time" + 0.050*"cryptographi"'),
 (1, '0.078*"need" + 0.043*"order" + 0.026*"work" + 0.026*"good"')]

### 判别出2个主题后，通过most-contributed words ,我们能看出它是如何划分这2个主题的

In [26]:
print('Most contributed words to the topics:')
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
    print('\nTopic',item[0],'==>',item[1])

Most contributed words to the topics:

Topic 0 ==> 0.050*"studi" + 0.050*"lot" + 0.050*"time" + 0.050*"cryptographi"

Topic 1 ==> 0.078*"need" + 0.043*"order" + 0.026*"work" + 0.026*"good"
