# Topic modeling
NLP中的Topics不完全匹配字典中的定义，更多对应的是一个模糊的统计学概念

当我们阅读一篇文章，我们期望某些词出现在标题或正文中，从而抓住文章主旨。比如讲Python编程的文章，会含有的词是class、function。

Documents通常含有多个主题。本节讨论主题建模和非负矩阵分解（NFM），通过分配不同的权重到topics，为topics定义一个additive model。

## 一个主题建模算法是 non-negative matrix factorization (NMF)

NMF将一个矩阵分解为 【2个小矩阵的积】，且这3个矩阵没有负值。

通常，我们只能对the solution of the factorizaiton进行数值逼近，时间复杂度是多项式的。

我们通过sklearn 的NMF 来实现。常用于文本聚类、信号处理。

相关参数：

    n_components    分量数，本例中为topics 数目
    max_iter        迭代数，默认为200
    alpha           正则项乘子，默认为0，例如10，2.85
    tol             Value regulating stopping conditions，默认为le-4，例如le-3,le-2

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [9]:
def letter_only(str_list):
    for c in str_list:
        if not c.isalpha():  # 字符串只有字母组成
            return False
    return True

In [10]:
a = ['abc','123']

In [11]:
letter_only('abc')

True

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer
from sklearn.decomposition import NMF


cv = CountVectorizer(stop_words="english", max_features=500)
groups = fetch_20newsgroups()
cleaned = []
all_names = set(names.words())
lemmatizer = WordNetLemmatizer()

# 数据清洗
for post in groups.data:  
    cleaned.append(' '.join([lemmatizer.lemmatize(word.lower())
                             for word in post.split()
                             if letter_only(word)
                             and word not in all_names]))

transformed = cv.fit_transform(cleaned)
nmf = NMF(n_components=100, random_state=43).fit(transformed)  # 分100个topics

#print这100个主题
for topic_idx, topic in enumerate(nmf.components_):
    label = '{}: '.format(topic_idx)
    print(label, " ".join([cv.get_feature_names()[i]
                           for i in topic.argsort()[:-9:-1]]))


0:  wa thought later took left order seen taken
1:  db bit data place stuff add time line
2:  server using display screen support code mouse application
3:  file section information write source change entry number
4:  disk drive hard controller support card board head
5:  entry rule program source number info email build
6:  new york sale change service result study early
7:  image software user package using display include support
8:  window manager application using offer user information course
9:  gun united control house american second national issue
10:  hockey league team game division player list san
11:  turkish government sent war study came american world
12:  program change technology display information version application rate
13:  space nasa technology service national international small communication
14:  government political federal sure free private local country
15:  output line open write read return build section
16:  people country doing tell live killed lot s