## 第 0 步：潜在狄利克雷分布 ##

LDA 用于将文档中的文本分类为特定话题。它会用狄利克雷分布构建一个话题/文档模型和单词/话题模型。

* 每个文档都建模为话题多态分布，每个话题建模为单词多态分布。
* LDA 假设我们传入其中的每段文本都相互关联。因此，选择正确的语料库很关键。
* 它还假设文档是根据多种话题创建的。然后，这些话题根据单词的分布概率生成单词。

## 第 1 步：加载数据集

我们将使用的数据集是一个列表，其中包含在 15 年内发表的超过 100 万条新闻标题。首先，我们将从 `abcnews-date-text.csv` 文件中加载该数据集。

In [2]:
import pandas as pd

In [6]:
'''
Load the dataset from the CSV and save it to 'data_text'
'''
data = pd.read_csv('./abcnews-date-text.csv', error_bad_lines=False);
# We only need the Headlines text column from the data
data_text = data[:300000][['headline_text']];
data_text['index'] = data_text.index

documents = data_text

print(len(data_text))

300000


In [5]:
documents.head(3)

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2


## 第 2 步：预处理数据 ##

我们将执行以下步骤：

* **标记化**：将文本拆分为句子，将句子拆分为单词。使单词全小写并删除标点。
* 删除少于 3 个字符的单词。
* 删除所有**停止词**。
* **词形还原**单词 - 第三人称的单词变成第一人称，过去式和将来式变成现在式。
* **词干提取**单词 - 将单词简化成根形式。

In [13]:
'''
Loading Gensim and nltk libraries
'''
import nltk
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

from nltk.stem import WordNetLemmatizer, SnowballStemmer               # 词形还原、词干提取
from nltk.stem.porter import *                                     

import numpy as np
np.random.seed(400)

In [14]:
# nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

```python
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[14]:
True
```

### Lemmatizer 示例
在预处理数据集之前，我们先看一个词形还原示例。如果词形还原单词“went”，输出是什么：

In [16]:
print(WordNetLemmatizer().lemmatize('went', pos = 'v')) # past tense to present tense

go


### Stemmer 示例
再看一个词干提取示例。我们向 stemmer 中传入多个单词，看看它是如何处理每个单词的：

In [18]:
stemmer = SnowballStemmer("english")
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]

pd.DataFrame(data={'original word':original_words, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [27]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    """
    动词词性词形还原text
    """
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    l_token = gensim.utils.simple_preprocess(text)                                  # 标记化文本
    for token in  l_token:
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3: # 仅处理非停用词
            # TODO: Apply lemmatize_stemming() on the token, then add to the results list
            result.append(lemmatize_stemming(token))
            
    return result



In [30]:
'''
Preview a document after preprocessing
'''
document_num = 4310
doc_sample = documents[documents['index'] == document_num+1].values[0][0]

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['ratepayers', 'group', 'wants', 'compulsory', 'local', 'govt', 'voting']


Tokenized and lemmatized document: 
['ratepay', 'group', 'want', 'compulsori', 'local', 'govt', 'vote']


In [29]:
documents.head(3)

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2


现在预处理所有新闻标题。为此，我们使用 pandas 中的 [map](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) 函数向 `headline_text` 列应用 `preprocess()`。

**注意**：可能需要几分钟（我的笔记本需要 6 分钟）

In [31]:
%%time
# TODO: preprocess all the headlines, saving the list of results as 'processed_docs'
processed_docs = [preprocess(text) for text in documents['headline_text']]

Wall time: 45.3 s


In [32]:
'''
Preview 'processed_docs'
'''
processed_docs[:10]

[['decid', 'communiti', 'broadcast', 'licenc'],
 ['wit', 'awar', 'defam'],
 ['call', 'infrastructur', 'protect', 'summit'],
 ['staff', 'aust', 'strike', 'rise'],
 ['strike', 'affect', 'australian', 'travel'],
 ['ambiti', 'olsson', 'win', 'tripl', 'jump'],
 ['antic', 'delight', 'record', 'break', 'barca'],
 ['aussi', 'qualifi', 'stosur', 'wast', 'memphi', 'match'],
 ['aust', 'address', 'secur', 'council', 'iraq'],
 ['australia', 'lock', 'timet']]

## 第 3.1 步：数据集上的词袋

现在，根据 processed_docs 创建一个字典，后者包含单词在训练集中的出现次数。为此，将 `processed_docs` 传入 [`gensim.corpora.Dictionary()`](https://radimrehurek.com/gensim/corpora/dictionary.html) 并称之为 `dictionary`。

In [46]:
%%time
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_docs)

print(type(dictionary), len(dictionary), dictionary.keys()[:5])

<class 'gensim.corpora.dictionary.Dictionary'> 24903 [0, 1, 2, 3, 4]
Wall time: 4.21 s


In [47]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


### filter_extremes
** Gensim filter_extremes **

[`filter_extremes(no_below=5, no_above=0.5, keep_n=100000)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)

滤除出现在以下情形中的标记

* 出现在 no_below 个以下的文档中（绝对数字），或
* 出现在 no_above 个以上的文档中（ 总语料库大小的一部分，不是绝对数字）。
* 在 (1) 和 (2) 之后，仅保留前 keep_n 个最常见的标记（如果为 None，则保留所有标记）。

In [48]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
# TODO: apply dictionary.filter_extremes() with the parameters mentioned above
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

In [49]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


### doc2bow
** Gensim doc2bow **

[`doc2bow(document)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow)

* 将文档（单词列表）转换为词袋格式 = 2 元组（token_id、token_count）列表。每个单词都应该是标记化和标准化的字符串（unicode 或 utf8-编码）。文档中的单词没有进一步预处理了；在调用此函数之前，请应用标记化、词干提取等方法。

In [53]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
# TODO
bow_corpus = [dictionary.doc2bow(words) for words in processed_docs]

In [54]:
'''
Checking Bag of Words corpus for our sample document --> (token_id, token_count)
'''
bow_corpus[document_num]

[(71, 1), (107, 1), (462, 1), (3530, 1)]

In [55]:
'''
Preview BOW for our sample preprocessed document
'''
# Here document_num is document number 4310 which we have checked in Step 2
bow_doc_4310 = bow_corpus[document_num]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 71 ("bushfir") appears 1 time.
Word 107 ("help") appears 1 time.
Word 462 ("rain") appears 1 time.
Word 3530 ("dampen") appears 1 time.


## 第 3.2 步： 对文档集合应用 TF-IDF ##

虽然使用 gensim 模型的 LDA 实现并非必须对语料库执行 TF-IDF，但是建议这么做。TF-IDF 在初始化过程中要求词袋（整数值）训练语料库。在转换过程中，它将接受向量并返回另一个维度相同的向量。

*请注意：Gensim 的作者规定 LDA 的标准流程是使用词袋模型。*

** TF-IDF 是“词频、逆文本频率"的简称。**

* 它是根据单词在多个文档中的出现频率对单词（或“术语”）重要性进行评分的方式。
* 如果单词频繁出现在文档中，则很重要，给该单词评很高的得分。但是如果单词出现在很多文档中，则不是唯一标识符，给该单词评很低的得分。
* 因此，“the”和“for”等常见单词出现在很多文档中，评分将降低。经常出现在单个文档中的单词评分将升高。

换句话说：

* TF(w) = `（术语 w 出现在文档中的次数）/（文档中的术语总数）`。
* IDF(w) = `log_e（文档总数/包含术语 w 的文档数）`。

** 例如 **

* 假设有一个文档包含 `100` 个单词，其中单词“tiger”出现了 3 次。
* "tiger"的词频（即 tf）是：
    - `TF = (3 / 100) = 0.03`. 

* 现在，假设有 `10 million` 个文档，单词”tiger“出现在了其中 `1000` 个文档中。逆文档频率（即 idk）的计算方式为：
    - `IDF = log(10,000,000 / 1,000) = 4`. 

* 因此，Tf-idf 权重是这些数量的积：
    - `TF-IDF = 0.03 * 4 = 0.12`.

In [57]:
from gensim.models import TfidfModel                         # TFIDF初始化

In [58]:
'''
Create tf-idf model object using models.TfidfModel on 'bow_corpus' and save it to 'tfidf'
'''
# TODO
tfidf = TfidfModel(bow_corpus)

In [59]:
'''
Apply transformation to the entire corpus and call it 'corpus_tfidf'
'''
# TODO
corpus_tfidf = tfidf[bow_corpus]

In [60]:
'''
Preview TF-IDF scores for our first document --> --> (token_id, tfidf score)
'''
from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5959813347777092),
 (1, 0.39204529549491984),
 (2, 0.48531419274988147),
 (3, 0.5055461098578569)]


## 第 4.1 步：使用词袋运行 LDA ##

我们将处理文档语料库中的 10 个话题。

** 我们将使用所有 CPU 核心运行 LDA，以并行化并加快模型训练。**

我们将调整以下参数：

* **num_topics** 是请求从训练语料库中提取的潜在话题数。
* **id2word** 是从单词 ID（整数）到单词（字符串）的映射，用于判断词汇表大小，以及用于调试和输出话题。
* **workers** 是用于并行化的额外进程数。默认使用所有可用的核心。
* **alpha** 和 **eta** 是影响文档-话题 (θ) 和话题-单词 (lambda) 分布的超参数。暂时使用默认值（默认值为 `1/num_topics`）
    - Alpha 是文档-话题分布。
        * alpha 很高：每个文档都包含所有话题（文档似乎都相似）。
        * alpha 很低：每个文档包含的话题很少

- Eta 是话题-单词分布。
    * eta 很高：每个话题都包含大部分单词（话题似乎都相似）。
    * eta 很低：每个话题包含的单词很少。

* ** 通过次数** 是通过语料库的训练次数。例如，如果训练语料库有 50,000 个文档，块大小是 10,000，通过次数是 2，则在线训练需要更新 10 次：
    * `#1 documents 0-9,999 `
    * `#2 documents 10,000-19,999 `
    * `#3 documents 20,000-29,999 `
    * `#4 documents 30,000-39,999 `
    * `#5 documents 40,000-49,999 `
    * `#6 documents 0-9,999 `
    * `#7 documents 10,000-19,999 `
    * `#8 documents 20,000-29,999 `
    * `#9 documents 30,000-39,999 `
    * `#10 documents 40,000-49,999`

In [64]:
%%time
# LDA mono-core -- fallback code in case LdaMulticore throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore 
'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'
'''
# TODO
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics=m_topics=10, 
                                       id2word = dictionary, 
                                       passes = 2, 
                                       workers=4)

Wall time: 1min 44s


In [65]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(topic, idx ))
    print("\n")

Topic: 0.043*"charg" + 0.035*"face" + 0.032*"court" + 0.022*"accus" + 0.022*"closer" + 0.021*"drug" + 0.020*"concern" + 0.018*"jail" + 0.017*"murder" + 0.016*"arrest" 
Words: 0


Topic: 0.036*"govt" + 0.027*"plan" + 0.023*"water" + 0.022*"urg" + 0.019*"council" + 0.014*"fund" + 0.012*"boost" + 0.012*"health" + 0.012*"rise" + 0.011*"servic" 
Words: 1


Topic: 0.026*"chang" + 0.018*"govt" + 0.016*"case" + 0.016*"fight" + 0.015*"help" + 0.015*"rule" + 0.013*"urg" + 0.013*"park" + 0.012*"law" + 0.010*"guilti" 
Words: 2


Topic: 0.020*"miss" + 0.019*"search" + 0.016*"communiti" + 0.016*"look" + 0.015*"land" + 0.014*"begin" + 0.013*"hold" + 0.013*"releas" + 0.013*"elect" + 0.011*"crew" 
Words: 3


Topic: 0.021*"drought" + 0.018*"year" + 0.016*"break" + 0.015*"expect" + 0.014*"busi" + 0.012*"farm" + 0.011*"power" + 0.011*"highway" + 0.011*"award" + 0.010*"confid" 
Words: 4


Topic: 0.024*"school" + 0.024*"test" + 0.016*"south" + 0.015*"time" + 0.015*"centr" + 0.013*"head" + 0.011*"england" + 

## *** 主题分类 ***

根据每个话题中的单词及其对应的权重，你能够推断出哪些类别？

* 0: 
* 1: 
* 2: 
* 3: 
* 4: 
* 5: 
* 6: 
* 7:  
* 8: 
* 9: 

## 第 4.2 步：使用 TF-IDF 运行 LDA ##

In [69]:
%%time
'''
Define lda model using corpus_tfidf, again using gensim.models.LdaMulticore()
'''
# TODO
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                             num_topics=10, 
                                             id2word = dictionary, 
                                             passes = 2, 
                                             workers=4)

Wall time: 1min 15s


In [70]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.009*"govt" + 0.009*"plan" + 0.008*"council" + 0.006*"rudd" + 0.006*"urg" + 0.006*"nation" + 0.006*"action" + 0.006*"chang" + 0.006*"reject" + 0.005*"call"


Topic: 1 Word: 0.009*"open" + 0.008*"world" + 0.008*"hick" + 0.007*"market" + 0.006*"final" + 0.006*"australia" + 0.005*"aussi" + 0.005*"news" + 0.005*"kangaroo" + 0.005*"win"


Topic: 2 Word: 0.018*"crash" + 0.011*"die" + 0.009*"toll" + 0.008*"road" + 0.008*"rise" + 0.008*"kill" + 0.007*"death" + 0.007*"plane" + 0.007*"rate" + 0.006*"victim"


Topic: 3 Word: 0.010*"guilti" + 0.007*"nuclear" + 0.007*"plead" + 0.007*"grower" + 0.006*"iran" + 0.006*"timor" + 0.005*"season" + 0.005*"revamp" + 0.005*"coal" + 0.005*"confid"


Topic: 4 Word: 0.018*"kill" + 0.015*"polic" + 0.011*"blaze" + 0.010*"iraq" + 0.009*"arrest" + 0.009*"attack" + 0.009*"firefight" + 0.009*"bomb" + 0.008*"charg" + 0.008*"troop"


Topic: 5 Word: 0.010*"liber" + 0.006*"climat" + 0.006*"costello" + 0.006*"labor" + 0.006*"solomon" + 0.006*"grant" + 0.00

## *** 主题分类 *** 

可以看出，在使用 tf-idf 时，不太常见的单词权重更高，导致名词被考虑在内。这样就更难分类，因为名词比较难分类。进一步表明我们应用的模型取决于要处理的文本语料库的类型。

根据每个话题中的单词及其对应的权重，你能够推断出哪些类别？

* 0: 
* 1:  
* 2: 
* 3: 
* 4:  
* 5: 
* 6: 
* 7: 
* 8: 
* 9: 

## 第 5.1 步：LDA 词袋模型评估性能##
- 分类样本文档

我们将检查可以在何处分类测试文档。

In [71]:
'''
Text of sample document 4310
'''
processed_docs[4310]

['rain', 'help', 'dampen', 'bushfir']

In [72]:
'''
Check which topic our test document belongs to using the LDA Bag of Words model.
'''
document_num = 4310
# Our test document is document number 4310

# TODO
# Our test document is document number 4310
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.8199663758277893	 
Topic: 0.021*"australia" + 0.017*"forc" + 0.014*"water" + 0.014*"rain" + 0.012*"melbourn" + 0.012*"record" + 0.012*"threat" + 0.011*"terror" + 0.011*"lead" + 0.010*"return"

Score: 0.02001069486141205	 
Topic: 0.026*"chang" + 0.018*"govt" + 0.016*"case" + 0.016*"fight" + 0.015*"help" + 0.015*"rule" + 0.013*"urg" + 0.013*"park" + 0.012*"law" + 0.010*"guilti"

Score: 0.020006701350212097	 
Topic: 0.020*"miss" + 0.019*"search" + 0.016*"communiti" + 0.016*"look" + 0.015*"land" + 0.014*"begin" + 0.013*"hold" + 0.013*"releas" + 0.013*"elect" + 0.011*"crew"

Score: 0.020004503428936005	 
Topic: 0.091*"polic" + 0.031*"crash" + 0.029*"death" + 0.025*"investig" + 0.015*"die" + 0.015*"driver" + 0.015*"probe" + 0.014*"blaze" + 0.014*"road" + 0.013*"victim"

Score: 0.020002929493784904	 
Topic: 0.021*"drought" + 0.018*"year" + 0.016*"break" + 0.015*"expect" + 0.014*"busi" + 0.012*"farm" + 0.011*"power" + 0.011*"highway" + 0.011*"award" + 0.010*"confid"

Score: 0.0200023

<div class="alert alert-block alert-info">
它成为我们所分配话题（X，分类正确）的一部分的概率最高
</div>

## 第 5.2 步：LDA TF-IDF 模型评估性能##
- 分类样本文档

In [73]:
'''
Check which topic our test document belongs to using the LDA TF-IDF model.
'''
# Our test document is document number 4310
for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.4492030441761017	 
Topic: 0.016*"water" + 0.014*"govt" + 0.012*"fund" + 0.010*"council" + 0.010*"plan" + 0.009*"health" + 0.009*"urg" + 0.007*"boost" + 0.006*"drought" + 0.006*"servic"

Score: 0.3907369077205658	 
Topic: 0.009*"govt" + 0.009*"plan" + 0.008*"council" + 0.006*"rudd" + 0.006*"urg" + 0.006*"nation" + 0.006*"action" + 0.006*"chang" + 0.006*"reject" + 0.005*"call"

Score: 0.02001015469431877	 
Topic: 0.018*"crash" + 0.011*"die" + 0.009*"toll" + 0.008*"road" + 0.008*"rise" + 0.008*"kill" + 0.007*"death" + 0.007*"plane" + 0.007*"rate" + 0.006*"victim"

Score: 0.020008662715554237	 
Topic: 0.010*"guilti" + 0.007*"nuclear" + 0.007*"plead" + 0.007*"grower" + 0.006*"iran" + 0.006*"timor" + 0.005*"season" + 0.005*"revamp" + 0.005*"coal" + 0.005*"confid"

Score: 0.020007897168397903	 
Topic: 0.017*"miss" + 0.015*"search" + 0.014*"polic" + 0.011*"murder" + 0.010*"court" + 0.010*"woman" + 0.010*"charg" + 0.009*"teen" + 0.008*"coast" + 0.007*"fatal"

Score: 0.0200075358152389

<div class="alert alert-block alert-info">
它成为我们所分配话题 (X) 的一部分的概率最高 (`x%`)
</div>

## 第 6 步：用未使用文档测试LDA词袋模型

In [75]:
# lda_model_tfidf -- lda_model
unseen_document = "My favorite sports activities are running and swimming."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.6200637817382812	 Topic: 0.021*"drought" + 0.018*"year" + 0.016*"break" + 0.015*"expect" + 0.014*"busi"
Score: 0.21990899741649628	 Topic: 0.026*"chang" + 0.018*"govt" + 0.016*"case" + 0.016*"fight" + 0.015*"help"
Score: 0.020003952085971832	 Topic: 0.025*"open" + 0.016*"coast" + 0.014*"dead" + 0.013*"hospit" + 0.012*"gold"
Score: 0.020003316923975945	 Topic: 0.043*"charg" + 0.035*"face" + 0.032*"court" + 0.022*"accus" + 0.022*"closer"
Score: 0.020003316923975945	 Topic: 0.036*"govt" + 0.027*"plan" + 0.023*"water" + 0.022*"urg" + 0.019*"council"
Score: 0.020003316923975945	 Topic: 0.020*"miss" + 0.019*"search" + 0.016*"communiti" + 0.016*"look" + 0.015*"land"
Score: 0.020003316923975945	 Topic: 0.024*"school" + 0.024*"test" + 0.016*"south" + 0.015*"time" + 0.015*"centr"
Score: 0.020003316923975945	 Topic: 0.091*"polic" + 0.031*"crash" + 0.029*"death" + 0.025*"investig" + 0.015*"die"
Score: 0.020003316923975945	 Topic: 0.021*"australia" + 0.017*"forc" + 0.014*"water" + 0.014*"r

<div class="alert alert-block alert-info">
模型正确地将未见过的文档分类成 X 类别，概率是 x%。
</div>

__END__