## 主题建模

主题建模专门设计用于从包含各种类型文档的大型语料库中提取各种不同概念或主题，其中每个文档涉及一个或多个概念。这些概念可以是从思想到一件、事实、展望、陈述等。主题建模的主要目的是使用数学和统计技术来发现语料库中的隐藏和潜在语义结构。
主题建模涉及从文档词项中提取特征，并使用矩阵分解和SVD等数学结构和框架来生成彼此不同的词簇或词组，并且这些词簇形成主题或概念。
构建主题模型有各种框架和算法。我们将介绍以下三种方法：
* 隐含语义索引
* 隐含Dirichlet分布。
* 非负矩阵分解
我们将使用gensim和scikit-learn来进行实际的实现，并且还会介绍如何基于隐含语义索引来构建自己的主题模型。

In [1]:
from gensim import corpora, models
from normalization3 import normalize_corpus
import numpy as np

toy_corpus = ["The fox jumps over the dog",
"The fox is very clever and quick",
"The dog is slow and lazy",
"The cat is smarter than the fox and the dog",
"Python is an excellent programming language",
"Java and Ruby are other programming languages",
"Python and Java are very popular programming languages",
"Python programs are smaller than Java programs"]

语料库中共八个文档，前四个是关于动物的，后四个是关于编程语言的。一旦构建了一些主题建模框架，我们将使用相同的方式来生成原子亚马逊实际产品评论的主题。

### 隐含语义索引

In [2]:
norm_tokenized_corpus = normalize_corpus(toy_corpus,tokenize=True)
print(normalize_corpus)

<function normalize_corpus at 0x0000022A2A51DC80>


In [3]:
dictionary = corpora.Dictionary(norm_tokenized_corpus)

print (dictionary.token2id)

{'dog': 0, 'fox': 1, 'jump': 2, 'clever': 3, 'quick': 4, 'lazy': 5, 'slow': 6, 'cat': 7, 'smarter': 8, 'excellent': 9, 'language': 10, 'programming': 11, 'python': 12, 'java': 13, 'ruby': 14, 'popular': 15, 'program': 16, 'small': 17}


In [4]:
corpus = [dictionary.doc2bow(text) for text in norm_tokenized_corpus]
corpus

[[(0, 1), (1, 1), (2, 1)],
 [(1, 1), (3, 1), (4, 1)],
 [(0, 1), (5, 1), (6, 1)],
 [(0, 1), (1, 1), (7, 1), (8, 1)],
 [(9, 1), (10, 1), (11, 1), (12, 1)],
 [(10, 1), (11, 1), (13, 1), (14, 1)],
 [(10, 1), (11, 1), (12, 1), (13, 1), (15, 1)],
 [(12, 1), (13, 1), (16, 2), (17, 1)]]

我们现在将对这个语料库建立一个TF-IDF加权模型，其中每个文档中的每个词将包含其TF-IDF权重。这类似于特征提取或向量空间转换，其中每个文档由其词的TF-IDF向量表示。完成之后我们将在这些特征上构建一个LSI模型，并输入我们想要生成的主题数量。这个数字是基于直觉和试错，所以在语料库上建立主题模型时，以可随意尝试这个参数。根据我们期望小语料库所包含的主题数量将此参数设置为2：

In [5]:
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

total_topics = 2
lsi = models.LsiModel(corpus_tfidf,
                     id2word = dictionary,
                     num_topics = total_topics)

In [6]:
for index,topic in lsi.print_topics(total_topics):
    print('Topic #'+ str(index + 1))
    print(topic)
    print

Topic #1
-0.459*"language" + -0.459*"programming" + -0.344*"python" + -0.344*"java" + -0.336*"popular" + -0.318*"excellent" + -0.318*"ruby" + -0.148*"program" + -0.074*"small" + 0.000*"fox"
Topic #2
-0.459*"fox" + -0.459*"dog" + -0.444*"jump" + -0.322*"smarter" + -0.322*"cat" + -0.208*"clever" + -0.208*"quick" + -0.208*"lazy" + -0.208*"slow" + -0.000*"programming"


下面的函数有助于在有阙值或无阙值的情况下以更好的方式显示主题：

In [7]:
def print_topics_gensim(topic_model,total_topics = 1,
                       weight_threshold = 0.0001,
                       display_weights = False,
                       num_terms = None):
    for index in range(total_topics):
        topic = topic_model.show_topic(index)
        topic = [(word,round(wt,2))for word,wt in topic 
                if abs(wt) >= weight_threshold]
        if display_weights:
            print('Topic #' + str (index+1)+'with weights')
            print(topic[:num_terms]) if num_terms else topic
        else:
            print('Topic #' + str(index+1)+'without weights')
            tw = [term for term,wt in topic]
            print(tw[:num_terms])if num_terms else tw
        print

可以使用以下代码对小语料库的主题模型测试这个函数，以了解如何获取主题并调整参数

In [8]:
print_topics_gensim(topic_model=lsi,
                   total_topics = total_topics,
                   num_terms = 5,
                   display_weights = False)

Topic #1without weights
['language', 'programming', 'python', 'java', 'popular']
Topic #2without weights
['fox', 'dog', 'jump', 'smarter', 'cat']


In [9]:
print_topics_gensim(topic_model=lsi,
                   total_topics = total_topics,
                   num_terms = 5,
                   display_weights = True)

Topic #1with weights
[('language', -0.46), ('programming', -0.46), ('python', -0.34), ('java', -0.34), ('popular', -0.34)]
Topic #2with weights
[('fox', -0.46), ('dog', -0.46), ('jump', -0.44), ('smarter', -0.32), ('cat', -0.32)]


现在已经成功的使用LSI构建了一个主题建模框架，它可以从文档语料库中区分和显示主题。
现在我们使用SVD从头开始构建自己的LSI主题模型框架。我们首先建立一个TF-IDF特征矩阵，实际上是一个文档-词项矩阵

In [10]:
from utils_in_NLP import build_feature_matrix,low_rank_svd
norm_corpus = normalize_corpus(toy_corpus)
vectorizer,tfidf_matrix = build_feature_matrix(norm_corpus,feature_type = 'tfidf')
td_matrix = tfidf_matrix.transpose()
td_matrix = td_matrix.multiply(td_matrix > 0)
total_topics = 2
feature_names = vectorizer.get_feature_names()

完成后使用low_rank_svd()函数计算我们的词项-文档矩阵的SVD，以便我们构建一个只取前k个奇异向量的低秩矩阵逼近，这将等于我们在此情况下的主题数量。通过是用S和U分量，我们将他们一起相乘以生成每个主题的每个词频及其权重

In [11]:
u , s , vt = low_rank_svd(td_matrix,singular_count=total_topics)
weights = u.transpose() * s[:,None]

现在我们有了词项的权重，需要将他们连接回到我们的词项。我们定义两个效用函数，用于通过连接词项与权重来生成这些主题，然后使用具有可配置参数的函数来打印这些主题

In [12]:
def get_topics_terms_weights(weights, feature_names):
    feature_names = np.array(feature_names)
    sorted_indices = np.array([list(row[::-1]) 
                           for row 
                           in np.argsort(np.abs(weights))])
    sorted_weights = np.array([list(wt[index]) 
                               for wt, index 
                               in zip(weights,sorted_indices)])
    sorted_terms = np.array([list(feature_names[row]) 
                             for row 
                             in sorted_indices])
    
    topics = [np.vstack((terms.T, 
                     term_weights.T)).T 
              for terms, term_weights 
              in zip(sorted_terms, sorted_weights)]     
    
    return topics            

In [13]:
def print_topics_udf(topics, total_topics=1,
                     weight_threshold=0.0001,
                     display_weights=False,
                     num_terms=None):
    
    for index in range(total_topics):
        topic = topics[index]
        topic = [(term, float(wt))
                 for term, wt in topic]
        topic = [(word, round(wt,2)) 
                 for word, wt in topic 
                 if abs(wt) >= weight_threshold]
                     
        if display_weights:
            print ('Topic #'+str(index+1)+' with weights')
            print (topic[:num_terms]) if num_terms else topic
        else:
            print ('Topic #'+str(index+1)+' without weights')
            tw = [term for term, wt in topic]
            print( tw[:num_terms]) if num_terms else tw
        print

In [14]:
topics = get_topics_terms_weights(weights,feature_names)
print_topics_udf(topics = topics,
                total_topics = total_topics,
                weight_threshold = 0,
                display_weights = True)

Topic #1 with weights
Topic #2 with weights


In [15]:
topics = get_topics_terms_weights(weights,feature_names)
print_topics_udf(
                topics = topics,
                total_topics = total_topics,
                weight_threshold = 0.15,
                display_weights = True
)

Topic #1 with weights
Topic #2 with weights


我们使用LSI来定义下面的函数作为通用可重用的主题建模框架：

In [16]:
def train_lsi_model_gensim(corpus,total_topics = 2):
    norm_tokenized_corpus = normalize_corpus(corpus,tokenize = True)
    dictionary = corpora.Dictionary(norm_tokenized_corpus)
    mapped_corpus = [dictionary.doc2bow(text)
                    for text in norm_tokenized_corpus]
    tfidf = models.TfidfModel(mapped_corpus)
    corpus_tfidf = tfidf[mapped_corpus]
    lsi = models.LsiModel(corpus_tfidf,
                         id2word = dictionary,
                         num_topics = total_topics)
    return lsi

### 隐含Dirichlet(狄利克利)分布

隐含Dirichlet分布技术是一种概率生成模型，其中假定每个文档具有类似于概率隐含语义索引模型的主题组合--但是在此情况下，隐含主题包含他们的Dirichlet先验分布。这项技术背后的数学知识比较复杂，因为他的具体细节将超出当前范围。

假设我们有M个文档，N个文档中的单词，以及K个想要生成的主题数量。

算法思想：
1 初始化必要参数。
2 对于每个文档，随机将每个单词初始化为K个主题之一。
3 开始如下的一个迭代过程，重复几次。
4 对于每个文档D：
a 对于文档中的每个单词W：
* 对于每个主题T
* 计算P(T|D),其是D中分配给主题T的词的比例
* 计算P(W|D),其是对于含有词W的所有文档分配给主题T的比例。
* 考虑所有其他单词及其主题分配，用主题T和概率P(T|D)*P(W|D)重新分配词W

运行了几次迭代之后，我们应该为每个文档提供主题混合，然后从指向该主题的词中生成每个主题的组成部分。我们在以下实现中使用gensim来构建基于LDA的主题模型：

In [17]:
def train_lda_model_gensim(corpus,total_topics = 2):
    norm_tokenized_corpus = normalize_corpus(corpus,tokenize = True)
    dictionary = corpora.Dictionary(norm_tokenized_corpus)
    mapped_corpus = [dictionary.doc2bow(text) for text in norm_tokenized_corpus]
    tfidf = models.TfidfModel(mapped_corpus)
    corpus_tfidf = tfidf[mapped_corpus]
    lda = models.LdaModel(corpus_tfidf,
                         id2word = dictionary,
                         iterations = 1000,
                         num_topics = total_topics)
    return lda

In [18]:
lda_gensim = train_lda_model_gensim(toy_corpus,
                                   total_topics = 2)
print_topics_gensim(topic_model = lda_gensim,
                   total_topics = 2,
                   num_terms = 5,
                   display_weights  = True)

Topic #1with weights
[('fox', 0.07), ('quick', 0.07), ('cat', 0.07), ('clever', 0.07), ('dog', 0.06)]
Topic #2with weights
[('programming', 0.07), ('language', 0.07), ('java', 0.07), ('popular', 0.06), ('jump', 0.06)]


In [19]:
from sklearn.decomposition import LatentDirichletAllocation

norm_corpus = normalize_corpus(toy_corpus)
vectorizer,tfidf_matrix = build_feature_matrix(norm_corpus,feature_type='tfidf')
total_topics = 2
lda = LatentDirichletAllocation(n_topics = total_topics,
                               max_iter = 100,
                               learning_method = 'online',
                               learning_offset = 50,
                               random_state = 42)
lda.fit(tfidf_matrix)
feature_names = vectorizer.get_feature_names()
weights = lda.components_
topics = get_topics_terms_weights(weights,feature_names)



在该段代码中，将LDA模型应用于文档-词项的TF-IDF特征矩阵，其被分解成两个矩阵，即一个文档-主题矩阵和一个主题-词项矩阵。我们使用存储在lda.components_ 中的主题-词项来检索每个主题每个词的权重。得到这些权重后，我们使用LSI建模中的get_topics_terms_weights()函数根据每个主题的词项和权重来构建主题。我们现在可以使用之前实现的print_topics_udf()函数查看主题：

In [20]:
topics = get_topics_terms_weights(weights,feature_names)
print_topics_udf(topics = topics,
                 total_topics = total_topics,
                num_terms = 8,
                display_weights = True)


Topic #1 with weights
[('fox', 1.85), ('dog', 1.54), ('jump', 1.17), ('clever', 1.11), ('quick', 1.11), ('cat', 1.06), ('smarter', 1.05), ('excellent', 0.6)]
Topic #2 with weights
[('programming', 1.73), ('language', 1.73), ('java', 1.61), ('python', 1.58), ('program', 1.29), ('ruby', 1.09), ('slow', 1.08), ('lazy', 1.08)]


### 非负矩阵分解

非负矩阵分解(NNMF)是一种类似于SVD的矩阵分解技术，虽然NNMF是对非负矩阵操作运算，并也可适用于多变量数据。NNMF可定义为：给定非负矩阵V，目标是找到两个非负矩阵因子W和H，使得他们相乘时，他们可以近似重构V，数学上这表示为V≈W * H

使得所有三个矩阵都为非负。为了实现这个近似，我们通常使用一个成本函数，如两个矩阵之间的欧几里得距离是L2范数，或是L2范数略微修改的Frobenius范数。

可以从scikit-learn decomposition模块的NMF类中可获得该实现。
可以在我们的小语料库上使用以下代码构建一个基于NNMF的主题模型，他给出了与LDA一样的特征名称和权重：

In [21]:
from sklearn.decomposition import NMF
norm_corpus = normalize_corpus(toy_corpus)
vectorizer,tfidf_matrix = build_feature_matrix(norm_corpus,feature_type='tfidf')
total_topics = 2
nmf = NMF(n_components = total_topics,
         random_state = 42,alpha = .1,l1_ratio = .5)
nmf.fit(tfidf_matrix)
feature_names = vectorizer.get_feature_names()
weights = nmf.components_

现在我们有了词项及其权重，可以使用我们以前定义的函数来打印主题，如下所示：

In [22]:
topics = get_topics_terms_weights(weights,feature_names)
print_topics_udf(topics=topics,
                total_topics = total_topics,
                num_terms = None,
                display_weights = True)

Topic #1 with weights
Topic #2 with weights


### 从产品评论中提取主题

我们将获取上古卷轴的在亚马逊上的评论

In [23]:
import pandas as pd
import numpy as np

CORPUS = pd.read_csv('amazon_skyrim_reviews.csv')
CORPUS = np.array(CORPUS['Reviews'])
print(CORPUS[12])

I base the value of a game on the amount of enjoyable gameplay I can get out of it and this one was definitely worth the price!


In [24]:
import pandas as pd
import numpy as np 
                 
CORPUS = pd.read_csv('amazon_skyrim_reviews.csv')                     
CORPUS = np.array(CORPUS['Reviews'])

# view sample review
print(CORPUS[12])

        
total_topics = 5
        
lsi_gensim = train_lda_model_gensim(CORPUS,
                                    total_topics=total_topics)
print_topics_gensim(topic_model=lsi_gensim,
                    total_topics=total_topics,
                    num_terms=10,
                    display_weights=False) 

lda_gensim = train_lda_model_gensim(CORPUS,
                                    total_topics=total_topics)
print_topics_gensim(topic_model=lda_gensim,
                    total_topics=total_topics,
                    num_terms=10,
                    display_weights=False) 


norm_corpus = normalize_corpus(CORPUS)
vectorizer, tfidf_matrix = build_feature_matrix(norm_corpus, 
                                    feature_type='tfidf') 
feature_names = vectorizer.get_feature_names()


lda = LatentDirichletAllocation(n_topics=total_topics, 
                                max_iter=1000,
                                learning_method='online', 
                                learning_offset=10.,
                                random_state=42)
lda.fit(tfidf_matrix)
weights = lda.components_
topics = get_topics_terms_weights(weights, feature_names)
print_topics_udf(topics=topics,
                 total_topics=total_topics,
                 num_terms=10,
                 display_weights=False)


nmf = NMF(n_components=total_topics, 
          random_state=42, alpha=.1, l1_ratio=.5)
nmf.fit(tfidf_matrix)      

feature_names = vectorizer.get_feature_names()
weights = nmf.components_

topics = get_topics_terms_weights(weights, feature_names)
print_topics_udf(topics=topics,
                 total_topics=total_topics,
                 num_terms=10,
                 display_weights=False)  

I base the value of a game on the amount of enjoyable gameplay I can get out of it and this one was definitely worth the price!
Topic #1without weights
['one', 'love', 'fun', 'buy', 'recommend', 'ever', 'quest', 'play', 'great', 'make']
Topic #2without weights
['much', 'fun', 'say', 'love', 'play', 'oblivion', 'great', 'thing', 'like', 'enjoy']
Topic #3without weights
['great', 'would', 'play', 'one', 'best', 'everyone', 'love', 'rpgs', 'like', 'skyrim']
Topic #4without weights
['level', 'skyrim', 'good', 'quest', 'get', 'much', 'armor', 'rpg', 'want', 'make']
Topic #5without weights
['play', 'skyrim', 'get', 'oblivion', 'hour', '5', 'even', 'like', 'lose', 'save']
Topic #1without weights
['play', 'best', 'definitely', 'say', 'oblivion', 'one', 'good', 'really', 'like', 'love']
Topic #2without weights
['love', 'play', 'skyrim', 'best', 'much', 'oblivion', 'elder', 'scroll', 'recommend', 'ever']
Topic #3without weights
['quest', 'dragon', 'skyrim', 'try', 'like', 'oblivion', 'one', 'gre



Topic #1 without weights
['estatic', 'booklet', 'wonder4ful', 'electricity', 'heat', 'trhats', 'amazingly', 'interfere', 'chirstmas', '12yr']
Topic #2 without weights
['game', 'play', 'get', 'one', 'skyrim', 'great', 'like', 'time', 'quest', 'much']
Topic #3 without weights
['de', 'crédito', 'pagar', 'momento', 'compras', 'responsabilidad', 'para', 'recomiendo', 'futuras', 'skyrimseguridad']
Topic #4 without weights
['booklet', 'estatic', 'wonder4ful', 'electricity', 'heat', 'trhats', 'amazingly', 'interfere', 'chirstmas', '12yr']
Topic #5 without weights
['estatic', 'booklet', 'wonder4ful', 'electricity', 'trhats', 'heat', 'amazingly', 'interfere', 'chirstmas', '12yr']
Topic #1 without weights
['game', 'get', 'skyrim', 'play', 'time', 'quest', 'like', 'one', 'go', 'much']
Topic #2 without weights
['game', 'recommend', 'love', 'great', 'highly', 'play', 'wonderful', 'like', 'would', 'graphic']
Topic #3 without weights
['scroll', 'elder', 'series', 'always', 'love', 'pass', 'franchise',

### 自动文档摘要

自动化文档摘要的主要目标是不包括人工输入的执行此摘要，除了运行任何计算机程序。数学和统计模型有助于通过观察其内容和上下文来构建和自动化概况文档的任务。

应用自动化技术进行文档摘要主要有两大类做法：
* 基于提取的技术：
这些方法使用数学和统计学概念（如SVD）从原始文档中提取内容的一些关键子集，使得该内容子集包含核心信息，并作为整个文档的重点。这个内容可以是单词、短语或句子。这种方法的最终结果是从原始文档中采集或提取了几行简短的执行摘要。在这种技术中不产生新的内容--因此这个名称是基于提取的。

* 基于概括的技术：这些方法更加复杂和精准，并利用语言语义来产生表示。他们还利用NLG技术，其中及其使用知识库和语义表达来自己生成文本，并像人类编写一样来创建摘要。

我们通过利用gensim的摘要模块来看看文档摘要的实现。我们将使用维基百科关于大象的描述来作为我们将测试所有摘要技术的文档。

In [25]:
toy_text = """
Elephants are large mammals of the family Elephantidae 
and the order Proboscidea. Two species are traditionally recognised, 
the African elephant and the Asian elephant. Elephants are scattered 
throughout sub-Saharan Africa, South Asia, and Southeast Asia. Male 
African elephants are the largest extant terrestrial animals. All 
elephants have a long trunk used for many purposes, 
particularly breathing, lifting water and grasping objects. Their 
incisors grow into tusks, which can serve as weapons and as tools 
for moving objects and digging. Elephants' large ear flaps help 
to control their body temperature. Their pillar-like legs can 
carry their great weight. African elephants have larger ears 
and concave backs while Asian elephants have smaller ears 
and convex or level backs.  
"""

In [26]:
from normalization3 import normalize_corpus,parse_document
from utils_in_NLP import build_feature_matrix,low_rank_svd
import numpy as np

现在定义一个函数将输入文档总结道其原始大小的一小部分，这将作为下面函数中的用户输入参数summary_ratio.输出将是摘要后的文件：

In [27]:
from gensim.summarization import summarize,keywords
def text_summarization_gensim(text,summary_ratio = 0.5):
    summary = summarize(text,split = True,ratio = summary_ratio)
    for sentence in summary:
        print(sentence)

In [28]:
docs = parse_document(toy_text)
text = ' '.join(docs)
text_summarization_gensim(text,summary_ratio=0.4)


Two species are traditionally recognised,  the African elephant and the Asian elephant.
All  elephants have a long trunk used for many purposes,  particularly breathing, lifting water and grasping objects.
Elephants' large ear flaps help  to control their body temperature.


原文档共有9个句子，观察后发现，总结后共有三个句子，但是文档的核心意义与主题已被保留。
这个原子gensim的摘要实现是基于一种流行的称为TextRank的算法

我们将主要关注以下技术：
* 隐含语义分析
* TextRank

In [29]:
#解析和规范化文档

In [30]:
sentences = parse_document(toy_text)
norm_sentences = normalize_corpus(sentences,lemmatize=True)
total_sentences = len(norm_sentences)
print('Total Sentences in Documents:',total_sentences)

Total Sentences in Documents: 9


一旦有了一个可运用得摘要算法，我们将为每种技术构建一个通用函数

### 隐含语义分析

隐含语义分析（LSA）的核心原则是，在任何文件中，在词语的相关语境中存在隐含的结构，因此也应该在相同的奇异空间中相关。

实现的主要思想是使用SVD，其中:$$ M = USV^T $$
使得U和V是正交矩阵，S是对角矩阵，其也可以表示为奇异值向量。原始矩阵可以表示为词项-文档矩阵，其中行将是词，每一列将是一个文档，也就是说，在这种情况下是我们文档中的一个句子。这些值可以是任何类型的加权，例如基于词袋模型的频率、TF-IDF或出现次数二值特征。
我们将使用low_rank_svd()函数根据概念数量k创建M的低秩矩阵近似，k将是奇异值的数量。来自矩阵U的相同的k列将指向k个概念中的每一个词向量，并且对于矩阵V，基于前k个奇异值的k行指向句子向量。从基于概念数量k的前k个奇异值的SVD中得到U、S和$V^T$之后，我们执行以下计算。
需要的输入参数是我们预期最终摘要包含的概念数量k和句子数n。
* 从矩阵V（k行）获取句子数量。
* 从S获得前k个奇异值。
* 应用基于阈值的方法，删除小于最大奇异值一半的奇异值（如果有的话）。这是启发式的，你可以按照需要调整这个值。在数学上，$S_i = 0 iff S_i <1/2 max(S)$
* 将来自V平方的每个词句子列乘以S平方相对应的奇异值，以获得每个主题句子的凸显度分数。
* 计算主题之间的句子权重之和，并取最终分数的平方根来获得文档中每个句子的凸显度分数。

每个句子前面的凸显度分数的计算可以在数学上表示为$$SS = \sqrt{\sum_{i=1}^k\ S_i*V_i^T}$$

其中SS表示每个句子的凸显度分数，其通过采用奇异值和$V^T$句子向量之间的点积获得。得到这些分数侯，按降序对他们进行排序，选择与最高分数相对应的前n个句子，并根据他们在原始文档中出现的顺序将他们组合起来形成最终的摘要：

In [32]:
num_sentences = 3
num_topics = 3
vec,dt_martix = build_feature_matrix(sentences,
                                    feature_type = 'frequency')

In [41]:
td_matrix = dt_martix.transpose()
td_matrix = td_matrix.multiply(td_matrix > 0)
u,s,vt = low_rank_svd(td_matrix,singular_count=num_topics)

sv_threshold = 0.5
min_sigma_value = max(s)*sv_threshold
s[s<min_sigma_value] = 0

In [42]:
salience_scores = np.sqrt(np.dot(np.square(s),np.square(vt)))

In [43]:
print(np.round(salience_scores,2))

[2.93 3.28 1.67 1.8  2.24 4.51 0.71 1.22 5.24]


In [44]:
top_sentence_indices = salience_scores.argsort()[-num_sentences:][::-1]
top_sentence_indices.sort()
print(top_sentence_indices)

[1 5 8]


In [45]:
for index in top_sentence_indices:
    print(sentences[index])

Two species are traditionally recognised,  the African elephant and the Asian elephant.
Their  incisors grow into tusks, which can serve as weapons and as tools  for moving objects and digging.
African elephants have larger ears  and concave backs while Asian elephants have smaller ears  and convex or level backs.


可以看到一些矩阵操作为我们提供了一个简明，优秀的总结文档，涵盖了大象文档中的主要主题。
我们现在使用之前的算法为LSA构建一个通用的可重用函数，以便我们可以在后续产品描述文档中使用它。

In [47]:
def lsa_text_summarizer(documents,num_sentences = 2,
                       num_topics = 2,feature_type = 'frequency',
                       sv_threshold = 0.5):
    vec,dt_martix = build_feature_matrix(documents,feature_type=feature_type)
    
    td_matrix = dt_martix.transpose()
    td_matrix = td_matrix.multiply(td_matrix>0)
    u,s,vt = low_rank_svd(td_matrix,singular_count=num_topics)
    min_sigma_value = max(s)*sv_threshold
    s[s < min_sigma_value] = 0
    salience_scores = np.sqrt(np.dot(np.square(s),np.square(vt)))
    top_sentence_indices = salience_scores.argsort()[-num_sentences:][::-1]
    top_sentence_indices.sort()
    for index in top_sentence_indices:
        print(sentences[index])

### TextRank算法

* 从待总结的文档中标记和提取句子
* 确定在最终摘要中我们想要的句子数量k
* 使用诸如TF-IDF或词袋的权重来构建文档-词项的特征矩阵
* 通过将矩阵与其转置矩阵相乘，计算文档相似性矩阵
* 使用这些文档作为顶点，每对文档之间的相似性作为前面提到的权重或得分系数，并将它们提供给PageRank算法
* 获得每个句子的分数
* 根据分数排序句子，并返回前k个句子

In [49]:
import networkx
num_sentences = 3
vec,dt_martix = build_feature_matrix(norm_sentences,feature_type='tfidf')
similarity_matrix = (dt_martix*dt_martix.T)
print(np.round(similarity_matrix.todense(),2))

[[1.   0.07 0.03 0.12 0.03 0.   0.11 0.   0.1 ]
 [0.07 1.   0.05 0.17 0.05 0.   0.07 0.   0.24]
 [0.03 0.05 1.   0.03 0.02 0.   0.03 0.   0.04]
 [0.12 0.17 0.03 1.   0.03 0.   0.11 0.   0.17]
 [0.03 0.05 0.02 0.03 1.   0.07 0.03 0.   0.04]
 [0.   0.   0.   0.   0.07 1.   0.   0.   0.  ]
 [0.11 0.07 0.03 0.11 0.03 0.   1.   0.   0.25]
 [0.   0.   0.   0.   0.   0.   0.   1.   0.  ]
 [0.1  0.24 0.04 0.17 0.04 0.   0.25 0.   1.  ]]


In [51]:
similarity_graph = networkx.from_scipy_sparse_matrix(similarity_matrix)
networkx.draw_networkx(similarity_graph)

In [52]:
scores = networkx.pagerank(similarity_graph)
ranked_sentences = sorted(((score,index) for index ,score in scores.items()),
                         reverse = True)

In [56]:
ranked_sentences

[(0.1260163898978227, 8),
 (0.11765066352817342, 1),
 (0.11552208007151823, 3),
 (0.1130880998369617, 6),
 (0.1111111111111111, 7),
 (0.1071100297193984, 0),
 (0.10529192044492774, 4),
 (0.10502675195545394, 5),
 (0.09918295343463272, 2)]

In [57]:
top_sentence_indices = [ranked_sentences[index][1] for index in range(num_sentences)]
top_sentence_indices.sort()
print(top_sentence_indices)

[1, 3, 8]


In [55]:
for index in top_sentence_indices:
    print(sentences[index])

Two species are traditionally recognised,  the African elephant and the Asian elephant.
Male  African elephants are the largest extant terrestrial animals.
African elephants have larger ears  and concave backs while Asian elephants have smaller ears  and convex or level backs.


通过使用TextRank算法，我们最终得到了我们想要的摘要。他的内容也是非常有意义的

我们定义一个通用函数以便在任何文档上计算基于TextRank的摘要

In [63]:
def textrank_text_summarizer(documents,num_sentences = 2,feature_type = 'frequency'):
    vec,dt_martix = build_feature_matrix(norm_sentences,feature_type='tfidf')
    similarity_matrix = (dt_martix*dt_martix.T)
    similarity_graph = networkx.from_scipy_sparse_matrix(similarity_matrix)
    scores = networkx.pagerank(similarity_graph)
    ranked_sentences = sorted(((score,index)for index,score in scores.items()),
                             reverse = True)
    top_sentence_indices = [ranked_sentences[index][1] for index in range(num_sentences)]
    top_sentence_indices.sort()
    
    for index in top_sentence_indices:
        print(sentences[index])

### 生成产品说明摘要

In [64]:
DOCUMENT = """
The Elder Scrolls V: Skyrim is an open world action role-playing video game 
developed by Bethesda Game Studios and published by Bethesda Softworks. 
It is the fifth installment in The Elder Scrolls series, following 
The Elder Scrolls IV: Oblivion. Skyrim's main story revolves around 
the player character and their effort to defeat Alduin the World-Eater, 
a dragon who is prophesied to destroy the world. 
The game is set two hundred years after the events of Oblivion 
and takes place in the fictional province of Skyrim. The player completes quests 
and develops the character by improving skills. 
Skyrim continues the open world tradition of its predecessors by allowing the 
player to travel anywhere in the game world at any time, and to 
ignore or postpone the main storyline indefinitely. The player may freely roam 
over the land of Skyrim, which is an open world environment consisting 
of wilderness expanses, dungeons, cities, towns, fortresses and villages. 
Players may navigate the game world more quickly by riding horses, 
or by utilizing a fast-travel system which allows them to warp to previously 
Players have the option to develop their character. At the beginning of the game, 
players create their character by selecting one of several races, 
including humans, orcs, elves and anthropomorphic cat or lizard-like creatures, 
and then customizing their character's appearance.discovered locations. Over the 
course of the game, players improve their character's skills, which are numerical 
representations of their ability in certain areas. There are eighteen skills 
divided evenly among the three schools of combat, magic, and stealth. 
Skyrim is the first entry in The Elder Scrolls to include Dragons in the game's 
wilderness. Like other creatures, Dragons are generated randomly in the world 
and will engage in combat. 
"""

In [65]:
sentences = parse_document(DOCUMENT)
norm_sentences = normalize_corpus(sentences,lemmatize=True)
print("Total Sentences:",len(norm_sentences))

Total Sentences: 13


In [66]:
lsa_text_summarizer(norm_sentences,num_sentences=3,
                   num_topics = 5,feature_type = 'frequency',
                   sv_threshold = 0.5)


The Elder Scrolls V: Skyrim is an open world action role-playing video game  developed by Bethesda Game Studios and published by Bethesda Softworks.
Players may navigate the game world more quickly by riding horses,  or by utilizing a fast-travel system which allows them to warp to previously  Players have the option to develop their character.
At the beginning of the game,  players create their character by selecting one of several races,  including humans, orcs, elves and anthropomorphic cat or lizard-like creatures,  and then customizing their character's appearance.discovered locations.


In [67]:
textrank_text_summarizer(norm_sentences,num_sentences=3,
                        feature_type = 'tfidf')

The Elder Scrolls V: Skyrim is an open world action role-playing video game  developed by Bethesda Game Studios and published by Bethesda Softworks.
Players may navigate the game world more quickly by riding horses,  or by utilizing a fast-travel system which allows them to warp to previously  Players have the option to develop their character.
Skyrim is the first entry in The Elder Scrolls to include Dragons in the game's  wilderness.
