## 关键词提取
这里介绍三种文本关键词提取的方法:
- TF-IDF
- TextRank
- LDA

### 基于 TF-IDF 算法的关键词抽取
使用 jieba 库中的 analyse.extract_tags
import jieba.analyse
* jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
    * sentence 为待提取的文本
    * topK 为返回几个 TF/IDF 权重最大的关键词，默认值为 20
    * withWeight 为是否一并返回关键词权重值，默认值为 False
    * allowPOS 仅包括指定词性的词，默认值为空，即不筛选

In [None]:
import jieba.analyse as analyse
import pandas as pd

# Glance some data
glance_data = True
# glance_data = False

df = pd.read_csv("./data/technology_news.csv", encoding='utf-8')
df = df.dropna()
lines=df.content.values.tolist()
content = "".join(lines)
print("  ".join(analyse.extract_tags(content, topK=30, withWeight=False, allowPOS=())))

In [None]:
import jieba.analyse as analyse
import pandas as pd
df = pd.read_csv("./data/military_news.csv", encoding='utf-8')
df = df.dropna()
lines=df.content.values.tolist()
content = "".join(lines)
print("  ".join(analyse.extract_tags(content, topK=30, withWeight=False, allowPOS=())))

### 基于 TextRank 算法的关键词抽取
* jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')) 直接使用，接口相同，注意默认过滤词性。
* jieba.analyse.TextRank() 新建自定义 TextRank 实例

算法论文： [TextRank: Bringing Order into Texts](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
基本思想:
* 将待抽取关键词的文本进行分词
* 以固定窗口大小(默认为5，通过span属性调整)，词之间的共现关系，构建图
* 计算图中节点的PageRank，注意是无向带权图

In [None]:
import jieba.analyse as analyse
import pandas as pd
df = pd.read_csv("./data/military_news.csv", encoding='utf-8')
df = df.dropna()
lines=df.content.values.tolist()
content = "".join(lines)

print("  ".join(analyse.textrank(content, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))))
print("---------------------我是分割线----------------")
print("  ".join(analyse.textrank(content, topK=20, withWeight=False, allowPOS=('ns', 'n'))))

## LDA主题模型
原理: TODO 

In [None]:
# 导入必要的包 
from gensim import corpora, models, similarities
import gensim
import jieba
import pandas as pd
# 加载停用词
stopwords=pd.read_csv("./data/stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
stopwords=stopwords['stopword'].values

# 加载数据
df = pd.read_csv("./data/technology_news.csv", encoding='utf-8')
df = df.dropna()
lines=df.content.values.tolist()

# 数据处理
sentences=[]
for line in lines:
    try:
        segs=jieba.lcut(line)
        segs = list(filter(lambda x:len(x)>1, segs))
        segs = list(filter(lambda x:x not in stopwords, segs))
        sentences.append(list(segs))
    except Exception as e:
        print(line)
        continue

In [None]:
if glance_data:
    for word in sentences[5]:
        print(word)

In [None]:
# 词袋模型
dictionary = corpora.Dictionary(sentences)
corpus = [dictionary.doc2bow(sentence) for sentence in sentences]
if glance_data:
    print(corpus[5])
# LDA建模
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)
# 输出3号分类的关键词
lda.print_topic(3, topn=10)

In [None]:
if glance_data:
    # 打印所有的主题
    for topic in lda.print_topics(num_topics=20, num_words=8):
        print(topic[1])

对新加入的文本，进行主题分类：
`lda.get_document_topics(bow)`