离线计算 Word2Vec 的词向量。

一、数据读入

首先进行数据清洗。转换原始数据中的结构字符，并转为简体。

In [120]:
import re
import json
from zhconv import convert

with open('chinese-poetry/json/表面结构字.json', 'r', encoding='utf-8') as f:
    stru = json.load(f)


def destru(c):
    st = stru.get(c, None)
    if st:
        return st[0]['font']
    return '\uFFFF'


def process(s):
    if isinstance(s, list):
        return [l for l in process('$'.join(s)).split('$') if re.match(r'[\u4E00-\u9FFF]', l)]

    s = re.sub(r'（.+）', '', s)
    s = re.sub(r'[\{\[].*?[\}\]]', lambda m: destru(m.group(0)), s)
    return convert(s, 'zh-cn')


def process2(s):
    if isinstance(s, list):
        return process2('$'.join(s)).split('$')

    s = re.sub(r'[\{\[].*?[\}\]]', lambda m: destru(m.group(0)), s)
    return convert(s, 'zh-cn')

In [121]:
import pandas as pd
import tqdm

for dn in tqdm.trange(0, 58):
    df = pd.read_json(f'chinese-poetry/json/poet.tang.{dn*1000}.json')
    df['paragraphs'] = df['paragraphs'].map(process)
    df['author'] = df['author'].map(process)
    df['title'] = df['title'].map(process)
    del df['tags']
    with open(f'data/poet.tang.{dn}.json', 'w', encoding='utf-8') as f:
        df.to_json(f, orient='records', force_ascii=False)

100%|██████████| 58/58 [00:04<00:00, 11.79it/s]


In [108]:
import pandas as pd
from zhconv import convert

df = pd.read_json(f'chinese-poetry/json/authors.tang.json')
df['desc'] = df['desc'].map(process2)
df['name'] = df['name'].map(process2)
with open('data/authors.tang.json', 'w', encoding='utf-8') as f:
    df.to_json(f, orient='records', force_ascii=False)

然后将清洗后的数据读入 DataFrame 中。

In [2]:
import pandas as pd
import functools

@functools.lru_cache()
def read_poem(n):
    df = pd.read_json(f'./data/poet.tang.{n}.json')

    def split(s):
        return s.replace('，', '。') \
            .replace('？', '。') \
            .replace('！', '。') \
            .strip('。').split('。')

    sens = []
    for poem in df['paragraphs']:
        sens.append(split(''.join(poem)))
    return sens


print(read_poem(0)[:1])

[['秦川雄帝宅', '函谷壮皇居', '绮殿千寻起', '离宫百雉余', '连甍遥接汉', '飞观迥凌虚', '云日隐层阙', '风烟出绮疏']]


二、训练模型

采用 gensim 包提供的 FastText 模型，逐字训练 n-gram 向量。

In [62]:
import logging
from gensim.models.fasttext import FastText

model = FastText(min_count=1, iter=20, sg=1)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

2021-10-30 16:15:21,238 : INFO : resetting layer weights


In [63]:
def train(n, update=False, save=False):
    sens = read_poem(n)
    model.build_vocab(sens, update=update)
    model.train(sens, total_examples=model.corpus_count, epochs=model.iter)
    if save:
        model.save("fasttext.model")

In [64]:
for dn in range(0, 58):
    train(dn, update=dn != 0, save=False)
    
model.save("fasttext.model")

2021-10-30 16:15:39,978 : INFO : collecting all words and their counts
2021-10-30 16:15:40,702 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-10-30 16:15:40,705 : INFO : collected 8283 word types from a corpus of 8651 raw words and 1000 sentences
2021-10-30 16:15:40,706 : INFO : Loading a fresh vocabulary
2021-10-30 16:15:40,719 : INFO : effective_min_count=1 retains 8283 unique words (100% of original 8283, drops 0)
2021-10-30 16:15:40,719 : INFO : effective_min_count=1 leaves 8651 word corpus (100% of original 8651, drops 0)
2021-10-30 16:15:40,743 : INFO : deleting the raw counts dictionary of 8283 items
2021-10-30 16:15:40,743 : INFO : sample=0.001 downsamples 0 most-common words
2021-10-30 16:15:40,744 : INFO : downsampling leaves estimated 8651 word corpus (100.0% of prior 8651)
2021-10-30 16:15:40,810 : INFO : estimated required memory for 8283 words, 104374 buckets and 100 dimensions: 53799836 bytes
2021-10-30 16:15:40,814 : INFO : resetting lay

训练完成后，可以直接从本地加载训练过的模型。

In [54]:
from gensim.models.fasttext import FastText
model = FastText.load('fasttext.model')

三、生成词表

利用 FastText 的训练结果，生成古诗的词表。

首先，生成所有的2到4字的相邻字符组合，并统计词频。

In [65]:
from collections import Counter

words_counter = Counter()


def update_counter(words_counter, dn):
    for n in range(2, 5):
        words_counter.update(
            ''.join(wd_t) for sen in read_poem(dn) for s in sen
            for wd_t in zip(*(s[i:] for i in range(n)))
        )
    return words_counter

In [32]:
words_counter = update_counter(Counter(), 0)

words, counts = zip(*words_counter.most_common())
print(words[:10])

('万国', '肃肃', '神其', '天地', '四海', '将军', '万里', '明德', '无疆', '万方')


In [71]:
import tqdm

words_counter = Counter()
for dn in tqdm.trange(0, 58):
    update_counter(words_counter, dn)

words, counts = zip(*words_counter.most_common())
print(len(words))
print(words[:10])

100%|██████████| 58/58 [00:12<00:00,  4.50it/s]


3442366
('何处', '不知', '万里', '千里', '不见', '白云', '今日', '不可', '春风', '不得')


计算词评分:

$$\mathrm{Score}_{w} = || \mathrm{Vec}_w || \times \ln (D_w+1)$$

其中$\mathrm{Vec}_w$是 FastText 计算出的平均词向量，$D_w$是词频。

In [84]:
import numpy as np

MIN_SCORE = 15.0

counts = np.array(counts) + 1
scores = np.linalg.norm(model.wv[words], axis=1) * np.log(counts)
freq_words_index = np.nonzero(scores > MIN_SCORE)[0]
freq_words = [words[i] for i in freq_words_index]
print(len(freq_words))
print(freq_words[:10])
print(scores[freq_words_index][:10])

34644
['何处', '不知', '万里', '千里', '不见', '白云', '今日', '不可', '春风', '不得']
[139.97656432 150.45272063 181.19890067 188.48421925 180.35110888
 191.70653118 162.29948331  69.08332987 173.11758386 169.24598314]


在利用词评分分词的时候，遇到了词的一部分的评分低于整体评分，导致分词结果带有冗余成分的问题。

因此，对初步分词的结果进行细化，比如将“人何处”分成“人”和“何处”，对于成为其他词一部分的词，降低选取标准。

In [85]:
freq_words_sub = set()
for n in range(2, 5):
    freq_words_sub.update(
        ''.join(wd_t) for wd in freq_words
        for wd_t in zip(*(wd[i:] for i in range(n)))
    )
print(list(freq_words_sub)[:10])

['平生官', '今春', '有时帘', '日骑', '首暮', '万里从', '道国', '至今听', '回首陇', '日夕白']


In [86]:
import tqdm

true_words = set()
for wd in tqdm.tqdm(
    sorted(freq_words_sub, key=lambda w: words_counter[w], reverse=True)
):
    if all(wd_p not in wd for wd_p in true_words) and \
        np.linalg.norm(model.wv[wd]) * np.log(words_counter[wd] + 1) > MIN_SCORE / 3:
        true_words.add(wd)
true_words = list(true_words)

100%|██████████| 50134/50134 [00:53<00:00, 932.42it/s]


In [87]:
true_counts = np.array([words_counter[wd] for wd in true_words])
true_scores = np.linalg.norm(model.wv[true_words], axis=1) * np.log(true_counts)
freq_t_words_index = np.nonzero(true_scores > MIN_SCORE)[0]
print([words[i] for i in freq_t_words_index[:10]])

['万里', '今日', '不可', '春风', '不得', '明月', '风吹', '惆怅', '故人', '秋风']


In [88]:
with open('words.txt', 'w', encoding='utf-8') as f:
    for i in sorted(freq_t_words_index, key=lambda i: -true_scores[i]):
        f.write(f'{true_words[i]} {true_scores[i]}\n')

最后统计一下单字的结果，按照单字在词表中的词频排序。

In [None]:
character_counter = Counter()
for i in freq_t_words_index:
    character_counter.update(true_words[i])

with open('characters.txt', 'w', encoding='utf-8') as f:
    for c, cnt in character_counter.most_common():
        f.write(f'{c} {cnt}\n')

三、挖掘近义词关系

首先，将词表中的词和所有的单字读入。

In [5]:
wds = []

with open('words.txt', 'r', encoding='utf-8') as f:
    wds += [line.split()[0] for line in f]

with open('characters.txt', 'r', encoding='utf-8') as f:
    wds += [line.split()[0] for line in f]

print(wds[:10])

['白云', '千里', '万里', '不见', '春风', '不得', '故人', '悠悠', '今日', '十年']


建立共现网络，判断连接情况相似的词为近义词。

此处认为在同一首诗中出现即为共现。

In [6]:
import tqdm
import itertools
import numpy as np

wds_inv = {wd: index for index, wd in enumerate(wds)}

coo = np.zeros((len(wds), len(wds)), dtype=int)
for n in tqdm.trange(0, 58):
    for lines in read_poem(n):
        line = '。'.join(lines)
        near_wds = {
            wds_inv[wd]
            for wd in itertools.
            chain((''.join(wd_t) for wd_t in zip(line, line[1:])), line)
            if wd in wds_inv
        }
        for i, j in itertools.product(near_wds, near_wds):
            coo[i, j] += 1

100%|██████████| 58/58 [02:21<00:00,  2.44s/it]


构建共现网络，有共现关系的设置边权为1，否则为0。

In [7]:
weights = np.zeros(coo.shape)
np.place(weights, coo > 0, 1)

对共现矩阵进行 PCA 降维，得到基于共现关系的词向量$\mathrm{v}_w$。

In [8]:
from sklearn.decomposition import PCA

vectors_coo = PCA(n_components=200).fit_transform(weights)
vectors_coo /= np.linalg.norm(vectors_coo, axis=1)[:, np.newaxis]
connection_coo = vectors_coo @ vectors_coo.T

In [9]:
import pandas as pd

i, j = 15, 10
pd.DataFrame(connection_coo[:i, :j], index=wds[:i], columns=wds[:j])

Unnamed: 0,白云,千里,万里,不见,春风,不得,故人,悠悠,今日,十年
白云,1.0,0.936267,0.925546,0.920675,0.885184,0.897835,0.887579,0.903412,0.910064,0.879519
千里,0.936267,1.0,0.978596,0.956124,0.915435,0.936532,0.915425,0.916342,0.950373,0.923391
万里,0.925546,0.978596,1.0,0.961704,0.911911,0.938451,0.905343,0.916629,0.953293,0.91856
不见,0.920675,0.956124,0.961704,1.0,0.926425,0.956222,0.876557,0.919297,0.960588,0.914188
春风,0.885184,0.915435,0.911911,0.926425,1.0,0.900608,0.88194,0.885647,0.923731,0.868277
不得,0.897835,0.936532,0.938451,0.956222,0.900608,1.0,0.863441,0.908748,0.940294,0.920743
故人,0.887579,0.915425,0.905343,0.876557,0.88194,0.863441,1.0,0.898188,0.892803,0.876716
悠悠,0.903412,0.916342,0.916629,0.919297,0.885647,0.908748,0.898188,1.0,0.91365,0.883448
今日,0.910064,0.950373,0.953293,0.960588,0.923731,0.940294,0.892803,0.91365,1.0,0.9253
十年,0.879519,0.923391,0.91856,0.914188,0.868277,0.920743,0.876716,0.883448,0.9253,1.0


In [8]:
def most_similar(wd, connection, n=10):
    k = wds.index(wd)
    topn_index = np.argpartition(-connection[k], n)[:n]
    return sorted(
        [(wds[i], connection[k, i]) for i in topn_index], key=lambda x: -x[1]
    )

In [10]:
most_similar('桃李', connection_coo, n=10)

[('桃李', 0.9999999999999993),
 ('李', 0.7682370575452178),
 ('春风', 0.7661175648578242),
 ('可怜', 0.7475218440438887),
 ('美人', 0.7457272028695293),
 ('扇', 0.7450352150092245),
 ('长安', 0.7441540048791975),
 ('黄金', 0.7429098172530839),
 ('年年', 0.7424770348500368),
 ('帐', 0.741735314433345)]

四、建立近义词数据库

按照共现矩阵，将与每个词最接近的10个词认定为近义词，写入到数据库中。

In [3]:
import yaml
import py2neo

with open('../cipher.yml', 'r', encoding='utf-8') as f:
    cipher = yaml.safe_load(f)

graph = py2neo.Graph(cipher['url'], auth=("neo4j", cipher['passwd']))

In [4]:
graph.run("CREATE INDEX IF NOT EXISTS FOR (w:Word) ON (w.name)")

In [4]:
import tqdm
from py2neo.bulk import merge_nodes
from itertools import islice

batch_size = 1000
stream = ((wd, ) for wd in wds)
for _ in tqdm.trange(len(wds) // batch_size + 1):
    batch = islice(stream, batch_size)
    if batch:
        merge_nodes(
            graph.auto(),
            batch, ('Word', 'name'),
            labels=('Word', ),
            keys=('name', )
        )


100%|██████████| 12/12 [00:00<00:00, 15.80it/s]


In [15]:
import tqdm
import numpy as np
from py2neo.bulk import merge_relationships
from itertools import islice

K = 10
rels = []
for i, wd in enumerate(tqdm.tqdm(wds)):
    topk = np.argpartition(-connection_coo[i], K)[:K]
    for j in topk:
        if i != j:
            rels.append((wd, (float(connection_coo[i, j]), ), wds[j]))

stream = iter(rels)
batch_size = 1000

for _ in tqdm.trange(len(rels) // batch_size + 1):
    batch = islice(stream, batch_size)
    if batch:
        merge_relationships(
            graph.auto(),
            batch,
            'SIMILAR',
            start_node_key=('Word', 'name'),
            end_node_key=('Word', 'name'),
            keys=['weight']
        )


100%|██████████| 11745/11745 [00:01<00:00, 8492.92it/s] 
100%|██████████| 106/106 [00:05<00:00, 18.39it/s]


近义词读取示例：

In [14]:
node = graph.nodes.match('Word', name='千里').first()
if node:
    sims = graph.relationships.match((node, None), r_type='SIMILAR').all()
    for sim in sims:
        print(sim.end_node['name'], sim['weight'])

沙 0.9791039912825853
州 0.9775936852298319
亭 0.9776948234877295
征 0.9809392490464931
吴 0.9810284176635078
川 0.9791498953812668
楚 0.9801299473151225
陵 0.9796371665953145
秦 0.9782973471070116


利用束搜索拓宽近义词范围。

In [17]:
def similar_words(wd, depth=3, width=10):
    beam = [(graph.nodes.match('Word', name=wd).first(), 1.0)]
    if not beam[0][0]:
        return []
    for _ in range(depth):
        possibility = beam.copy()
        selected = {node['name'] for node, _ in beam}
        for node, score in beam:
            sims = graph.relationships.match((node, None),
                                             r_type='SIMILAR').all()
            for sim in sims:
                if sim.end_node['name'] not in selected:
                    selected.add(sim.end_node['name'])
                    possibility.append((sim.end_node, score * sim['weight']))
        possibility.sort(key=lambda x: -x[1])
        beam = possibility[:width]

    return [(node['name'], score) for node, score in beam]


对于未登录词，考虑将词语拆分，获取每一部分的近义词，构建并集，奖励交集。

In [19]:
def similar_words_plus(wd, depth=3, width=10):
    beam = [
        (
            wd_t, graph.nodes.match('Word',
                                    name=wd_t).first(), len(wd_t) / len(wd)
        )
        for wd_t in [''.join(wd_t) for wd_t in zip(*(wd, wd[1:]))] + list(wd)
    ]
    beam = [(w, n, s) for w, n, s in beam if n]
    if not beam:
        return []
    for _ in range(depth):
        possibility = {wd: (wd, node, score) for wd, node, score in beam}
        for _, node, score in beam:
            sims = graph.relationships.match((node, None),
                                             r_type='SIMILAR').all()
            for sim in sims:
                name = sim.end_node['name']
                s = score * sim['weight']
                if name not in possibility:
                    possibility[name] = (name, sim.end_node, s)
                else:
                    possibility[name] = (
                        name, sim.end_node,
                        ((s + possibility[name][-1]) / 2)**0.5
                    )
        beam = sorted(possibility.values(), key=lambda x: -x[-1])[:width]

    return [(wd, score) for wd, _, score in beam]


In [20]:
similar_words('桃李')

[('桃李', 1.0),
 ('李', 0.7705247653801758),
 ('春风', 0.7644110987657782),
 ('美人', 0.748435975125649),
 ('可怜', 0.7471985199658665),
 ('燕', 0.7457832058205238),
 ('年年', 0.7447035633415184),
 ('颜', 0.7444414559524974),
 ('长安', 0.7436674595971658),
 ('黄金', 0.7431579091933598)]

五、基于近义词关系的主题词挖掘

使用 neo4j 数据库的 gds 拓展，利用图算法进行聚类与主题词挖掘。

调用 neo4j 的 Louvain 算法，对近义词进行聚类，并将聚类结果写入数据库中。

In [6]:
# 在 neo4j 中建立子图
if not graph.run("CALL gds.graph.exists('words')").data()[0]['exists']:
    graph.run("CALL gds.graph.create('words', 'Word', 'SIMILAR') YIELD nodeCount, relationshipCount")

In [4]:
graph.run("""CALL gds.louvain.write('words', { 
    writeProperty: 'topic'
}) YIELD communityCount, modularity, modularities""")

communityCount,modularity,modularities
125,0.7658609435171729,[0.7658609435171729]


In [11]:
topics = graph.run("MATCH (w:Word) RETURN DISTINCT w.topic").to_ndarray(dtype=int).squeeze()
print(topics)

[ 6697   228  5783     8  7014  7205  5616  6652  6933  8081  7477  8749
   707  6582  6704   215  5308    36   532  7262   238  7062  7930 10006
  6217   180   621  7293  6603  7726  7143  5804  7185  7576   372  7789
  7848  7502   643  7586  2414  7836   268  5072  8772  7371   135  6749
  2448   284  8368  7271 10604  9125  9459  7846  8747  8250  2182  9406
  7497   235  7285  4572  7867   275   745  8583   552  6769  6881  6512
  8446  7795  7770  2076  8679 10142  7755  5916 10657  7144  6742  5618
  2610   472 10208  7686  7359   676  5514   653  8329   686   692  2333
  7131  2013  9039  2458  2681  7052  5827  4114  5426   581  7228  7598
  5560  6959  2871  2950  7079  2601   193  5894  3058  6216  5729  7662
  6800  5058   108  6853  7459]


In [6]:
from py2neo.bulk import merge_nodes

topic_nodes = ((int(topic), ) for topic in topics)
graph.run("CREATE INDEX IF NOT EXISTS FOR (t:Topic) ON (t.id)")
merge_nodes(graph.auto(), topic_nodes, ('Topic', 'id'), labels=('Topic', ), keys=('id', ))

In [7]:
graph.run(
    """MATCH (w:Word)
        OPTIONAL MATCH (t:Topic) WHERE w.topic=t.id
        MERGE (w)-[r:BELONG]->(t)"""
)


分别对每一类调用 neo4j 的 PageRank 算法，计算中心度 $\mathrm{PageRank}_w$，并将结果写入数据库中。

In [26]:
import tqdm
for topic in tqdm.tqdm(topics):
    graph.run(
        """CALL gds.pageRank.write({
        nodeQuery: 'MATCH (w:Word) WHERE w.topic = $topic RETURN id(w) as id',
        relationshipQuery: 'MATCH (w1:Word)-[r:SIMILAR]-(w2:Word) 
            WHERE w1.topic = $topic AND w2.topic = $topic 
            RETURN id(w1) AS source, id(w2) AS target, r.weight AS weight',
        writeProperty: 'pagerank', 
        relationshipWeightProperty: 'weight'
    }) YIELD nodePropertiesWritten, ranIterations""".replace(
            '$topic', str(topic)
        )
    )


100%|██████████| 125/125 [00:12<00:00, 10.16it/s]


In [27]:
def topic_words(wd: str = '', topic: int = 0):
    topic = int(topic) or graph.nodes.match('Word', name=wd).first()['topic']
    nodes = graph.nodes.match('Word', topic=topic).order_by('-_.pagerank'
                                                            ).limit(10).all()
    return [node['name'] for node in nodes]

In [28]:
import tqdm
with open('topics.txt', 'w', encoding='utf-8') as f:
    for topic in tqdm.tqdm(topics):
        f.write(f"{topic} {' '.join(topic_words(topic=topic))}\n")

100%|██████████| 125/125 [00:04<00:00, 26.72it/s]


In [29]:
import tqdm
with open('word_topics.csv', 'w', encoding='utf-8') as f:
    f.write("name,topic,pagerank\n")
    for wd in tqdm.tqdm(wds):
        node = graph.nodes.match('Word', name=wd).first()
        topic, pagerank = node['topic'], node['pagerank']
        f.write(f"{wd},{topic},{pagerank}\n")


100%|██████████| 11745/11745 [01:05<00:00, 178.17it/s]


六、将诗歌数据存入数据库中

In [20]:
graph.run("CREATE INDEX IF NOT EXISTS FOR (p:Author) ON (p.name)")
graph.run("CREATE INDEX IF NOT EXISTS FOR (p:Author) ON (p.id)")
graph.run("CREATE INDEX IF NOT EXISTS FOR (p:Poem) ON (p.id)")
graph.run("CREATE INDEX IF NOT EXISTS FOR (p:Poem) ON (p.title)")
graph.run("CREATE FULLTEXT INDEX paragraphs IF NOT EXISTS FOR (p:Poem) ON EACH [p.paragraphs, p.title]")

In [1]:
import tqdm as tq
from itertools import islice

def batch_cut(it, batch_size=1000, tqdm=True):    
    l = len(it) // batch_size + 1
    rr = tq.trange(l) if tqdm else range(l)
    it = iter(it)
    for _ in rr:
        batch = islice(it, batch_size)
        if batch:
            yield batch

In [7]:
import json
from py2neo.bulk import merge_nodes

with open('data/authors.tang.json', 'r', encoding='utf-8') as f:
    authors = json.load(f)

for batch in batch_cut(authors):
    merge_nodes(graph.auto(), batch, ('Author', 'name'), labels=('Author', ))


100%|██████████| 4/4 [00:00<00:00,  9.32it/s]


In [8]:
import json
import tqdm as tq

def get_poems_info(max_n, start_n=0, tqdm=False):
    rr = tq.trange(start_n, max_n) if tqdm else range(start_n, max_n)
    for n in rr:
        with open(f'./data/poet.tang.{n}.json', 'r', encoding='utf-8') as f:
            poems = json.load(f)
        yield from poems


In [36]:
from py2neo.bulk import merge_nodes

N = 58

poems = get_poems_info(N)
batch_size = 1000

data = [
    {
        'id': poem['id'],
        'title': poem['title'],
        'author': poem['author'],
        'paragraphs': '\n'.join(poem['paragraphs']),
    } for poem in poems
]

for batch in batch_cut(data):
    merge_nodes(graph.auto(), batch, ('Poem', 'id'), labels=('Poem', ))

del data

100%|██████████| 58/58 [00:05<00:00, 10.39it/s]


In [11]:
from py2neo.bulk import merge_relationships

N = 58

poems = get_poems_info(N)
batch_size = 1000

data = [(poem['author'], tuple(), poem['id']) for poem in poems]

for batch in batch_cut(data):
    merge_relationships(
        graph.auto(),
        batch,
        'WRITE',
        start_node_key=('Author', 'name'),
        end_node_key=('Poem', 'id'),
        keys=[]
    )

del data

100%|██████████| 58/58 [00:02<00:00, 22.16it/s]


In [15]:
from py2neo.bulk import merge_relationships
from collections import Counter

N = 58
wds_set = set(wds)


def rels_poem_word():
    for poem in get_poems_info(N, tqdm=True):
        s = ''.join(poem['paragraphs'])
        c = Counter(''.join(wd) for wd in chain(zip(s, s[1:]), s))
        for wd, count in c.items():
            if wd in wds_set:
                yield poem['id'], (count, ), wd


data = list(rels_poem_word())

for batch in batch_cut(data):
    merge_relationships(
        graph.auto(),
        batch,
        'CONTAIN',
        start_node_key=('Poem', 'id'),
        end_node_key=('Word', 'name'),
        keys=['count']
    )

del data

100%|██████████| 58/58 [00:05<00:00,  9.70it/s]
100%|██████████| 3100/3100 [02:19<00:00, 22.19it/s]


七、计算诗和词语的匹配度

$$\mathrm{Match}[P,w]=\frac{\alpha}{|P|}\sum_{w'\in \mathcal{S}\cap P}\mathrm{v}_w\cdot\mathrm{v}_{w'} + 
\frac{1-\alpha}{|\mathcal{T}_w\cap P|}\sum_{w'\in\mathcal{T}_w\cap P}\ln(1+\mathrm{PageRank}_{w'})$$

其中$P$为一首诗包含的所有词语，$w$为一个词语，$\mathcal{S}$为$w$的近义词集合，$\mathcal{T}$为$w$所在主题的词语集合。

$\alpha$为比例系数，经实验取0.85。

第一项：基于近义词的文本匹配：

$$\frac{1}{|P|}\sum_{w'\in \mathcal{S}\cap P}\mathrm{v}_w\cdot\mathrm{v}_{w'}$$

一首诗包含某个词语的近义词越多，就认为这首诗和这个词的匹配度越高。

In [45]:
import numpy as np
from py2neo.bulk import merge_relationships
from itertools import chain

N = 58
K = 10

wds_inv = {wd: index for index, wd in enumerate(wds)}
num = connection_coo.shape[0]
mmap = np.zeros(connection_coo.shape)
for i, wd in enumerate(tqdm.tqdm(wds)):
    topk = np.argpartition(-connection_coo[i], K)[:K]
    mmap[i, topk] = connection_coo[i, topk]
    mmap[i, i] = 0


def rels_poem_word():
    for poem in get_poems_info(N, tqdm=True):
        s = ''.join(poem['paragraphs'])
        c = set(''.join(wd) for wd in chain(zip(s, s[1:]), s))
        c = [wd for wd in c if wd in wds_inv]
        for wd in c:            
            matching = sum(float(mmap[wds_inv[wd], wds_inv[wd2]]) for wd2 in c)
            yield poem['id'], (float(matching) / len(c), ), wd


data = list(rels_poem_word())

for batch in batch_cut(data):
    merge_relationships(
        graph.auto(),
        batch,
        'CONTAIN',
        start_node_key=('Poem', 'id'),
        end_node_key=('Word', 'name'),
        keys=['matching']
    )

del data

100%|██████████| 11745/11745 [00:01<00:00, 8666.58it/s] 
100%|██████████| 58/58 [01:55<00:00,  1.99s/it]
100%|██████████| 3100/3100 [02:39<00:00, 19.42it/s]


In [36]:
graph.run("MATCH ()-[r:CONTAIN]-() WHERE r.matching IS NULL SET r.matching=0.0")

第二项：基于主题词模型的文本匹配：

$$\frac{1}{|\mathcal{T}_w\cap P|}\sum_{w'\in\mathcal{T}_w\cap P}\ln(1+\mathrm{PageRank}_{w'})$$

基于诗中包含词语所属主题与词语的 PageRank，计算每首诗与主题的关联度。

In [41]:
import pandas as pd
from collections import defaultdict
from py2neo.bulk import merge_relationships
from itertools import chain

df = pd.read_csv('word_topics.csv')
topics = df['topic']
pr = df['pagerank']
wds_inv = {wd: index for index, wd in enumerate(wds)}

N = 58


def rels_poem_topic():
    for poem in get_poems_info(N, tqdm=True):
        s = ''.join(poem['paragraphs'])
        c = set(''.join(wd) for wd in chain(zip(s, s[1:]), s))
        c = [wd for wd in c if wd in wds_inv]
        relevance = defaultdict(float)
        counts = defaultdict(int)
        for wd in c:
            index = wds_inv[wd]            
            relevance[topics[index]] += np.log(pr[index] + 1)
            counts[topics[index]] += 1
        for topic, rele in relevance.items():
            yield poem['id'], (float(rele) / counts[topic], ), int(topic)


data = list(rels_poem_topic())

for batch in batch_cut(data):
    merge_relationships(
        graph.auto(),
        batch,
        'RELEVANT',
        start_node_key=('Poem', 'id'),
        end_node_key=('Topic', 'id'),
        keys=['relevance']
    )

del data

100%|██████████| 58/58 [00:39<00:00,  1.46it/s]
100%|██████████| 498/498 [00:19<00:00, 24.92it/s]


In [43]:
import pandas as pd
from collections import defaultdict
from py2neo.bulk import merge_relationships
from itertools import chain
import numpy as np

df = pd.read_csv('word_topics.csv')
topics = df['topic']
pr = df['pagerank']
wds_inv = {wd: index for index, wd in enumerate(wds)}

N = 58


def rels_poem_topic_word():
    for poem in get_poems_info(N, tqdm=True):
        s = ''.join(poem['paragraphs'])
        c = set(''.join(wd) for wd in chain(zip(s, s[1:]), s))
        c = [wd for wd in c if wd in wds_inv]
        relevance = defaultdict(float)
        counts = defaultdict(int)
        for wd in c:
            index = wds_inv[wd]
            relevance[topics[index]] += float(np.log(pr[index] + 1))
            counts[topics[index]] += 1
        for wd in c:
            index = wds_inv[wd]
            rele = relevance[topics[index]]
            yield poem['id'], (float(rele) / counts[topics[index]], ), wd


data = list(rels_poem_topic_word())

for batch in batch_cut(data):
    merge_relationships(
        graph.auto(),
        batch,
        'CONTAIN',
        start_node_key=('Poem', 'id'),
        end_node_key=('Word', 'name'),
        keys=['relevance']
    )

del data

100%|██████████| 58/58 [01:01<00:00,  1.06s/it]
100%|██████████| 3100/3100 [02:38<00:00, 19.50it/s]


In [21]:
graph.run("MATCH ()-[r:CONTAIN]-() WHERE r.relevance IS NULL SET r.relevance=0.0")

八、基于中心性与影响力的诗歌离线排名

按照诗歌所包含词语的相似性，来判断诗歌的相似性。在相似网络中的 PageRank 越高，影响力越高。

以此为依据，计算搜索结果中的诗歌排名：

$$\mathrm{Score}[P, w]=\sqrt{\ln(1+\mathrm{PageRank}[P])}\times \mathrm{Match}[P, w]$$

In [10]:
import tqdm
import pandas as pd
import numpy as np
from itertools import chain

df = pd.concat([pd.read_json(f'data/poet.tang.{n}.json') for n in range(58)])
df.index = range(len(df))
wds_inv = {wd: index for index, wd in enumerate(wds)}

pwds = np.zeros((len(df), 200))
for i, row in tqdm.tqdm(df.iterrows(), total=len(df)):
    s = ''.join(row['paragraphs'])
    c = set(''.join(wd) for wd in chain(zip(s, s[1:]), s))
    c = [wds_inv[wd] for wd in c if wd in wds_inv]
    for wdi in c:
        pwds[i] += vectors_coo[wdi]

100%|██████████| 57612/57612 [00:12<00:00, 4484.16it/s]


In [11]:
from sklearn.decomposition import PCA

pwds = PCA(n_components=200).fit_transform(pwds)
norms = np.linalg.norm(pwds, axis=1)[:, np.newaxis]
ppwds = np.divide(pwds, norms, out=pwds, where=norms != 0)

In [17]:
import numpy as np
import tqdm

K = 16
kmost = np.zeros((len(pwds), K), dtype=int)
for i, pvec in enumerate(tqdm.tqdm(pwds)):
    sims = pwds.dot(pvec)
    kmost[i] = (-sims).argpartition(K)[:K]

print(kmost[:10])

100%|██████████| 57612/57612 [04:15<00:00, 225.17it/s]

[[    0  4035 52162 10305   144  3268  3411  2885 29669 23100  4932  2765
  41206  4882 17367 31517]
 [  437   221 27301   570   761   519     1   384 50366   242   748  2470
  17550   756 25246  2448]
 [  734    57     2   791   810  2476   844 11051  8477   489 18134  2554
  27301   730 13381   615]
 [ 2604 52597   576     3 32178   396  2553  2891  2554  2890 52164 52162
   4053  2886   716  5294]
 [52164   305  3827  3264     4    74  3991 50509  3997  3587    76  5209
   3486  4224  5299  3460]
 [23787  8982  4154 13031 41277     5 17459 17323 28696 43171 39555 25844
   6714 25407 32420   199]
 [    6 41154  2587 52164  2884  5831 50509 17454    46  5319 44900 17421
   2592  4932 28295 17094]
 [28057   575   430   844     7   678   620   402  2462   395   474   437
    791  4507   418  3891]
 [    8  8878  6977 29428  1975 29388 49182 13329  1417 18741  2213 39863
  12047 20844 43917 16413]
 [ 2962 47288 43617  2960 17437 18580 47301   120 47297     9  7681 17176
   2586   117 499




In [22]:
import tqdm
import numpy as np
from py2neo.bulk import merge_relationships
from itertools import islice

ids = df['id']
rels = []
for i, topk in enumerate(tqdm.tqdm(kmost)):
    for j in topk:
        if i != j:
            rels.append((ids[i], (float(pwds[i].dot(pwds[j])), ), ids[j]))

stream = iter(rels)
batch_size = 1000

for _ in tqdm.trange(len(rels) // batch_size + 1):
    batch = islice(stream, batch_size)
    if batch:
        merge_relationships(
            graph.auto(),
            batch,
            'INFLUENCE',
            start_node_key=('Poem', 'id'),
            end_node_key=('Poem', 'id'),
            keys=['weight']
        )


100%|██████████| 57612/57612 [00:09<00:00, 5911.99it/s]
100%|██████████| 865/865 [01:19<00:00, 10.87it/s]


In [61]:
# 在 neo4j 中建立子图
if not graph.run("CALL gds.graph.exists('poems')").data()[0]['exists']:
    graph.run("""CALL gds.graph.create(
        'poems', 
        'Poem', 
        { INFLUENCE: { orientation: 'UNDIRECTED', properties: "weight" } }
    ) YIELD nodeCount, relationshipCount""")

In [65]:
graph.run(
    """CALL gds.pageRank.write('poems', {
        writeProperty: 'pagerank', 
        relationshipWeightProperty: 'weight'
    }) YIELD nodePropertiesWritten, ranIterations"""
)
graph.run("MATCH (p:Poem) SET p.pagerank=log(1+p.pagerank)^0.5")

九、根据匹配度，查询诗歌

In [76]:
def match_poems(wd, start=0, count=10):
    return graph.run(
        """MATCH (p:Poem)-[r:CONTAIN]->(:Word {name:'$wd'})
            WITH p.title AS title, p.author AS author, p.paragraphs AS paragraphs, 
                p.pagerank * (r.matching*0.85 + r.relevance*0.15) AS score
            RETURN title, author, paragraphs, score
            ORDER BY score DESC SKIP $s LIMIT $e
        """.replace("$wd", str(wd)).replace("$s", str(start)).replace("$e", str(count))
    ).to_data_frame()

In [77]:
match_poems('东风', start=30, count=10)

Unnamed: 0,title,author,paragraphs,score
0,折杨柳,李白,垂杨拂绿水，摇艳东风年。\n花明玉关雪，叶暖金窗烟。\n美人结长想，对此心凄然。\n攀条折春...,0.12515
1,早春浐水送友人,温庭筠,青门烟野外，渡浐送行人。\n鸭卧溪沙暖，鸠鸣社树春。\n残波青有石，幽草绿无尘。\n杨柳东风...,0.124896
2,杂曲歌辞 独不见,杨巨源,东风艳阳色，柳绿花如霰。\n竞理同心鬟，争持合欢扇。\n香传贾娘手，粉离何郎面。\n最恨卷帘...,0.124884
3,醉后赠从甥高镇,李白,马上相逢揖马鞭，客中相见客中怜。\n欲邀击筑悲歌饮，正值倾家无酒钱。\n江东风光不借人，枉杀...,0.124327
4,春游值雨,张旭,欲寻轩槛列清尊，江上烟云向晚昏。\n须倩东风吹散雨，明朝却待入华园。,0.123736
5,东风二章 一,欧阳詹,东风叶时，匪沃匪飘。\n莫雪凝川，莫阴沍郊。\n朝不徯夕乃销，东风之行地上兮。\n上德临慝，...,0.122826
6,小重山 一,薛昭蕴,春到长门春草青，玉阶华露滴，月胧明。\n东风吹断紫箫声，宫漏促，帘外晓啼莺。\n愁极梦难成，...,0.12273
7,南阳道中作,窦巩,东风雨洗顺阳川，蜀锦花开绿草田。\n彩雉鬬时频驻马，酒旗翻处亦留钱。\n新晴日照山头雪，薄暮...,0.122572
8,柳枝词十首 四,徐铉,绿水成文柳带摇，东风初到不鸣条。\n龙舟欲过偏留恋，万缕轻丝拂御桥。,0.12217
9,杂曲歌辞 杨柳枝 三,温庭筠,苏小门前柳万条，毵毵金线拂平桥。\n黄莺不语东风起，深闭朱门伴细腰。,0.121657


In [80]:
wds_set = set(wds)
def pull_words(s):    
    c = set(''.join(wd) for wd in chain(zip(s, s[1:]), s))
    return [wd for wd in c if wd in wds_set]

In [81]:
def match_poems_ex(word, start=0, count=10):
    return graph.run(
        """MATCH (p:Poem)-[r:CONTAIN]->(w:Word)
            WHERE w.name IN ['$wds']
            WITH p.title AS title, p.author AS author, p.paragraphs AS paragraphs, 
                p.pagerank * (r.matching*0.85 + r.relevance*0.15) AS score
            RETURN title, author, paragraphs, score
            ORDER BY score DESC SKIP $s LIMIT $e
        """.replace("$wds",
                    "','".join(pull_words(word)
                             )).replace("$s",
                                        str(start)).replace("$e", str(count))
    ).to_data_frame()

In [82]:
match_poems_ex('刻晴')

Unnamed: 0,title,author,paragraphs,score
0,句,李洞,鱼弄晴波影上帘。,0.22097
1,桃花,薛能,香色自天种，千年岂易逢。\n开齐全未落，繁极欲相重。\n泠湿朝如淡，晴干午更浓。\n风光新社...,0.207983
2,泊凫矶江馆,赵嘏,风雪晴来岁欲除，孤舟晚下意何如。\n月当轩色湖平后，雁断云声夜起初。\n傍晓管弦何处静，犯寒...,0.205527
3,仙掌,齐己,峭形寒倚夕阳天，毛女莲花翠影连。\n云外自为高出手，人间谁合鬬挥拳。\n鹤抛青汉来岩桧，僧隔...,0.198443
4,自和次前韵,陆龟蒙,命既时相背，才非世所容。\n著书粮易绝，多病药难供。\n梦为怀山数，愁因戒酒浓。\n鸟媒呈不...,0.195114
5,游崔监丞城南别业,刘得仁,门与青山近，青山复几重。\n雪融皇子岸，春浥翠微峰。\n地有经冬草，林无未老松。\n竹寒溪隔...,0.186493
6,四明山诗 石窗,陆龟蒙,石窗何处见，万仞倚晴虚。\n积霭迷青璅，残霞动绮疏。\n山应列圆峤，宫便接方诸。\n祗有三奔...,0.180621
7,酬张少尹秋日凤翔西郊见寄,耿𣲗,鼎气孕河汾，英英济旧勋。\n刘生曾任侠，张率自能文。\n官佐征西府，名齐将上军。\n秋山遥出...,0.179874
8,送王昌涉侍御,张祜,十里指东平，军前首出征。\n诸侯青服旧，御史紫衣荣。\n入陈枭心死，分围虎力生。\n画时安楚...,0.178809
9,五月水边柳,张又新,结根挺涯涘，垂影覆清浅。\n睡脸寒未开，懒腰晴更软。\n摇空条已重，拂水带方展。\n似醉烟景...,0.176507
