digoal
2020-03-26
PostgreSQL , pg-knn_hamming
https://github.com/jrodatus/pg-knn_hamming
ghtree实现的海明距离排序索引, 性能不错
- ghtree: This is a simple Generalized Hyperplane tree (gh-tree) using SP-GiST indexing facilities, with good performance. It's constructed as described here (backup link). "At the top node, pick two points. Then, divide the remaining points based on which of these two they are closer to. Now, recursively build both branches. This method is an improvement [against VP-trees] in that it is symmetric and the tree structure still tends to be well balanced."
- old: This was my initial implementation using a VP-tree. Although fast for Euclidean space, for some reason it performs extremely poorly when the Hamming metric is used (~10 seconds on 10M records, as bad as linear search for me). I also tried Alexander Korotkov's higher splitting-degree version as a base, with the same results. I gave up on this when I found gh-trees.
My test system is an Ideapad 120S with 2 GB RAM, Celeron N3350 @ 1.1Ghz (sorry), OS Debian Stretch.
Nearest-neighbor search on 10 million BIGINT records takes approx 150 milliseconds. (Probably faster with a decent computer.) Test as follows:
CREATE EXTENSION ghtree;
CREATE TABLE gh_test AS (SELECT floor(random()*9223372036854775807)::BIGINT AS value FROM generate_series(0,10000000) AS i);
CREATE INDEX gh_idx ON gh_test USING spgist(value ghtree_ops);
\timing
select floor(random()*9223372036854775807)::BIGINT;
SELECT * FROM gh_test ORDER BY value <=> 12345 LIMIT 10;
postgres=# SELECT value <=> 3385948339929088,* FROM gh_test ORDER BY value <=> 3385948339929088 LIMIT 10;
?column? | value
----------+---------------------
0 | 3385948339929088
8 | 581535551044648960
8 | 22032652001574912
8 | 3387872490487808
10 | 41962318095089664
10 | 4615139042357706752
10 | 745351925708849152
10 | 2884704552572780544
10 | 183948984824823808
10 | 580336533242347520
(10 rows)
Time: 0.292 ms
postgres=# select 3385948339929088::int8::bit(64);
bit
------------------------------------------------------------------
0000000000001100000001111000000010001000000001111000000000000000
(1 row)
postgres=# select 581535551044648960::int8::bit(64);
bit
------------------------------------------------------------------
0000100000010010000001111000000010101000010001110000000000000000
(1 row)
postgres=# select int8xor(3385948339929088::int8, 581535551044648960::int8);
int8xor
--------------------
584905002145841152
(1 row)
postgres=# select int8xor(3385948339929088::int8, 581535551044648960::int8)::bit(64);
int8xor
------------------------------------------------------------------
0000100000011110000000000000000000100000010000001000000000000000
(1 row)
文本相似搜索结合simhash
https://www.cnblogs.com/jiyuqi/p/4845969.html
simhash是由 Charikar 在2002年提出来的,参考 《Similarity estimation techniques from rounding algorithms》 。 介绍下这个算法主要原理,为了便于理解尽量不使用数学公式,分为这几步:
1、分词,把需要判断文本分词形成这个文章的特征单词。最后形成去掉噪音词的单词序列并为每个词加上权重,我们假设权重分为5个级别(1~5)。比如:“ 美国“51区”雇员称内部有9架飞碟,曾看见灰色外星人 ” ==> 分词后为 “ 美国(4) 51区(5) 雇员(3) 称(1) 内部(2) 有(1) 9架(3) 飞碟(5) 曾(1) 看见(3) 灰色(4) 外星人(5)”,括号里是代表单词在整个句子里重要程度,数字越大越重要。
2、hash,通过hash算法把每个词变成hash值,比如“美国”通过hash算法计算为 100101,“51区”通过hash算法计算为 101011。这样我们的字符串就变成了一串串数字,还记得文章开头说过的吗,要把文章变为数字计算才能提高相似度计算性能,现在是降维过程进行时。
3、加权,通过 2步骤的hash生成结果,需要按照单词的权重形成加权数字串,比如“美国”的hash值为“100101”,通过加权计算为“4 -4 -4 4 -4 4”;“51区”的hash值为“101011”,通过加权计算为 “ 5 -5 5 -5 5 5”。
4、合并,把上面各个单词算出来的序列值累加,变成只有一个序列串。比如 “美国”的 “4 -4 -4 4 -4 4”,“51区”的 “ 5 -5 5 -5 5 5”, 把每一位进行累加, “4+5 -4+-5 -4+5 4+-5 -4+5 4+5” ==》 “9 -9 1 -1 1 9”。这里作为示例只算了两个单词的,真实计算需要把所有单词的序列串累加。
5、降维,把4步算出来的 “9 -9 1 -1 1 9” 变成 0 1 串,形成我们最终的simhash签名。 如果每一位大于0 记为 1,小于0 记为 0。最后算出结果为:“1 0 1 0 1 1”。
原理图:
对比这个:
https://github.com/fake-name/pg-spgist_hamming
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.