-
Notifications
You must be signed in to change notification settings - Fork 2
Development Note
Drawed by drawerd.com
Databasse design is inspired by hackershare
FullText search is done with PostgreSQL's full text search function, zhparser is used to segment Chinese word and RUM to be a better gin index.
Since PostgreSQL(as well as other DBMS) cannot use index with function, bookmarks entry and link entry both have a generated tsvector column called tsv. Both of them are rum indexed.
The tsv column in bookmarks table is generated from the title, description(user defined), and cached tag name(current tagging joined by ,
updated using trigger)
The one in link table is generated from title, description, url and a cleaned archive html content. Because PostgreSQL has some limit for tsvector and fulltext search, I use a function to cleanup html before generate tsvector and index.
-- Remove
-- data uri(base64),
-- src/srcset attr style tag and its content,
-- div|span|path|defs|svg|li open and close tag(only tag)
-- and inline style with REGEXP
CREATE OR REPLACE FUNCTION clean_html_for_search(text) RETURNS TEXT AS
$$
DECLARE
clean_html TEXT;
BEGIN
SELECT REGEXP_REPLACE(
$1,
'(data:[^,]+,)[^"]+|(<style[^>]*>[^<]+</style>)|style="[^"]+"|src(?:set)?="[^"]+"|class="[^"]+"|</?(?:div|span|path|defs|svg|li|tr|th|td|math|mtd|mtr|mtable|mn|mfenced|)[^>]*>',
'',
'g'
)
INTO clean_html;
RETURN clean_html;
END;
$$ IMMUTABLE LANGUAGE plpgsql;
Tag entry contains a alias column as text[]
, every word in alias will use the some id in tagging relationship.
A generated tsq
column is a generated websearch tsquery with all tag and column joined by or
.
-- Add tsq of tags
ALTER TABLE tags
ADD COLUMN tsq tsquery
GENERATED ALWAYS AS (
websearch_to_tsquery('chinese_zh', text_array_to_text(array_prepend(tag, alias), ' or '))
) STORED;
The tagging suggestion is a full text search with this tsquery.
select tag,
tag_id
from (select distinct tags.tag,
tags.id as tag_id,
tags.tsq as tsq,
bookmarks.tsv as bookmarks_tsv,
links.tsv as links_tsv,
ts_rank_cd(links.tsv, tsq) as link_score,
ts_rank_cd(bookmarks.tsv, tsq) as bookmark_score
from tags,
bookmarks
join links ON links.id = bookmarks.link_id
where bookmarks.id = ${bookmarkId}
and (
bookmarks.tsv @@ tags.tsq
or
links.tsv @@ tags.tsq
)
and (ts_rank_cd(links.tsv, tsq) + ts_rank_cd(bookmarks.tsv, tsq) > 1)
) as t1
order by ts_rank_cd(bookmarks_tsv, tsq) desc,
ts_rank_cd(links_tsv, tsq) desc
limit 5;
Archiving and Auto tagging is currently done in API route. I want to move it to Supabase workflow once it is public and offer a self host docker image.