Skip to content

Development Note

fengkx edited this page May 3, 2021 · 2 revisions

Database design

ER diagram Drawed by drawerd.com
Databasse design is inspired by hackershare

FullText search index

FullText search is done with PostgreSQL's full text search function, zhparser is used to segment Chinese word and RUM to be a better gin index.

Since PostgreSQL(as well as other DBMS) cannot use index with function, bookmarks entry and link entry both have a generated tsvector column called tsv. Both of them are rum indexed.

The tsv column in bookmarks table is generated from the title, description(user defined), and cached tag name(current tagging joined by , updated using trigger)

The one in link table is generated from title, description, url and a cleaned archive html content. Because PostgreSQL has some limit for tsvector and fulltext search, I use a function to cleanup html before generate tsvector and index.

-- Remove
-- data uri(base64),
-- src/srcset attr style tag and its content,
-- div|span|path|defs|svg|li open and close tag(only tag)
-- and inline style with REGEXP

CREATE OR REPLACE FUNCTION clean_html_for_search(text) RETURNS TEXT AS
$$
DECLARE
    clean_html TEXT;
BEGIN
    SELECT REGEXP_REPLACE(
                   $1,
                   '(data:[^,]+,)[^"]+|(<style[^>]*>[^<]+</style>)|style="[^"]+"|src(?:set)?="[^"]+"|class="[^"]+"|</?(?:div|span|path|defs|svg|li|tr|th|td|math|mtd|mtr|mtable|mn|mfenced|)[^>]*>',
                   '',
                   'g'
               )
    INTO clean_html;
    RETURN clean_html;
END;
$$ IMMUTABLE LANGUAGE plpgsql;

Tag alias and Auto tagging

Tag entry contains a alias column as text[], every word in alias will use the some id in tagging relationship. A generated tsq column is a generated websearch tsquery with all tag and column joined by or.

-- Add tsq of tags
ALTER TABLE tags
    ADD COLUMN tsq tsquery
        GENERATED ALWAYS AS (
            websearch_to_tsquery('chinese_zh', text_array_to_text(array_prepend(tag, alias), ' or '))
            ) STORED;

The tagging suggestion is a full text search with this tsquery.

select tag,
       tag_id
from (select distinct tags.tag,
                      tags.id                        as tag_id,
                      tags.tsq                       as tsq,
                      bookmarks.tsv                  as bookmarks_tsv,
                      links.tsv                      as links_tsv,
                      ts_rank_cd(links.tsv, tsq)     as link_score,
                      ts_rank_cd(bookmarks.tsv, tsq) as bookmark_score

      from tags,
           bookmarks
               join links ON links.id = bookmarks.link_id
      where bookmarks.id = ${bookmarkId}
        and (
              bookmarks.tsv @@ tags.tsq
              or
              links.tsv @@ tags.tsq
          )
        and (ts_rank_cd(links.tsv, tsq) + ts_rank_cd(bookmarks.tsv, tsq) > 1)
     ) as t1
order by ts_rank_cd(bookmarks_tsv, tsq) desc,
         ts_rank_cd(links_tsv, tsq) desc
limit 5;

Archiving and Auto tagging is currently done in API route. I want to move it to Supabase workflow once it is public and offer a self host docker image.

Logo and favicon

Logo font: Work Sans
favicon font: Sansita

Clone this wiki locally