# NBC News Headlines: Exploring Hybrod FTS5 + Vector Search

This notebooks explore a few different ways one could combine FTS5 and vector search results, when querying 
[FTS5](https://www.sqlite.org/fts5.html) and
[`sqlite-vec`](https://github.com/asg017/sqlite-vec) virtual table.

This dataset is a small list of headines scraped from NBC News, found in the [`./1_scrape.ipynb`](./1_scrape.ipynb) notebook.
To see how the `fts_articles` and `vec_articles` tables were created, see the [`./3_search.ipynb`](./3_search.ipynb) notebook.

In [1]:
.open tmp-artic2.db

.load ../../dist/vec0
.load ./lembed0

insert into lembed_models(name, model)
  values (
    'default',
    lembed_model_from_file('snowflake-arctic-embed-m-v1.5.d70deb40.f16.gguf')
  );

select vec_version(), lembed_version();

vec_version(),lembed_version()
v0.1.3-alpha.2,v0.0.1-alpha.8


## Full-text Search Only

A simple FTS query on the `fts_articles` virutal table can be made like so:

In [19]:
.param set query planned parenthood

select
  rowid,
  headline,
  rank
from fts_articles
where headline match :query
order by rank
limit 10;

rowid,headline,rank
4666,Kamala Harris visits Planned Parenthood clinic,-18.9139950477264
6521,Former Marine sentenced to 9 years in prison for firebombing Planned Parenthood clinic,-14.807022703838651


The `rank` column is the negative BM25 score of the query + document. 

##  Vector Search Only

A KNN vector search can be made on the `vec_articles` virtual table like so:

In [6]:
.param set query planned parenthood

select
  article_id,
  articles.headline,
  distance
from vec_articles
left join articles on articles.rowid = vec_articles.article_id
where headline_embedding match lembed(:query)
  and k = 10;

article_id,headline,distance
4666,Kamala Harris visits Planned Parenthood clinic,0.492593914270401
13928,"After Dobbs decision, more women are managing their own abortions",0.5789032578468323
12636,Transforming Healthcare,0.5822411179542542
6979,"A timeline of Trump's many, many positions on abortion",0.6101462841033936
7038,How a network of abortion pill providers works together in the wake of new threats,0.6196886897087097
6914,'Major hurdles': The reality check behind Biden's big abortion promise,0.6198344826698303
6794,Trump's conflicting abortion stances are coming back to haunt him — and his party,0.6198986768722534
7381,Where abortion rights could be on the ballot this fall: From the Politics Desk,0.6201764345169067
6871,How the Biden campaign quickly mobilized on Trump's abortion stance,0.633980393409729
5496,Battle over abortion heats up in Arizona — and could be on the 2024 ballot,0.6341449022293091


The `distance` column is the L2 distance between the query vector and the headline embedding. 

The rest of this notebook explore different ways of combining these FTS5 and vector search results. 
The core queries are similar, and only really different on different `JOIN` or `ORDER BY` techniques.

## Combination Technique #1: Keyword-first

In many search-engine cases, you may way to display keyword matches first, and supplement the rest wih with vector search results. 
This makes some intuitive sense — keyword matches are what uses expect, but you'll want to display more result if there are only a few matching documents. 


In [11]:
.param set query abortion bans
.param set k 10


with fts_matches as (
  select
    rowid as article_id,
    row_number() over (order by rank) as rank_number,
    rank as score
  from fts_articles
  where headline match :query
  limit :k
),
vec_matches as (
  select
    article_id,
    row_number() over (order by distance) as rank_number,
    distance as score
  from vec_articles
  where
    headline_embedding match lembed(:query)
    and k = :k
  order by distance
),
combined as (
  select 'fts' as match_type, * from fts_matches
  union all
  select 'vec' as match_type, * from vec_matches
),
final as (
  select
    articles.id,
    articles.headline,
    combined.*
  from combined
  left join articles on articles.rowid = combined.article_id
)
select * from final;



id,headline,match_type,article_id,rank_number,score
10098,Kamala Harris says abortion bans are creating 'a health care crisis',fts,10098,1,-10.678829270936069
9776,"States with abortion bans saw birth control prescriptions fall post-Dobbs, study finds",fts,9776,2,-10.016316725971112
2292,Ohio GOP Senate candidates pitch federal abortion bans even after voters protected reproductive rights,fts,2292,3,-9.7149595994016
452,"64K women and girls became pregnant due to rape in states with abortion bans, study estimates",fts,452,4,-9.163558569425538
9187,"Abortion bans drive away up to half of young talent, CNBC/Generation Lab youth survey finds",fts,9187,5,-9.163558569425538
6989,"Trump says abortion restrictions should be left to states, dodging a national ban",vec,6989,1,0.4930749833583832
13928,"After Dobbs decision, more women are managing their own abortions",vec,13928,2,0.5120846629142761
11822,Iowa now bans most abortions after about 6 weeks,vec,11822,3,0.512569785118103
7381,Where abortion rights could be on the ballot this fall: From the Politics Desk,vec,7381,4,0.5168291926383972
14009,Trump signals openness to banning abortion pill,vec,14009,5,0.5288293957710266


We do this with a verbose CTE: one step for the FTS5 query, another for the vector search, one to "combine" the results with a `UNION ALL`, and one last one to `LEFT JOIN` back to the base `articles` table to get the headline.

Here we have 5 FTS results and 10 additional vector results. This seems pretty natural, a fallback to vector search when keywords matches lack a bit.

One note: this example doesn't do any de-duplication, so you may get the same results twice. So you may want to add a `DISTINCT` or `GROUP BY` somehwere to handle that. 

## Combination Technique #2: Reciprocal Rank Fusion (RRF)

[Reciprocal Rank Fusion](https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking) 
is another combination technique, where matches that are both FTS matches and vector matches
are ranked higher than other. The CTE logic is a bit more involved, but can still be represented in a few steps:


In [14]:
.param set query abortion ban


.param set k 10
.param set rrf_k 60
.param set weight_fts 1.0
.param set weight_vec 1.0

with vec_matches as (
  select
    article_id,
    row_number() over (order by distance) as rank_number,
    distance
  from vec_articles
  where
    headline_embedding match lembed(:query)
    and k = :k
),
fts_matches as (
  select
    rowid,
    row_number() over (order by rank) as rank_number,
    rank as score
  from fts_articles
  where headline match :query
  limit :k
),
final as (
  select
    articles.id,
    articles.headline,
    vec_matches.rank_number as vec_rank,
    fts_matches.rank_number as fts_rank,
    coalesce(1.0 / (:rrf_k + fts_matches.rank_number), 0.0) * :weight_fts
    + coalesce(1.0 / (:rrf_k + vec_matches.rank_number), 0.0) * :weight_vec
      as combined_rank,
    vec_matches.distance as vec_distance,
    fts_matches.score as fts_score
  from fts_matches
  full outer join vec_matches on vec_matches.article_id = fts_matches.rowid
  join articles on articles.rowid = coalesce(fts_matches.rowid, vec_matches.article_id)
  order by combined_rank desc
)
select * from final;



id,headline,vec_rank,fts_rank,combined_rank,vec_distance,fts_score
4328,Trump signals support for a national 15-week abortion ban,2.0,3.0,0.0320020481310803,0.5334203839302063,-9.841645168493953
5769,Mitch McConnell shies away from supporting national abortion ban,8.0,2.0,0.0308349146110056,0.5501425266265869,-10.19017787567105
9507,Arizona Senate passes repeal of 1864 abortion ban,,1.0,0.0163934426229508,,-10.564302831642667
6989,"Trump says abortion restrictions should be left to states, dodging a national ban",1.0,,0.0163934426229508,0.5142395496368408,
10717,Supreme Court rejects bid to restrict access to abortion pill,3.0,,0.0158730158730158,0.5351248383522034,
5981,Arizona state House passes bill to repeal 1864 abortion ban,,4.0,0.015625,,-9.841645168493953
14009,Trump signals openness to banning abortion pill,4.0,,0.015625,0.5364335179328918,
6375,Arizona Republicans again quash effort to repeal 1864 abortion ban,,5.0,0.0153846153846153,,-9.841645168493953
7381,Where abortion rights could be on the ballot this fall: From the Politics Desk,5.0,,0.0153846153846153,0.5462378859519958,
9443,Arizona Gov. Katie Hobbs signs repeal of 1864 abortion ban,,6.0,0.0151515151515151,,-9.841645168493953


The first two CTE steps are identical to the "keyword-first" approach, just a normal FTS5 + vector KNN queries. 

The combination CTE step is more involved, and is described in detail in [this "Hybrid Search" Supabase docs page](https://supabase.com/docs/guides/ai/hybrid-search). 
What's nice about this approach is that you can configure the "weights" of FTS or vector results with a normal SQL parameter. 

In this query, we can see the top result `"Trump signals support for a national 15-week abortion ban"` was neither a top FTS result or vector result — only ranked `2` and `3` respectively. 
But since it appeared in both the FTS and vector results, it's ranked higher than others, same with `"Mitch McConnell shies away from supporting national abortion ban"`. The rest of the results are
FTS + vector results interwoven together, pretty nice!

### Combination Technique #3: Re-rank by semantics

Here we use FTS5 results are the "source truth", but we re-order them based on semantic similarity between 

In [18]:
.param set query abortion ban
.param set k 10


with fts_matches as (
  select
    rowid,
    row_number() over (order by rank) as fts_rank_number,
    rank as score
  from fts_articles
  where headline match :query
  limit :k
),
final as (
  select
    articles.id,
    articles.headline,
    fts_matches.*
  from fts_matches
  left join articles on articles.rowid = fts_matches.rowid
  order by vec_distance_cosine(lembed(:query), lembed(articles.headline))
)
select * from final;



id,headline,rowid,fts_rank_number,score
4328,Trump signals support for a national 15-week abortion ban,4328,3,-9.841645168493953
5769,Mitch McConnell shies away from supporting national abortion ban,5769,2,-10.19017787567105
2646,Trump campaign scrambles over abortion ban report as Democrats seize the moment,2646,10,-9.211525101866211
7150,Tennessee court weighs challenge to abortion ban’s narrow medical exception,7150,8,-9.51616557526609
1821,"Dominican women fight child marriage, teen pregancy amid total abortion ban",1821,7,-9.51616557526609
6375,Arizona Republicans again quash effort to repeal 1864 abortion ban,6375,5,-9.841645168493953
9507,Arizona Senate passes repeal of 1864 abortion ban,9507,1,-10.564302831642667
8690,Arizona Supreme Court pushes back enforcement date for 1864 abortion ban,8690,9,-9.51616557526609
5981,Arizona state House passes bill to repeal 1864 abortion ban,5981,4,-9.841645168493953
9443,Arizona Gov. Katie Hobbs signs repeal of 1864 abortion ban,9443,6,-9.841645168493953
