# Constructing SQLite Tables for Notebooks and Search

SQLite full text search setup via APSW for all the notebooks on this website, inspired by the [APSW FTS5 Tour]((https://rogerbinns.github.io/apsw/example-fts.html)).

## Setup

In [40]:
from typing import Optional, Iterator, Any

from pprint import pprint
import re
import functools

import apsw
import apsw.ext
import apsw.fts5
import apsw.fts5aux
import apsw.fts5query

from execnb.nbio import read_nb
from fastcore.all import *
from pathlib import Path, PosixPath

In [41]:
print("FTS5 available:", "ENABLE_FTS5" in apsw.compile_options)

FTS5 available: True


## Create a Notebooks Database

Until now I haven't had a SQLite database for anything on this site. Let's create one.

In [42]:
connection = apsw.Connection("notebooks.db")
connection

<apsw.Connection at 0x10b282110>

## Create and Populate Table `notebooks`

In [43]:
nbs = L(Path(".").glob("*.ipynb")).sorted(reverse=True)
nbs

(#38) [Path('2025-01-16-Cosine-Similarity-Breakdown-in-LaTeX.ipynb'),Path('2025-01-14-Constructing-SQLite-Tables-for-Notebooks-and-Search.ipynb'),Path('2025-01-13-SQLite-FTS5-Tokenizers-unicode61-and-ascii.ipynb'),Path('2025-01-12-A-Better-Notebook-Index-Page.ipynb'),Path('2025-01-11-NBClassic-Keyboard-Shortcuts-in-Command-and-Dual-Mode.ipynb'),Path('2025-01-10-Understanding-FastHTML-Routes-Requests-and-Redirects.ipynb'),Path('2025-01-09-Reading-and-Writing-Jupyter-Notebooks-With-Python.ipynb'),Path('2025-01-08-HTML-Title-Tag-in-FastHTML.ipynb'),Path('2025-01-07-Verifying-Bluesky-Domain-in-FastHTML.ipynb'),Path('2025-01-06-Understanding-FastHTML-Headers.ipynb'),Path('2025-01-05-SSH-Agent-to-Save-Passphrase-Typing.ipynb'),Path('2025-01-04-Claude-Artifacts-in-Notebooks.ipynb'),Path('2025-01-03-Using-zip.ipynb'),Path('2025-01-02-FastHTML-Piano-Part-3.ipynb'),Path('2025-01-02-FastHTML-Piano-Part-2.ipynb'),Path('2025-01-01-FastHTML-Piano-Part-1.ipynb'),Path('2025-01-01-Command-Substitution-

All notebooks, sorted from newest to oldest.

In [44]:
nb = read_nb(nbs[0])
nb.cells[0]

```json
{ 'cell_type': 'markdown',
  'id': 'e865206a',
  'idx_': 0,
  'metadata': {},
  'source': '# Cosine Similarity Breakdown in LaTeX'}
```

A Markdown cell looks like this.

In [45]:
connection.execute("""CREATE TABLE IF NOT EXISTS notebooks (
    id INTEGER PRIMARY KEY,
    path TEXT NOT NULL,
    markdown_content TEXT)""")

<apsw.Cursor at 0x10a42e560>

We create a table to put notebooks' paths and Markdown content into. (At this point we skip code cells to make things simple.)

In [46]:
def is_md_cell(c): return c.cell_type == 'markdown'
md_cells = L(nb.cells).filter(is_md_cell)
md_cells

(#18) [{'cell_type': 'markdown', 'id': 'e865206a', 'metadata': {}, 'source': '# Cosine Similarity Breakdown in LaTeX', 'idx_': 0},{'cell_type': 'markdown', 'id': '5dc1aea0', 'metadata': {}, 'source': 'A mathematical breakdown of cosine similarity, with copy-pastable LaTeX.', 'idx_': 1},{'cell_type': 'markdown', 'id': '1bbf957e', 'metadata': {}, 'source': '## Overview', 'idx_': 2},{'cell_type': 'markdown', 'id': 'faad9583', 'metadata': {}, 'source': "After hearing so much about cosine similarity, I thought it was something extremely difficult to understand. It turns out it's just the similarity of 2 word vectors, calculated as the cosine of the angle between them.\n\nThe result ranges from -1 to 1, where:\n\n* 1 means the vectors point in the same direction (most similar)\n* 0 means they're perpendicular (totally unrelated)\n* -1 means they point in opposite directions (complete opposites)\n\nSimilarity is just about the angle. The vectors are typically normalized to make comparisons ma

A list of only Markdown cells from a notebook looks like this.

In [47]:
def cell_source(c): return c.source
md = md_cells.map(cell_source)
md

(#18) ['# Cosine Similarity Breakdown in LaTeX','A mathematical breakdown of cosine similarity, with copy-pastable LaTeX.','## Overview',"After hearing so much about cosine similarity, I thought it was something extremely difficult to understand. It turns out it's just the similarity of 2 word vectors, calculated as the cosine of the angle between them.\n\nThe result ranges from -1 to 1, where:\n\n* 1 means the vectors point in the same direction (most similar)\n* 0 means they're perpendicular (totally unrelated)\n* -1 means they point in opposite directions (complete opposites)\n\nSimilarity is just about the angle. The vectors are typically normalized to make comparisons make more sense intuitively, and to avoid having to deal with magnitudes as well.\n\nNote: I'm reading that the range used with word vectors is usually [0,1] because word vectors typically are non-negative.",'## The Cosine Similarity Formula','The cosine similarity formula in LaTeX:','## Dot Product',"If you're wonde

Map the cells to just their Markdown source. For now we don't care about the rest.

In [48]:
def extract_markdown_content(nbpath):
    """Extract all markdown cell content from a notebook"""
    nb = read_nb(nbpath)
    md_cells = L(nb.cells).filter(is_md_cell)
    return "\n".join(md_cells.map(cell_source))
extract_markdown_content(nbs[0])

"# Cosine Similarity Breakdown in LaTeX\nA mathematical breakdown of cosine similarity, with copy-pastable LaTeX.\n## Overview\nAfter hearing so much about cosine similarity, I thought it was something extremely difficult to understand. It turns out it's just the similarity of 2 word vectors, calculated as the cosine of the angle between them.\n\nThe result ranges from -1 to 1, where:\n\n* 1 means the vectors point in the same direction (most similar)\n* 0 means they're perpendicular (totally unrelated)\n* -1 means they point in opposite directions (complete opposites)\n\nSimilarity is just about the angle. The vectors are typically normalized to make comparisons make more sense intuitively, and to avoid having to deal with magnitudes as well.\n\nNote: I'm reading that the range used with word vectors is usually [0,1] because word vectors typically are non-negative.\n## The Cosine Similarity Formula\nThe cosine similarity formula in LaTeX:\n## Dot Product\nIf you're wondering why it lo

Join all the Markdown cells' content for a notebook together, separated by 1 new line between each pair of cells.

In [49]:
def populate_notebooks_table():
    connection.execute("DELETE FROM notebooks")
    
    for nb_path in nbs:
        markdown_text = extract_markdown_content(nb_path)
        connection.execute(
            "INSERT INTO notebooks (path, markdown_content) VALUES (?, ?)",
            (str(nb_path), markdown_text)
        )
populate_notebooks_table()

In [50]:
connection.execute("SELECT count(*) FROM notebooks").get

38

Now we have notebook paths and their contents in a SQLite table.

## Create Search Table

In [59]:
if not connection.table_exists("main", "search"):
    search_table = apsw.fts5.Table.create(
        connection,
        "search",
        content="notebooks",
        columns=None,
        generate_triggers=True,
        tokenize=["unicode61"])
else:
    search_table = apsw.fts5.Table(connection, "search")

Here we check if a table named `search` already exists in the `main` database of `connection`. If it doesn't exist, we create it.

`search` is a virtual table pointing at the `notebooks` table. It doesn't actually store any data! It contains indexes on that table, as well as a table-like interface for searching it.

## Check the Tables

In [60]:
print("quoted name", search_table.quoted_table_name)

quoted name "main"."search"


This verifies that the database schema `main` and table name `search have been set up.

The notebooks' content:

In [61]:
print(connection.execute(
        "SELECT sql FROM sqlite_schema WHERE name='notebooks'"
    ).get)

CREATE TABLE notebooks (
    id INTEGER PRIMARY KEY,
    path TEXT NOT NULL,
    markdown_content TEXT)


In [62]:
pprint(search_table.structure)

FTS5TableStructure(name='search',
                   columns=('id', 'path', 'markdown_content'),
                   unindexed=set(),
                   tokenize=('unicode61',),
                   prefix=set(),
                   content='notebooks',
                   content_rowid='_ROWID_',
                   contentless_delete=None,
                   contentless_unindexed=None,
                   columnsize=True,
                   tokendata=False,
                   locale=False,
                   detail='full')


In [63]:
print(f"{search_table.config_rank()=}")

search_table.config_rank()='bm25()'


In [66]:
print(f"{search_table.row_count=}")

search_table.row_count=38


In [65]:
print(f"{search_table.tokens_per_column=}")

search_table.tokens_per_column=[38, 302, 10334]


## Optional Cleanup

At one point I had to run this to drop the search table, so I could recreate it with a different tokenizer:

In [20]:
connection.execute("DROP TABLE IF EXISTS search")

<apsw.Cursor at 0x1093dfa00>

## What Next?

To be continued...