## Sentence Embedding Exploration and PgVectors (2022-12-12)

- Initial exploration of sentence embeddings
- No harder than word embeddings
- Off the shelf works somewhat better than word embeddings anecdotally
- Used model `all-MiniLM-L6-v2` - did not explore other models. Reason: time. Will consider other models in the future.
- No problem setting up pgvector
- Took a more than an hour to load 40k word vectors into posgres. Caveat - using batch updates / inserts or other
  loading tricks could help.

In [14]:
import sqlalchemy as sa
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://jeffreyarnold@localhost:5432/osm", future=True)

from sqlalchemy import text
with engine.connect() as conn:
    features = conn.execute(text("SELECT osm_id, tags FROM  osm TABLESAMPLE SYSTEM (1) LIMIT 10")).fetchall()

from typing import Dict, Any
import spacy
from typing import List
import itertools
import re

class Documentizer:

    def __init__(self) -> None:
        pass
    
    def _key_cleaner(self, key: str) -> str:
        key = re.sub("[_:]", " ", key)
        return key
    
    def _value_cleaner(self, value: str) -> str:
        value = value.replace("yes", "").replace("no", "").strip()
        # _ splits "coffe_shop" into "coffe shop"
        # ; is needed to split lists of values like "american;burger;fast_food" into "american burger fast food"
        value = re.sub("[_;]", " ", value)
        return value  
    
    def _tag_filter(self, key: str, value: str) -> List[str]:
        if value == "no":
            return []
        return [self._key_cleaner(key), self._value_cleaner(value)]
    
    def __call__(self, tags: Dict[str, Any]) -> str:
        return " ".join(itertools.chain.from_iterable(self._tag_filter(k, v) for k, v in tags.items()))


In [15]:
import sentence_transformers
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('all-MiniLM-L6-v2')


In [16]:
import jinja2 as j2
ENV = j2.Environment(undefined=j2.StrictUndefined)

from spacy.tokens import Doc

if not Doc.has_extension("osm_id"):
    Doc.set_extension("osm_id", default=None)
    
# Oakland bbox
bbox = (-122.350244,37.698079,-122.112321,37.850861)

query ="""
SELECT osm_id, tags
FROM osm
WHERE geom && ST_MakeEnvelope({{bbox | join(",")}}, 4326)
    AND osm_type = 'N'
LIMIT 500000
"""

query = ENV.from_string(query).render(bbox=bbox)

docs = []

import itertools

documentizer = Documentizer()

with engine.connect() as conn:
    features = conn.execute(text(query))
    documents = [{"osm_id": row.osm_id, "tags": row.tags, "text": documentizer(row.tags)} for row in features]


In [None]:
corpus = [doc["text"] for doc in documents]

corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

In [17]:
# Query sentences:
queries = ["cafe", "coffee shop", "wind mill", "wind mills", "ethiopian restraunt", "train", "arport"]

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

    """
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    """





Query: cafe

Top 5 most similar sentences in corpus:
name Cafe 88 amenity cafe (Score: 0.7526)
name Caffe 817 amenity cafe (Score: 0.7185)
amenity cafe (Score: 0.7131)
amenity cafe (Score: 0.7131)
name Comeback Cafe amenity cafe (Score: 0.7073)




Query: coffee shop

Top 5 most similar sentences in corpus:
name Beanery Coffee Roasters shop coffee amenity cafe (Score: 0.6642)
amenity cafe cuisine coffee shop (Score: 0.6619)
name Starbucks amenity cafe cuisine coffee shop (Score: 0.6433)
name Coffee Cultures amenity cafe cuisine coffee shop (Score: 0.6372)
name Peerless Coffee and Tea Espresso and Shop amenity cafe (Score: 0.6253)




Query: wind mill

Top 5 most similar sentences in corpus:
name Wind River Systems office company (Score: 0.5386)
leisure garden (Score: 0.4158)
name Wing Lee Cleaners shop dry cleaning (Score: 0.4135)
name Kitesurf Launch sport windsurfing;kitesurfing;wingsurfing leisure slipway website https://www.sfba.org/alameda.html (Score: 0.4112)
name Wah Hang Ma

# pgvector extension

Proof of concept to load vectors into Postgres using [pgvector](https://github.com/pgvector/pgvector)

In [37]:
with engine.connect() as conn:
    conn.execute(text("CREATE EXTENSION IF NOT EXISTS vector"))
    conn.execute(text("ALTER TABLE osm ADD COLUMN embedding vector(384)"))

killed the upload after 89 minutes. 


In [40]:
with engine.connect() as conn:
    for i, doc in enumerate(documents):
        embed = str(list(corpus_embeddings[i].numpy()))
        conn.execute(text("""
        UPDATE osm
        SET embedding = :embed
        WHERE osm_id = :osm_id
        """), {"embed": embed, "osm_id": doc["osm_id"]})


KeyboardInterrupt: 

In [31]:
str(list(embed.numpy()))

'[-0.023183504, -0.079287805, -0.017884871, 0.055417657, 0.0008431207, 0.0319564, -0.038964204, 0.12262586, -0.12085, -0.04256106, -0.0040748753, -0.026869537, 0.001456012, 0.026645683, -0.003220808, 0.09341368, -0.03641462, 0.05689699, -0.04943897, -0.008999851, 0.013227197, -0.114441134, -0.027501876, -0.00086226093, -0.048031323, -0.024000254, 0.026468333, 0.0044135866, -0.020003095, -0.05423768, -0.06735641, -0.017488835, -0.071719594, 0.024877336, 0.005633726, -0.02154252, 0.075405374, -0.027228838, 0.054416828, 0.018888496, -0.06510048, -0.036264226, -0.042696398, 0.0689855, 0.09092336, 0.048798345, 0.0129739195, -0.02074254, -0.060705245, -0.024103433, 0.036865745, -0.053874683, 0.007822419, 0.06297742, -0.05410355, 0.01664775, 0.0155653125, 0.015502567, 0.060299642, 0.108421445, -0.102812275, -0.018160364, -0.036404625, -0.060512584, 0.03309949, -0.070785254, -0.097935915, 0.07578102, 0.053189542, 0.02991527, 0.028597124, -0.034161814, -0.012848741, -0.051193368, 0.019141361, -