# Word Vectors Lookup

## Overview

Initial exploration of using word vectors and FAISS indexes

```
{'power': 'generator', 'generator:type': 'solar_photovoltaic_panel', 'generator:method': 'photovoltaic', 'generator:source': 'solar', 'generator:output:electricity': 'yes'}
```
becomes
```
power generator generator type solar photovoltaic panel generator method photovoltaic generator source solar generator output electricity 
```

Each token in the document, like "power" or "generator", is assigned is mapped to a word vector. The vector of the document is the mean of the individual word embedding vectors.

Spacy was used for preprocessing, with FastText word embeddings.



## Takeaways

No problems using FAISS, but did not try to optimize speed of retrieval.

Word embeddings without training don't look particularly effective (opinion).

Ways the semantic search can be improved.

Needs better preprocessing of documents

- remove names
- remove addresses
- remove contact info
- remove "amenity" word

Word embeddings have inherent problems because they are context free. See the coffee shop example.
These problems could be minimized by training the word vectors using the OSM tags. However, that would
move the meaning of words further from their usage in natural language. If all tags were treated as 
different tokens, e.g. `key:amenity`, `value:shop`, `key:cuisine`, `value:coffee_shop` - then training
with documents like "coffee shop key:amenity value:cafe key:cuisine value:coffee_shop" would provide 
a bridge between natural language and the keys.

However, moving to sentence embeddings is likely a better option to handle context.



## Analysis

In [305]:
import pandas as pd
import sqlalchemy as sa
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://jeffreyarnold@localhost:5432/osm", future=True)

In [255]:
from sqlalchemy import text
with engine.connect() as conn:
    features = conn.execute(text("SELECT osm_id, tags FROM  osm TABLESAMPLE SYSTEM (1) LIMIT 10")).fetchall()

In [None]:
from typing import Dict, Any
import spacy
from typing import List
import itertools

nlp = spacy.load("en_core_web_lg")

class Documentizer:

    def __init__(self) -> None:
        pass
    
    def _key_cleaner(self, key: str) -> str:
        key = re.sub("[_:]", " ", key)
        return key
    
    def _value_cleaner(self, value: str) -> str:
        # remove yes/no values
        value = value.replace("yes", "").replace("no", "").strip()
        value = re.sub("[_]", " ", value)
        return value  
    
    def _tag_filter(self, key: str, value: str) -> List[str]:
        # If NO then remove the word
        if value == "no":
            return []
        return [self._key_cleaner(key), self._value_cleaner(value)]
    
    def __call__(self, tags: Dict[str, Any]) -> str:
        return " ".join(itertools.chain.from_iterable(self._tag_filter(k, v) for k, v in tags.items()))


In [301]:
with engine.connect() as conn:
    res = conn.execute(text("SELECT tags from osm WHERE length(tags::TEXT) > 100 LIMIT 1")).fetchone()

In [304]:
print("TAG: ", res['tags'])
print("Document: ", Documentizer()(res['tags']))

TAG:  {'power': 'generator', 'generator:type': 'solar_photovoltaic_panel', 'generator:method': 'photovoltaic', 'generator:source': 'solar', 'generator:output:electricity': 'yes'}
Document:  power generator generator type solar photovoltaic panel generator method photovoltaic generator source solar generator output electricity 


Example 

In [280]:
import jinja2 as j2
ENV = j2.Environment(undefined=j2.StrictUndefined)

from spacy.tokens import Doc

if not Doc.has_extension("osm_id"):
    Doc.set_extension("osm_id", default=None)
    
# Oakland bbox
bbox = (-122.350244,37.698079,-122.112321,37.850861)

query ="""
SELECT osm_id, tags
FROM osm
WHERE geom && ST_MakeEnvelope({{bbox | join(",")}}, 4326)
    AND osm_type = 'N'
LIMIT 500000
"""

query = ENV.from_string(query).render(bbox=bbox)

docs = []

import itertools

with engine.connect() as conn:
    features = conn.execute(text(query))
    documents = ((nlp(Documentizer()(row.tags)), {"osm_id": row.osm_id, "tags": row.tags}) for row in features)
    for doc, ctx in nlp.pipe(documents, as_tuples=True):
        vec = doc.vector / doc.vector_norm if doc.vector_norm > 0 else doc.vector
        docs.append((ctx, vec))


There are only 39 thousand values in that bbox

```
SELECT count(*)
 FROM osm
 WHERE geom && ST_MakeEnvelope(-122.350244,37.698079,-122.112321,37.850861, 4326)
     AND osm_type = 'N'
 
+-------+
| count |
|-------|
| 39431 |
+-------+
SELECT 1
```

In [239]:
import numpy as npb
embeddings = np.vstack([d[1] for d in docs])
embeddings.shape

(39431, 300)

## Load word embeddings into a FAISS index

None of the choices of the FAISS index are optimized

- Using inner product (IP) and a unit vector is equivalent to cosine similarity
- Using Flat index to see how the lookup works. The other indexes are for computational reasons.

In [267]:
import faiss
embedding_size = embeddings.shape[1]

#Defining our FAISS index
#Number of clusters used for faiss. Select a value 4*sqrt(N) to 16*sqrt(N) - https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
# n_clusters = int(2 * np.sqrt(embeddings.shape[0]))

#We use Inner Product (dot-product) as Index. We will normalize our vectors to unit length, then is Inner Product equal to cosine similarity
# quantizer = faiss.IndexFlatIP(embedding_size)
# index = faiss.IndexIVFFlat(quantizer, embedding_size, n_clusters, faiss.METRIC_INNER_PRODUCT)
index = faiss.IndexFlatIP(embedding_size)  

# Then we train the index to find a suitable clustering
index.train(embeddings)

# Finally we add all embeddings to the index
index.add(embeddings)

In [268]:
index.search(embeddings[:5, :], k=5)

(array([[0.9999999, 0.9999999, 0.9999999, 0.9999999, 0.9999999],
        [0.9999999, 0.9999999, 0.9999999, 0.9999999, 0.9999999],
        [0.9999999, 0.9999999, 0.9999999, 0.9999999, 0.9999999],
        [0.9999999, 0.9999999, 0.9999999, 0.9999999, 0.9999999],
        [0.9999999, 0.9999999, 0.9999999, 0.9999999, 0.9999999]],
       dtype=float32),
 array([[60, 40,  2,  1,  0],
        [60, 40,  2,  1,  0],
        [60, 40,  2,  1,  0],
        [48,  8,  7,  4,  3],
        [48,  8,  7,  4,  3]]))

In [269]:
query = "cafe"
qv = nlp(query).vector 
qv /= np.linalg.norm(qv)
dist, idx = index.search(qv.reshape(1, -1), 5)
[Documentizer()(doc_values[i]['tags']) for i in idx[0]]

['amenity cafe',
 'amenity cafe',
 'amenity cafe cuisine coffee shop',
 'name Paddington Cafe amenity cafe cuisine coffee shop',
 'name Aroma Cafe amenity cafe cuisine coffee shop']

In [270]:
query = "coffee shop"
qv = nlp(query).vector 
qv /= np.linalg.norm(qv)
dist, idx = index.search(qv.reshape(1, -1), 5)
[Documentizer()(doc_values[i]['tags']) for i in idx[0]]

['amenity cafe cuisine coffee shop',
 'shop laundry',
 'shop laundry',
 'shop hairdresser',
 'shop convenience']

Here we see problems with word vectors and the synthetic documents being used. 

- "coffee" and "shop" are given equal prominence
- So any entry with "shop" that is short - returns a good value

In [283]:
query = "ethiopian food"
qv = nlp(query).vector 
qv /= np.linalg.norm(qv)
dist, idx = index.search(qv.reshape(1, -1), 5)
[Documentizer()(doc_values[i]['tags'])  for i in idx[0]]

['name Uarhi Taqueria amenity fast food cuisine mexican',
 'name Pieology amenity fast food cuisine pizza',
 'name Guadalajara Food Truck amenity fast food cuisine mexican',
 'name Sushi GoGo amenity fast food cuisine sushi delivery ',
 'name Dimond Kitchen amenity fast food cuisine mexican']

Picks up "food" but does not get ethiopian values

In [287]:
with engine.connect() as conn:
    res = conn.execute(text(ENV.from_string("""
        SELECT tags from osm 
        WHERE tags ->> 'cuisine' ILIKE '%ethiopian%' 
        AND geom && ST_MakeEnvelope({{bbox | join(",")}}, 4326)
        AND osm_type = 'N'
        LIMIT 5""").render(bbox=bbox))).fetchall()
for row in res:
    print(Documentizer()(row.tags))

name Lemat phone +15104302717 amenity restaurant cuisine ethiopian website https://www.lematethiorestaurant.com/ addr street Adeline Street addr postcode 94703 opening hours We-Mo 12:00-22:00 addr housenumber 3212
name Shewhat cafe phone 5102509533 amenity restaurant cuisine ethiopian
name Cafe Colucci amenity restaurant cuisine ethiopian addr city Berkeley addr state CA addr street Telegraph Avenue addr housenumber 6427
name Asmara phone (510) 547-5100 amenity restaurant cuisine ethiopian website http://asmararestaurant.com/ takeaway  addr city Oakland wheelchair  addr street Telegraph Avenue addr postcode 94609 opening hours Tu-Su 11:30-22:30
name Enssaro amenity restaurant cuisine ethiopian website http://www.enssaro.com addr city Oakland addr street Grand Avenue addr postcode 94610 opening hours Mo, We, Th, Su 11:30-22:00; Fr, Sa 11:30-23:00 addr housenumber 357


In [285]:
query = "attorney"
qv = nlp(query).vector 
qv /= np.linalg.norm(qv)
dist, idx = index.search(qv.reshape(1, -1), 5)
[Documentizer()(doc_values[i]['tags']) for i in idx[0]]

['name Jackson Hewitt office tax advisor addr housenumber 2137',
 'name Dentist Bryan D. Haynes amenity dentist healthcare dentist',
 'office tax advisor',
 'name Alameda County Sheriff amenity police',
 'name Dr. Huey P. Newton Foundation office ngo website https://hueypnewtonfoundation.org/']

In [286]:
query = "hotel"
qv = nlp(query).vector 
qv /= np.linalg.norm(qv)
dist, idx = index.search(qv.reshape(1, -1), 10)
[Documentizer()(doc_values[i]['tags']) for i in idx[0] if i >= 0]

['name Neptune Palace Hotel tourism hotel',
 'name Pullman Hotel demolished tourism hotel',
 'name Arcadia Hotel demolished tourism hotel',
 'amenity restaurant',
 'amenity restaurant',
 'amenity restaurant',
 'amenity restaurant',
 'name Hotel Pine demolished tourism hotel',
 'name Trending Inn tourism hotel',
 'amenity cafe']

Hotel is picking up some hotels and models - but also restraunts. 