# !poetry install --no-root
* on mac - start server with 
```
poetry run jupyter notebook --notebook-dir  `pwd`
```

In [37]:
# %load_ext autoreload
# %autoreload 2

In [38]:
from typing import List
from IPython.core.display import HTML
import helpers
import os
import warnings
import pandas as pd
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

warnings.filterwarnings("ignore")
os.environ[
    "TOKENIZERS_PARALLELISM"
] = "true"  # display warnings for tokenizer

# Vector Embedding Introduction


* [Thomas Fuchs](thomas.fuchs@nosto.com), Data Scientist
* [Georg M. Sorst](georg.sorst@nosto.com), Team Lead Search

<p style="text-align: center;"><img src="Nosto_hr_magenta.svg" style="height: 50%; width: 50%; margin-left: auto; margin-right: auto;"/></p>

Formerly known as:

<p style="text-align: center;"><img src="Findologic_Logo_Dark.svg" style="height: 15%; width: 15%; margin-left: auto; margin-right: auto;"/></p>

# Vector Embeddings

Special Neural Nets can transform text into vectors.

These vectors can be _embedded_ into a common vector space.

This makes it possible to discover semantic relationships between texts.

Let's define some words.

In [39]:
words = [
    "queen",
    "king",
    "prince",
    "princes",
    "man",
    "woman",
    "boy",
    "girl",
    "red",
    "green",
    "blue",
    "palace",
]

Transforming words into vectors is easy with Python.

Many free models exist to perform vector embedding.

In [40]:
def embed(texts):
    model_name = "all-MiniLM-L6-v2"
    model = SentenceTransformer(
        model_name,
        device=helpers.get_torch_device_name(),  # Optional: if you want to run this on GPU
    )
    return model.encode(texts)

The resulting vector is represented as a multidimensional array in Python.

All vectors share the same dimensionality, for this model it's 384 dimensions.

In [41]:
pd.DataFrame(embed(words[0]))

Unnamed: 0,0
0,0.035487
1,-0.065605
2,-0.009935
3,0.031590
4,-0.013387
...,...
379,0.026038
380,0.091385
381,-0.053889
382,-0.031242


Each word is transformed into a vector so that we can discover semantic relationships.

In [42]:
pd.DataFrame(
    {"Sentence": words, "Encoding": list(embed(words))}
).head(3)

Unnamed: 0,Sentence,Encoding
0,queen,"[0.03548696, -0.06560465, -0.009934981, 0.0315..."
1,king,"[-0.059599336, 0.050512414, -0.06951014, 0.079..."
2,prince,"[-0.036828797, 0.041281953, 0.041856598, 0.041..."


Let's visualize the vectors to show their relationships.

But 384-dimensional vectors cannot be plotted on a 2-dimensional screen.

Principal Component Analysis (PCA) can reduce the number of dimensions from 384 to 2.

In [43]:
word_samples = words[0:3]
embeddings = embed(word_samples)
reduced_embeddings = PCA(n_components=2).fit_transform(embeddings)
pd.DataFrame(
    {"Words": word_samples, "Encoding": list(reduced_embeddings)}
)

Unnamed: 0,Words,Encoding
0,queen,"[-0.2934045, -0.38926488]"
1,king,"[-0.25320098, 0.4088318]"
2,prince,"[0.54660565, -0.019567026]"


In [44]:
import plotly.graph_objects as go
from sklearn.decomposition import PCA


def plot(sentences, embeddings, color="blue", existing_figure=None):
    # Perform PCA
    pca = PCA(n_components=2)
    reduced_embeddings = pca.fit_transform(embeddings)

    # Create annotations for each sentence
    annotations = []
    for i, sentence in enumerate(sentences):
        words = sentence.split()
        annotation = " ".join(words[:3]) + ("..." if len(words) > 3 else "")
        annotations.append(annotation)

    my_figure = existing_figure or go.Figure()

    # Add the reduced embeddings as a scatter plot
    my_figure.add_trace(go.Scatter(
        x=reduced_embeddings[:, 0],
        y=reduced_embeddings[:, 1],
        mode="markers+text",
        text=annotations,  # Add the annotations
        textposition="top center",
        marker=dict(color=color, size=10)
    ))

    # Update axes and title
    my_figure.update_layout(
        title="2D PCA of Sentence Embeddings",
        xaxis_title="Principal Component 1",
        yaxis_title="Principal Component 2",
        showlegend=False,
        height=600,
        width=800
    )

    # Only show the figure if none was provided
    if existing_figure is None:
        my_figure.show()

When visualizing the reduced vectors, clear semantic clusters appear.

In [45]:
plot(words, embed(words))

# Sentence Embeddings

We can not only embed words but entire sentences and documents.

Let's define some documents and embed them.

In [46]:
import pandas as pd

documents = [
    "Vector embeddings are mathematical representations of objects, often words or phrases, in a high-dimensional space. By mapping similar objects to proximate points, embeddings capture relationships and semantic meaning. Commonly used in machine learning and natural language processing tasks, methods like Word2Vec, GloVe, and FastText have popularized their application, enabling advancements in text analysis, recommendation systems, and more.",
    "Keyword search refers to the process of locating information in a database, search engine, or other data repository by specifying particular words, phrases, or symbols. In the digital realm, it's foundational to search engines like Google and Bing. The search results are typically ranked based on relevance, which is determined using various algorithms that consider factors like frequency, location, and link structures. Keyword search is integral for navigating the vast expanse of online information, aiding users in retrieving relevant data efficiently.",
    "Sandwiches are a popular type of food consisting of one or more types of food, such as vegetables, sliced meat, or cheese, placed between slices of bread. They can range from simple combinations like peanut butter and jelly to more complex gourmet creations. Originating from England in the 18th century, sandwiches have become a staple in many cultures worldwide, prized for their convenience and versatility. Variations exist based on regional preferences, ingredients, and preparation methods.",
    "Data science is an interdisciplinary field that leverages statistical, computational, and domain-specific expertise to extract insights and knowledge from structured and unstructured data. It encompasses various techniques from statistics, machine learning, data mining, and big data technologies to analyze and interpret complex data. Data science has applications across numerous sectors, including healthcare, finance, marketing, and social sciences, driving decision-making, predictive analytics, and artificial intelligence advancements. Its growing significance in today's data-driven world has led to the rise of specialized tools, methodologies, and educational programs.",
    "Neural networks are a class of machine learning models inspired by the biological neural networks of animal brains. They consist of interconnected layers of nodes, or neurons, which process input data through a series of transformations and connections to produce output. Neural networks are particularly adept at recognizing patterns, making them useful for a wide range of applications such as image and speech recognition, natural language processing, and predictive analytics. The development of deep neural networks, which contain multiple hidden layers, has been central to the field of deep learning and has significantly advanced the capabilities of artificial intelligence systems.",
    "Pasta is a staple food of traditional Italian cuisine, with the first reference dating to 1154 in Sicily. It is typically made from an unleavened dough of durum wheat flour mixed with water or eggs and formed into sheets or various shapes, then cooked by boiling or baking. Pasta is versatile and can be served with a variety of sauces, meats, and vegetables. It is categorized in two basic styles: dried and fresh. Popular around the world, pasta dishes are central to many diets and come in numerous shapes like spaghetti, penne, and ravioli.",
    "Soup is a liquid food, generally served warm or hot (but also cold), that is made by combining ingredients such as meat and vegetables with stock, juice, water, or another liquid. Soups are inherently diverse, ranging from rich, cream-based varieties to brothy and vegetable-laden concoctions. They are often regarded as comfort food and can be served as a main dish or as an appetizer, with regional and cultural variations like the Spanish gazpacho, Japanese miso soup, and Russian borscht.",
    "A casserole is a comprehensive one-dish meal baked in a deep, ovenproof dish with a glass or ceramic base. It typically includes a combination of meats, vegetables, starches like rice or potatoes, and a binding agent like a soup or sauce. Topped with cheese or breadcrumbs for a crispy crust, casseroles are appreciated for their convenience and the ability to meld flavors during the baking process. They are a fixture in many cultures and are particularly beloved as home-cooked comfort foods, often featuring in communal gatherings and family dinners.",
]

pd.DataFrame(
    {"Sentence": documents, "Encoding": list(embed(documents))}
).head(3)

Unnamed: 0,Sentence,Encoding
0,Vector embeddings are mathematical representat...,"[-0.0016682168, -0.06941, -0.026505126, 0.0056..."
1,Keyword search refers to the process of locati...,"[0.019650575, -0.062715, -0.045780774, -0.0006..."
2,Sandwiches are a popular type of food consisti...,"[-0.04432277, -0.0237825, 0.03651129, -0.01122..."


Again, semantic clusters appear when visualizing the vectors in a 2D-space.

In [47]:
import plotly.graph_objects as go


def plots(sentences_embeddings_color):
    figure = go.Figure()
    for sentences, embeddings, color in sentences_embeddings_color:
        plot(sentences, embeddings, color, figure)
    figure.show()

In [48]:
plots([(documents, embed(documents), "green")])

# Information Retrieval

Similar documents have similar vectors.

This characteristic can be used to retrieve related documents for an input text.

Let's start by defining some search queries.

In [49]:
queries = [
    "information retrieval",
    "machine learning",
    "cooking",
]
plots([(queries, embed(queries), "red")])

Visualizing documents and queries in one space uncovers semantic relations.

Each query is closests to its most relevant documents.

In [50]:
plots(
    [
        (documents, embed(documents), "green"),
        (queries, embed(queries), "red"),
    ]
)

# Simply Search

## Load Data
Firstly, we need to load data. To do this, we use the product data from a [customer](https://gympluscoffee.com/).

In [51]:
merchant_id = "shopify-20345599"
path_name = f"data/{merchant_id}_products.pkl"
df_products = pd.read_pickle(path_name)

In [52]:
df_products.head(3)

Unnamed: 0,productId,name,description,brand,category
0,1303906844777,Black KeepCup Small,Gym+Coffee branded Black Keepcup. The easy cho...,Gym+Coffee,"[Versatile Collection, Autumn Fits, All-In Col..."
1,1303907074153,Black KeepCup Medium,Gym+Coffee branded Black Keepcup. The easy cho...,Gym+Coffee,"[Versatile Collection, Autumn Fits, Gifts Unde..."
2,1316361076841,U-Move Tank,The essential U-Move tanks were designed with ...,Gym+Coffee,"[T-Shirts & Tanks, Versatile Collection, Tanks..."


## Embed Data
The next step is to embed the data.  
In a real-world scenario, we use specialised programs such as ElasticSearch to apply embeddings and assign weights to different fields.  
The advantage of using embeddings is that different fields can be combined in advance to achieve a favourable result.

In [53]:
combine_fields = (
    lambda x: f"Product name = {x['name']}\n"
    f"Description = {x['description']}\n"
    f"Categories = {x['category']}\n"
    f"Brand = {x['brand']}"
)

In [54]:
model = SentenceTransformer(
    "all-MiniLM-L6-v2",
    device=helpers.get_torch_device_name(),  # Optional: if you want to run this on GPU
)
df_for_search = pd.DataFrame()
df_for_search["base_string"] = df_products.apply(
    combine_fields, axis=1
).values
df_for_search["embeddings"] = list(
    model.encode(df_for_search["base_string"].values)
)

In [55]:
df_for_search.head(5)

Unnamed: 0,base_string,embeddings
0,Product name = Black KeepCup Small\nDescriptio...,"[-0.07552906, 0.01928508, -0.0069891512, 0.082..."
1,Product name = Black KeepCup Medium\nDescripti...,"[-0.064423524, 0.025618477, 0.0058485963, 0.07..."
2,Product name = U-Move Tank\nDescription = The ...,"[-0.08300712, 0.06053903, -0.02068592, 0.05100..."
3,Product name = U-Stretch Tank\nDescription = B...,"[-0.087295555, 0.06142012, 0.026349524, 0.0778..."
4,Product name = U-Live Tank\nDescription = Made...,"[-0.0651835, 0.04733745, 0.011253881, 0.062099..."


In [56]:
df_for_search = pd.concat(
    [df_for_search, df_products.loc[:, ["name", "productId"]]],
    axis=1,
)

## Calculate Similarity
The most frequently employed method for assessing similarity is through [cosine_similarity](https://en.wikipedia.org/wiki/Cosine_similarity) or cosine distance.  
$ \text{cosinus-similarity}=S_{C}(A,B):=\cos(\theta)={\mathbf {A} \cdot \mathbf {B} \over \|\mathbf { A} \|\|\mathbf{B}\|}= \frac{\sum \limits_{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits_{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum \limits_{i=1}^{n}{B_{i}^{ 2}}}}}\in [-1,1]$  
$\text{cosinus-distance}=D_{C}(A,B):=1-S_{C}(A,B)$  
**ATTENTION: not really a distance-metric**  
We can leverage the pre-existing functionality provided by sklearn for this purpose.

In [57]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

np.random.seed(1)
dimension = 2
num_vectors = 4
df_show_cos = pd.DataFrame(
    {
        "embeddings": np.random.uniform(
            0, 1, size=(num_vectors, dimension)
        ).tolist(),
        "color_col": ["products"] * (num_vectors - 1) + ["query"],
    }
)
df_show_cos["similarity"] = cosine_similarity(
    [df_show_cos["embeddings"].iloc[-1]],
    df_show_cos["embeddings"].tolist(),
)[0]
fig = helpers.color_embedings_df(
    df_show_cos,
    color_col="color_col",
    dimensions=dimension,
    add_vectors=True,
    hover_data=["similarity"],
)

In [58]:
fig.show()

## Approximate calculation of similarity
With [ANNOY](https://github.com/spotify/annoy) (Approximate Nearest Neighbors Oh Yeah) we can significantly increase the efficiency of our search processes.  
To achieve this, we create an index that is not only very powerful, but also compact.

In [59]:
from annoy import AnnoyIndex


def get_annoy_index(
    df: pd.DataFrame, n_trees: int = 20
) -> AnnoyIndex:
    embeddings = df["embeddings"]
    index_ann = AnnoyIndex(
        len(embeddings[0]), "angular"
    )  # Length of item vector that will be indexed
    for i, v in embeddings.items():  # ATTENTION index must be int
        index_ann.add_item(i, v)
    index_ann.build(
        n_trees=n_trees
    )  # More trees gives higher precision when querying
    return index_ann

In [60]:
ann_index: AnnoyIndex = get_annoy_index(df_for_search, n_trees=20)

#### Show entries of annoy index

In [61]:
n_items = ann_index.get_n_items()
pd.DataFrame(
    [ann_index.get_item_vector(x) for x in range(n_items)],
)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
0,-0.075529,0.019285,-0.006989,0.082984,0.115914,0.066811,0.031621,-0.034079,0.004284,-0.010243,...,-0.002267,0.022710,-0.030541,0.028125,0.018602,0.015280,0.072853,-0.122227,0.014324,0.010698
1,-0.064424,0.025618,0.005849,0.079927,0.061402,0.056608,0.057538,-0.017084,-0.009730,-0.068837,...,-0.007495,-0.018865,-0.002062,0.021568,0.007308,-0.010578,0.100386,-0.089338,0.022126,0.031776
2,-0.083007,0.060539,-0.020686,0.051009,0.080964,0.008052,0.079219,0.026133,-0.099782,-0.060111,...,0.030982,0.100541,-0.045774,-0.029939,0.030733,0.030493,0.015915,-0.060644,0.019287,0.062732
3,-0.087296,0.061420,0.026350,0.077841,0.076514,0.022325,0.088448,0.027987,-0.060551,-0.079241,...,-0.022492,0.043960,-0.059321,0.009342,0.024988,0.012512,0.059332,-0.079130,-0.002070,0.001584
4,-0.065183,0.047337,0.011254,0.062100,0.053767,0.029995,0.076041,0.016295,-0.047152,-0.080945,...,-0.025368,0.016461,-0.019377,0.019938,0.032378,-0.014602,0.072847,-0.085598,-0.001103,-0.002153
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1516,-0.059440,0.052507,0.025046,0.060279,0.059984,0.014972,0.041650,0.039143,-0.064719,-0.053004,...,-0.022284,0.028261,-0.021379,0.012574,0.026155,0.025952,0.039348,-0.130215,-0.013892,0.043890
1517,-0.040223,0.025612,0.018766,0.054562,0.051350,0.006034,0.067634,0.028067,-0.050209,-0.027354,...,-0.026328,0.023861,-0.022710,0.015125,0.033535,0.023414,0.031018,-0.125512,-0.017788,0.040470
1518,-0.040751,0.035574,0.006139,0.052165,0.077379,0.012374,0.047038,0.050236,-0.040346,-0.051100,...,-0.011843,-0.006011,-0.000757,0.018591,0.029172,0.036213,0.047751,-0.149898,0.026703,0.053867
1519,-0.000236,0.049645,0.002055,0.030432,0.062886,-0.021721,-0.011820,0.036769,-0.051494,-0.023602,...,-0.027179,0.017882,-0.000320,0.014453,0.027729,0.039504,0.062714,-0.160603,0.024804,0.051788


In [62]:
df_for_search["annoy_cluster"] = helpers.calc_cluster(ann_index)
fig_annoy = helpers.color_embedings_df(
    df_for_search, "annoy_cluster", hover_data=["name"], dimensions=3
)

In [63]:
fig_annoy.show()

### Query with Annoy

In [64]:
def get_similar_products_annoy(
    ann_index: AnnoyIndex, query: str, top_n: int = 5
) -> List[int]:
    query_embedding = model.encode(query)
    nns = ann_index.get_nns_by_vector(query_embedding, top_n)
    return nns

In [65]:
query_easy = "hoodie"
sim_prod = df_products.loc[
    get_similar_products_annoy(ann_index, query_easy, 5), :
]
html_easy = helpers.display_images_and_names(
    sim_prod,
    merchant_id,
    f"ANNOY VectorSearch for:<br>'{query_easy}'",
)

In [66]:
display(HTML(html_easy))

ANNOY VectorSearch for: 'hoodie',ANNOY VectorSearch for: 'hoodie'.1,ANNOY VectorSearch for: 'hoodie'.2,ANNOY VectorSearch for: 'hoodie'.3,ANNOY VectorSearch for: 'hoodie'.4
Women's 2Tone Hoodie in Blush-Pink,Women's Midnight Navy Hoodie,FREE GIFT | Men's Chill Pullover Hoodie in Black,FREE GIFT | Men's Chill Pullover Hoodie in Black,FREE GIFT | Men's Chill Pullover Hoodie in Black


### Complex Query with Annoy

In [67]:
query_complex = "I Need a new hody for my Frau. It soll be green."
# I need a new hoodie for my wife. It should be green.
sim_prod = df_products.loc[
    get_similar_products_annoy(ann_index, query_complex, 5)[0:5]
]
html_complex = helpers.display_images_and_names(
    sim_prod,
    merchant_id,
    f"ANNOY VectorSearch for:<br>'{query_complex}'",
)

In [68]:
display(HTML(html_complex))

ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.',ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.'.1,ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.'.2,ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.'.3,ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.'.4
Women's Green Chill Hoodie Gift Box,Kinney Crew in Fern Green,UniCrew in Hunter Green,Iconic Blend Beanie in Beige Melange,Pine Green Beanie


### Same query - same index --- more results

In [69]:
sim_prod = df_products.loc[
    get_similar_products_annoy(ann_index, query_complex, 500)[0:5]
]
html_complex_more_result = helpers.display_images_and_names(
    sim_prod,
    merchant_id,
    f"ANNOY VectorSearch for:<br>'{query_complex}'",
)

In [70]:
display(HTML(html_complex_more_result))

ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.',ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.'.1,ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.'.2,ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.'.3,ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.'.4
Women's Green Chill Hoodie Gift Box,Women's Green Chill Hoodie Gift Box,Kinney Crew in Fern Green,Retro UniCrew in Hunter Green,Kin Crew in Pacific Green


### Same query - index with more tree --- few results

In [71]:
ann_index_more_tree: AnnoyIndex = get_annoy_index(
    df_for_search, n_trees=200
)
sim_prod = df_products.loc[
    get_similar_products_annoy(ann_index_more_tree, query_complex, 5)[
        0:5
    ]
]
html_complex_more_tree = helpers.display_images_and_names(
    sim_prod,
    merchant_id,
    f"ANNOY VectorSearch for:<br>'{query_complex}'",
)

In [72]:
display(HTML(html_complex_more_tree))

ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.',ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.'.1,ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.'.2,ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.'.3,ANNOY VectorSearch for: 'I Need a new hody for my Frau. It soll be green.'.4
Women's Green Chill Hoodie Gift Box,Women's Green Chill Hoodie Gift Box,Kinney Crew in Fern Green,Kin Crew in Pacific Green,UniCrew in Hunter Green


## Vector Search: Advantages and Disadvantages

### Advantages
1. **Efficiency:**
    * Vector search allows for fast and efficient similarity searches in high-dimensional spaces.  
3. **Scalability:**
    * Well-suited for large datasets and can scale effectively with the growing volume of data.  
5. **Flexibility:**
    * Adaptable to various data types, making it versatile for different domains such as **image, text, and audio**.  
7. **Semantic Understanding:**
    * Captures semantic relationships, enabling more meaningful and context-aware search results.  

### Disadvantages
1. **Complexity:**
    * Implementation and optimization of vector search algorithms can be complex, requiring specialized knowledge.  
3. **Resource Intensive:**
    * Computationally intensive, demanding significant computing resources for large-scale applications.  
5. **Quality of Embeddings:**
    * The effectiveness of vector search heavily depends on the quality of the embeddings, which may require fine-tuning.  
7. **Interpretability:**
    * Results may lack interpretability, making it challenging to understand the reasoning behind specific search outcomes.

### time comparison
![](data/time_to_calculate_similarity_100_queries_all-MiniLM-L6-v2.png)