# Finding similar sentences within a text corpus

One great use case for embedding data is to use their vector representation for similarity search (often referred to as "neural search"). In this very short example, we'll show you how to build a super simple sentence comparison via comparing the cosine similarity of two embeddings.

![A huge pile of embedded data points in vector space](https://miro.medium.com/max/2028/1*1LHBbqmPI0X4I3rio5ujWQ.png)

This notebook is only meant as a tutorial; be aware that there are many fascinating neural search engines, such as [qdrant](https://qdrant.tech).

---

To get started, we just load two things: a function to load some sample data from the library, and a pre-trained sentence embedder based on transformer architecture.

In [None]:
from embedders.samples.clickbait import get_sample_data
from embedders.classification.contextual import TransformerSentenceEmbedder

The `clickbait` dataset is straightforward simple and consists of some short headliners. Here's an example.

In [None]:
texts = get_sample_data()
print(texts[0])

Next, we just load the embedder via some Hugging Face configuration string. We make use of `distilbert-base-uncased`. You can input any other model from the Hub.

In [None]:
embedder = TransformerSentenceEmbedder("distilbert-base-uncased")

And now the magic happens: we encode the data. This is as easy as with your favorite sklearn objects - just call `fit_transform`.

In [None]:
embeddings = embedder.fit_transform(texts)

Now, to compute a vanilla similarity search, we'll make use of the cosine similarity, which helps us to compute similarities for given vectors.

In [None]:
import numpy as np

def cosine_similarity(vector_1, vector_2):
    return np.dot(vector_1, vector_2)/(np.linalg.norm(vector_1)*np.linalg.norm(vector_2))

And finally, a simplistic nested loop to calculate pairwise similarities (excluding identical sentences).

In [None]:
from tqdm import tqdm

highest_similarity = float("-inf")
vector_pair = None, None
for vector_1_idx, vector_1 in tqdm(enumerate(embeddings), total=len(embeddings)):
    for vector_2_idx, vector_2 in enumerate(embeddings):
        if vector_1_idx != vector_2_idx:
            similarity = cosine_similarity(vector_1, vector_2)
            if similarity > highest_similarity:
                highest_similarity = similarity
                vector_pair = vector_1_idx, vector_2_idx

We can now take a look at the most similar pair in our text corpus.

In [None]:
print(texts[vector_pair[0]], texts[vector_pair[1]])

Wow - isn't that amazing?! Embedding data is one of the most sophisticated and intelligent way to enrich your records with valuable semantic metadata. There are sheer endless use cases. And `embedders` helps you to quickly generate embeddings for your dataset! 😋

---

If you have further questions, don't hesitate to contact us. If there is anything you want to have added to the library, open an [issue](https://github.com/code-kern-ai/embedders/issues). And please, don't forget to give us a ⭐