# Wikipedia Semantic Search with Cohere Embeddings Archives

---
## Introduction
In this notebook, we demonstrate how to use the [Amazon Bedrock InvokeModel API](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_InvokeModel.html) to do simple [semantic search](https://txt.cohere.ai/what-is-semantic-search/) on the [Wikipedia embeddings archives](https://cohere.com/blog/embedding-archives-wikipedia) published by Cohere. These archives embed Wikipedia sites in multiple languages. In this example, we'll use the 2023 version of [Wikipedia Simple English](https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3-int8-binary) and binary embeddings. We also use the [Amazon Bedrock Converse API](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-cohere-command-r-plus.html) to demonstrate how we can use the results of semantic search.

### Semantic Search and Text Embeddings
Semantic search leverages text embeddings and similarity to find responses based on meaning, not just keywords. Text embeddings represent pieces of text as numeric vectors that encode semantic meaning. These embeddings allow for mathematical comparisons of word and sentence meaning. Multilingual embeddings map text in different languages to the same vector space, enabling semantic search across languages. See [What is Semantic Search](https://cohere.com/blog/what-is-semantic-search) to read about improvement algorithms such as hierarchical navigable small world (HNSW) and multiple negative ranking loss.

### Int8/byte and Binary Encoded Embeddings
Semantic search over large datasets can require a lot of memory because most vector databases store embeddings and vector indices in memory. Dimensionality reduction to conserve memory and reduce costs can perform poorly ([Cohere research](https://arxiv.org/abs/2205.11498?ref=cohere-ai.ghost.io)). 

A better approach is to use a model that uses fewer bits per dimension. Cohere's Embed is a text embedding model that offers leading performance in 100+ languages. It translates text into vector representations which encode semantic meaning. Cohere's Embed is the first embedding model that natively supports int8/byte and binary embeddings.

Binary embeddings give you a 32x reduction in memory and can be searched 40x faster. Given that embeddings are typically stored as float32, an embedding with 1024 dimensions requires 1024 x 4 bytes = 4096 bytes. Using 1 bit per dimension results in a 32x reduction in required memory (or, 4096 * 8 / 1024). See [Cohere int8 & binary embeddings](https://cohere.com/blog/int8-binary-embeddings).

---

## Getting Started

### Step 0: Install dependencies

In [1]:
# Let's install HF datasets and boto3, the AWS SDK for Python
%pip install datasets --quiet
%pip install boto3==1.34.120 --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Step 1: Install the Wikipedia embeddings archives published by Cohere

Let's now download 1,000 records from the English Wikipedia embeddings archive so we can search it afterwards.

In [2]:
from datasets import load_dataset
# Import torch, the open-source machine learning library
import torch

# Load at max 1000 documents and embeddings
max_docs = 1000
# Use the Simple English Wikipedia subset
lang = "simple"
docs_stream = load_dataset(f"Cohere/wikipedia-2023-11-embed-multilingual-v3-int8-binary", lang, split="train", streaming=True)

# To verify we have loaded the data, print docs_stream
print(docs_stream)

IterableDataset({
    features: ['_id', 'url', 'title', 'text', 'emb_int8', 'emb_ubinary'],
    n_shards: 7
})


The `IterableDataset` object contains a collection of 1000 examples, each with `features` which are the names of the columns for each example.

The `emb_int8` is an integer encoded embedding while `emb_ubinary` is a binary encoded embedding for each Wikipedia article article.

### Step 2: Create tensor of binary embeddings for semantic search

In [3]:
# Access python interpreter and command-line arguments
import sys
# Let's create lists of documents and binary embeddings
docs = []
doc_embeddings = []

for doc in docs_stream:
    docs.append(doc)
    doc_embeddings.append(doc["emb_ubinary"])
    if len(docs) >= max_docs:
        break

# Convert doc_embeddings into a PyTorch tensor
doc_embeddings = torch.tensor(doc_embeddings)

first_doc = next(iter(docs_stream))

# Size of a tensor with the integer embeddings of the first doc
first_integer_tensor_size = sys.getsizeof(torch.tensor(first_doc["emb_int8"]).untyped_storage())
# Size of a tensor with the binary embeddings of the first doc
first_binary_tensor_size = sys.getsizeof(torch.tensor(first_doc["emb_ubinary"]).untyped_storage())

print(f"The memory consumed by the tensor for the integer embeddings of the first doc in bytes is {first_integer_tensor_size}")
print(f"The memory consumed by the tensor for the binary embeddings of the first doc in bytes is {first_binary_tensor_size}")
print(f"The tensor for binary embeddings consumes {first_integer_tensor_size / first_binary_tensor_size} less memory.")

The memory consumed by the tensor for the integer embeddings of the first doc in bytes is 8256
The memory consumed by the tensor for the binary embeddings of the first doc in bytes is 1088
The tensor for binary embeddings consumes 7.588235294117647 less memory.


Now, `doc_embeddings` holds the embeddings of the first 1,000 documents in the dataset. Each document is represented as an [embeddings vector](https://cohere.com/blog/sentence-word-embeddings) of 128 values. 

Note that the tensor for binary embeddings is approximately 7.59 times smaller than the tensor for integer embeddings. This is expected as integer embeddings use 1 byte (8 bits) per dimension while binary embeddings use 1 bit per dimension. The memory reduction is smaller than 8 because the tensor array itself has a non-zero size.

In [4]:
# Return the tensor shape
doc_embeddings.shape

torch.Size([1000, 128])

### Step 3: Embed query and compute dot product with document embeddings
We can now search these vectors for any query we want. For this example, we'll ask a question about Alan Turing since we know the Wikipedia page for Alan Turing is included in this subset of the archive.

To search, we embed the query, then get the nearest neighbors to its embedding (using dot product).

This shows the top `k` passages that are relevant to the query. We can retrieve more results by changing the `k` value. The question in this simple demo is about Alan Turing because we know that the Wikipedia page is part of the documents in this subset of the archive.

In [5]:
# To use Cohere models on Bedrock we need to install dependencies
import boto3, json, logging
# Set up the Bedrock client
bedrock_rt = boto3.client(service_name="bedrock-runtime", region_name = "us-east-1")
from botocore.exceptions import ClientError

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Create request paramaters for Bedrock
model_id = "cohere.embed-multilingual-v3"
accept = "*/*"
content_type = "application/json"
embedding_types = ["ubinary"]
input_type = "search_query"

# Create the text used for semantic search
query = "Tell me about Alan Turing"

# Set the number of nearest neighbors
k = 7

body = json.dumps({
    "texts": [query],
    "input_type": input_type,
    "embedding_types": embedding_types}
)

# Call the Bedrock InvokeModel API
response = bedrock_rt.invoke_model(
    body=body,
    modelId=model_id,
    accept=accept,
    contentType=content_type
)

# Load the response into response_body
response_body = json.loads(response.get("body").read())

# Extract the binary embeddings
query_emb_ubinary = response_body["embeddings"]["ubinary"]
print("Query embeddings:", query_emb_ubinary, "\n")

# Convert query into a PyTorch tensor
query_emb_ubinary = torch.tensor(query_emb_ubinary)

Query embeddings: [[18, 63, 75, 232, 59, 67, 51, 160, 255, 68, 251, 186, 114, 165, 136, 58, 82, 15, 211, 232, 128, 37, 107, 204, 75, 163, 74, 251, 32, 233, 200, 154, 106, 241, 127, 125, 74, 31, 123, 209, 82, 220, 228, 15, 254, 151, 220, 43, 199, 230, 143, 73, 67, 229, 149, 61, 34, 86, 69, 56, 215, 178, 131, 49, 108, 251, 76, 187, 134, 2, 155, 169, 129, 130, 229, 103, 12, 113, 145, 9, 32, 139, 212, 3, 224, 64, 27, 151, 175, 217, 139, 30, 132, 192, 111, 60, 221, 162, 108, 120, 153, 219, 214, 165, 164, 133, 78, 232, 203, 63, 149, 53, 135, 117, 100, 213, 75, 46, 114, 159, 22, 216, 255, 233, 98, 26, 252, 22]] 



Let's imagine that we didn't know that our documents contain text with information about Alan Turing. The way to search for relevant documents is to search the `doc_embeddings` with the binary embeddings that we just created above. The semantic meaning of our query is captured by the embeddings, and a simliar query will return an embedding with similar elements. A high score for the dot product indicates similarity. See [What is similarity between sentences](https://cohere.com/blog/what-is-similarity-between-sentences). 

In [14]:
# Compute dot score between query embeddings and document embeddings
dot_scores = torch.mm(query_emb_ubinary, doc_embeddings.transpose(0, 1))

print("The largest element from the dot_scores tensor is", dot_scores.max().item())
print("As expected, this value is still lower than the dot product of the query embeddings and itself", torch.mm(query_emb_ubinary, query_emb_ubinary.transpose(0, 1)).item(), '\n')

# Use topk to return the largest elements of the dot_scores tensor
top_k = torch.topk(dot_scores, k)

# Print results
print("The query is:", query, "\n\nThe below is a list of top k relevant documents:")
# This loop iterates over the indices of the top k relevant documents
for doc_id in top_k.indices[0].tolist():
    print("Title:", docs[doc_id]["title"])
    print("Text:", docs[doc_id]["text"])
    print(docs[doc_id]["url"], "\n")

The largest element from the dot_scores tensor is 2703919
As expected, this value is still lower than the dot product of the query embeddings and itself 2873374 

The query is: Tell me about Alan Turing 

The below is a list of top k relevant documents:
Title: Alan Turing
Text: Turing was one of the people who worked on the first computers. He created the theoretical  Turing machine in 1936. The machine was imaginary, but it included the idea of a computer program.
https://simple.wikipedia.org/wiki/Alan%20Turing 

Title: Alan Turing
Text: In 2013, almost 60 years later, Turing received a posthumous Royal Pardon from Queen Elizabeth II. Today, the “Turing law” grants an automatic pardon to men who died before the law came into force, making it possible for living convicted gay men to seek pardons for offences now no longer on the statute book.
https://simple.wikipedia.org/wiki/Alan%20Turing 

Title: Botany
Text: Gregor Mendel (1822–1884), Augustinian priest and scientist, and is often c

### Step 4: Add the results of semantic search as context to a prompt

Let's start by sending the same query to the Command R+ model using Amazon Bedrock to compare responses.

In [15]:
# Create the variables to make a call to the converse API
user_message = "Tell me about Alan Turing."
conversation = [
    {
        "role": "user",
        "content": [{"text": user_message}],
    }
]

try:
    # Make the API call three times to visualize the different responses
    for i in range(3):
        print("Result of API call", str(i+1), ':')
        # Send the message to the model, using a basic inference configuration.
        response = bedrock_rt.converse(
            modelId="cohere.command-r-plus-v1:0",
            messages=conversation,
            inferenceConfig={"maxTokens": 200, "temperature": 0.5, "topP": 0.9},
        )
        # Extract and print the response text.
        response_text = response["output"]["message"]["content"][0]["text"]
        print(response_text, "\n\n")
except (ClientError, Exception) as e:
    print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
    exit(1)

Result of API call 1 :
Alan Turing was a British mathematician, computer scientist, and cryptanalyst who made groundbreaking contributions to many fields, particularly in the areas of computing, cryptography, and artificial intelligence. He is widely regarded as one of the most influential figures in the development of modern computing and one of the key people who helped shape the field of artificial intelligence.

Early Life and Education:
Alan Mathison Turing was born on June 23, 1912, in Maida Vale, London, to Ethel Sara Turing and Julius Mathison Turing. From an early age, he displayed a talent for science and mathematics. He attended Sherborne School, where he excelled in mathematics and science, and later earned a scholarship to study at King's College, University of Cambridge, in 1931. At Cambridge, he studied mathematics, logic, and cryptology, and his work on probability theory and computability laid the foundations for his later achievements.

Contributions to Computing and 

The response from the large language model (LLM) is clearly non-deterministic. [Read more about LLM parameters here](https://cohere.com/blog/llm-parameters-best-outputs-language-ai) to learn about the parameters used to control model output. Note that we can also add the results of the semantic search as context for our prompt to augment the response.

In [16]:
# Initialize an empty string
context = ""

# Append the text from the relevant documents to the context
for doc_id in top_k.indices[0].tolist():
    context += docs[doc_id]["text"]

# Create a new prompt
prompt = f"""{context}
Given the information above, answer this question: {user_message}"""

print("The prompt is now:", prompt)

The prompt is now: Turing was one of the people who worked on the first computers. He created the theoretical  Turing machine in 1936. The machine was imaginary, but it included the idea of a computer program.In 2013, almost 60 years later, Turing received a posthumous Royal Pardon from Queen Elizabeth II. Today, the “Turing law” grants an automatic pardon to men who died before the law came into force, making it possible for living convicted gay men to seek pardons for offences now no longer on the statute book.Gregor Mendel (1822–1884), Augustinian priest and scientist, and is often called the father of genetics for his study of the inheritance of traits in pea plants.Creativity is the ability of a person or group to make something new and useful or valuable, or the process of making something new and useful or valuable. It happens in all areas of life - science, art, literature and music.In 1837, Charles Babbage proposed the first general mechanical computer, the Analytical Engine. 

We are using 0.5 for the `temperature` to illustrate variance in responses from the LLM. The default value for `temperature` is 0.3 and lower values decrease randomness in the response. [Read more about inference requests to Amazon Bedrock here](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-cohere-command-r-plus.html). Also, change the `modelId` to `cohere.command-r-v1:0` and use the Command R model if you want to test chat history. The Command R model supports a conversation history with multiple turns. 

In [18]:
# Send the same query to the Command R+ model using the Bedrock converse API but add the results of the semantic search as context
conversation = [
    {
        "role": "user",
        "content": [{"text": prompt}],
    }
]

# For use as the chat_history parameter when modelId is cohere.command-r-v1:0
# history = [
#     {"role": "USER", "message": "Example question from user"},
#     {"role": "CHATBOT", "message": "Example response from chatbot"}
# ]

try:
    for i in range(3):
        print('Result of API call', str(i+1), ':')
        # Send the message to the model, using a basic inference configuration.
        response = bedrock_rt.converse(
            modelId='cohere.command-r-plus-v1:0',
            messages=conversation,
            # chat_history = history,
            inferenceConfig={"maxTokens": 200, "temperature": 0.5, "topP": 0.9},
        )
        # Extract and print the response text.
        response_text = response["output"]["message"]["content"][0]["text"]
        print(response_text, '\n\n')
except (ClientError, Exception) as e:
    print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")

Result of API call 1 :
Alan Turing was a pioneering British mathematician, computer scientist, and cryptanalyst. He is widely regarded as one of the most influential figures in the development of theoretical computer science, and his work laid the foundations for many aspects of modern computing. 

One of his most significant contributions was the conception of the Turing machine in 1936, which was a theoretical device that could perform complex calculations and has since become a central concept in computer science and theory. The Turing machine demonstrated the idea of a programmable computer, even though it was never intended to be built as a physical machine. 

During World War II, Turing played a crucial role in breaking German military codes, most notably those generated by the Enigma machine. His work at Bletchley Park, Britain's code-breaking center, is credited with shortening the war and saving countless lives. 

Turing's life and career were also marked by significant challe

---
## Conclusion
In this notebook, we discussed the benefits of using different types of embeddings. Dimensionality reduction negatively impacts performance, but Cohere's Embed helps solve this problem by natively supporting int8/byte and binary embeddings. Binary embeddings provide a 32x memory reduction and enable 40x faster search compared to float32 embeddings, offering a highly efficient solution for large-scale semantic search. In the example code of this notebook, we used Amazon Bedrock to call Cohere's Embed and Command R+ LLMs. In addition, we can use the results of semantic search as context when sending prompts to a LLM. As seen above, using the augmented prompt containing the top `k` results from the semantic search gives much more deterministic responses. This is just one of the techniques we can use to return responses with up-to-date and domain-specific information. Semantic search also forms the basis for more advanced improvement algorithms such as HNSW and multiple negative ranking loss.