# Wikipedia Semantic Search with Cohere Embedding Archives

---
## Introduction
In this notebook, we demonstrate how to use the Bedrock InvokeModel API to do simple [semantic search](https://txt.cohere.ai/what-is-semantic-search/) on the [Wikipedia embeddings archives](https://txt.cohere.ai/embedding-archives-wikipedia/) published by Cohere. These archives embed Wikipedia sites in multiple languages. In this example, we'll use [Wikipedia Simple English](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings). 

### Semantic Search and Text Embeddings
Semantic search leverages text embeddings and similarity to find responses based on meaning, not just keywords. Text embeddings represent pieces of text as numeric vectors that encode semantic meaning. These embeddings allow for mathematical comparisons of word and sentence meaning. Multilingual embeddings map text in different languages to the same vector space, enabling semantic search across languages. 

---

## Getting Started

### Step 0: Install Dependencies

In [33]:
# Let's install "cohere<5" and HF datasets
%pip install datasets --quiet
# Let's also install boto3
%pip install boto3==1.34.120 --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Let's now download 1,000 records from the English Wikipedia embeddings archive so we can search it afterwards.

<!-- Now, `doc_embeddings` holds the embeddings of the first 1,000 documents in the dataset. Each document is represented as an [embeddings vector](https://txt.cohere.ai/sentence-word-embeddings/) of 768 values. -->

In [34]:
# doc_embeddings.shape

We can now search these vectors for any query we want. For this toy example, we'll ask a question about Wikipedia since we know the Wikipedia page is included in the first 1000 documents we used here.

To search, we embed the query, then get the nearest neighbors to its embedding (using dot product).

In [35]:
# # Get the query, then embed it
# query = 'Who founded Wikipedia'
# response = co.embed(texts=[query], model='multilingual-22-12')
# query_embedding = response.embeddings 

# # print(type(query_embedding))
# # print(query_embedding)

# query_embedding = torch.tensor(query_embedding)


# # Compute dot score between query embedding and document embeddings
# dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1))
# top_k = torch.topk(dot_scores, k=3)

# # Print results
# print("Query:", query)
# for doc_id in top_k.indices[0].tolist():
#     print(docs[doc_id]['title'])
#     print(docs[doc_id]['text'], "\n")


This shows the top three passages that are relevant to the query. We can retrieve more results by changing the `k` value. The question in this simple demo is about Wikipedia because we know that the Wikipedia page is part of the documents in this subset of the archive.

In [36]:
DEFAULT_MODEL= "cohere.command-r-plus-v1:0"
COMMAND_R_PLUS = "cohere.command-r-plus-v1:0"
COMMAND_R = "cohere.command-r-v1:0"
model_id = DEFAULT_MODEL

Now lets import the required modules to run the notebook and set up the Bedrock client

In [37]:
import boto3, json
bedrock_rt = boto3.client(service_name="bedrock-runtime", region_name = "us-east-1")

### Step 3 - Configure the request to the model

The developer provides a few things to the model:
- A preamble containing instructions about the task and the desired style for the output.
- The user request.
- A list of tools to the model.
- (Optionally) a chat history for the model to work with.

In [38]:
user_message = "Tell me who founded Wikipedia."
conversation = [
    {
        "role": "user",
        "content": [{"text": user_message}],
    }
]

try:
    # Send the message to the model, using a basic inference configuration.
    response = bedrock_rt.converse(
        modelId=DEFAULT_MODEL,
        messages=conversation,
        inferenceConfig={"maxTokens": 512, "temperature": 0.5, "topP": 0.9},
    )

    # Extract and print the response text.
    response_text = response["output"]["message"]["content"][0]["text"]
    print(response_text)

except (ClientError, Exception) as e:
    print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
    exit(1)


Wikipedia was founded by Jimmy Wales and Larry Sanger. It was launched on January 15, 2001, as a free, open-content encyclopedia that anyone can edit. Jimmy Wales is often referred to as the "father of Wikipedia" and is known for his strong belief in the power of collaborative knowledge sharing. Larry Sanger played a key role in the early development and promotion of Wikipedia, and is considered the co-founder.


Cohere Embed is a text embedding model that offers leading performance in 100+ languages. It translates text into vector representations which encode semantic meaning. Enterprises use this model to power search and retrieval systems.import logging

In [39]:
%pip install datasets numpy faiss-cpu --quiet

Note: you may need to restart the kernel to use updated packages.


In [42]:
from datasets import load_dataset
import torch
import numpy as np


#Load at max 1000 documents + embeddings
max_docs = 1000

# docs_stream = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True)
# docs_stream = load_dataset(f"Cohere/movies", split="train", streaming=True)

lang = "simple"
docs_stream = load_dataset(f"Cohere/wikipedia-2023-11-embed-multilingual-v3-int8-binary", lang, split="train", streaming=True)

print(docs_stream)

docs = []
doc_embeddings = []

# Printing docs stream
print(docs_stream)

for doc in docs_stream:
    docs.append(doc)
    # doc_embeddings.append(doc['emb'])
    # doc_embeddings.append(doc['emb_int8'])
    doc_embeddings.append(doc['emb_ubinary'])
    if len(docs) >= max_docs:
        break

# doc_embeddings = torch.tensor(doc_embeddings)

IterableDataset({
    features: ['_id', 'url', 'title', 'text', 'emb_int8', 'emb_ubinary'],
    n_shards: 7
})
IterableDataset({
    features: ['_id', 'url', 'title', 'text', 'emb_int8', 'emb_ubinary'],
    n_shards: 7
})


In [45]:
import faiss

doc_embeddings = np.asarray(doc_embeddings, dtype='uint8')
#Create the faiss IndexBinaryFlat index
num_dim = 1024 
index = faiss.IndexBinaryFlat(num_dim)
index.add(doc_embeddings)

def search(index, query, top_k=7):
    # Make sure to set input_type="search_query"
    # query_emb = co.embed(texts=[query], model="embed-multilingual-v3.0", input_type="search_query", embedding_types=["ubinary", "float"]).embeddings
    # query_emb_bin = np.asarray(query_emb.ubinary, dtype='uint8')
    # query_emb_float = np.asarray(query_emb.float, dtype="float32")
    # Function for generating text embeddings
    """
    Generate text embedding by using the Cohere Embed model.
    Args:
        model_id (str): The model ID to use.
        body (str) : The reqest body to use.
    Returns:
        dict: The response from the model.
    """
    accept = '*/*'
    content_type = 'application/json'
    bedrock = boto3.client(service_name='bedrock-runtime')
    model_id = 'cohere.embed-multilingual-v3'
    text1 = "who was the founder of wikipedia"
    input_type = "search_query"
    embedding_types = ["ubinary", "float"]
    # Request body
    body = json.dumps({
        "texts": 
        # [text1],
        [query],
        "input_type": input_type,
        "embedding_types": embedding_types}
    )
    response = bedrock.invoke_model(
        body=body,
        modelId=model_id,
        accept=accept,
        contentType=content_type
    )
    response_body = json.loads(response.get('body').read())
    query_emb_bin = np.asarray(response_body['embeddings']['ubinary'], dtype='uint8')
    query_emb_float = np.asarray(response_body['embeddings']['float'], dtype='float32')

    # Phase I: Search on the index with a binary
    hits_scores, hits_doc_ids = index.search(query_emb_bin, k=min(10*top_k, index.ntotal))

    #Get the results in a list of hits
    hits = [{'doc_id': doc_id.item(), 'score_bin': score_bin} for doc_id, score_bin in zip(hits_doc_ids[0], hits_scores[0])]

    # Phase II: Do a re-scoring with the float query embedding
    binary_doc_emb = np.asarray([index.reconstruct(hit['doc_id']) for hit in hits])
    binary_doc_emb_unpacked = np.unpackbits(binary_doc_emb, axis=-1).astype("int")
    binary_doc_emb_unpacked = 2*binary_doc_emb_unpacked-1

    scores_cont = (query_emb_float[0] @ binary_doc_emb_unpacked.T)
    for idx in range(len(scores_cont)):
        hits[idx]['score_cont'] = scores_cont[idx]

    #Sort by largest score_cont
    hits.sort(key=lambda x: x['score_cont'], reverse=True)

    return hits[0:top_k]

# query2 = 'who was the founder of wikipedia'
hits = search(index, "What are the national partks in the united states?")
print(hits)
for hit in hits:
    doc_id = hit['doc_id']
    print(docs[doc_id]['title'])
    print(docs[doc_id]['text'])
    print(docs[doc_id]['url'], "\n")

[{'doc_id': 207, 'score_bin': 322, 'score_cont': 13.732296824455261}, {'doc_id': 285, 'score_bin': 322, 'score_cont': 13.679420351982117}, {'doc_id': 291, 'score_bin': 323, 'score_cont': 13.309430718421936}, {'doc_id': 225, 'score_bin': 327, 'score_cont': 13.071157336235046}, {'doc_id': 888, 'score_bin': 339, 'score_cont': 12.715680480003357}, {'doc_id': 772, 'score_bin': 335, 'score_cont': 12.682390093803406}, {'doc_id': 886, 'score_bin': 333, 'score_cont': 12.474794745445251}]
Australia
Australia is part of the Commonwealth of Nations. Australia is made up of six states, and two mainland territories. Each state and territory has its own Parliament and makes its own local laws. The Parliament of Australia sits in Canberra and makes laws for the whole country, also known as the Commonwealth or Federation.
https://simple.wikipedia.org/wiki/Australia 

Native American
Native Americans are divided into many small nations, called First Nations in Canada and tribes elsewhere.
https://simple