<a href="https://colab.research.google.com/github/dzivkovi/LLM_RBAC_experiments/blob/main/07_Pinecone__VectorDB_namespaces_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multitenancy using Namespaces in Pinecone

## https://docs.pinecone.io/docs/multitenancy
### This (too) old code is from: https://docs.pinecone.io/docs/namespaces

Namespaces are a unique concept in Pinecone -- they are logical separators of vectors within a single index. Namespaces are invaluable when tackling problems surrounding multitenancy and scaling in a financially sustainable manner. Stay tuned for an indepth writeup about namespaces that will touch on these issues and much more in the future.

In this notebook, we will showcase namespaces' utility by building a search application that deals with a simple multitenant situation: we will have 3 users ("tenants") who'd like results back in their native languages -- one user speaks English, another user speaks Italian, and another user speaks French. The language our Pinecone engine receives at query-time will determine where (i.e. which namespace) the query is routed.

Let's get started!

## Step 1. Setup

#### Our environment:

In [1]:
!pip install -qU \
     langchain \
     tiktoken \
     datasets \
     pinecone-client openai

#!pip install protobuf==3.20.3
#!pip install apache-beam==2.50.0

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.50.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.[0m[31m
[0m

In [2]:
# Import everything we need

from tqdm.auto import tqdm
from uuid import uuid4
import pinecone
from getpass import getpass
from langchain.embeddings.openai import OpenAIEmbeddings
from datasets import load_dataset
from getpass import getpass
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
import pandas as pd

#### Demo context:

For this demo, we will be creating a mini search application that searches an index for content, based on a user's (or "tenant's") preferred language.

We will be working with three Wikipedia datasets [from HuggingFace](https://huggingface.co/datasets/wikipedia); one in English, one in French, and one in Italian. Each dataset contains `id`, `url`, `title`, and `text` columns. We will direct users' (i.e. tenants') queries to the appropriate namespace based on the language passed in at query-time.

## Step 2. Initialize Our Index

We will configure our Pinecone index to store `1536`-dimension vectors and to use cosine similarity as its similarity metric (see [here](https://docs.pinecone.io/docs/choosing-index-type-and-size) for more about dimensions & [here](https://www.pinecone.io/learn/vector-similarity/) for more about similarity metrics).

We will be using OpenAI's [text-embedding-ada-002](https://openai.com/blog/new-and-improved-embedding-model) model to produce vector representations of our textual data, and since that model outputs `1536`-dimension vectors, that's what our index will have to be configured to intake.

To keep things straightforward, we will name our index `"namespaces-demo"`.


In [47]:
# Note: This step takes a little while to complete, so don't worry if it takes 30s or so!
from google.colab import userdata

# Find API key in console at app.pinecone.io
YOUR_API_KEY = userdata.get('PINECONE_API_KEY')

# Find ENV (cloud region) next to API key in console
YOUR_ENV = userdata.get('PINECONE_ENVIRONMENT')

INDEX_NAME = 'namespaces-demo'

# Initialize Pinecone client
pinecone.init(
    api_key=YOUR_API_KEY,
    environment=YOUR_ENV
)


In [48]:
# RUN ONCE

## Create index
#pinecone.create_index(
#    name=INDEX_NAME,
#    metric='cosine',
#    dimension=1536)

In [49]:
# Confirm we indeed created our "namespaces-demo" index

pinecone.list_indexes()

['namespaces-demo']


Now, to connect to the index we just created, we will use the [GRPCIndex class](https://github.com/pinecone-io/pinecone-python-client/blob/main/pinecone/core/grpc/index_grpc.py#L293) (this class is generally quicker than the [REST Index class](https://github.com/pinecone-io/pinecone-python-client/blob/main/pinecone/index.py#L49) when you've got a lot of data to process).

You can see below with the `.describe_index_stats()` method that our index currently has `0` vectors in it (`'total_vector_count': 0`).

In [5]:
demo_index = pinecone.Index(INDEX_NAME)

demo_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

## Step 3. Prepare Our Data

In our for our data to be indexed into our Pinecone index, we need to do a few things --

Besides loading up our data into memory, we also need to chunk up our text into small pieces, extract whatever metadata we are interested in, and vectorize those small chunks of text data.

### Step 3a. Load & Preview Our Data

The data we'll be using is Wikipedia data in 3 different languages: English, French, and Italian. The datasets are from [HuggingFace](https://huggingface.co/datasets/wikipedia). Since they take a pretty long time to load in their entirety, we'll be using snippets.

If you want to load the datasets in their entirety, you can execute this code

```
from datasets import load_dataset

wiki_en = load_dataset("wikipedia", "20220301.en")
wiki_it = load_dataset("wikipedia", "20220301.it")
wiki_fr = load_dataset("wikipedia", "20220301.fr")
```

To preview the datasets, you'll want to drill down to the `train` section, like so (you can read more about the various methods available to run on HuggingFace's `DatasetDict` objects [here](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.DatasetDict)):

```
wiki_en['train'][:1000]
```

In [10]:
# Load up our English, Italian, and French datasets (just the first 10 rows from each HuggingFace dataset)

wiki_en = load_dataset("wikipedia", "20220301.en", split="train[:10]")
wiki_it = load_dataset("wikipedia", "20220301.it", split="train[:10]")
wiki_fr = load_dataset("wikipedia", "20220301.fr", split="train[:10]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access p

Downloading:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.33k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/7.38G [00:00<?, ?B/s]

In [11]:
# Preview data in a Pandas dataframe

# English
wiki_en.to_pandas().head(2)

Unnamed: 0,id,url,title,text
0,12,https://en.wikipedia.org/wiki/Anarchism,Anarchism,Anarchism is a political philosophy and moveme...
1,25,https://en.wikipedia.org/wiki/Autism,Autism,Autism is a neurodevelopmental disorder charac...


In [12]:
# Italian
wiki_it.to_pandas().head(2)

Unnamed: 0,id,url,title,text
0,2,https://it.wikipedia.org/wiki/Organo%20a%20pompa,Organo a pompa,Lorgano a pompa è un tipo di organo a serbatoi...
1,3,https://it.wikipedia.org/wiki/Antropologia,Antropologia,Lantropologia (dal greco ἄνθρωπος ànthropos «u...


In [13]:
# French
wiki_fr.to_pandas().head(2)

Unnamed: 0,id,url,title,text
0,3,https://fr.wikipedia.org/wiki/Antoine%20Meillet,Antoine Meillet,"Paul Jules Antoine Meillet, né le à Moulins (..."
1,7,https://fr.wikipedia.org/wiki/Alg%C3%A8bre%20l...,Algèbre linéaire,L’algèbre linéaire est la branche des mathémat...


### Step 3b. Write Helper Functions to Chunk, Extract Metadata, and Create Embeddings

As noted before, in order to get our data into our Pinecone index, we'll need to chunk it all up, figure out what (if any) metadata we want to include, and then transform our chunks of text into embeddings.

You can read more about chunking [here](https://www.pinecone.io/learn/chunking-strategies/) and metadata [here](https://docs.pinecone.io/docs/metadata-filtering).

The steps we need to take are:
- Initialize our embedding model
- Write a function (`tiktoken_len`) that will tokenize our text data
- Write a function that will determine the ideal chunk size for our text data, given the number of tokens in our data, and chunk our data up (`chunk_by_size`)
- Write a function that will sew everything together (`create_chunks_metadata_embeddings`) -- chunking our data, grabbing metadata from our data that we want in our index, and creating embeddings of our chunked-up text data


**Note:** You will need an [OpenAI](https://openai.com/) API Key for the next part, since we will be making API calls to an OpenAI-hosted model: `ada-002`

In [6]:
import openai
from google.colab import userdata


OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

  warn_deprecated(


Chunking can be a complex process, but we are going to keep it simple by using Langchain's [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter). Into that method, we will insert a custom length function (`tiktoken_len` below). This will essentially let us recursively split our text data into `n` tokens, using the `tiktoken` [library](https://github.com/openai/tiktoken).

(If you don't know what "recursively" means, check out [this article](https://www.geeksforgeeks.org/introduction-to-recursion-data-structure-and-algorithm-tutorials/)!)

In [7]:
# Tell tiktoken what model we'd like to use for embeddings
tiktoken.encoding_for_model('text-embedding-ada-002')

# Intialize a tiktoken tokenizer (i.e. a tool that identifies individual tokens (words))
tokenizer = tiktoken.get_encoding('cl100k_base')

# Create our custom tiktoken function
def tiktoken_len(text: str) -> int:
    """
    Split up a body of text using a custom tokenizer.

    :param text: Text we'd like to tokenize.
    """
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

Now that we have our custom tiktoken function, we need to write an additional function that will come up with the ideal size of our chunks & split our text up according to that size:

In [8]:
def chunk_by_size(text: str, size: int = 50) -> list[Document]:
    """
    Chunk up text recursively.

    :param text: Text to be chunked up
    :return: List of Document items (i.e. chunks).|
    """
    text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = size,
    chunk_overlap = 20,
    length_function = tiktoken_len,
    add_start_index = True,
)
    return text_splitter.create_documents([text])

Now, we can write a final function to put everything together nicely:

In [9]:
def create_chunks_metadata_embeddings(dataset: pd.DataFrame) -> list[dict]:
    """
    Given a dataset, split text data into chunks, extract metadata, create embeddings for each chunk.

    :param dataset: Data we want to process.
    :return: List of data objects to upsert into our Pinecone index.
    """
    data_objs = []

    # For each row in our dataset:
    for index, row in tqdm(dataset.iterrows()):  # (tqdm library prints status of for-loop to console)
        # Create chunks
        chunked_text = chunk_by_size(row["text"])

        # Extract just the string content from the chunk
        chunked_text = [c.page_content for c in chunked_text]

        # Extract some metadata, create an ID, and generate an embedding for the chunk.
        # Wrap that all in a dictionary, and append that dictionary to a list (`data_objs`).
        for idx, text in enumerate(chunked_text):
            payload = {
                "metadata": {
                    "url": row["url"],
                    "title": row["title"],
                    "chunk_num": idx,
                    "text_content": text  # there are 248 chars in this chunk of text
                },
             "id": str(uuid4()),
            "values": embed.embed_documents([text])[0]  # --> list of len 248, each item of those 248 has a len of 1536
            }
            data_objs.append(payload)

    # Return list of dictionaries, each containing our metadata, ID, and embedding, per chunk.
    return data_objs




### Step 3b. Create Our Data Objects For Upsert

Now that we have written our helper functions, let's use them!

We will extract 3 rows from our datasets to upsert into our index.

(Don't worry if this part takes a few minutes to run -- Wikipedia articles (even just 3) are *long* and creating embeddings is a lot of work!)

#### English

In [14]:
# English
data_objs_en = create_chunks_metadata_embeddings(wiki_en.to_pandas().head(3))  # Grab the first 3 rows (Wikipedia articles) from dataset

0it [00:00, ?it/s]

In [15]:
# Our 3 rows of Wikipedia data have resulted in 737 data objects, each with metadata, an ID, and an embedding!
len(data_objs_en)

737

In [16]:
# Inspect one of our data objects
data_objs_en[0]

{'metadata': {'url': 'https://en.wikipedia.org/wiki/Anarchism',
  'title': 'Anarchism',
  'chunk_num': 0,
  'text_content': 'Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As'},
 'id': '62af271b-e57b-4daa-a6a5-48c6ce444bf6',
 'values': [-0.010786627791773553,
  -0.004103393435614342,
  -0.012747832082832065,
  -0.01575545275973415,
  -0.02661447318905498,
  0.014860405168492428,
  -0.03759195412866724,
  -0.016045025172568474,
  -0.00043107023554228227,
  -0.005478869847158219,
  0.009503288522701898,
  -0.01717699750968349,
  -0.0023445613560564437,
  0.007456527814710638,
  0.0013590229678806245,
  -0.006262035325341469,
  0.047411141691651795,
  0.011102525900733599,
  -0.02146795466168941,
  0.019388288929016857,
  -0.007680289246859765,
  0.01880914037805779,
  -0.007364390672238414,
 

#### Italian

In [17]:
# Italian:
data_objs_it = create_chunks_metadata_embeddings(wiki_it.to_pandas().head(3))

0it [00:00, ?it/s]

In [18]:
# See how many data objects we have created
len(data_objs_it)

490

In [19]:
# Preview
data_objs_it[0]

{'metadata': {'url': 'https://it.wikipedia.org/wiki/Organo%20a%20pompa',
  'title': 'Organo a pompa',
  'chunk_num': 0,
  'text_content': "Lorgano a pompa è un tipo di organo a serbatoio d'aria costituito da una (o più) tastiera, manuale, e da due pedali per azionare i mantici per"},
 'id': '5beb1496-1537-4d92-8bf0-a39bd08e1c78',
 'values': [-0.01043073738999078,
  0.0005003722748602045,
  0.0169037863281612,
  -0.02366251471506252,
  -0.019203147051889233,
  0.03603725555658655,
  -0.030686016223423592,
  -0.016137333995348512,
  -0.023341998217244666,
  -0.0055846599486347025,
  0.014325715241900152,
  0.007344019403162663,
  -0.015161846329499209,
  0.009650347816104831,
  0.027048846132066068,
  -0.008549441744401034,
  0.005400014088984423,
  0.009211379204662888,
  -0.005312917275315822,
  -0.020805733266268448,
  0.00927408887325138,
  0.009092927091038793,
  -0.003644138711894154,
  -0.008849055640237713,
  -0.007455503827572618,
  -0.015175782639249978,
  -0.009657315970980216

#### French

In [20]:
# French:
data_objs_fr = create_chunks_metadata_embeddings(wiki_fr.to_pandas().head(3))

0it [00:00, ?it/s]

In [21]:
# See how many data objects we have created
len(data_objs_fr)

199

In [22]:
# Preview
data_objs_fr[0]

{'metadata': {'url': 'https://fr.wikipedia.org/wiki/Antoine%20Meillet',
  'title': 'Antoine Meillet',
  'chunk_num': 0,
  'text_content': 'Paul Jules Antoine Meillet, né le  à Moulins (Allier) et mort le  à Châteaumeillant (Cher), est le principal linguiste français des premières décennies du . Il est aussi'},
 'id': 'c7f5c96c-7cf9-483a-abe8-229e91fdb3f2',
 'values': [-0.032051658010188874,
  0.003933188703652953,
  -0.003032318702020616,
  -0.027682190293151863,
  -0.004279549449249733,
  0.04990254221471698,
  -0.006757358429573481,
  0.017704346177964174,
  0.007080405841646505,
  -0.042549042312449104,
  0.008838851450203548,
  0.03295752572183906,
  -0.0033703531905501533,
  0.004232923713524845,
  0.006197852860504004,
  -0.021607560263042794,
  0.02564399259855205,
  0.037060564388595764,
  0.01765105962285002,
  0.0053819080732542,
  -0.021048055159634628,
  0.008059540238272106,
  0.00849915152399604,
  -0.008372596886922537,
  0.007886359865473716,
  0.0009766364729392964,
  

## Step 4. Upsert Our Data Objects into Our Index, into Different Namespaces

Now that we have created our data objects, it's time to index ("upsert") them into our index!

Since we want to sub-divide our index into language-based `namespaces`, you'll see below that each time we call the `.upsert()` method, we also specify a `namespace` parameter. For our example, each namespace corresponds to the language of the dataset.

We will index our data into our index in small batches of `100`. Read more about batch-indexing [here](#https://docs.pinecone.io/docs/insert-data#batching-upserts).

In [25]:
BATCH_SIZE = 100

def batch_upsert(data: list[dict], index: pinecone.Index, namespace: str):
    """
    Upsert data objects to a Pinecone index in batches.

    :param data: Data objects we want to upsert.
    :param index: Index into which we want to upsert our data objects.
    :namespace: Namespace within our index into which we want to upsert our data objects.
    """
    for i in range(0, len(data), BATCH_SIZE):
        batch = data[i:i+BATCH_SIZE]
        # print(batch)
        index.upsert(vectors=batch, namespace=namespace)


# NOTE:
# - In Production, you'll want to have a try/except loop here to catch upsert errors.
# - You'll also likely want to optimize your batching as you scale your data. Parallelization and using generator objects will
#   significantly improve your batch performance.
# - Last, you'll want to confirm that the # of vectors you upsert matches the # of vectors you intend to upsert.

#### English

In [26]:
# English
batch_upsert(data_objs_en, demo_index, 'en')

In [27]:
# Confirm there are indeed 737 vectors added to our index in the 'en' namespace. Perfect!
demo_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'en': {'vector_count': 737}},
 'total_vector_count': 737}

#### Italian

In [28]:
# Italian
batch_upsert(data_objs_it, demo_index, 'it')

In [29]:
# Confirm there are 490 vectors added to our index in the 'it' namespace
demo_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'en': {'vector_count': 737}, 'it': {'vector_count': 490}},
 'total_vector_count': 1227}

#### French

In [30]:
# French
batch_upsert(data_objs_fr, demo_index, 'fr')

In [31]:
# Confirm there are 199 vectors added to our index in the 'fr' namespace. Perfect!
demo_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'en': {'vector_count': 737},
                'fr': {'vector_count': 199},
                'it': {'vector_count': 490}},
 'total_vector_count': 1426}

## Step 5. Query Our Data

Now that we have all of our data objects in our index, it's time for the fun part: querying.

First let's come up with some fun queries in different languages, then let's create tenants to ask those queries.

In [32]:
# English
query_en = "Who is Wilhelm Weitling?"  # This should get us back some chunks of text from our Anarchism Wikipedia article

# Italian
query_it = "Chi è un famoso antropologo?"  # Here we are asking "Who is a famous Anthropologist?" This should get us results from the Antropologia article

# French
query_fr = "Qu’est-ce que l’espace vectoriel?" # This one is very apropos: "What is vector space?" This should get us results from the Algèbre linéaire article

In [33]:
# Tenants dict
tenants = [{
            'name': 'Audrey',
            'native_language': 'en',
            'query': query_en

            },
           {
             "name": "Michele",
             "native_language": 'it',
             "query": query_it
           },
           {
             "name": 'Pierre',
             "native_language": 'fr',
             "query": query_fr
           }]


### Step 5a. Create Vectorized Queries to Send to our Index

Since we are dealing with vector space, we can't simply execute a natural language query.

Instead, we need to turn our query into a language that's understandable by our vector database -- we need to *vectorize* our query. Once vectorized, *then* can send it on through to our index & get results back.


To vectorize our queries, we need to send them through the same model (`ada-002`) that we used to vectorize our Wikipedia articles.

In [34]:
def vectorize_query(model: OpenAIEmbeddings, query: str) -> list[float]:
    """
    Given a vectorization model & query, create an embedding.

    :param model: Model for creating embeddings.
    :param query: Query we want to vectorize/embed.
    :return: Vector/embedding.
    """
    return model.embed_query(query)


In [35]:
# Let's create our vectors for each of our 3 queries:
query_vector_en = vectorize_query(embed, query_en)
query_vector_it = vectorize_query(embed, query_it)
query_vector_fr = vectorize_query(embed, query_fr)

In [36]:
# Preview
query_vector_en

[0.0073842687953785836,
 -0.010534530831395411,
 -0.00022739273093404954,
 0.0004011451798243106,
 -0.023559654860286376,
 0.019817034395588105,
 -0.026346425409477448,
 0.015589759236003892,
 0.021015210727215673,
 -0.022024911578298025,
 0.021984521979632757,
 0.016343668608765582,
 0.006522658482789196,
 -0.006152435396185896,
 0.008986325281174573,
 -0.0023273572579950667,
 0.03115259393554281,
 -0.019332379700702163,
 -0.008838235674004212,
 -0.01289049668995309,
 -0.030129431747550575,
 0.0030812661650954556,
 -0.004829729125286768,
 0.014095404621358203,
 -0.005698070571992399,
 0.003093045999044857,
 0.01747453266790812,
 -0.0190765882223815,
 -0.002418230129701506,
 -0.02314231243788547,
 -0.008333386179785638,
 0.008939205945376966,
 -0.001989107873351198,
 0.006058196724590684,
 -0.006822202565696088,
 -0.027436899869954717,
 0.012217363720554125,
 -0.007081358912582919,
 0.006876053501271239,
 0.006421688677077742,
 0.0065529492848042424,
 0.023209624710370504,
 0.018753484

In [37]:
# Confirm that our query vectors are of the same dimension as our indexed vectors: 1536. Perfect!
len(query_vector_en)

1536

To save us some time, let's add these vectorized queries into our tenants dictionary:

In [38]:
# Define a list of new key-value pairs
new_key_value_pairs = [
    {'vector_query': query_vector_en},
    {'vector_query': query_vector_it},
    {'vector_query': query_vector_fr}
]

# Loop through the list of dictionaries and the list of new key-value pairs
for tenant, new_pair in zip(tenants, new_key_value_pairs):
    tenant.update(new_pair)

In [39]:
tenants

[{'name': 'Audrey',
  'native_language': 'en',
  'query': 'Who is Wilhelm Weitling?',
  'vector_query': [0.0073842687953785836,
   -0.010534530831395411,
   -0.00022739273093404954,
   0.0004011451798243106,
   -0.023559654860286376,
   0.019817034395588105,
   -0.026346425409477448,
   0.015589759236003892,
   0.021015210727215673,
   -0.022024911578298025,
   0.021984521979632757,
   0.016343668608765582,
   0.006522658482789196,
   -0.006152435396185896,
   0.008986325281174573,
   -0.0023273572579950667,
   0.03115259393554281,
   -0.019332379700702163,
   -0.008838235674004212,
   -0.01289049668995309,
   -0.030129431747550575,
   0.0030812661650954556,
   -0.004829729125286768,
   0.014095404621358203,
   -0.005698070571992399,
   0.003093045999044857,
   0.01747453266790812,
   -0.0190765882223815,
   -0.002418230129701506,
   -0.02314231243788547,
   -0.008333386179785638,
   0.008939205945376966,
   -0.001989107873351198,
   0.006058196724590684,
   -0.006822202565696088,
   -

### Step 5b. Send Queries!

Now that we have our vectorized queries, we can use our `tenants` dictionary to get each tenants' native language, then send it + the vectorized query together to our index & results back.

In [40]:
# Let's send Audrey's query through to our index
audrey = [t for t in tenants if t.get('name') == 'Audrey'][0]

# Grab Audrey's vectorized query & her native language (which we'll map onto our namespaces)
audrey_query_vector = audrey['vector_query']
audrey_namespace = audrey['native_language']

# Send the query on through!
demo_index.query(vector=audrey_query_vector, top_k=1, include_metadata=True, namespace=audrey_namespace)

# Amazing! We get our chunk of text back that specifically mentions who Wilhelm Weitling is

{'matches': [{'id': '7c0644f4-2d1f-4958-a8d9-c71dce89708f',
              'metadata': {'chunk_num': 18.0,
                           'text_content': 'of the 19th century such as '
                                           'William Godwin (1756–1836) and '
                                           'Wilhelm Weitling (1808–1871) would '
                                           'contribute to the anarchist '
                                           'doctrines of the next generation '
                                           'but did not use anarchist or '
                                           'anarchism in describing',
                           'title': 'Anarchism',
                           'url': 'https://en.wikipedia.org/wiki/Anarchism'},
              'score': 0.772866428,
              'values': []}],
 'namespace': 'en'}

In [41]:
# Let's check out Pierre
pierre = [t for t in tenants if t.get('name') == 'Pierre'][0]

# Grab Pierre's vectorized query & her native language (which we'll map onto our namespaces)
pierre_query_vector = pierre['vector_query']
pierre_namespace = pierre['native_language']

# # Send the query on through!
demo_index.query(vector=pierre_query_vector, top_k=1, include_metadata=True, namespace=pierre_namespace)

# Amazing! We get our chunk of text back that defines what linear algebra is

{'matches': [{'id': 'c06ca84b-c84a-4d55-885f-bd424cad5383',
              'metadata': {'chunk_num': 37.0,
                           'text_content': 'représenter certaines entités '
                                           'physiques comme des déplacements, '
                                           'additionnés entre eux ou encore '
                                           'multipliés par des scalaires '
                                           '(nombres), formant ainsi le '
                                           "premier exemple concret d'espace "
                                           'vectoriel.',
                           'title': 'Algèbre linéaire',
                           'url': 'https://fr.wikipedia.org/wiki/Alg%C3%A8bre%20lin%C3%A9aire'},
              'score': 0.884463072,
              'values': []}],
 'namespace': 'fr'}

## BONUS: Filter by Metadata Within a Single Namespace

What if we wanted to filter by metadata...let's say an article's `title`...*within* a specific namespace?

Easy!

Below we will send a query to our index, to the `'en'` namespace, and filter our results to only be from the Wikipedia article titled `Anarchism`:

In [42]:
# Reminder what our English data looks like
wiki_en.to_pandas().head(3)

Unnamed: 0,id,url,title,text
0,12,https://en.wikipedia.org/wiki/Anarchism,Anarchism,Anarchism is a political philosophy and moveme...
1,25,https://en.wikipedia.org/wiki/Autism,Autism,Autism is a neurodevelopmental disorder charac...
2,39,https://en.wikipedia.org/wiki/Albedo,Albedo,Albedo (; ) is the measure of the diffuse refl...


In [43]:
sample_anarchy_query = "What is anarchy?"

In [44]:
vectorized_sample_anarchy_query = vectorize_query(embed, sample_anarchy_query)

In [45]:
targeted_namespace = 'en'

In [46]:
# Send our query through!

demo_index.query(
    vector=vectorized_sample_anarchy_query,
    filter={
        "title": {"$eq": "Anarchism"},
    },
    top_k=3,
    include_metadata=True,
    namespace=targeted_namespace
)

{'matches': [{'id': '04012448-37f2-465f-a168-5d8c2629acb9',
              'metadata': {'chunk_num': 224.0,
                           'text_content': 'Anarchism and the state',
                           'title': 'Anarchism',
                           'url': 'https://en.wikipedia.org/wiki/Anarchism'},
              'score': 0.879203916,
              'values': []},
             {'id': '9ce6c53e-faa0-4463-93ff-e36cbb7c5e05',
              'metadata': {'chunk_num': 31.0,
                           'text_content': 'society, the rejection of the '
                                           'state apparatus, the belief that '
                                           'human nature allows humans to '
                                           'exist in or progress toward such a '
                                           'non-coercive society, and a '
                                           'suggestion on how to act to pursue '
                                           'the ideal of a

## Awesome!!