# Alignment Lit Semantic Search using Pinecone

We will take a look at how to use Pinecone to perform a semantic search, while applying a traditional keyword search.

https://github.com/pinecone-io/examples/blob/master/metadata_filtered_search/metadata_filtered_search.ipynb

We will use the `sentence-transformers` library to build our sentence embeddings. It can be installed using `pip` like so:

In [10]:
!pip install sentence-transformers
!pip install pinecone-client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


*(The notebook may need to be restarted for the install to take effect)*
### Processing and Storing the News Data
#### 1. Generating Embeddings
In this example we are using the sentence_transformer library  to encode the sentence into vectors. More info can be found [here](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models).

In [11]:
import json
import pandas as pd
from google.colab import drive

drive.mount('/content/drive/', force_remount=True)
PATH = "/content/drive/My Drive/Colab Notebooks/data/"
data = pd.read_json(PATH + 'arxiv_pos_list.json')
data.head()

Mounted at /content/drive/


Unnamed: 0,source,source_type,converted_with,paper_version,title,authors,date_published,data_last_modified,url,abstract,...,citation_level,alignment_text,confidence_score,main_tex_filename,text,bibliography_bbl,bibliography_bib,arxiv_citations,alignment_newsletter,source_filetype
0,arxiv,latex,pandoc,1806.09055v2,DARTS: Differentiable Architecture Search,"[Hanxiao Liu, Karen Simonyan, Yiming Yang]",2018-06-24 00:06:13+00:00,2019-04-23 06:29:32+00:00,http://arxiv.org/abs/1806.09055v2,This paper addresses the scalability challenge...,...,0,pos,1.0,main.tex,---\nabstract: |\n This paper addresses the s...,\begin{thebibliography}{46}\n\providecommand{\...,,"{'1709.09582': True, '1708.04552': True, '1711...",,
1,arxiv,latex,pandoc,1906.02530v2,Can You Trust Your Model's Uncertainty? Evalua...,"[Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary ...",2019-06-06 11:42:53+00:00,2019-12-17 21:30:28+00:00,http://arxiv.org/abs/1906.02530v2,Modern machine learning methods including deep...,...,0,pos,1.0,,,\begin{thebibliography}{57}\n\providecommand{\...,"@incollection{lang1995newsweeder,\n title={Ne...","{'1807.00906': True, '1606.06565': True, '1811...",,
2,arxiv,latex,pandoc,1902.08265v1,Quantifying Perceptual Distortion of Adversari...,"[Matt Jordan, Naren Manoj, Surbhi Goel, Alexan...",2019-02-21 21:02:58+00:00,2019-02-21 21:02:58+00:00,http://arxiv.org/abs/1902.08265v1,Recent work has shown that additive threat mod...,...,0,pos,1.0,,,\begin{thebibliography}{27}\n\providecommand{\...,,"{'1707.07397': True, '1712.03141': True, '1712...","{'source': 'alignment-newsletter', 'source_typ...",
3,arxiv,latex,pandoc,2006.13258v6,Adversarial Soft Advantage Fitting: Imitation ...,"[Paul Barde, Julien Roy, Wonseok Jeon, Joelle ...",2020-06-23 18:29:13+00:00,2021-04-16 10:09:13+00:00,http://arxiv.org/abs/2006.13258v6,Adversarial Imitation Learning alternates betw...,...,0,pos,0.994039,main.tex,---\nabstract: |\n Adversarial Imitation Lear...,\begin{thebibliography}{30}\n\providecommand{\...,"@article{peng2018continual,\n title={Continua...","{'1611.03852': True, '1812.05905': True, '1812...","{'source': 'alignment-newsletter', 'source_typ...",
4,arxiv,latex,pandoc,2011.05623v3,"Fooling the primate brain with minimal, target...","[Li Yuan, Will Xiao, Giorgia Dellaferrera, Gab...",2020-11-11 08:30:54+00:00,2022-03-30 05:36:53+00:00,http://arxiv.org/abs/2011.05623v3,Artificial neural networks (ANNs) are consider...,...,0,pos,1.0,,,\begin{thebibliography}{10}\n\expandafter\ifx\...,"@inproceedings{he2015delving,\n title={Delvin...","{'1312.6199': True, '1412.6572': True, '1802.0...","{'source': 'alignment-newsletter', 'source_typ...",


In [12]:
data.head()

Unnamed: 0,source,source_type,converted_with,paper_version,title,authors,date_published,data_last_modified,url,abstract,...,citation_level,alignment_text,confidence_score,main_tex_filename,text,bibliography_bbl,bibliography_bib,arxiv_citations,alignment_newsletter,source_filetype
0,arxiv,latex,pandoc,1806.09055v2,DARTS: Differentiable Architecture Search,"[Hanxiao Liu, Karen Simonyan, Yiming Yang]",2018-06-24 00:06:13+00:00,2019-04-23 06:29:32+00:00,http://arxiv.org/abs/1806.09055v2,This paper addresses the scalability challenge...,...,0,pos,1.0,main.tex,---\nabstract: |\n This paper addresses the s...,\begin{thebibliography}{46}\n\providecommand{\...,,"{'1709.09582': True, '1708.04552': True, '1711...",,
1,arxiv,latex,pandoc,1906.02530v2,Can You Trust Your Model's Uncertainty? Evalua...,"[Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary ...",2019-06-06 11:42:53+00:00,2019-12-17 21:30:28+00:00,http://arxiv.org/abs/1906.02530v2,Modern machine learning methods including deep...,...,0,pos,1.0,,,\begin{thebibliography}{57}\n\providecommand{\...,"@incollection{lang1995newsweeder,\n title={Ne...","{'1807.00906': True, '1606.06565': True, '1811...",,
2,arxiv,latex,pandoc,1902.08265v1,Quantifying Perceptual Distortion of Adversari...,"[Matt Jordan, Naren Manoj, Surbhi Goel, Alexan...",2019-02-21 21:02:58+00:00,2019-02-21 21:02:58+00:00,http://arxiv.org/abs/1902.08265v1,Recent work has shown that additive threat mod...,...,0,pos,1.0,,,\begin{thebibliography}{27}\n\providecommand{\...,,"{'1707.07397': True, '1712.03141': True, '1712...","{'source': 'alignment-newsletter', 'source_typ...",
3,arxiv,latex,pandoc,2006.13258v6,Adversarial Soft Advantage Fitting: Imitation ...,"[Paul Barde, Julien Roy, Wonseok Jeon, Joelle ...",2020-06-23 18:29:13+00:00,2021-04-16 10:09:13+00:00,http://arxiv.org/abs/2006.13258v6,Adversarial Imitation Learning alternates betw...,...,0,pos,0.994039,main.tex,---\nabstract: |\n Adversarial Imitation Lear...,\begin{thebibliography}{30}\n\providecommand{\...,"@article{peng2018continual,\n title={Continua...","{'1611.03852': True, '1812.05905': True, '1812...","{'source': 'alignment-newsletter', 'source_typ...",
4,arxiv,latex,pandoc,2011.05623v3,"Fooling the primate brain with minimal, target...","[Li Yuan, Will Xiao, Giorgia Dellaferrera, Gab...",2020-11-11 08:30:54+00:00,2022-03-30 05:36:53+00:00,http://arxiv.org/abs/2011.05623v3,Artificial neural networks (ANNs) are consider...,...,0,pos,1.0,,,\begin{thebibliography}{10}\n\expandafter\ifx\...,"@inproceedings{he2015delving,\n title={Delvin...","{'1312.6199': True, '1412.6572': True, '1802.0...","{'source': 'alignment-newsletter', 'source_typ...",


In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize

model = SentenceTransformer('allenai-specter')

In [64]:
# Get questions and answers.
title_data = data['title'].tolist()
text_data = data['abstract'].tolist()
ids = data['paper_version'].tolist()

title_text_data = data['title'].map(str) + '[SEP]' + data['abstract'].map(str)
data['text_to_encode'] = data['title'].map(str) + '[SEP]' + data['abstract'].map(str)
data.set_index("paper_version", inplace = True)
data.head()

Unnamed: 0_level_0,source,source_type,converted_with,title,authors,date_published,data_last_modified,url,abstract,author_comment,...,alignment_text,confidence_score,main_tex_filename,text,bibliography_bbl,bibliography_bib,arxiv_citations,alignment_newsletter,source_filetype,text_to_encode
paper_version,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1806.09055v2,arxiv,latex,pandoc,DARTS: Differentiable Architecture Search,"[Hanxiao Liu, Karen Simonyan, Yiming Yang]",2018-06-24 00:06:13+00:00,2019-04-23 06:29:32+00:00,http://arxiv.org/abs/1806.09055v2,This paper addresses the scalability challenge...,Published at ICLR 2019; Code and pretrained mo...,...,pos,1.0,main.tex,---\nabstract: |\n This paper addresses the s...,\begin{thebibliography}{46}\n\providecommand{\...,,"{'1709.09582': True, '1708.04552': True, '1711...",,,DARTS: Differentiable Architecture Search[SEP]...
1906.02530v2,arxiv,latex,pandoc,Can You Trust Your Model's Uncertainty? Evalua...,"[Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary ...",2019-06-06 11:42:53+00:00,2019-12-17 21:30:28+00:00,http://arxiv.org/abs/1906.02530v2,Modern machine learning methods including deep...,Advances in Neural Information Processing Syst...,...,pos,1.0,,,\begin{thebibliography}{57}\n\providecommand{\...,"@incollection{lang1995newsweeder,\n title={Ne...","{'1807.00906': True, '1606.06565': True, '1811...",,,Can You Trust Your Model's Uncertainty? Evalua...
1902.08265v1,arxiv,latex,pandoc,Quantifying Perceptual Distortion of Adversari...,"[Matt Jordan, Naren Manoj, Surbhi Goel, Alexan...",2019-02-21 21:02:58+00:00,2019-02-21 21:02:58+00:00,http://arxiv.org/abs/1902.08265v1,Recent work has shown that additive threat mod...,"18 pages, codebase/framework available at\n h...",...,pos,1.0,,,\begin{thebibliography}{27}\n\providecommand{\...,,"{'1707.07397': True, '1712.03141': True, '1712...","{'source': 'alignment-newsletter', 'source_typ...",,Quantifying Perceptual Distortion of Adversari...
2006.13258v6,arxiv,latex,pandoc,Adversarial Soft Advantage Fitting: Imitation ...,"[Paul Barde, Julien Roy, Wonseok Jeon, Joelle ...",2020-06-23 18:29:13+00:00,2021-04-16 10:09:13+00:00,http://arxiv.org/abs/2006.13258v6,Adversarial Imitation Learning alternates betw...,,...,pos,0.994039,main.tex,---\nabstract: |\n Adversarial Imitation Lear...,\begin{thebibliography}{30}\n\providecommand{\...,"@article{peng2018continual,\n title={Continua...","{'1611.03852': True, '1812.05905': True, '1812...","{'source': 'alignment-newsletter', 'source_typ...",,Adversarial Soft Advantage Fitting: Imitation ...
2011.05623v3,arxiv,latex,pandoc,"Fooling the primate brain with minimal, target...","[Li Yuan, Will Xiao, Giorgia Dellaferrera, Gab...",2020-11-11 08:30:54+00:00,2022-03-30 05:36:53+00:00,http://arxiv.org/abs/2011.05623v3,Artificial neural networks (ANNs) are consider...,,...,pos,1.0,,,\begin{thebibliography}{10}\n\expandafter\ifx\...,"@inproceedings{he2015delving,\n title={Delvin...","{'1312.6199': True, '1412.6572': True, '1802.0...","{'source': 'alignment-newsletter', 'source_typ...",,"Fooling the primate brain with minimal, target..."


We use this pretrained sentence transformer model to encode the sentences.

In [76]:
all_embeddings = model.encode(title_text_data, show_progress_bar=True)
all_embeddings = normalize(all_embeddings)
all_embeddings.shape
json.dump(list(zip(data.index.tolist(), all_embeddings.tolist())), open(PATH + "arxiv_pos_list_embeddings.json", "w"))

Batches:   0%|          | 0/30 [00:00<?, ?it/s]

In [79]:
json.dump(list(zip(data.index.tolist(), data.apply(get_vector_metadata_from_dataframe_row,axis=1).tolist() )), open(PATH + "test.json", "w"))

We have everything we need, the dense vector representations of each sentence. So let's establish a connection to Pinecone ready for upserting our data.

Next we need to connect to a Pinecone instance, you can get a [free API key here](https://app.pinecone.io).

In [17]:
import pinecone
pinecone.init(api_key='API_KEY', environment='us-west1-gcp')

We can check for existing indexes with:

In [20]:
pinecone.list_indexes()

['alignment-lit']

There are none, so let's create a new index with `create_index` and connect with `Index`.

In [22]:
index_name = 'alignment-lit'
if index_name not in pinecone.list_indexes():
  pinecone.create_index(name=index_name, dimension=all_embeddings.shape[1])
index = pinecone.Index(index_name)

### Components of a Pinecone vector embedding

There are three components to every Pinecone vector embedding:
 - a vector ID
 - a sequence of floats of a user-defined, fixed dimension
 - vector metadata (a key-value store)

### Prepare vector embeddings for upload

We will encode the news articles for upload to Pinecone. This may take a while depending on your machine. If on a recent MacBookPro or Google Colab, this may take up to one hour, sometimes longer. We will use the index of the pandas dataframe for the vector ID, the pretrained model to generate the sequence of 384 floats, and the year, month and article source for details in the metadata.

#### Prepare metadata

The function below creates metadata from a single row of the dataframe. This is going to be important further down this notebook for additional filter requirements we may want to employ in our queries.

In [63]:
# TODO reorder to above encode for embeddings
def get_vector_metadata_from_dataframe_row(df_row):
    """Return vector metadata."""
    vector_metadata = {
        'title': df_row['title'],
        # TODO convert list by join authors separated by commas
        'authors': df_row['authors'], 
        'abstract': df_row['abstract'],
        'url': df_row['url']
    }
    return vector_metadata

#### Prepare all vector data for upload

The function below will take a portion of the dataframe and create the full vector data as Pinecone expects it for [upsert](https://www.pinecone.io/docs/insert-data/).

In [66]:
def get_vectors_to_upload_to_pinecone(df_chunk, model, is_multiprocess=False):
    """Return list of tuples like (vector_id, vector_values, vector_metadata)."""
    # create embeddings
    if is_multiprocess:
        pool = model.start_multi_process_pool()
        vector_values = model.encode_multi_process(df_chunk['text_to_encode'], pool).tolist()
        model.stop_multi_process_pool(pool)
    else:
        vector_values = model.encode(df_chunk['text_to_encode'], show_progress_bar=True).tolist()
    # create vector ids and metadata
    vector_ids = df_chunk.index.tolist()
    vector_metadata = df_chunk.apply(get_vector_metadata_from_dataframe_row,axis=1).tolist()
    return list(zip(vector_ids, vector_values, vector_metadata))

### Upload data to Pinecone in asynchronous batches

The function below iterates through the dataframe in chunks, and for each of those chunks, will upload asynchronously in sub-chunks to your Pinecone Index.

In [67]:
def chunks(lst, n):
    """A generator function that iterates through lst in batches.
    Each batch is of size n except possibly the last batch, which may be of 
    size less than n.
    """
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

In [72]:
import numpy as np

def get_tqdm_kwargs(dataframe, chunksize):
    return dict(
        smoothing=0, 
        unit='chunk of vectors', 
        total=int(np.ceil(len(dataframe)/chunksize))
    )

In [74]:
import collections
import tqdm

def upload_dataframe_to_pinecone_in_chunks(
    dataframe, 
    pinecone_index, 
    model, 
    is_multiprocess=False,
    chunk_size=100, 
    upsert_size=100):
    """Encode dataframe column `text_to_encode` to dense vector and upsert to Pinecone."""
    tqdm_kwargs = get_tqdm_kwargs(dataframe, chunk_size)
    async_results = collections.defaultdict(list)
    for df_chunk in tqdm.notebook.tqdm(chunks(dataframe, chunk_size), **tqdm_kwargs):
        vectors = get_vectors_to_upload_to_pinecone(df_chunk, model, is_multiprocess=is_multiprocess)
        # upload to Pinecone in batches of `upsert_size`
        for vectors_chunk in chunks(vectors, upsert_size):
            start_index_chunk = df_chunk.index[0]
            async_result = pinecone_index.upsert(vectors_chunk, async_req=True)
            async_results[start_index_chunk].append(async_result)
        # wait for results
        _ = [async_result.get() for async_result in async_results[start_index_chunk]]
        is_all_successful = all(map(lambda x: x.successful(), async_results[start_index_chunk]))
        # report chunk upload status
        print(
        f'All upserts in chunk successful with index starting with {start_index_chunk:>7}: '
        f'{is_all_successful}. Vectors uploaded: {len(vectors):>3}.'
        )
    return async_results

#### Asynchronous Upload
The Pinecone API responds right away with its [async](https://www.pinecone.io/docs/insert-data/#sending-upserts-in-parallel) requests.

In [75]:
# Toggling the `is_multiprocess` flag to `False` will give visibilty 
# into per-batch progress but the embeddings will be created at roughly a 2x 
# slower rate, based on a few runs on a 2021 macbook pro
async_results = upload_dataframe_to_pinecone_in_chunks(data, index, model, is_multiprocess=False)

  0%|          | 0/10 [00:00<?, ?chunk of vectors/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

All upserts in chunk successful with index starting with 1806.09055v2: True. Vectors uploaded: 100.


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

All upserts in chunk successful with index starting with 2007.08124v1: True. Vectors uploaded: 100.


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

All upserts in chunk successful with index starting with 1806.09795v3: True. Vectors uploaded: 100.


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

All upserts in chunk successful with index starting with 1906.08663v1: True. Vectors uploaded: 100.


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

All upserts in chunk successful with index starting with 1805.08263v4: True. Vectors uploaded: 100.


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

All upserts in chunk successful with index starting with 1803.10664v2: True. Vectors uploaded: 100.


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

All upserts in chunk successful with index starting with 2107.04303v2: True. Vectors uploaded: 100.


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

All upserts in chunk successful with index starting with 1604.04728v1: True. Vectors uploaded: 100.


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

All upserts in chunk successful with index starting with 2111.14726v1: True. Vectors uploaded: 100.


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

All upserts in chunk successful with index starting with 2102.09692v1: True. Vectors uploaded:  59.


In [57]:
# upserts = []
# for (id, embedding, row) in zip(ids, all_embeddings, data):
#   # (id, vectors, metadata)
#   upserts.append((id, embedding.tolist(), 
#                   {'title': row['title'], 'authors': row['authors'], 'url': row['url'], 'abstract': row['abstract']}))

In [58]:
# import itertools

# def chunks(iterable, batch_size=100):
#     """A helper function to break an iterable into chunks of size batch_size."""
#     it = iter(iterable)
#     chunk = tuple(itertools.islice(it, batch_size))
#     while chunk:
#         yield chunk
#         chunk = tuple(itertools.islice(it, batch_size))

# # Upsert data with 100 vectors per upsert request
# for ids_vectors_chunk in chunks(upserts, batch_size=32):
#     index.upsert(vectors=ids_vectors_chunk)  # Assuming `index` defined elsewhere

## Querying

We now have the data in our index, let's first perform a semantic search using a query sentence, we will return the most *semantically* similar sentences.

We define the query, and encode as we did for `all_sentences` before. When querying with `index.query` we can pass the query vector as our first argument, and *later* when filtering for specific keywords we will add the `filter` parameter.

In [98]:
query_sentence = "What is AI Safety?"
xq = model.encode(query_sentence).tolist()
result = index.query(xq, top_k=5, includeMetadata=True)
for item in result["matches"]:
  print('{0:.2f}'.format(item["score"]), item["metadata"]["title"])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

0.86 Safe AI -- How is this Possible?
0.86 Unpredictability of AI
0.85 The Concept of Criticality in AI Safety
0.85 Understanding and Avoiding AI Failures: A Practical Guide
0.84 System Safety and Artificial Intelligence


Let's extract just the sentence IDs to see the order of what we have returned.

In [1]:
# pinecone.delete_index(name='alignment-lit')

In [95]:
pip install streamlit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting streamlit
  Downloading streamlit-1.15.2-py2.py3-none-any.whl (9.2 MB)
[K     |████████████████████████████████| 9.2 MB 29.1 MB/s 
Collecting validators>=0.2
  Downloading validators-0.20.0.tar.gz (30 kB)
Collecting watchdog
  Downloading watchdog-2.2.0-py3-none-manylinux2014_x86_64.whl (78 kB)
[K     |████████████████████████████████| 78 kB 8.6 MB/s 
Collecting gitpython!=3.1.19
  Downloading GitPython-3.1.29-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 74.3 MB/s 
Collecting rich>=10.11.0
  Downloading rich-12.6.0-py3-none-any.whl (237 kB)
[K     |████████████████████████████████| 237 kB 70.7 MB/s 
[?25hCollecting pympler>=0.9
  Downloading Pympler-1.0.1-py3-none-any.whl (164 kB)
[K     |████████████████████████████████| 164 kB 71.7 MB/s 
Collecting blinker>=1.0.0
  Downloading blinker-1.5-py2.py3-none-any.whl (12 kB)
Collecting pydeck>=0.1.

In [96]:
import streamlit as st
st.write("""
# Apps with widgets!
""")

x = st.slider("Select a number")
st.write("You selected.")

  command:

    streamlit run /usr/local/lib/python3.8/dist-packages/ipykernel_launcher.py [ARGUMENTS]
2022-12-06 04:36:13.484 
  command:

    streamlit run /usr/local/lib/python3.8/dist-packages/ipykernel_launcher.py [ARGUMENTS]
2022-12-06 04:36:13.490 Session state does not function when running a script without `streamlit run`
