# Alignment Lit Semantic Search using Pinecone

We will take a look at how to use Pinecone to perform a semantic search, while applying a traditional keyword search.

https://github.com/pinecone-io/examples/blob/master/metadata_filtered_search/metadata_filtered_search.ipynb

We will use the `sentence-transformers` library to build our sentence embeddings. It can be installed using `pip` like so:

In [None]:
!pip install sentence-transformers
!pip install pinecone-client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Download Data
In this example we are using the sentence_transformer library  to encode the sentence into vectors. More info can be found [here](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models).

In [None]:
import json
import pandas as pd
from google.colab import drive

drive.mount('/content/drive/', force_remount=True)
PATH = "/content/drive/My Drive/Colab Notebooks/data/"
data = pd.read_json(PATH + 'arxiv_pos_list.json')
data.head()

Mounted at /content/drive/


Unnamed: 0,source,source_type,converted_with,paper_version,title,authors,date_published,data_last_modified,url,abstract,...,citation_level,alignment_text,confidence_score,main_tex_filename,text,bibliography_bbl,bibliography_bib,arxiv_citations,alignment_newsletter,source_filetype
0,arxiv,latex,pandoc,1806.09055v2,DARTS: Differentiable Architecture Search,"[Hanxiao Liu, Karen Simonyan, Yiming Yang]",2018-06-24 00:06:13+00:00,2019-04-23 06:29:32+00:00,http://arxiv.org/abs/1806.09055v2,This paper addresses the scalability challenge...,...,0,pos,1.0,main.tex,---\nabstract: |\n This paper addresses the s...,\begin{thebibliography}{46}\n\providecommand{\...,,"{'1709.09582': True, '1708.04552': True, '1711...",,
1,arxiv,latex,pandoc,1906.02530v2,Can You Trust Your Model's Uncertainty? Evalua...,"[Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary ...",2019-06-06 11:42:53+00:00,2019-12-17 21:30:28+00:00,http://arxiv.org/abs/1906.02530v2,Modern machine learning methods including deep...,...,0,pos,1.0,,,\begin{thebibliography}{57}\n\providecommand{\...,"@incollection{lang1995newsweeder,\n title={Ne...","{'1807.00906': True, '1606.06565': True, '1811...",,
2,arxiv,latex,pandoc,1902.08265v1,Quantifying Perceptual Distortion of Adversari...,"[Matt Jordan, Naren Manoj, Surbhi Goel, Alexan...",2019-02-21 21:02:58+00:00,2019-02-21 21:02:58+00:00,http://arxiv.org/abs/1902.08265v1,Recent work has shown that additive threat mod...,...,0,pos,1.0,,,\begin{thebibliography}{27}\n\providecommand{\...,,"{'1707.07397': True, '1712.03141': True, '1712...","{'source': 'alignment-newsletter', 'source_typ...",
3,arxiv,latex,pandoc,2006.13258v6,Adversarial Soft Advantage Fitting: Imitation ...,"[Paul Barde, Julien Roy, Wonseok Jeon, Joelle ...",2020-06-23 18:29:13+00:00,2021-04-16 10:09:13+00:00,http://arxiv.org/abs/2006.13258v6,Adversarial Imitation Learning alternates betw...,...,0,pos,0.994039,main.tex,---\nabstract: |\n Adversarial Imitation Lear...,\begin{thebibliography}{30}\n\providecommand{\...,"@article{peng2018continual,\n title={Continua...","{'1611.03852': True, '1812.05905': True, '1812...","{'source': 'alignment-newsletter', 'source_typ...",
4,arxiv,latex,pandoc,2011.05623v3,"Fooling the primate brain with minimal, target...","[Li Yuan, Will Xiao, Giorgia Dellaferrera, Gab...",2020-11-11 08:30:54+00:00,2022-03-30 05:36:53+00:00,http://arxiv.org/abs/2011.05623v3,Artificial neural networks (ANNs) are consider...,...,0,pos,1.0,,,\begin{thebibliography}{10}\n\expandafter\ifx\...,"@inproceedings{he2015delving,\n title={Delvin...","{'1312.6199': True, '1412.6572': True, '1802.0...","{'source': 'alignment-newsletter', 'source_typ...",


In [None]:
saved_data = data[['paper_version', 'title', 'authors', 'url', 'abstract']]
saved_data.to_json(PATH + "arxiv_pos_list_metadata.json", orient="records")
# duplicate index "1808.03644v1"
saved_data.head()

Unnamed: 0,paper_version,title,authors,url,abstract
0,1806.09055v2,DARTS: Differentiable Architecture Search,"[Hanxiao Liu, Karen Simonyan, Yiming Yang]",http://arxiv.org/abs/1806.09055v2,This paper addresses the scalability challenge...
1,1906.02530v2,Can You Trust Your Model's Uncertainty? Evalua...,"[Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary ...",http://arxiv.org/abs/1906.02530v2,Modern machine learning methods including deep...
2,1902.08265v1,Quantifying Perceptual Distortion of Adversari...,"[Matt Jordan, Naren Manoj, Surbhi Goel, Alexan...",http://arxiv.org/abs/1902.08265v1,Recent work has shown that additive threat mod...
3,2006.13258v6,Adversarial Soft Advantage Fitting: Imitation ...,"[Paul Barde, Julien Roy, Wonseok Jeon, Joelle ...",http://arxiv.org/abs/2006.13258v6,Adversarial Imitation Learning alternates betw...
4,2011.05623v3,"Fooling the primate brain with minimal, target...","[Li Yuan, Will Xiao, Giorgia Dellaferrera, Gab...",http://arxiv.org/abs/2011.05623v3,Artificial neural networks (ANNs) are consider...


In [None]:
# Get questions and answers.
title_data = data['title'].tolist()
text_data = data['abstract'].tolist()
title_text_data = data['title'].map(str) + "[SEP]" + data['abstract'].map(str)
data['text_to_encode'] = title_text_data
# if this cell is run multiple time, will get error 
# if paper_version already set as index no longer valid column
data.set_index("paper_version", inplace = True)
ids = data.index.tolist()
data.head()

Unnamed: 0_level_0,source,source_type,converted_with,title,authors,date_published,data_last_modified,url,abstract,author_comment,...,alignment_text,confidence_score,main_tex_filename,text,bibliography_bbl,bibliography_bib,arxiv_citations,alignment_newsletter,source_filetype,text_to_encode
paper_version,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1806.09055v2,arxiv,latex,pandoc,DARTS: Differentiable Architecture Search,"[Hanxiao Liu, Karen Simonyan, Yiming Yang]",2018-06-24 00:06:13+00:00,2019-04-23 06:29:32+00:00,http://arxiv.org/abs/1806.09055v2,This paper addresses the scalability challenge...,Published at ICLR 2019; Code and pretrained mo...,...,pos,1.0,main.tex,---\nabstract: |\n This paper addresses the s...,\begin{thebibliography}{46}\n\providecommand{\...,,"{'1709.09582': True, '1708.04552': True, '1711...",,,DARTS: Differentiable Architecture Search[SEP]...
1906.02530v2,arxiv,latex,pandoc,Can You Trust Your Model's Uncertainty? Evalua...,"[Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary ...",2019-06-06 11:42:53+00:00,2019-12-17 21:30:28+00:00,http://arxiv.org/abs/1906.02530v2,Modern machine learning methods including deep...,Advances in Neural Information Processing Syst...,...,pos,1.0,,,\begin{thebibliography}{57}\n\providecommand{\...,"@incollection{lang1995newsweeder,\n title={Ne...","{'1807.00906': True, '1606.06565': True, '1811...",,,Can You Trust Your Model's Uncertainty? Evalua...
1902.08265v1,arxiv,latex,pandoc,Quantifying Perceptual Distortion of Adversari...,"[Matt Jordan, Naren Manoj, Surbhi Goel, Alexan...",2019-02-21 21:02:58+00:00,2019-02-21 21:02:58+00:00,http://arxiv.org/abs/1902.08265v1,Recent work has shown that additive threat mod...,"18 pages, codebase/framework available at\n h...",...,pos,1.0,,,\begin{thebibliography}{27}\n\providecommand{\...,,"{'1707.07397': True, '1712.03141': True, '1712...","{'source': 'alignment-newsletter', 'source_typ...",,Quantifying Perceptual Distortion of Adversari...
2006.13258v6,arxiv,latex,pandoc,Adversarial Soft Advantage Fitting: Imitation ...,"[Paul Barde, Julien Roy, Wonseok Jeon, Joelle ...",2020-06-23 18:29:13+00:00,2021-04-16 10:09:13+00:00,http://arxiv.org/abs/2006.13258v6,Adversarial Imitation Learning alternates betw...,,...,pos,0.994039,main.tex,---\nabstract: |\n Adversarial Imitation Lear...,\begin{thebibliography}{30}\n\providecommand{\...,"@article{peng2018continual,\n title={Continua...","{'1611.03852': True, '1812.05905': True, '1812...","{'source': 'alignment-newsletter', 'source_typ...",,Adversarial Soft Advantage Fitting: Imitation ...
2011.05623v3,arxiv,latex,pandoc,"Fooling the primate brain with minimal, target...","[Li Yuan, Will Xiao, Giorgia Dellaferrera, Gab...",2020-11-11 08:30:54+00:00,2022-03-30 05:36:53+00:00,http://arxiv.org/abs/2011.05623v3,Artificial neural networks (ANNs) are consider...,,...,pos,1.0,,,\begin{thebibliography}{10}\n\expandafter\ifx\...,"@inproceedings{he2015delving,\n title={Delvin...","{'1312.6199': True, '1412.6572': True, '1802.0...","{'source': 'alignment-newsletter', 'source_typ...",,"Fooling the primate brain with minimal, target..."


In [None]:
from sentence_transformers import SentenceTransformer
# from sklearn.preprocessing import normalize

model = SentenceTransformer('allenai-specter')

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/622 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/462k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

## Components of a Pinecone vector embedding

There are three components to every Pinecone vector embedding:
 - a vector ID
 - a sequence of floats of a user-defined, fixed dimension
 - vector metadata (a key-value store)

### Prepare vector embeddings for upload

We will encode the articles for upload to Pinecone. This may take a while depending on your machine. We will use the index of the pandas dataframe for the vector ID, the pretrained model to generate the sequence of 768 floats, and the title, authors, url and abstract for details in the metadata.

#### Prepare metadata

The function below creates metadata from a single row of the dataframe. This is going to be important further down this notebook for additional filter requirements we may want to employ in our queries.

In [None]:
all_embeddings = model.encode(title_text_data, show_progress_bar=True)
# all_embeddings = normalize(all_embeddings)
all_embeddings.shape

Batches:   0%|          | 0/30 [00:00<?, ?it/s]

(959, 768)

In [None]:
def get_vector_metadata_from_dataframe_row(df_row):
    """Return vector metadata."""
    vector_metadata = {
        'title': df_row['title'],
        'authors': ", ".join(df_row['authors']), 
        'abstract': df_row['abstract'],
        'url': df_row['url']
    }
    return vector_metadata

metadata = data.apply(get_vector_metadata_from_dataframe_row,axis=1).tolist()

In [None]:
all_data = list(zip(ids, all_embeddings.tolist(), metadata))
json.dump(all_data, open(PATH + "arxiv_pos_list_embeddings.json", "w"))

We have everything we need, the dense vector representations of each sentence. So let's establish a connection to Pinecone ready for upserting our data.

Next we need to connect to a Pinecone instance, you can get a [free API key here](https://app.pinecone.io).

There are none, so let's create a new index with `create_index` and connect with `Index`.

In [None]:
import pinecone
pinecone.init(api_key='040b0588-32b2-4195-b234-63e068540253', environment='us-west1-gcp')
index_name = 'alignment-lit'
# if doesn't exist, create new index else delete old contents
if index_name not in pinecone.list_indexes():
  pinecone.create_index(name=index_name, dimension=all_embeddings.shape[1])
  index = pinecone.Index(index_name)
else:
  index = pinecone.Index(index_name)
  index.delete(deleteAll=True)

In [None]:
def chunks(lst, n):
    """A generator function that iterates through lst in batches.
    Each batch is of size n except possibly the last batch, which may be of 
    size less than n.
    """
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

In [None]:
chunk_size = 100
for chunk in chunks(all_data, chunk_size):
  upserts = []
  for (id, vectors, meta) in chunk:
    upserts.append((id, vectors, meta))
  index.upsert(upserts)

## PostgreSQL

In [None]:
! docker run --name postgres0 -d  -p 5438:5432 -e POSTGRES_HOST_AUTH_METHOD=trust postgres

/bin/bash: docker: command not found


In [None]:
! docker logs postgres0 --tail 6

/bin/bash: docker: command not found


In [None]:
import psycopg2
conn = psycopg2.connect(host='localhost', port='5438', user='postgres', password='postgres')
cursor = conn.cursor()

In [None]:
# postgresql table is same name as pinecone index
table_name = index_name

#Deleting previouslny stored table for clean run
drop_table = "DROP TABLE IF EXISTS " + table_name
cursor.execute(drop_table)
conn.commit()

try:
    sql = "CREATE TABLE if not exists " + table_name + " (ids bigint, title text, authors text, url text, text text, vectors tsvector);"
    cursor.execute(sql)
    conn.commit()
    print("create postgres table successfully!")
except Exception as e:
    print("can't create a postgres table: ", e)

In [None]:
    sql = """INSERT INTO vendors(vendor_name)
             VALUES(%s) RETURNING vendor_id;"""

In [None]:
# import os 
# import re

# # conn.rollback()
# # extra delimiters | and quotes in title/text causes parsing issues, must strip
# def clean_string(old_string):
#     return old_string.replace("|","").replace("\"","").replace("'","")


# def record_temp_csv(fname, ids, title, authors, urls, text):
#     with open(fname,'w') as f:
#         for i in range(len(ids)):
#             line = str(ids[i]) + "|" + clean_string(title[i]) + "|" + ", ".join(authors[i]) + \
#             "|" + clean_string(urls[i]) + "|" + clean_string(text[i]) + "\n"
#             f.write(line)

# def copy_data_to_pg(table_name, fname, conn, cur):
#     fname = os.path.join(os.getcwd(),fname)
#     try:
#         sql = "COPY " + table_name + " FROM STDIN DELIMITER '|' CSV HEADER"
#         cursor.copy_expert(sql, open(fname, "r"))
#         conn.commit()
#         print("Inserted into Postgress Sucessfully!")
#     except Exception as e:
#         print("Copy Data into Postgress failed: ", e)
        
# DATA_WITH_IDS = 'arxiv_pos_list.csv'   

# record_temp_csv(DATA_WITH_IDS, ids, title_data, data['authors'].tolist(), data['url'].tolist(), text_data)
# copy_data_to_pg(table_name, DATA_WITH_IDS, conn, cursor)

## Querying

We now have the data in our index, let's first perform a semantic search using a query sentence, we will return the most *semantically* similar sentences.

We define the query, and encode as we did for `all_sentences` before. When querying with `index.query` we can pass the query vector as our first argument, and *later* when filtering for specific keywords we will add the `filter` parameter.

In [None]:
# from sentence_transformers import SentenceTransformer
# import pinecone
# pinecone.init(api_key='040b0588-32b2-4195-b234-63e068540253', environment='us-west1-gcp')
# index_name = 'alignment-lit'
# index = pinecone.Index(index_name)
# model = SentenceTransformer('allenai-specter')

In [None]:
import timeit
def run_search(query):
  xq = model.encode(query).tolist()
  return index.query(xq, top_k=5, includeMetadata=True)

t0 = timeit.default_timer()
result = run_search("What is AI Safety?")
t1 = timeit.default_timer()

for item in result["matches"]:
  print('{0:.2f}'.format(item["score"]), item["id"], item["metadata"]["title"])

print(f"Elapsed time: {t1-t0}s")

0.86 2201.10436v2 Safe AI -- How is this Possible?
0.86 1905.13053v1 Unpredictability of AI
0.85 2201.04632v1 The Concept of Criticality in AI Safety
0.85 2104.12582v3 Understanding and Avoiding AI Failures: A Practical Guide
0.84 2202.09292v1 System Safety and Artificial Intelligence
Elapsed time: 0.3725436909999189s


In [None]:
# pinecone.delete_index(name='alignment-lit')