# Stampy Extractive FAQ using Pinecone

We will take a look at how to use Pinecone to perform a semantic search, while applying a traditional keyword search.

https://github.com/pinecone-io/examples/blob/master/metadata_filtered_search/metadata_filtered_search.ipynb

We will use the `sentence-transformers` library to build our sentence embeddings. It can be installed using `pip` like so:

In [57]:
!pip install sentence-transformers
!pip install pinecone-client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Download Data
In this example we are using the sentence_transformer library  to encode the sentence into vectors. More info can be found [here](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models).

In [59]:
import requests, json
import pandas as pd

wikiURL = "https://stampy.ai/w/api.php?action=ask&query=[[Category:Answers]][[Canonical::true]][[OutOfScope::false]]%7Climit%3D1000%7Cformat%3Dplainlist%7C%3FAnswer%7C%3FAnswerTo&format=json"
wikiJSON = requests.get(wikiURL).json()["query"]["results"]

faq_list = []
for entry in wikiJSON.values():
  faq_list.append({
    "question": entry["printouts"]["AnswerTo"][0]["fulltext"],
    "answer": entry["printouts"]["Answer"][0],
    "url": entry["printouts"]["AnswerTo"][0]["fullurl"],  
  })

data = pd.DataFrame(faq_list)
data.head()

Unnamed: 0,question,answer,url
0,A lot of concern appears to focus on human-lev...,"AI is already superhuman at some tasks, for ex...",https://stampy.ai/wiki/A_lot_of_concern_appear...
1,Any AI will be a computer program. Why wouldn'...,While it is true that a computer program alway...,https://stampy.ai/wiki/Any_AI_will_be_a_comput...
2,"Are Google, OpenAI, etc. aware of the risk?",The major AI companies are thinking about this...,"https://stampy.ai/wiki/Are_Google,_OpenAI,_etc..."
3,Are there types of advanced AI that would be s...,We don’t yet know which AI architectures are s...,https://stampy.ai/wiki/Are_there_types_of_adva...
4,Aren't robots the real problem? How can AI cau...,What’s new and potentially risky is not the ab...,https://stampy.ai/wiki/Aren%27t_robots_the_rea...


In [61]:
from google.colab import drive

drive.mount('/content/drive/', force_remount=True)
PATH = "/content/drive/My Drive/Colab Notebooks/data/"

file_prefix = "stampyQA"
data.to_json(PATH + file_prefix + ".json", orient="records")

Mounted at /content/drive/


In [60]:
# Get questions and answers.
title_data = data['question'].tolist()
text_data = data['answer'].tolist()
title_text_data = data['question'].map(str) + "[SEP]" + data['answer'].map(str)
data['text_to_encode'] = title_text_data
ids = data.index.tolist()

In [62]:
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize

model = SentenceTransformer('allenai-specter')

### Components of a Pinecone vector embedding

There are three components to every Pinecone vector embedding:
 - a vector ID
 - a sequence of floats of a user-defined, fixed dimension
 - vector metadata (a key-value store)

### Prepare vector embeddings for upload

We will encode the articles for upload to Pinecone. This may take a while depending on your machine. We will use the index of the pandas dataframe for the vector ID, the pretrained model to generate the sequence of 768 floats, and the title, authors, url and abstract for details in the metadata.

#### Prepare metadata

The function below creates metadata from a single row of the dataframe. This is going to be important further down this notebook for additional filter requirements we may want to employ in our queries.

In [63]:
all_embeddings = model.encode(title_text_data, show_progress_bar=True)
# all_embeddings = normalize(all_embeddings)
all_embeddings.shape

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

(201, 768)

In [64]:
def get_vector_metadata_from_dataframe_row(df_row):
    """Return vector metadata."""
    vector_metadata = {
        'title': df_row['question'],
        'text': df_row['answer'],
        'url': df_row['url']
    }
    return vector_metadata

metadata = data.apply(get_vector_metadata_from_dataframe_row,axis=1).tolist()

In [65]:
all_data = list(zip(ids, all_embeddings.tolist(), metadata))
json.dump(all_data, open(PATH + file_prefix + "_embeddings.json", "w"))

We have everything we need, the dense vector representations of each sentence. So let's establish a connection to Pinecone ready for upserting our data.

Next we need to connect to a Pinecone instance, you can get a [free API key here](https://app.pinecone.io).

There are none, so let's create a new index with `create_index` and connect with `Index`.

In [69]:
import pinecone
pinecone.init(api_key='040b0588-32b2-4195-b234-63e068540253', environment='us-west1-gcp')
index_name = 'alignment-lit'
namespace='faq'

# if doesn't exist, create new index else delete old contents
if index_name not in pinecone.list_indexes():
  pinecone.create_index(name=index_name, dimension=all_embeddings.shape[1])
  index = pinecone.Index(index_name)
else:
  index = pinecone.Index(index_name)
  index.delete(deleteAll=True, namespace=namespace)

In [67]:
def chunks(lst, n):
    """A generator function that iterates through lst in batches.
    Each batch is of size n except possibly the last batch, which may be of 
    size less than n.
    """
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

In [75]:
chunk_size = 100
for chunk in chunks(all_data, chunk_size):
  upserts = []
  for (id, vectors, meta) in chunk:
    upserts.append((str(id), vectors, meta))
  index.upsert(upserts, namespace=namespace)

## Querying & Answer Extraction

We now have the data in our index, let's first perform a semantic search using a query sentence, we will return the most *semantically* similar sentences.

We define the query, and encode as we did for `all_sentences` before. When querying with `index.query` we can pass the query vector as our first argument, and *later* when filtering for specific keywords we will add the `filter` parameter.

In [97]:
from transformers import pipeline

model_name = "deepset/electra-base-squad2"
# load the reader model into a question-answering pipeline
reader = pipeline(model=model_name, task="question-answering")

In [102]:
query = "What is AI Safety"
xq = model.encode(query).tolist()
result = index.query(xq, namespace=namespace, top_k=5, includeMetadata=True)
for item in result["matches"]:
  print('{0:.2f}'.format(item["score"]), item["id"], item["metadata"]["title"])

0.85 90 What are the differences between “AI safety”, “AGI safety”, “AI alignment” and “AI existential safety”?
0.84 96 What can I do to contribute to AI safety?
0.83 13 Can you give an AI a goal which involves “minimally impacting the world”?
0.82 54 I'm interested in working on AI safety. What should I do?
0.82 178 Why is safety important for smarter-than-human AI?


In [103]:
context = result["matches"][0]["metadata"]["text"]
answer = reader(question=query, context=context)
print(answer)
print(context)

{'score': 0.008653157390654087, 'start': 561, 'end': 625, 'answer': 'AI existential safety is a slightly broader term than AGI safety'}
AI alignment is the research field focused on trying to give us the tools to align AIs to specific goals, such as human values. This is crucial when they are highly competent, as a misaligned superintelligence could be the end of human civilization.

AGI safety is the field trying to make sure that when we build Artificial General Intelligences they are safe and do not harm humanity. It overlaps with AI alignment strongly, in that misalignment of AI would be the main cause of unsafe behavior in AGIs, but also includes misuse and other governance issues.

AI existential safety is a slightly broader term than AGI safety, including AI risks which pose an existential threat without necessarily being as general as humans.

AI safety was originally used by the existential risk reduction movement for the work done to reduce the risks of misaligned superintell

{'score': 0.04587697610259056, 'start': 265, 'end': 346, 'answer': 'the field trying to make sure that when we build Artificial General Intelligences'}
AI alignment is the research field focused on trying to give us the tools to align AIs to specific goals, such as human values. This is crucial when they are highly competent, as a misaligned superintelligence could be the end of human civilization.

AGI safety is the field trying to make sure that when we build Artificial General Intelligences they are safe and do not harm humanity. It overlaps with AI alignment strongly, in that misalignment of AI would be the main cause of unsafe behavior in AGIs, but also includes misuse and other governance issues.

AI existential safety is a slightly broader term than AGI safety, including AI risks which pose an existential threat without necessarily being as general as humans.

AI safety was originally used by the existential risk reduction movement for the work done to reduce the risks of misaligned superintelligence, but has also been adopted by researchers and others studying nearer term and less catastrophic risks from AI in recent years.


In [None]:
# from pprint import pprint

# # extracts answer from the context passage
# def extract_answer(question, context):
#     results = []
#     for c in context:
#         # feed the reader the question and contexts to extract answers
#         answer = reader(question=question, context=c)
#         # add the context to answer dict for printing both together
#         answer["context"] = c
#         results.append(answer)
#     # sort the result based on the score from reader model
#     sorted_result = pprint(sorted(results, key=lambda x: x["score"], reverse=True))
#     return sorted_result

In [None]:
# pinecone.delete_index(name='alignment-lit')