### Query Pinecone and Feast Stores
In this notebook, we load the Pinecone embedding store with the values from Feast. Pinecone allows for similarity searching -- we can query the store with a new vector and return the stored vectors that are closest in distance. 

Feast can store both questions and answers; therefore, we can use the best matching vector to predict the answer to a question the user will ask. First, we follow the process using the [tutorial from Pinecone](https://www.pinecone.io/docs/examples/question-answering/) with a few modifications to the API. This uses a Quora question duplication dataset. 

We then change the underlying data and the models so that we can return answers. This is using Google's Natural Questions dataset and a model from Hugging Face.

In [1]:
import os
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from feast import FeatureStore
import pinecone
import datetime

#### Pinecone Tutorial

In [2]:
# Initiate pinecone and load model object
api_key = 'a1f44a5f-1978-4b1a-9a42-502b0a48d175'
pinecone.init(api_key=api_key)
pinecone.list_indexes()

os.chdir('feature_repo')
model = SentenceTransformer('average_word_embeddings_komninos')

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/248 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/267M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.59M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/164 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [21]:
# Create a new vector index
index_name = 'feast-questions'

if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)
    
pinecone.create_index(name=index_name, dimension=300, metric='cosine', shards=1)

In [22]:
# Connect to the created index and load it with feast vectors
# Set up
question_ids = pd.read_parquet('./feature_repo/data/questions.parquet', columns=['qid1'])
BATCH_SIZE = 450
dimensions = 300
store = FeatureStore(repo_path="./feature_repo/")
index = pinecone.Index('feast-questions')

# Chunk upserts into the pinecone database
for i in range(0, len(question_ids), BATCH_SIZE):
    batch = question_ids[i: i+BATCH_SIZE]

    feature_vectors = store.get_online_features(
        features=[f'questions:e_{i}' for i in range(dimensions)],
        entity_rows=[{"qid1":_id} for _id in batch.qid1.to_list()]
    ).to_dict()

    # Prepare list of items to upload into Pinecone's index
    items_to_insert = []

    for e in range(len(feature_vectors['qid1'])):
        l = [feature_vectors[f'e_{i}'][e] for i in range(dimensions)]
        items_to_insert.append((str(feature_vectors['qid1'][e]), l))
        tuple_insert = tuple(items_to_insert)
    
    # Upsert batch data
    index.upsert(vectors=tuple_insert)  

In [49]:
# Create dataframe for new questions
df_new_questions = pd.DataFrame([[1000001, 'How can I make money using Youtube?'], 
                                 [1000002, 'What is the best book for learning Python?']], 
                                    columns=['qid1', 'question1'])

# Create embedding for each question
model = SentenceTransformer('average_word_embeddings_komninos')
df_new_questions['question_vector'] = df_new_questions.question1.apply(lambda x: model.encode(str(x), show_progress_bar=False))

# Create timestamps 
df_new_questions['created'] = datetime.datetime.utcnow()
df_new_questions['datetime'] = df_new_questions['created'].dt.floor('h')

# Generate columns for vector elements
df_new_questions2 = df_new_questions.question_vector.apply(pd.Series)
df_new_questions2.columns = [f'e_{i}' for i in range(300)]
result = pd.concat([df_new_questions, df_new_questions2], axis=1)

# Exclude some columns
result = result.drop(['question_vector'], axis=1)

# Change directory if needed
if os.getcwd().split('/')[-1] != 'feature_repo':
    os.chdir('feature_repo')

# Save to parquet file
result.to_parquet('./data/test_questions.parquet')

04/01/2022 04:03:20 PM INFO:Load pretrained SentenceTransformer: average_word_embeddings_komninos
04/01/2022 04:04:23 PM INFO:Use pytorch device: cpu


In [51]:
# Fetch the feature store and get feature vectors for the query questions
store = FeatureStore(repo_path=".")

feature_vectors = store.get_online_features(
    features=[f'test_questions:question1',
                  *[f'test_questions:e_{i}'
                    for i in range(300)
                  ]],
    entity_rows=[{"qid1":_id} for _id in df_new_questions.qid1.tolist()]
).to_dict()

# Prepare list of vectors to query Pinecone
query_vectors = []

for e in range(len(feature_vectors['qid1'])):
    l = [feature_vectors[f'e_{i}'][e] for i in range(300)]
    query_vectors.append(l)

In [125]:
# Query Pinecone's index
query_results = index.query(queries=query_vectors, top_k=5)

# Show results
for e, res in enumerate(query_results['results']):
    print(e)
    print('Original question: ' + feature_vectors['question1'][e])
    print('Most similar questions based on Pinecone vector search: ')

    # Fetch from Feast to get question text
    ids = [d['id'] for d in res.matches]
    scores = [d['score'] for d in res.matches]
    
    result_feature_vectors = store.get_online_features(
        features=[f'questions:question1'],
        entity_rows=[{"qid1":int(_id)} for _id in ids]
    ).to_dict()

    # Prepare and display table
    df_result = pd.DataFrame({'id':ids,
                              'question': result_feature_vectors['question1'],
                              'score':scores})
    display(df_result)


0
Original question: How can I make money using Youtube?
Most similar questions based on Pinecone vector search: 


EntityNotFoundException: Entity qid1 does not exist in project feature_repo

In [81]:
result_feature_vectors

{'qid1': [1292, 14375, 1126, 3759, 157],
 'question1': ['How do I make money with YouTube?',
  'How do I make money using Instagram?',
  'How can I earn money from YouTube?',
  'How do you make money giving through a app?',
  'How can I make money through the Internet?']}

### Recreating with a new dataset

In [97]:
# Delete existing index as Pinecone only allows for 1
old_index = 'feast-questions'
pinecone.delete_index(old_index)

# Create a new vector index
index_name = 'nq-questions'
dimensions = 384 

if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)
    
pinecone.create_index(name=index_name, dimension=dimensions, metric='cosine', shards=1)

In [102]:
# Connect to the created index and load it with feast vectors
# Set up
question_ids = pd.read_parquet('./data/NQ_questions.parquet', columns=['qid'])
BATCH_SIZE = 450
dimensions = 384
store = FeatureStore(repo_path=".")
index = pinecone.Index('nq-questions')

# Chunk upserts into the pinecone database
for i in range(0, len(question_ids), BATCH_SIZE):
    batch = question_ids[i: i+BATCH_SIZE]

    feature_vectors = store.get_online_features(
        features=[f'questions:e_{i}' for i in range(dimensions)],
        entity_rows=[{"qid":_id} for _id in batch.qid.to_list()]
    ).to_dict()

    # Prepare list of items to upload into Pinecone's index
    items_to_insert = []

    for e in range(len(feature_vectors['qid'])):
        l = [feature_vectors[f'e_{i}'][e] for i in range(dimensions)]
        items_to_insert.append((str(feature_vectors['qid'][e]), l))
        tuple_insert = tuple(items_to_insert)
    
    # Upsert batch data
    index.upsert(vectors=tuple_insert)  

In [135]:
# Create dataframe for new questions
df_new_questions = pd.DataFrame([[1000001, 'What is the capital of Pennsylvania?'], 
                                 [1000002, 'How nutritious are apples?']], 
                                    columns=['qid', 'question1'])

# Create embedding for each question
model = SentenceTransformer('all-MiniLM-L6-v2')
df_new_questions['question_vector'] = df_new_questions.question1.apply(lambda x: model.encode(str(x), show_progress_bar=False))

# Create timestamps 
df_new_questions['created'] = datetime.datetime.utcnow()
df_new_questions['datetime'] = df_new_questions['created'].dt.floor('h')

# Generate columns for vector elements
df_new_questions2 = df_new_questions.question_vector.apply(pd.Series)
df_new_questions2.columns = [f'e_{i}' for i in range(dimensions)]
result = pd.concat([df_new_questions, df_new_questions2], axis=1)

# Exclude some columns
result = result.drop(['question_vector'], axis=1)

# Change directory if needed
if os.getcwd().split('/')[-1] != 'feature_repo':
    os.chdir('feature_repo')

# Save to parquet file
result.to_parquet('./data/test_questions.parquet')


04/01/2022 10:40:21 PM INFO:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
04/01/2022 10:40:22 PM INFO:Use pytorch device: cpu


In [136]:
# Fetch the feature store and get feature vectors for the query questions
store = FeatureStore(repo_path=".")

feature_vectors = store.get_online_features(
    features=[f'test_questions:question1',
                  *[f'test_questions:e_{i}'
                    for i in range(dimensions)
                  ]],
    entity_rows=[{"qid":_id} for _id in df_new_questions.qid.tolist()]
).to_dict()

# Prepare list of vectors to query Pinecone
query_vectors = []

for e in range(len(feature_vectors['qid'])):
    l = [feature_vectors[f'e_{i}'][e] for i in range(dimensions)]
    query_vectors.append(l)

In [139]:
# Query Pinecone's index
query_results = index.query(queries=query_vectors, top_k=5)

# Show results
for e, res in enumerate(query_results['results']):
    print(e)
    print('Original question: ' + feature_vectors['question1'][e])
    print('Most similar questions based on Pinecone vector search: ')

    # Fetch from Feast to get question text
    ids = [d['id'] for d in res.matches]
    scores = [d['score'] for d in res.matches]
    
    result_feature_vectors = store.get_online_features(
        features=[f'questions:document_url'],
        entity_rows=[{"qid":int(_id)} for _id in ids]
    ).to_dict()

    # Prepare and display table
    df_result = pd.DataFrame({'id':ids,
                              'question': result_feature_vectors['document_url'],
                              'score':scores})
    display(df_result)


0
Original question: What is the capital of Pennsylvania?
Most similar questions based on Pinecone vector search: 


Unnamed: 0,id,question,score
0,6408,https://en.wikipedia.org//w/index.php?title=Province_of_Pennsylvania&amp;oldid=865431192,0.633106
1,2179,https://en.wikipedia.org//w/index.php?title=Philadelphia&amp;oldid=807750048,0.631774
2,9229,https://en.wikipedia.org//w/index.php?title=Pennsylvania&amp;oldid=866292516,0.589458
3,7379,https://en.wikipedia.org//w/index.php?title=Pennsylvania_State_University&amp;oldid=831460048,0.559006
4,6384,https://en.wikipedia.org//w/index.php?title=West_Virginia&amp;oldid=835287674,0.54013


1
Original question: How nutritious are apples?
Most similar questions based on Pinecone vector search: 


Unnamed: 0,id,question,score
0,6550,https://en.wikipedia.org//w/index.php?title=An_apple_a_day_keeps_the_doctor_away&amp;oldid=829882694,0.535801
1,531,https://en.wikipedia.org//w/index.php?title=Apples_and_Bananas&amp;oldid=795113537,0.512015
2,999,https://en.wikipedia.org//w/index.php?title=An_apple_a_day_keeps_the_doctor_away&amp;oldid=805434402,0.467765
3,4380,https://en.wikipedia.org//w/index.php?title=Golden_Apples_of_the_Sun&amp;oldid=695291390,0.45866
4,2384,https://en.wikipedia.org//w/index.php?title=Waterlogging_(agriculture)&amp;oldid=791681328,0.441215
