The PDF in question must be parseable (meaning that you should be able to highlight it when on a PDF editor)

This program trains itself with information from any PDF, takes an input question, finds the most appropriate reference text (page) and then gives the response. This should be run before the criteria evaluation and embedding distance.

Imports

In [None]:
import google.generativeai as palm
import numpy as np
import pandas as pd
import textwrap
import os
from langchain.document_loaders import PyPDFLoader

Setting up the OS with your API key

In [None]:
os.environ["GOOGLE_API_KEY"] = ""
#insert your own Google API key inside the quotations
palm.configure(api_key=os.environ["GOOGLE_API_KEY"])

Finding the appropriate model to use

In [None]:
models = [m for m in palm.list_models() if 'embedText' in m.supported_generation_methods]
model1 = models[0].name
#print(model1)

Loading and splitting the PDF

In [None]:
pdf_loader = PyPDFLoader("")
# Put your PDF file name in the above quotations. Ensure that this file and the PDF are in the same folder.
pages = pdf_loader.load_and_split()

# This step may take longer depending on the PDF length.

content = []
for page in pages:
    text_content = page.page_content
    content.append(text_content)

# BELOW IS JUST SAMPLE CODE TO TEST IF THE EMBEDDINGS WORK
# sample_text = content[7]
# print(sample_text)
# embedding = palm.generate_embeddings(model=model1, text=sample_text)
# print(embedding)

Converting the content into a pandas dataframe so that you can add extra columns to it.

In [None]:
df = pd.DataFrame(content)
df.columns = ["Text"]
df

Adding the embeddings to the Pandas dataframe

In [None]:
def embed_fn(text):
  return palm.generate_embeddings(model=model1, text=text)['embedding']

df['Embeddings'] = df['Text'].apply(embed_fn)
df

User input to enter the question

In [None]:
question = input("Enter a question: ")
# the input is at the top of this code interpreter

Method for finding the best passage with the question as an input

In [None]:
def find_best_passage(query, dataframe):
  """
  Compute the distances between the query and each document in the dataframe
  using the dot product.
  """
  query_embedding = palm.generate_embeddings(model=model1, text=query)
  dot_products = np.dot(np.stack(dataframe['Embeddings']), query_embedding['embedding'])
  idx = np.argmax(dot_products)
  return dataframe.iloc[idx]['Text'] # Return text from index with max value

Calling the appropriate passage and printing it out (can comment out the second line)

In [None]:
best_passage = find_best_passage(question, df)
print(best_passage)

Method to make an appropriate prompt that AI can understand based on the passages

In [None]:
def make_prompt(query, relevant_passage):
  escaped = relevant_passage.replace("'", "").replace('"', "")
  prompt = textwrap.dedent("""You are a helpful and informative bot that answers questions using text from the reference passage included below.
  Be sure to respond in a complete sentence, being comprehensive, including all relevant background information.
  However, you are talking to a non-technical audience, so be sure to break down complicated concepts and strike a friendly and converstional tone.
  If the passage is irrelevant to the answer, you may ignore it.
  QUESTION: '{query}'
  
  PASSAGE: 
  {relevant_passage}

  
    ANSWER:
  """).format(query=query, relevant_passage=escaped)

  return prompt

Method to generate the answer based on the prompt

In [None]:
def generate_answer(prompt):
    text_models = [m for m in palm.list_models() if 'generateText' in m.supported_generation_methods]
    text_model = text_models[0]
    temperature = 0
    # feel free to experiment with the temperature 
    # (0 means it is the same response each time, 1 means that the responses are random)

    answer = palm.generate_text(prompt=prompt,
                            model=text_model,
                            candidate_count=1,
                            temperature=temperature,
                            max_output_tokens=1000)
    # feel free to experiment with the candidate count (the number of answers it gives)
    
    for i, candidate in enumerate(answer.candidates):
        print(f"Candidate {i}: {candidate['output']}\n")

Calling the methods to generate the prompt and answer to it and printing both

In [None]:
prompt = make_prompt(question, best_passage)
print(prompt)
generate_answer(prompt)