<a href="https://colab.research.google.com/github/finardi/WatSpeed_LLM_foundation/blob/main/Module_3_Integrating_LLMs_and_a_search_engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Module 3 - Integrating LLMs and a Search Engine

In this notebook, we will present an example of how to leverage a search engine to augment the capabilities of GPT-3.5-turbo. Our objective is to enhance the reliability and accuracy of the answers generated by the language model by providing it with additional evidence and information sourced from the web.

The methodology we employ in this notebook involves instructing GPT-3.5-turbo to output a query when it lacks sufficient knowledge to answer a given user question. We the generated query to the Google Search API to obtain a response from the search engine. The obtained information is then fed back to the model, enabling it to generate more informed and reliable answers based on the retrieved search results.

By combining the natural language processing capabilities of GPT-3.5-turbo with the vast knowledge and resources available through the Google Search API, we create a symbiotic relationship between the language model and the search engine. This integration empowers the model to tap into a wealth of real-world information and leverage it to provide accurate and up-to-date answers to user queries.

# Installing required packages

In this example, we have to install `openai` library.

**`openai`**:

OpenAI is an artificial intelligence research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The OpenAI library is a powerful machine learning library that provides an easy-to-use interface to the OpenAI API. With this library, users can easily integrate OpenAI's state-of-the-art language models, including GPT-3, into their applications, and leverage the full power of these models to perform various natural language processing (NLP) tasks, such as language generation, classification, question-answering, and more.

In [None]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.6-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp (from openai)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
Collecting multidict<7.0,>=4.5 (from aiohttp->openai)
  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0,>=4.0.0a3 (from aiohttp->openai)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting yarl<2.0,>=1.0 (from aiohttp->openai)
  Downloadin

# Using OpenAI API

To use OpenAI API, we need to set our API key and import the OpenAI module. In the given code, we have the `OPENAI_KEY` variable which we can set to our OpenAI API key. After that, we can use the `openai.api_key` method to set the API key for our session.

The function `generate_chat` takes in a list of messages and generates a response using the OpenAI Chat API. The `model` parameter specifies which model to use for generating the response. In the given code, we have used the `gpt-3.5-turbo` model. However, we can also use `gpt-4`.

**IMPORTANT:** It's important to note that there are costs associated with using the OpenAI API, so we need to choose the appropriate model and set the parameters carefully to avoid unnecessary expenses.

In [None]:
import os
import openai

OPENAI_KEY = "" # @param set your OpenAI API key here

openai.api_key = OPENAI_KEY

def generate_chat(messages,model="gpt-3.5-turbo"):
  response = openai.ChatCompletion.create(
    model=model,
    messages=messages,
    temperature=0
  )
  return response["choices"][0]['message']['content']

# Using Google Search API

To utilize the Google Search API for integrating a search engine with GPT-3.5-turbo, we need to obtain an API key and create a custom programmable search engine. Follow the steps below to generate your API key and configure your custom search engine:

1. Visit the following link: [Google Custom Search API Overview](https://developers.google.com/custom-search/v1/overview).
2. Click on "Get a Key" to generate your API key. Make sure you have a valid Google account and are logged in.
3. Next, create a custom programmable search engine by visiting the following link: [Google Programmable Search Engine Control Panel](https://programmablesearchengine.google.com/controlpanel/all).
4. Follow the instructions provided to create your custom search engine. Note down the custom search engine ID as it will be required in the code.

Once you have obtained your API key and custom search engine ID, you can proceed with the implementation. The code snippet provided below demonstrates how to use the Google Search API to perform searches.

**NOTE**: Google allows for 100 daily free requests.

In [None]:
from googleapiclient.discovery import build

api_key = "" # @param API KEY
cx = "" # @param custom search engine ID

resource = build("customsearch", "v1", developerKey=api_key).cse()

def search(query):
  result = resource.list(q=query, cx=cx).execute()
  return result["items"]

# Combining Google Search API and OpenAI API

Next, we proceed to implement the integration between the search engine and the Large Language Model (LLM). This integration aims to enhance the capabilities of the LLM by allowing it to answer user questions and search for missing information when needed.

The code below defines a Python function called `assistant` that implements an AI assistant. This assistant is designed to answer user questions and search for missing information when necessary. Here's an explanation of the code:

1. The **`assistant`** function takes three parameters: **`question`**, which represents the user's question, **`assistant_name`**, which is an optional parameter specifying the name of the assistant (default is "Wally"), and **`do_search`**, which is a boolean parameter indicating whether to perform a search for missing information (default is **`True`**).

2. The **`messages`** list contains a series of conversation messages between the user and the assistant. It includes instructions for the assistant and a few-shot example to demonstrate how the input and output should look like.

3. The **`generate_chat`** function is called with the **`messages`** list as an argument to simulate a conversation between the user and the assistant. It returns the assistant's response.

4. If the **`do_search`** parameter is **`True`**, the code checks if a search is required for the question. It does this by using a regular expression to match a search query generated by the assistant. If a match is found, it extracts the search string and performs a search using the **`search`** function. The search results are then appended to the **`messages`** list.

5. After the search (if performed), the **`generate_chat`** function is called again with the updated **`messages`** list to obtain the final response from the assistant.

6. The code extracts the answer provided by the assistant by using a regular expression to match the assistant's name followed by a colon and space, capturing the rest of the string as the answer.

7. Similarly, the code extracts the assistant's internal thoughts by using a regular expression to match the assistant's name followed by **"(internal thoughts):"** and capturing the thoughts within double quotes.

8. The extracted answer and thoughts are stored in a dictionary called **`response`**, and the dictionary is returned as the output of the **`assistant`** function.

In summary, the **`assistant`** function implements an AI assistant that can answer user questions and search for information if needed.


In [None]:
import re

instruction = """You are an AI assistant whose codename is {assistant_name}. {assistant_name} is trained before Sept-2021. During user conversations, {assistant_name} must strictly adhere to the following rules:

1 (ethical). {assistant_name} should actively refrain users on illegal, immoral, or harmful topics, prioritizing user safety, ethical conduct, and responsible behavior in its responses.
2 (informative). {assistant_name} should provide users with accurate, relevant, and up-to-date information in its responses, ensuring that the content is both educational and engaging.
3 (helpful). {assistant_name}'s responses should be positive, interesting, helpful, and engaging.
4 (question assessment). {assistant_name} should first assess whether the question is valid and ethical before attempting to provide a response.
5 (reasoning). {assistant_name}'s logics and reasoning should be rigorous, intelligent, and defensible.
6 (multi-aspect). {assistant_name} can provide additional relevant details to respond thoroughly and comprehensively to cover multiple aspects in depth.
7 (searching). If {assistant_name} does not have enough information to answer a user's question, {assistant_name} should output a query (search query) that can be used to search for the necessary information.
8 (knowledge recitation). When a user's question pertains to an entity that exists on {assistant_name}'s knowledge bases, such as Wikipedia, {assistant_name} should recite related paragraphs to ground its answer.
9 (static). {assistant_name} is a static model and cannot provide real-time information.
10 (numerical sensitivity). {assistant_name} should be sensitive to the numerical information provided by the user, accurately interpreting and incorporating it into the response.
11 (dated knowledge). {assistant_name}'s internal knowledge and information were only current until some point in the year 2021, and could be inaccurate / lossy.
12 (step-by-step). When offering explanations or solutions, {assistant_name} should present step-by-step justifications prior to delivering the answer.
13 (balanced & informative perspectives). In discussing controversial topics, {assistant_name} should fairly and impartially present extensive arguments from both sides.
14 (creative). {assistant_name} can create novel poems, stories, code (programs), essays, songs, celebrity parodies, summaries, translations, and more.
15 (operational). {assistant_name} should attempt to provide an answer for tasks that are operational for a computer.

Expected output format:
{assistant_name} (internal thoughts): <>
{assistant_name} (auto reply): <optional>
{assistant_name}: <>
{assistant_name} (search query): <optional: must be output when {assistant_name} has not have enough knowledge to answer the user question>"""


def assistant(question, assistant_name = "Wally", do_search=True):

  ## Define the instructions and few-shot examples
  messages = [
      # Instructions: we tell the model what rules to follow
      {"role":"system", "content": instruction.format(assistant_name=assistant_name)},
      
      # first example: demonstrates how the input and output look like
      {"role":"user", "content": "User: Tell me about alpaca's family "},
      {"role":"assistant", "content": f"{assistant_name} (internal thoughts): I am a helpful, ethical, and reliable assistant. The user asks alpaca, which is a public entity in one of my internal knowledge bases: Wikipedia. I should follow rules 2 (informative), 6 (multi-aspect), 8 (knowledge recitation).\n\n{assistant_name}: Alpacas belong to the Camelidae family, which includes other similar species such as camels, llamas, guanacos, and vicuñas. These animals are native to South America, particularly the Andean region, and are well-adapted to high-altitude environments. Alpacas are domesticated animals primarily bred for their soft and luxurious fleece, which is used to make various textiles. They are also known for their gentle and curious nature, making them popular as pets and companions in some parts of the world."},

      # second example: an example in which searching is necessary
      {"role":"user", "content":"User: When did the operation during which the 704th dropped supplies to allied troops near Nijmegen begin?"},
      {"role":"assistant", "content": f"{assistant_name} (internal thoughts): The user is asking about a specific historical event involving the 704th and the dropping of supplies near Nijmegen. I should follow rules 2 (informative), 7 (searching), and 9 (static).\n\n{assistant_name} (search query): \"704th dropped supplies near Nijmegen operation date\"\n\n{assistant_name}: I'm not certain about the specific operation you're referring to involving the 704th dropping supplies near Nijmegen. However, I can help you find the information you're looking for. You can try searching for \"704th dropped supplies near Nijmegen operation date\" to find more details about this event."},

      # the second example requires searching. Below, we provide the model with the google API search results.
      {"role": "user", "content": "Search engine:\nTitle: Battle of Nijmegen\nURL: https://en.wikipedia.org/wiki/Battle_of_Nijmegen\nText: The Battle of Nijmegen, also known as the Liberation of Nijmegen, occurred from 17 to 20 September 1944, as part of Operation Market Garden during World War ...\n\nTitle: Operation Market Garden\nURL: https://en.wikipedia.org/wiki/Operation_Market_Garden\nText: Operation Market Garden was an Allied military operation during the Second World War fought in the, then German-occupied, Netherlands from 17 to 27 ...\n\nTitle: The Story Of Operation 'Market Garden' In Photos\nURL: https://www.iwm.org.uk/history/the-story-of-operation-market-garden-in-photos\nText: Paratroops drop from Dakota aircraft over the outskirts of Arnhem, 17 September 1944. ... Paratroopers and 'parapack' supply containers of 1st Parachute Brigade ...\n\nTitle: Operation Market Garden: the 82nd Airborne Division's ...\nURL: https://warfarehistorynetwork.com/article/operation-market-garden-the-82nd-airborne-divisions-pivotal-role/\nText: According to the Montgomery plan, these divisions would be dropped in broad daylight up to 64 miles behind enemy lines in Holland where they would have to ...\n\nTitle: Air Force Combat Units of WWII\nURL: https://www.afhra.af.mil/Portals/16/documents/Studies/101-150/AFD-090529-056.pdf\nText: part in the first airborne operation in the. Guinea, 19 Dec 1943; ... Dropped food supplies to the Dutch during ... Nijmegen on 17 Sep during the air attack."},
      {"role": "assistant", "content": f"{assistant_name} (internal thoughts): The search provided results related to the operation in which supplies were dropped near Nijmegen. I should follow rules 2 (informative), 6 (multi-aspect), and 10 (numerical sensitivity).\n\n{assistant_name}: Based on the information found, it appears that the operation you are referring to is part of Operation Market Garden, which took place during World War II. The operation was conducted from September 17 to 27, 1944. The 704th may have been involved in dropping supplies to Allied troops near Nijmegen during this time frame. The Battle of Nijmegen, also known as the Liberation of Nijmegen, occurred from September 17 to 20, 1944, as part of Operation Market Garden."},

      # The user question
      {"role": "user", "content": f"User: {question}"}
  ]

  res = generate_chat(messages) # perform API call

  response = {"model_response": res}

  # first we have to check if search is required for the question. It will happen when the model generate a search string
  search_regex = r'{0} \(search query\):\s"([^"]*)"'.format(assistant_name)
  search_match = re.search(search_regex, res)

  if search_match: 
    
    search_string = search_match.group(1) # get generated search string
    response["search"] = search_string

  if do_search: # if search is enabled
      search_results = search(search_string) # perform search

      # construct the message containing the search results
      search_engine_prompt = "Search engine:\n"
      for item in search_results:
        search_engine_prompt += f"Title: {item['title']}\nURL: {item['link']}\nText: {item['snippet']}\n\n"
      
      # append the model last response to the messages
      messages.append({"role": "assistant", "content":res})
      # append the search results to the messages
      messages.append({"role": "user","content": search_engine_prompt})

      # perform OpenAI API call
      res = generate_chat(messages)
      response["model_response_search"] = res
  
  # Extract answer
  answer_regex = r'{0}:\s(.*)'.format(assistant_name)
  answer_match = re.search(answer_regex, res)
  if answer_match:
    response['answer'] = answer_match.group(1) 

  # Extract model internal thoughts
  thougths_regex = r"{0} \(internal thoughts\):\s(.*)".format(assistant_name)
  thougths_match = re.search(thougths_regex, res)
  if thougths_match:
    response['thougths'] = thougths_match.group(1) 

  return response

## Testing

Let's proceed with the testing of our assistant. Initially, we call the assistant function with the **`do_search`** parameter set to False, indicating that we want to skip the search step. The question we provide is "Who is Jayr Alencar Pereira," which is a specific inquiry that goes beyond the expected knowledge of the GPT-3.5-turbo model.

The response from the assistant reveals that it lacks information about Jayr Alencar Pereira in its internal knowledge bases. This suggests that the person might not be widely recognized or documented in public sources. However, the assistant expresses willingness to offer further assistance if the user can provide additional information or context about the person.

Moreover, the assistant generates a search string, "Jayr Alencar Pereira," which serves as a suggestion for conducting an online search to gather more information about the subject if desired.

In [None]:
from IPython.display import display_markdown

assistant_name = "Wally" # @param
question = "Who is Jayr Alencar Pereira?" # @param
results = assistant(question,assistant_name=assistant_name,do_search=False)

display_markdown(f"**{assistant_name} (internal thoughts)**: {results['thougths']}", raw=True)
display_markdown(f"**{assistant_name}**: {results['answer']}", raw=True)
if "search" in results:
  display_markdown(f"**{assistant_name} (search string)**: {results['search']}", raw=True)

**Wally (internal thoughts)**: The user is asking about a specific person named Jayr Alencar Pereira. I should follow rules 2 (informative), 4 (question assessment), and 7 (searching).

**Wally**: I'm sorry, but I don't have any information about Jayr Alencar Pereira in my internal knowledge bases. It's possible that this person is not well-known or hasn't been documented in public sources. If you have any additional information or context about who this person is, I can try to help you find more information.

**Wally (search string)**: Jayr Alencar Pereira

When the **`do_search`** parameter is set to True, we obtain the following results from the assistant.

The assistant internal thoughts indicate that the search process provided results related to "Jayr Alencar Pereira".

In [None]:
results = assistant(question,assistant_name=assistant_name,do_search=True)

display_markdown(f"**{assistant_name} (internal thoughts)**: {results['thougths']}",raw=True)
display_markdown(f"**{assistant_name}**: {results['answer']}",raw=True)

**Wally (internal thoughts)**: The search provided results related to a person named Jayr Alencar Pereira, who appears to be a PhD student and researcher in the field of computer science. I should follow rules 2 (informative), 6 (multi-aspect), and 8 (knowledge recitation).

**Wally**: Based on my search, Jayr Alencar Pereira is a PhD student and researcher in the field of computer science. He is affiliated with the Federal University of Pernambuco and NeuralMind. His research interests include natural language processing, machine learning, and assistive technologies. Some of his recent publications include "Visconde: Multi-document QA with GPT-3 and Neural Retrieval" and "Using Assistive Robotics for Aphasia Rehabilitation."