<a href="https://colab.research.google.com/github/achildress83/wiki-chatbot/blob/main/Wiki_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install wikipedia --quiet
!pip install openai==0.28 --quiet
!pip install tiktoken --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import wikipedia as wiki
import ipywidgets as widgets
from IPython.display import display, Markdown
import pandas as pd

import openai
from openai.embeddings_utils import distances_from_embeddings
import tiktoken

In [3]:
# keys and constants
openai.api_key = "YOUR API KEY"
EMBEDDING_MODEL="text-embedding-ada-002"
COMPLETION_MODEL = "gpt-3.5-turbo-instruct"
BATCH_SIZE=100
MAX_CONTEXT_LENGHT = 4096
MAX_TOKENS_OUTPUT=150

## Dataset
I used the wikipedia API and built a basic UI to allow the user to learn about any topic by selecting a topic and relevant page (from the subset of available pages under that topic). I wrote some basic preprocessing code. It's pretty robust, although it fails on certain pages for various reasons.

In [4]:
#@title Functions for building chatbot

def make_df(data):
  """
  Takes wikipedia page data and returns a dataframe with the text column
  after applying some basic cleaning.
  """
  df = pd.DataFrame(data, columns=['text'])
  # drop rows with no text, drop header rows, drop
  df = df[(df['text'].str.len() > 0) & (~df['text'].str.contains('==')) & (df['text'].str.match('^[A-Za-z]'))]
  df.reset_index(drop=True, inplace=True)
  return df


def make_embeddings_column(df, model=EMBEDDING_MODEL, batch_size=BATCH_SIZE):
  """
  add embeddings column to dataframe limiting the batch size to not overload the API.
  """
  embeddings = []

  for i in range(0, len(df), batch_size):
    response = openai.Embedding.create(input=df.iloc[i:i+batch_size]['text'].tolist(), model=model)
    # add embeddings list
    embeddings.extend([data.embedding for data in response.data])

  # add embeddings list to df
  df['embeddings'] = embeddings
  return df


def get_embedding(question, model):
  """
  helper function for get_rows_sorted_by_relevance. Creates an embedding from the question.
  """
  return openai.Embedding.create(input=question, model=model).data[0].embedding


def get_rows_sorted_by_relevance(df, question, model=EMBEDDING_MODEL):
    """
    Helper function for create_prompt. Takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, model)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

def create_prompt(df, question, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context:

    {}

    ---

    Question: {}
    Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))

    context = []

    for text in get_rows_sorted_by_relevance(df, question)['text'].values:
      # Add the length of the text to the current token count
      current_token_count += len(tokenizer.encode(text))

      # If current token count exceeds max, break
      if current_token_count <= max_token_count:
        context.append(text)

      # Otherwise add the text to the context
      else:
        break

    return prompt_template.format("\n\n###\n\n".join(context), question)

def answer_question(
    df, question, max_token_count=MAX_CONTEXT_LENGHT, max_answer_tokens=MAX_TOKENS_OUTPUT
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """

    prompt = create_prompt(df, question, max_token_count)

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

def non_custom_answer(question):
  answer = openai.Completion.create(
    model=COMPLETION_MODEL,
    prompt=QUESTION,
    max_tokens=150
)["choices"][0]["text"].strip()
  return answer

## Chatbot UI

In [27]:
#@title What topic do you want to learn about?
topic = "machine learning" #@param {type:"string"}

res = wiki.search(topic)
dropdown = widgets.Dropdown(
    options=res,
    description='Select from available pages:',
)
display(dropdown)

Dropdown(description='Select from available pages:', options=('Machine learning', 'Neural network (machine lea…

In [28]:
#@title Run this cell everytime you change your wiki page
# retrieve content from the selected wiki page
page_name = dropdown.value
data = wiki.page(page_name).content.split('\n')
# make a dataframe from the content
df = make_df(data)
# add an embeddings column to the df
df = make_embeddings_column(df)

In [29]:
#@title What's your question?
QUESTION = "What is the attention mechanism?" #@param {type: "string"}

display(Markdown(answer_question(df, question=QUESTION)))

The attention mechanism is a machine learning-based method that allows a model to focus on parts of an input sequence that are most relevant for the task at hand. It does this by calculating soft weights for each input element and using those weights to create a context vector that captures the important information. This method is used in natural language processing tasks like machine translation and has been extended within the Transformer architecture for more efficient processing.

## Answer Comparison (w/ vs w/out prompt engineering)

In [11]:
#@title Question #1

print(f'wiki page: {dropdown.value}')
print(f'question: {QUESTION}')
print('-----------------------------------------------------')
print('non-custom response:')
display(Markdown(non_custom_answer(question=QUESTION)))
print('\ncustom response:')
display(Markdown(answer_question(df, question=QUESTION)))

wiki page: Acquisition of Twitter by Elon Musk
question: Who owns twitter?
-----------------------------------------------------
non-custom response:


Twitter, Inc. is a publicly traded company and its ownership is distributed among its shareholders. It is not owned by any one individual or entity.


custom response:


Elon Musk.

In [26]:
#@title Question #2

print(f'wiki page: {dropdown.value}')
print(f'question: {QUESTION}')
print('-----------------------------------------------------')
print('non-custom response:')
display(Markdown(non_custom_answer(question=QUESTION)))
print('\ncustom response:')
display(Markdown(answer_question(df, question=QUESTION)))

wiki page: Gemini (language model)
question: What is gemini?
-----------------------------------------------------
non-custom response:


Gemini is the third astrological sign of the zodiac, represented by the symbol of the twins. People born between May 21st and June 20th are considered to be Geminis. They are known for their duality, adaptability, and communication skills. Geminis are often sociable and intellectual individuals who love to learn and express themselves. Mercury is the ruling planet of Gemini, and the element associated with this sign is air.


custom response:


Gemini is a family of multimodal large language models developed by Google DeepMind, serving as the successor to LaMDA and PaLM 2. It was announced on December 6, 2023, positioned as a competitor to OpenAI's GPT-4. It powers the chatbot of the same name.