# Custom Chatbot Project

### TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

In this Chatbot project I will be collecting data from Wikipedia page of 2024. This dataset I have chosen is appropriate for this task since  I have two question from events on 2024 and the model of gpt-35-turbo was updated till 2021.

We can provide context (RAG) to th model for obtening the response from 2024.

Question 1: "Who was elected president of the United States of America?"
Question 2: "Which country won the Olympic Games of Paris 2024?"


# Step 0 - Initial OpenAI response

Before I custom a chatbot, I want to ask the openAI model 2 questions from 2024 to check its answers.

In [2]:
from dateutil.parser import parse
import pandas as pd
import requests
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import numpy as np
import tiktoken

openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"

In [3]:
def initial_answer(question):
    response = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=question,
        max_tokens=150
    )
    return response["choices"][0]["text"].strip()


The model is answering this way because the training data ends in 2021. My task will be to provide context from 2024 to help the model answer these questions correctly.

# STEP 01 - DATA INDEXING

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [4]:
def load_dataset_wikipedia(title):
    #Getting the wikipedia page for 2024
    params = {
       "action": "query", 
       "prop": "extracts",
       "exlimit": 1,
       "titles": title,
       "explaintext": 1,
       "formatversion": 2,
       "format": "json"
    }

    response = requests.get("https://en.wikipedia.org/w/api.php", params)
    
    df = pd.DataFrame()
    df["text"] = response.json()["query"]["pages"][0]["extract"].split("\n")
    
    # Cleaning text to remove empty lines
    df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]
   
    # adjusting so dataset starts with dates (if possible)
    prefix = ""
    for (i, row) in df.iterrows():
        # If the row already has " - ", it already has the needed date prefix
        if " – " not in row["text"]:
            try:
                # If the row's text is a date, set it as the new prefix
                parse(row["text"])
                prefix = row["text"]
            except:
                # If the row's text isn't a date, add the prefix
                row["text"] = prefix + " – " + row["text"]
    df = df[df["text"].str.contains(" – ")]
    return df


## Creating an Embeddings Index

Using "text-embedding-ada-002".

In [5]:
def generate_embeddings(df, embedding_model_name="text-embedding-ada-002", batch_size=100):
    embeddings = []
    for i in range(0, len(df), batch_size):
        # Send text data to OpenAI model to get embeddings
        response = openai.Embedding.create(
            input=df.iloc[i:i+batch_size]["text"].tolist(),
            engine=embedding_model_name
        )

        # Add embeddings to list
        embeddings.extend([data["embedding"] for data in response["data"]])

    # Add embeddings list to dataframe
    df["embeddings"] = embeddings
    return df


## Saving the embeddings

In [6]:
def save_embeddings(df, output_file):
    df.to_csv(output_file, index=False)


#save_embeddings(df, "embeddings.csv")

## Load embeddings for cosine similarity

In [7]:
def load_embeddings(file_path):
    df = pd.read_csv(file_path, index_col=0)
    df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
    return df

#load_embeddings("embeddings.csv")

## Cosine Similarity

I want to look for the closest distance to compare the vectors from my question and the dataset from WIKIPEDIA 2024 with the answer.

In [8]:
def cosine_similarity(question, df, embedding_model_name="text-embedding-ada-002"):
    question_embeddings = get_embedding(question, engine=embedding_model_name)
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy



# STEP 2 - Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [9]:
def create_prompt(question, df, max_token_count=1000):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    prompt_template = """
    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context: 

    {}

    ---

    Question: {}
    Answer:"""   
    
    current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
    context = []
    
    for text in df["text"].values:
        text_token_count = len(tokenizer.encode(text))

        if current_token_count <= max_token_count:
            context.append(text)
            current_token_count += text_token_count

        else:
            break
    
    return prompt_template.format("\n\n###\n\n".join(context), question)



# STEP 06 Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

In [10]:
def openai_answer(question, max_answer_count):
    response = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=question,
        max_tokens=max_answer_count
    )
    return response["choices"][0]["text"].strip()


In [11]:
def context_answer(question, df, max_prompt_count=1000, max_answer_count=200):
    
    #Cosine similarity with most relevant
    relevant_rows = cosine_similarity(question, df)
    
    #prompt for the question and with relevant rows to answer
    prompt=create_prompt(question, relevant_rows, max_token_count=max_prompt_count)
    
    #
    return openai_answer(prompt, max_answer_count=max_answer_count)

#context_answer("Who was elected president of the United States of America?", df, max_prompt_count=1000, max_answer_count=200)

## Customize Chatbot for wikipedia 2024

In [12]:
def main_demostration(user_input, df):
        
    #Embeddings, save and load
    df_context = generate_embeddings(df)
    #save_embeddings(df, "embeddings.csv")
    #load_embeddings("embeddings.csv")
    
    #get question from user
    question = user_input
    
    #answer before providing context
    before_answer = initial_answer(question)
    
    #answer after providing context
    answer_context = context_answer(question, df_context)
    
    #Show the resutls
    print(f"\n\nThe answer for your question: {question} before providing context to openai is: {before_answer}\n\n")
    print(f"After we provided context, the answer is: {answer_context}\n\n")
    

### Chatbot that receives input from the user

In [13]:
chatbot_on = True
user_questions = 0

#loading dataset from wikipedia 2024
df = load_dataset_wikipedia("2024")
    
while chatbot_on:
    user_input = input("What do you want to know from 2024 events?:  \n\n").lower()
    main_demostration(user_input, df)
    user_questions += 1
    
    if user_questions ==2:
        chatbot_on = False
        print("You run out of questions for today, try again tomorrow!")

#Question 1: Who was elected president of the United States of America? 
#Question 2: When was X (former twitter) banned in Brasil?

What do you want to know from 2024 events?:  

Who was elected president of the United States of America? 


The answer for your question: who was elected president of the united states of america?  before providing context to openai is: As of May 2021, Joe Biden is the elected President of the United States of America.


After we provided context, the answer is: Donald Trump


What do you want to know from 2024 events?:  

When was X (former twitter) banned in Brasil?


The answer for your question: when was x (former twitter) banned in brasil? before providing context to openai is: Former Twitter was not banned in Brazil. Twitter itself has not been banned in Brazil. However, there have been instances where certain features or accounts on Twitter have been temporarily blocked in Brazil due to court orders.


After we provided context, the answer is: September 2


You run out of questions for today, try again tomorrow!
