# Chatbot with RAG - Retrieval-Augmented Generation 

#### Description
A custom chatbot is built to answer questions about cars and is augmented with more recent data using the RAG technique.

The same questions will be asked of the original model, then the model with the augmented dataset to compare the responses.

#### Main Componenets

- LLM: OpenAI `gpt-3.5-turbo-instruct`
- Embedding Model: `text-embedding-ada-002`
- Fine-tuning: `RAG`
- Custom Dataset: `recent_car_data_embeddings.csv` produced by `data_collection.ipynb`


## Imports

In [37]:
import openai
import tiktoken
from openai.embeddings_utils import get_embedding, distances_from_embeddings

import pandas as pd
import json


# 1) The Dataset


When OpenAI was asked the questions below, the response did not yield the correct answers.
1. which car was the best selling car in the world in 2023?
2. which car has the lowest total cost of ownership or reuires the least maintenance in 2023?

This is because the LLM was trained on data from up to 2021.

To get the right answers, the notebook `data_collection.ipynb` was created to retrieve more recent data by collecting data from the internet through a Google API, then exporting the dataset to the file `recent_car_data.csv`.

Once, we augment the chatbot with this data, the same questions will be asked with and without the augmententation.


## 2) Data Wrangling

Load the dataset from the file `recent_car_data.csv` which was produced by `data_collection.ipynb`


### Load and Wrangle Data

In [None]:
credentials = {}
try:
    with open('credentials.json') as file:
        credentials = json.load(file)
except FileNotFoundError:
    print("Error: file credentials.json was not found.")
print(credentials)    

api_key = credentials['OpenAIAPIKey']
openai.api_key = api_key

print(api_key)

In [39]:
df = pd.read_csv('recent_car_data.csv')    
df.head()

Unnamed: 0,text
0,Tesla Model Y secures position as world's best...
1,Best-selling car models worldwide 2023. The Te...
2,The world's top 500 best-selling cars in 2023....
3,World Best Selling Cars Ranking 2024. World Be...
4,Tesla Model Y to be crowned world's best-selli...


### Generate Embeddings

In [40]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,Tesla Model Y secures position as world's best...,"[0.0007928981212899089, -0.025256941094994545,..."
1,Best-selling car models worldwide 2023. The Te...,"[-0.010839383117854595, -0.022148113697767258,..."
2,The world's top 500 best-selling cars in 2023....,"[-0.007705730386078358, -0.026900004595518112,..."
3,World Best Selling Cars Ranking 2024. World Be...,"[0.009730380959808826, -0.00798781681805849, 0..."
4,Tesla Model Y to be crowned world's best-selli...,"[-0.015623337589204311, -0.0179719440639019, 0..."
...,...,...
89,The True Cost to Own EVs vs. Gas Cars. ... car...,"[0.02793605998158455, -0.012697041966021061, 0..."
90,Discounts on year-old vehicles have more than ...,"[-0.014496373012661934, 0.003646989120170474, ..."
91,Cheaper and Cleaner: Electric Vehicle Owners S...,"[0.018361523747444153, -0.0010400335304439068,..."
92,Safe vehicles for teens. Best Choices — used ;...,"[0.009535635821521282, -0.026397429406642914, ..."


In [41]:
# Save the embeddings locally
df.to_csv("recent_car_data_embeddings.csv")

### Function to Find Related Pieces of Text for a Given Question

In [42]:

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


In [43]:
# Check using a question
get_rows_sorted_by_relevance("what was the best selling car in the world in 2023?", df)

Unnamed: 0,text,embeddings,distances
23,The best-selling car in the world in 2023 is t...,"[-0.010958744212985039, -0.02671232633292675, ...",0.095250
1,Best-selling car models worldwide 2023. The Te...,"[-0.010839383117854595, -0.022148113697767258,...",0.101876
22,Tesla Model Y crowned world's best-selling car...,"[-0.01762448064982891, -0.017356475815176964, ...",0.102505
25,Tesla Model Y crowned world's best-selling car...,"[-0.016832543537020683, -0.025813322514295578,...",0.102738
37,Tesla Model Y is world's best-selling car of 2...,"[-0.015965625643730164, -0.022575119510293007,...",0.102806
...,...,...,...
79,Cost of Owning a Car for a Year in Every State...,"[0.018761899322271347, 0.0036847013980150223, ...",0.262905
76,Electric Vehicle Cost Goes Beyond the Purchase...,"[0.022899383679032326, -0.013003667816519737, ...",0.263573
68,It Sucks to Own an EV in These Cities—Here's W...,"[0.019500192254781723, 0.0075468276627361774, ...",0.266218
84,The True Cost of Owning a Car in California. L...,"[0.0092276930809021, -0.01021749246865511, 0.0...",0.275896


## 3) Custom Query Completion

Compose a custom query using the chosen dataset and retrieve results from an OpenAI `Completion` model. 

### Create a Function that Composes a Text Prompt

In [44]:

def create_prompt(question, df, max_token_count, augmented):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []

    if augmented == True:
        for text in get_rows_sorted_by_relevance(question, df)["text"].values:
            
            # Increase the counter based on the number of tokens in this row
            text_token_count = len(tokenizer.encode(text))
            current_token_count += text_token_count
            
            # Add the row of text to the list if we haven't exceeded the max
            if current_token_count <= max_token_count:
                context.append(text)
            else:
                break

    return prompt_template.format("\n\n###\n\n".join(context), question)

### Create a Function that Answers a Question

In [45]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, augmented, max_prompt_tokens=1800, max_answer_tokens=150):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens, augmented)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""


## 4) Custom Performance Demonstration

Demonstrate the performance of the custom query using the 2 questions. For each question, inspect the answer from a basic `Completion` model query as well as the answer from the custom query.


### Question 1

In [46]:
non_custom_best_selling_car_answer = answer_question("which car was the best selling car in the world in 2023?", df, False)
print(non_custom_best_selling_car_answer)

I don't know


In [47]:
custom_best_selling_car_answer = answer_question("which car was the best selling car in the world in 2023?", df, True)
print(custom_best_selling_car_answer)

The Tesla Model Y was the best-selling car in the world in 2023.


### Question 2

In [48]:
non_custom_lowest_tco_car_answer = answer_question("which car had the lowest total cost of ownership or required the least maintenance in 2023?", df, False)
print(non_custom_lowest_tco_car_answer)

I don't know


In [49]:
custom_lowest_tco_car_answer = answer_question("which car had the lowest total cost of ownership or required the least maintenance in 2023?", df, True)
print(custom_lowest_tco_car_answer)

Tesla Model 3


# 5) Car Chatbot

In [51]:
print('Hello, ask me about cars. Hit Enter to exit.\n')
while True:
    question = input('You: ')
    if len(question) > 0:
        print(f'\nCar Bot: {answer_question(question, df, True)}', end='\n\n')
    else:
        print('\nCar Bot: Good bye!')
        break

Hello, ask me about cars. Hit Enter to exit.



You:  which car was the best selling car in the world in 2023?



Car Bot: The Tesla Model Y was the best-selling car in the world in 2023.



You:  which car had the lowest total cost of ownership or required the least maintenance in 2023?



Car Bot: Tesla Model 3 was found to have the lowest total cost of ownership or required the least maintenance in 2023.



You:  



Car Bot: Good bye!
