# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

* I have decided to use the wikipedia information on the [2024 Oscar's](https://en.wikipedia.org/wiki/96th_Academy_Awards), namely the Winners of the different categories
* I chose this information since the model I will use (gtp-4o-mini) only has information until 2021 so it can't know the winners of the Oscar's edition of 2024. This will make the custom RAG application knowladgeable about this edition of the Oscar's

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import requests
import json
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
def scrape_wikipedia(url: str):
    
    #Get repsonse
    response = requests.get(url)
    
    if response.status_code == 200:
        print("Success!")
        
        #Parse content
        parsed_content = BeautifulSoup(response.content, 'html.parser')
        
        winners_section = parsed_content.find('h2', id='Winners_and_nominees')
        print(winners_section)
        #Get metadata
        title = parsed_content.title.string
        
        header = parsed_content.find('h2', id='Winners_and_nominees').text
        
        if winners_section:
            section_header = winners_section.find_parent()
            
             # Get the next sibling elements until the next header
            nominees_content = []
            for sibling in section_header.find_next_siblings():
                if sibling.name and sibling.name.startswith('h'):  # Stop at the next header
                    break
                nominees_content.append(sibling)

            # Join the content into a single string
            nominees_text = ''.join(str(content) for content in nominees_content)

            # Clean up the extracted content
            cleaned_nominees_text = BeautifulSoup(nominees_text, 'html.parser').get_text()
            
             # Extract winners from the cleaned nominees text
            lines = []
            winners = []

            for idx, line in enumerate(cleaned_nominees_text.splitlines()):
                lines.append(line)
                if '‡' in line:
                    winner = lines[idx-1] + " - " + line.strip()
                    winners.append(winner) 
        
        #need to remove the first element since it is non-informative
        df = pd.DataFrame({"text": winners[1:]})
        
        return df
    else: 
        print("Failed repsonse!")

In [3]:
df = scrape_wikipedia("https://en.wikipedia.org/wiki/96th_Academy_Awards")

Success!
<h2 id="Winners_and_nominees">Winners and nominees</h2>


In [4]:
print(df.shape)
print(df.head(10))

(23, 1)
                                                text
0  Best Picture - Oppenheimer – Emma Thomas, Char...
1  Best Director - Christopher Nolan – Oppenheimer ‡
2  Best Actor - Cillian Murphy – Oppenheimer as J...
3  Best Actress - Emma Stone – Poor Things as Bel...
4  Best Supporting Actor - Robert Downey Jr. – Op...
5  Best Supporting Actress - Da'Vine Joy Randolph...
6  Best Original Screenplay - Anatomy of a Fall –...
7  Best Adapted Screenplay[22] - American Fiction...
8  Best Animated Feature - The Boy and the Heron ...
9  Best International Feature Film - The Zone of ...


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [5]:
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"

### First, I validate the LLM is not aware that the 2024 Oscars already took place

In [7]:
question1 = """
"Who was the winner of the Oscar for Best Director in the 2024 Oscars?"

"""
answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question1,
    max_tokens=200
)["choices"][0]["text"].strip()

print(answer)

It is not possible to accurately answer this question as the 2024 Oscars have not yet taken place. The winner for Best Director in the 2024 Oscars will be announced during the ceremony itself.


In [8]:
question2 = """
"Do you know who was the best supporting acctress winner in the 2024 Oscars?"

"""
answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question2,
    max_tokens=200
)["choices"][0]["text"].strip()

print(answer)

I am an AI and do not have access to future events or information. I also do not track Oscars or other award shows. Is there something else I can assist you with?


### I then create the embeddings

In [9]:
def create_embeddings(df, embedding_name = "text-embedding-ada-002"):
        
    #iterate
    embeddings = []
    for index, row in df.iterrows():
        
        text = row["text"]
        # Send text data to OpenAI model to get embeddings
        response = openai.Embedding.create(
            input=text,
            engine=embedding_name
        )

        # Add embeddings to list
        embeddings.extend([data["embedding"] for data in response["data"]])

    # Add embeddings list to dataframe
    df["embeddings"] = embeddings
    
    return df

In [10]:
df_emb = create_embeddings(df)

In [11]:
df_emb.head()

Unnamed: 0,text,embeddings
0,"Best Picture - Oppenheimer – Emma Thomas, Char...","[-0.014236527495086193, -0.017690204083919525,..."
1,Best Director - Christopher Nolan – Oppenheimer ‡,"[-0.008370069786906242, -0.011518937535583973,..."
2,Best Actor - Cillian Murphy – Oppenheimer as J...,"[-0.02766180783510208, -0.017674298956990242, ..."
3,Best Actress - Emma Stone – Poor Things as Bel...,"[-0.02568988874554634, -0.0269196517765522, 0...."
4,Best Supporting Actor - Robert Downey Jr. – Op...,"[-0.023602420464158058, 0.008683276362717152, ..."


### Next I calculate the distances

In [17]:
def calculate_distances(df_emb, question, embedding_name = "text-embedding-ada-002"):
    
    #First calculate embedding based on user question:
    question_emb = get_embedding(question, engine=embedding_name)
    
    #Let's calculate distances and sort
    df_emb_dist = df_emb.copy()
    
    # Distances
    df_emb_dist["distances"] = distances_from_embeddings(
        question_emb,
        df_emb_dist["embeddings"].to_list(),
        distance_metric="cosine"
    )
    
    #Sorting
    df_emb_dist.sort_values("distances", ascending = True, inplace = True)
    
    return df_emb_dist

In [18]:
calculate_distances(df_emb, question=question1)

Unnamed: 0,text,embeddings,distances
1,Best Director - Christopher Nolan – Oppenheimer ‡,"[-0.008370069786906242, -0.011518937535583973,...",0.208632
18,Best Cinematography - Oppenheimer – Hoyte van ...,"[-0.0018361049005761743, -0.008508363738656044...",0.227347
21,Best Film Editing - Oppenheimer – Jennifer Lame ‡,"[-0.017987269908189774, 0.015496301464736462, ...",0.228611
0,"Best Picture - Oppenheimer – Emma Thomas, Char...","[-0.014236527495086193, -0.017690204083919525,...",0.230984
13,Best Animated Short Film - War Is Over! Inspir...,"[-0.022053616121411324, -0.01141801755875349, ...",0.232826
4,Best Supporting Actor - Robert Downey Jr. – Op...,"[-0.023602420464158058, 0.008683276362717152, ...",0.233569
12,Best Live Action Short Film - The Wonderful St...,"[-0.004012526012957096, 0.0025663787964731455,...",0.233854
2,Best Actor - Cillian Murphy – Oppenheimer as J...,"[-0.02766180783510208, -0.017674298956990242, ...",0.233855
9,Best International Feature Film - The Zone of ...,"[-0.0027441319543868303, -0.022614292800426483...",0.23885
8,Best Animated Feature - The Boy and the Heron ...,"[-0.005875669419765472, -0.00873585231602192, ...",0.243896


### Now creating a costum prompt
* Note: I have decided not to use the trick explained in the course to get the maximum context since my dataset is small and I do not want to confuse my RAG Model

In [40]:
def create_prompt(question, df):
    
    prompt_template = """
                    Answer the question based on the context below, and if the question
                    can't be answered based on the context, say "I don't know"

                    Context: 

                    {}

                    ---

                    Question: {}
                    Answer:
                    """
    
    df_dist = calculate_distances(df_emb, question=question)
    context = (df_dist["text"].values[0])
    
    print(f"Context used: {context}")
    return prompt_template.format(context, question)

In [41]:
create_prompt(question1, df_emb)

Context used: Best Director - Christopher Nolan – Oppenheimer ‡


'\n                    Answer the question based on the context below, and if the question\n                    can\'t be answered based on the context, say "I don\'t know"\n\n                    Context: \n\n                    Best Director - Christopher Nolan – Oppenheimer ‡\n\n                    ---\n\n                    Question: \n"Who was the winner of the Oscar for Best Director in the 2024 Oscars?"\n\n\n                    Answer:\n                    '

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [42]:
question1 = """
"Who was the winner of the Oscar for Best Director in the 2024 Oscars?"
"""

In [43]:
initial_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question1,
    max_tokens=200
)["choices"][0]["text"].strip()

print(initial_answer)

As of now, the 2024 Oscars have not yet taken place, so there is no information available on who the winner of the Oscar for Best Director will be.


In [44]:
improved_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=create_prompt(question1, df_emb),
    max_tokens=200
)["choices"][0]["text"].strip()

print(improved_answer)

Context used: Best Director - Christopher Nolan – Oppenheimer ‡
Christopher Nolan


### Question 2

In [45]:
question2 = """
"Do you know who was the best supporting acctress winner in the 2024 Oscars?"
"""

In [46]:
initial_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question2,
    max_tokens=200
)["choices"][0]["text"].strip()

print(initial_answer)

I cannot answer this question as the 2024 Oscars have not yet happened and the winner for Best Supporting Actress has not yet been announced.


In [48]:
improved_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=create_prompt(question2, df_emb),
    max_tokens=200
)["choices"][0]["text"].strip()

print(improved_answer)

Context used: Best Supporting Actress - Da'Vine Joy Randolph – The Holdovers as Mary Lamb ‡
Da'Vine Joy Randolph
