# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

### The chosen Dataset and why it is appropriate for the task

For this project, we have chosen the data from the Wikipedia page about the **ICC T20 Cricket World Cup 2024**. This dataset is suitable for building a chatbot to answer questions related to this event because the GPT-3.5 Turbo model does not have training data on this recent event that took place in 2024. By using this dataset, we ensure that our chatbot can provide up-to-date and accurate information.

To enhance the question-answering capability, we employ the Retrieval Augmented Generation (RAG) technique. This technique supplements the prompt with contextual information from the dataset, enabling the model to provide more accurate and relevant answers.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
# Importing libraries
import requests
import pandas as pd
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings

In [2]:
# Environment (global) variables
openai.api_key = "YOUR API KEY"

# File paths
DATA_FILEPATH = './wikipedia_paragraphs.csv'
CSV_FILEPATH_WITH_EMBEDDINGS = './wikipedia_with_embeddings.csv'

# OpenAI Models
EMBEDDING_MODEL = 'text-embedding-3-small'
COMPLETION_MODEL = 'gpt-3.5-turbo-instruct'

# Batch size for processing
BATCH_SIZE = 25

In [3]:
def fetch_wikipedia_page(title: str, output_file: str = None):
    """
    Fetches and cleans text content from a specified Wikipedia page and separates it into paragraphs.

    Parameters:
    title (str): The title of the Wikipedia page to fetch.
    output_file (str, optional): The file path to save the DataFrame as a CSV. Defaults to None.

    Returns:
    pd.DataFrame: A DataFrame containing the paragraphs of the Wikipedia page.
    """
    # Wikipedia API endpoint
    url = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "prop": "extracts",
        "explaintext": True,
        "titles": title
    }

    # Send the request to the Wikipedia API
    response = requests.get(url, params=params)
    data = response.json()

    # Extract the page content
    page = next(iter(data['query']['pages'].values()))
    text = page['extract']

    # Split the text into paragraphs
    paragraphs = text.split('\n')

    # Clean the paragraphs (remove extra whitespace)
    cleaned_paragraphs = [' '.join(para.split()) for para in paragraphs if para.strip()]

    # Create a DataFrame with one column named 'text'
    df = pd.DataFrame({'text': cleaned_paragraphs})

    # Clean up dataframe to remove empty lines and headings
    df = df[((df["text"].str.len() > 0) & (~df["text"].str.startswith("==")))].reset_index(drop=True)

    # Save the DataFrame to a CSV file if output_file is provided
    if output_file:
        df.to_csv(output_file, index=False)

    return df

In [4]:
# Extracting the data
df = fetch_wikipedia_page("2024_ICC_Men's_T20_World_Cup", DATA_FILEPATH)
print(df)

                                                 text
0   The 2024 ICC Men's T20 World Cup was the ninth...
1   The tournament field expanded from 16 to 20 te...
2   England were the defending champions and were ...
3   The ICC Men's T20 World Cup is a professional ...
4   In November 2021, the ICC announced that the 2...
..                                                ...
57  On 30 June 2024, the ICC announced its team of...
58  In India, Disney Star handled host broadcastin...
59  In an effort to help promote the sport to U.S....
60                                   Official website
61                    Tournament home at ESPNcricinfo

[62 rows x 1 columns]


## Creating embeddings index

In [5]:

def generate_embeddings_from_df(df, embedding_model_name, batch_size=25):
    """
    Generates embeddings for texts in a DataFrame using the specified OpenAI embedding model.

    Parameters:
    df (pd.DataFrame): DataFrame containing the texts.
    embedding_model_name (str): The name of the OpenAI embedding model to use.
    batch_size (int): The number of texts to process in each batch. Default:25

    Returns:
    pd.DataFrame: DataFrame with the original texts and their corresponding embeddings.
    """
    embeddings = []
    for i in range(0, len(df), batch_size):
        # Send text data to OpenAI model to get embeddings
        response = openai.Embedding.create(
            input=df.iloc[i:i+batch_size]["text"].tolist(),
            engine=embedding_model_name
        )

        # Add embeddings to list
        embeddings.extend([data["embedding"] for data in response["data"]])

    # Add embeddings list to dataframe
    df["embeddings"] = embeddings
    return df

# Creating embeddings
df = pd.read_csv(DATA_FILEPATH)

df_with_embeddings = generate_embeddings_from_df(df, EMBEDDING_MODEL, BATCH_SIZE)
display(df_with_embeddings[['text', 'embeddings']].head())


Unnamed: 0,text,embeddings
0,The 2024 ICC Men's T20 World Cup was the ninth...,"[-0.02775840274989605, -0.005271230358630419, ..."
1,The tournament field expanded from 16 to 20 te...,"[-0.01747475564479828, 0.015204494819045067, 0..."
2,England were the defending champions and were ...,"[0.02288796566426754, -0.010841187089681625, 0..."
3,The ICC Men's T20 World Cup is a professional ...,"[-0.013975653797388077, 0.011683300137519836, ..."
4,"In November 2021, the ICC announced that the 2...","[-0.02671620063483715, -0.02153180167078972, 0..."


In [6]:
# Saving embeddings
df_with_embeddings.to_csv(CSV_FILEPATH_WITH_EMBEDDINGS)

In [42]:
# Loading the embeddings (if required)
df_with_embeddings = pd.read_csv(CSV_FILEPATH_WITH_EMBEDDINGS)
# df_with_embeddings.head()

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [43]:
def create_custom_query(query, context):
    """
    Function that takes in the query and the context together to create a custom prompt
    Parameters:
    query (str): The question
    context : The top n pieces of text most relevant to the question
    Returns:
    A string that is actually a custom instruction prompt using the query and context
    """
    
    return f"""
            Answer the question based on the context below. If the question cannot be answered from the context, simply respond with "I don't know." and nothing else. The context has some facts about the 2024 ICC T20 Cricket World Cup.
            Context:
            {context}
            Question: {query}
            Answer:"""


def find_context(query, data_df, n=5):
    """
    Function that takes in a question string and a dataframe containing rows of text
    and associated embeddings, and returns the top n rows of text from that dataframe
    sorted from least to most relevant for that question.
    Parameters:
    query (str): The question
    data_df (pd.DataFrame): The dataframe containing text and respective embeddings
    n (int): The number of top results required
    Returns:
    A string of text blocks separated by newline
    """

    # Get embeddings for the question text
    query_embeddings = get_embedding(query, engine=EMBEDDING_MODEL)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = data_df.copy()
    df_copy["distances"] = distances_from_embeddings(
        query_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return '\n'.join(df_copy.iloc[:n]["text"].tolist())



In [44]:
def answer_question(prompt):
    """
    Function that sends a prompt/question to OpenAI model and returns the response/answer
    Parameters:
    prompt (str): The text of the prompt
    Returns:
    A string, which is the response from the model
    """
    
    resp = openai.Completion.create(model=COMPLETION_MODEL,
                                    prompt=prompt,
                                    max_tokens=100)
    
    return resp.choices[0].text

In [34]:
# Testing with a sample question
question = "Which 4 teams qualified for the semifinals in the T20 World Cup 2024?"
cont = find_context(question, df_with_embeddings)
print(cont)

The two hosts West Indies and the United States along with the top eight teams from the 2022 tournament qualified automatically for the tournament. The remaining two automatic qualification places were taken by the best-ranked teams in the ICC Men's T20I Team Rankings which had not already qualified, as of 14 November 2022. The eight remaining places were filled viathe ICC's regional qualifiers, consisting of two teams from Africa, Asia, and Europe and one team each from the Americas and the East Asia-Pacific groups. In May 2022, the ICC confirmed the sub-regional qualification pathways for Europe, East Asia-Pacific, and Africa.
On 23 June 2024, England became the first team to qualify for the semi-finals after defeating United States at Kensington Oval. Later on the same day, South Africa became the second team to qualify for the semi-final after defeating West Indies at Sir Vivian Richards Stadium. On 24 June 2024, India became the third team to qualify for the semi-finals after defe

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [16]:
question1 = "Who was the player of the tournament in ICC T20 World Cup 2024?"

In [20]:
# Answer without context
display(answer_question(question1))

' \n\nAs of 2021, the ICC T20 World Cup 2024 has not taken place and therefore there is no player who has been awarded the Player of the Tournament title. The next ICC T20 World Cup is scheduled to take place in 2022 in Australia.'

In [19]:
# Answer with context (custom prompt)
context1 = find_context(question1, df_with_embeddings)
display(answer_question(create_custom_query(question1, context1)))

' Jasprit Bumrah'

### Question 2

In [21]:
question2 = "Which team did India defeat in the semi-finals in the ICC T20 World Cup 2024?"

In [22]:
# Answer without context
display(answer_question(question2))

' \n\nIt is not possible to accurately predict which team India will defeat in the semi-finals of the ICC T20 World Cup in 2024 as the tournament schedule and participating teams have not been announced yet. \n'

In [23]:
# Answer with context (custom prompt)
context2 = find_context(question2, df_with_embeddings)
display(answer_question(create_custom_query(question2, context2)))

' England'

### Question 3

In [38]:
question3 = "Which was one of the biggest upsets in the ICC T20 World Cup 2024?"

In [39]:
# Answer without context (wrong answer)
display(answer_question(question3))

"\n\nOne of the biggest upsets in the ICC T20 World Cup 2024 was when underdog team Papua New Guinea defeated heavy favorites India by 7 wickets in the group stage match. This victory was considered a major upset as India was the top-ranked team in the world and had been dominant in T20 cricket for years. Papua New Guinea's win shocked the cricket world and showed that any team can be a threat in the T20 format. "

In [40]:
# Answer with context (custom prompt)
context3 = find_context(question3, df_with_embeddings)
display(answer_question(create_custom_query(question3, context3)))

' USA beating Pakistan in the Super Over thanks to brilliant bowling from Saurabh Netravalkar.'