# Custom Chatbot Project - Chat on fun facts about Indian Lok Sabha Election 2024

I have chosen dataset containing information about 2024 Lok Sabha Elections held in India. The information is rewritten with reference from various news sources and put into a csv file for easy input. This is a latest information and with many upredictable facts so its difficult for model to find any similar information in its training data.

#### Imports and variables for frequent use

In [153]:
import openai
import tiktoken
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os
from typing import Dict, List

In [None]:
# Environment variables
openai.api_key = 'OPEN_AI_API_KEY'

# OpenAI Models
EMBEDDING_MODEL = 'text-embedding-ada-002'
COMPLETION_MODEL = 'gpt-3.5-turbo-instruct'

# Batch size for processing
BATCH_SIZE = 25

# Embeddings file path
CSV_FILEPATH_WITH_EMBEDDINGS = "dataset_with_embeddings.csv"

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [155]:
# Read Election facts dataset
df = pd.read_csv('toi_2024_election_facts.csv', encoding='utf-8', index_col=0)
df.head()

Unnamed: 0_level_0,text
index,Unnamed: 1_level_1
1,"After 20 years, Congress reclaimed the Allahab..."
2,Samajwadi Party achieved its best Lok Sabha re...
3,BJP secured only 33 seats in Uttar Pradesh in ...
4,BJP's Lallu Singh lost the Faizabad seat to SP...
5,Congress won its first Gujarat Lok Sabha seat ...


### Generate embeddings for dataset

We'll use the `Embedding` tooling from OpenAI [documentation here](https://platform.openai.com/docs/guides/embeddings/embeddings) to create vectors representing each row of our custom dataset.

In [177]:
embeddings = []
for i in range(0, len(df), BATCH_SIZE):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+BATCH_SIZE]["text"].tolist(),
        engine=EMBEDDING_MODEL
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.head()

Unnamed: 0_level_0,text,embeddings
index,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"After 20 years, Congress reclaimed the Allahab...","[0.007014318369328976, -0.022495316341519356, ..."
2,Samajwadi Party achieved its best Lok Sabha re...,"[0.001216356409713626, 0.012867189943790436, 0..."
3,BJP secured only 33 seats in Uttar Pradesh in ...,"[-0.0010518135968595743, -0.012326016090810299..."
4,BJP's Lallu Singh lost the Faizabad seat to SP...,"[0.008040057495236397, 0.009712133556604385, 0..."
5,Congress won its first Gujarat Lok Sabha seat ...,"[-0.00889828521758318, -0.003393794409930706, ..."


In [178]:
df.to_csv(CSV_FILEPATH_WITH_EMBEDDINGS, index=False)

### Function to find relevent context from text dataset for answerig given question

In [179]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question:str, df:pd.DataFrame) -> pd.DataFrame:
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

Test the function

In [180]:
relevant_rows_df = get_rows_sorted_by_relevance("How many VVPAT systems were deployed for 2024 elections?", df)
relevant_rows_df.head()

Unnamed: 0_level_0,text,embeddings,distances
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
46,1.8 million VVPAT systems and 1.7 million cont...,"[0.0013148181606084108, -0.014351684600114822,...",0.079102
18,Seven Independent candidates won in the 2024 L...,"[0.007494308520108461, -0.003459648694843054, ...",0.175379
44,"With 400 seats, BJP's 2024 performance was see...","[-0.019077735021710396, -0.005691202823072672,...",0.186646
47,Mysore Paints & Varnish Ltd provided over 26 l...,"[0.010591262020170689, -0.0018588538514450192,...",0.187923
42,BJP's best performance in Odisha was winning 2...,"[-0.00121549260802567, -0.0015830229967832565,...",0.190139


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

### Functions to compose a text prompt

In [181]:
# Create simple prompt without context
def create_simple_prompt(question: str) -> str:
    """
    Builds a simple prompt for asking a question.

    Args:
        question (str): The question to include in the prompt.

    Returns:
        List[Dict[str, str]]: A list containing a single message with the user role and the provided question.
    """
    prompt_template = """Question: {}
    Answer:
    """
    return prompt_template.format(question)

In [182]:
# Create prompt with context for RAG
def create_custom_prompt(question: str, df: pd.DataFrame, max_token_count:int) -> str:
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
        Answer the question based on the context below, and if the question
        can't be answered based on the context, say "I don't know"

        Context: 

        {}

        ---

        Question: {}
        Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

Test the function

In [183]:
create_simple_prompt("How many VVPAT systems were deployed for 2024 elections?")

'Question: How many VVPAT systems were deployed for 2024 elections?\n    Answer:\n    '

In [184]:
create_custom_prompt("How many VVPAT systems were deployed for 2024 elections?", df, 150)

'\n        Answer the question based on the context below, and if the question\n        can\'t be answered based on the context, say "I don\'t know"\n\n        Context: \n\n        1.8 million VVPAT systems and 1.7 million control units were used across 1.05 million polling stations in 2024.\n\n###\n\nSeven Independent candidates won in the 2024 Lok Sabha election.\n\n###\n\nWith 400 seats, BJP\'s 2024 performance was seen as disappointing, but still impressive compared to pre-2014 contexts, similar to Congress\'s 244 seats in 1989 under PVN Rao.\n\n        ---\n\n        Question: How many VVPAT systems were deployed for 2024 elections?\n        Answer:'

## Function that answers the question with completion model

In [185]:
def answer_question(prompt:str, max_answer_tokens:int=150) -> str:
    """
    Given a prompt a maximum number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
        
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

In [198]:
# Load embeddnigs file to avoid calling the API repetedly
# Read the DataFrame from CSV file
df = pd.read_csv(CSV_FILEPATH_WITH_EMBEDDINGS)
df['embeddings'] = df['embeddings'].apply(lambda value: [float(dim) for dim in value.replace('[', '').replace(']', '').split(',')])

### Question 1

In [199]:
question_string = "How many seats did BJP win during 2024 lok sabha elections in Uttar Pradesh?"
question_1_simple_prompt = create_simple_prompt(question=question_string)
question_1_custom_prompt = create_custom_prompt(question=question_string, df=df, max_token_count=1800)

In [200]:
print(f'Question: {question_string}')
print(f'Answer without context: {answer_question(question_1_simple_prompt)}')
print(f'Answer with context: {answer_question(question_1_custom_prompt)}')

Question: How many seats did BJP win during 2024 lok sabha elections in Uttar Pradesh?
Answer without context: As a language AI, I cannot predict future events and outcomes. The 2024 lok sabha elections have not taken place yet, so the results are not available. It is best to wait until the elections take place to find out the number of seats won by BJP in Uttar Pradesh.
Answer with context: 33


### Question 2

In [201]:
question_string = "How many seats did NDA win in Maharashtra during 2024 lok sabha elections?"
question_2_simple_prompt = create_simple_prompt(question=question_string)
question_2_custom_prompt = create_custom_prompt(question=question_string, df=df, max_token_count=1800)

In [202]:
print(f'Question: {question_string}')
print(f'Answer without context: {answer_question(question_2_simple_prompt)}')
print(f'Answer with context: {answer_question(question_2_custom_prompt)}')

Question: How many seats did NDA win in Maharashtra during 2024 lok sabha elections?
Answer without context: As it is currently 2021, the 2024 Lok Sabha elections have not yet taken place. Therefore, it is not possible to accurately state how many seats the NDA (National Democratic Alliance) will win in Maharashtra during the 2024 Lok Sabha elections. The results of the elections are determined by the voting decisions of the citizens of Maharashtra and can only be known after the elections have taken place.
Answer with context: NDA (including Shiv Sena and NCP) won only 17 seats.


### Question 3

In [203]:
question_string = "How many VVPAT systems were deployed for 2024 elections?"
question_3_simple_prompt = create_simple_prompt(question=question_string)
question_3_custom_prompt = create_custom_prompt(question=question_string, df=df, max_token_count=1800)

In [204]:
print(f'Question: {question_string}')
print(f'Answer without context: {answer_question(question_3_simple_prompt)}')
print(f'Answer with context: {answer_question(question_3_custom_prompt)}')

Question: How many VVPAT systems were deployed for 2024 elections?
Answer without context: It is currently unclear how many VVPAT (Voter Verified Paper Audit Trail) systems will be deployed for the 2024 elections. The number will depend on various factors, including the decisions of election officials and the availability of funding and resources. Some areas may have a high concentration of VVPAT systems, while others may not have any at all. Therefore, an exact number cannot be determined at this time.
Answer with context: 1.8 million


### Question 4

In [208]:
question_string = "Can you summarize results for Indian Lok Sabha Elections duting 2024, in 3 sentences?"
question_4_simple_prompt = create_simple_prompt(question=question_string)
question_4_custom_prompt = create_custom_prompt(question=question_string, df=df, max_token_count=1800)

In [209]:
print(f'Question: {question_string}')
print(f'Answer without context: {answer_question(question_4_simple_prompt)}')
print(f'Answer with context: {answer_question(question_4_custom_prompt)}')

Question: Can you summarize results for Indian Lok Sabha Elections duting 2024, in 3 sentences?
Answer without context: As per the results of the 2024 Indian Lok Sabha Elections, the Bharatiya Janata Party (BJP) emerged as the frontrunner with a majority of seats, securing a second consecutive term for Prime Minister Narendra Modi. The opposition alliance, led by the Indian National Congress (INC), saw a decline in its number of seats compared to the previous election. Overall, the BJP-led National Democratic Alliance (NDA) secured a comfortable majority, paving the way for the continued leadership of PM Modi.
Answer with context: BJP was the largest majority by far, indonesia was 2nd in number with 400 seats won, INDIA Bloc performed well in states and Trinamool congress had a good show as well (29 seats).


## Conclusion

As we can see that the dataset with context is a very latest information and
 from answers without context either mention hallucinated information or refues to answer.

After providing proper context with RAG, model is able to pin point the information and gives correct response as per given context.

Also an intresting observation is that upon asking to summarize event with given context the model is produced some
wrong information as the project design optimizes for factual questions instead of summarization type questions. 