# Using RAG approach on an open source LLM such as Llama or PALM or Gemini with a vector db using langchain, DSPy etc.



| **Tasks and Comments**                                               | **Status** | **Individual Responsible** |
|---------------------------------------------------------------------|------------|----------------------------|
| **Using RAG approach on an open source LLM such as Llama or PALM or Gemini with a vector db using langchain, DSPy etc.** |            |                            |
| **Preprocessing Steps** - 1. Clean up unwanted characters, 2. Extract Pairs, 3. Handling Abbreviations 4. Replace Slang 5. Negation Handling 6. Stopword removal 6. NER                   | Done | Kamalpreet Kaur             |
| Training - Model built            | Done | Abhijeet Singh              |
| **Evaluation - ROUGE-L Score (0.7647518193361473), BERT Score(0.9388486438989639)**                          | Done| Abhijeet Singh               |
| Interpretation using Lime                                           | Not Applicable |                            |
| 1st round of tuning - What was the issue faced/tuned? Ans:- **Not much contextually Aware** (Solution:- Decreased Temperature parameter)              | Done |    Abhijeet Singh                        |
| 2nd round of tuning - What was the issue faced/tuned? Ans:- **Answer was not much precise**(Added parameter Num_beam and Num_return_Sequence )             | Done |    Abhijeet Singh                        |
| Final AUC value?                              | Not Applicable |                            |


## Preprocessing 

In [1]:
import pandas as pd
import re


In [None]:
df = pd.read_csv('C:/Users/singh/Downloads/NLP-1/chat_data.csv', encoding='utf-8')

In [3]:
print(df.head())

                                       conversations          id
0  [{'from': 'human', 'value': "I've been feeling...  identity_0
1  [{'from': 'human', 'value': "Hi, I'm feeling r...  identity_1
2  [{'from': 'human', 'value': "Hey, I hope you'r...  identity_2
3  [{'from': 'human', 'value': "I'm feeling reall...  identity_3
4  [{'from': 'human', 'value': "I'm feeling reall...  identity_4


In [None]:

# Function to clean up the unwanted characters
def clean_conversation(conversation):
    # Replace single quotes with double quotes (standardizing quote format)
    conversation = conversation.replace("'", '"')
    
    # Remove unnecessary escape characters like \"
    conversation = re.sub(r'\\(["\'])', r'\1', conversation)
    
    # Remove any stray backslashes at the end of the conversation
    conversation = re.sub(r'\\$', '', conversation)
    
    # Ensure that the conversation is properly enclosed in double quotes
    if not conversation.startswith('"'):
        conversation = '"' + conversation
    if not conversation.endswith('"'):
        conversation = conversation + '"'
    
    return conversation

# Function to extract human-GPT pairs
def extract_pairs(conversation):
    # Split the conversation by 'from' and 'value' to get human and GPT responses
    pairs = []
    conversation_data = re.findall(r'{"from": "(human|gpt)", "value": "(.*?)"}', conversation)
    
    # Group the pairs
    human_msg = None
    for speaker, message in conversation_data:
        if speaker == 'human':
            human_msg = message
        elif speaker == 'gpt' and human_msg:
            pairs.append({'human': human_msg, 'gpt': message})
            human_msg = None  # Reset for the next pair
    
    return pairs

# Clean the conversations column
df['cleaned_conversations'] = df['conversations'].apply(clean_conversation)

# Extract human-GPT pairs
flattened_conversations = []

for idx, row in df.iterrows():
    pairs = extract_pairs(row['cleaned_conversations'])
    if pairs:
        for pair in pairs:
            flattened_conversations.append({
                'id': row['id'],
                'human': pair['human'],
                'gpt': pair['gpt']
            })

# Create a DataFrame from the valid pairs
flattened_df = pd.DataFrame(flattened_conversations)

# Save the result to a new CSV
flattened_df.to_csv('C:/Users/singh/Downloads/NLP-1/preprocessed_conversations.csv', index=False)


In [4]:
New_df = pd.read_csv('C:/Users/singh/Downloads/NLP-1/preprocessed_conversations.csv', encoding='utf-8')

In [5]:
print(New_df.head())

           id                                              human  \
0  identity_0  I"ve been feeling so sad and overwhelmed latel...   
1  identity_0  I recently got a promotion at work, which I th...   
2  identity_0  Well, the workload has increased significantly...   
3  identity_0  I"ve been trying to prioritize my tasks and de...   
4  identity_0  You"re right. I haven"t really opened up about...   

                                                 gpt  
0  Hey there, I"m here to listen and support you....  
1  I can understand how it can be overwhelming wh...  
2  It sounds like you"re dealing with a lot of pr...  
3  It"s great to hear that you"re already impleme...  
4  It"s completely normal to feel that way, but r...  


In [None]:
import re

# 1. Define the function to expand abbreviations
def expand_abbreviations(text):
    # Replace all double quotes with apostrophes before applying the abbreviation expansion
    text = text.replace('"', "'")
    
    abbreviations = {
        "I'm": "I am", "you're": "you are","It's": "It is", "it's": "it is", "can't": "cannot",
        "don't": "do not", "I've": "I have", "he's": "he is", "she's": "she is",
        "they're": "they are", "we're": "we are", "isn't": "is not", "wasn't": "was not",
        "weren't": "were not", "hasn't": "has not", "haven't": "have not", "won't": "will not",
        "didn't": "did not", "couldn't": "could not", "shouldn't": "should not", "wouldn't": "would not",
        "there's": "there is","There's": "There is", "That's": "That is","that's": "that is", "What's": "What is", "what's": "what is", "let's": "let us", "Let's": "Let us",
        "who's": "who is","Who's": "Who is", "aren't": "are not"
    }
    for key, value in abbreviations.items():
        text = re.sub(r'\b' + re.escape(key) + r'\b', value, text)
    return text

# Apply cleaning to both human and GPT responses
New_df['human'] = New_df['human'].apply(expand_abbreviations)
New_df['gpt'] = New_df['gpt'].apply(expand_abbreviations)


In [6]:

# Check the output
print(New_df.head())


           id                                              human  \
0  identity_0  I"ve been feeling so sad and overwhelmed latel...   
1  identity_0  I recently got a promotion at work, which I th...   
2  identity_0  Well, the workload has increased significantly...   
3  identity_0  I"ve been trying to prioritize my tasks and de...   
4  identity_0  You"re right. I haven"t really opened up about...   

                                                 gpt  
0  Hey there, I"m here to listen and support you....  
1  I can understand how it can be overwhelming wh...  
2  It sounds like you"re dealing with a lot of pr...  
3  It"s great to hear that you"re already impleme...  
4  It"s completely normal to feel that way, but r...  


In [None]:
def replace_slang(text):
    slang_map = {
        "gonna": "going to", "wanna": "want to", "gotta": "got to",
        "ain't": "is not", "gimme": "give me", "kinda": "kind of",
        "sorta": "sort of", "lemme": "let me", "outta": "out of",
        "dunno": "do not know", "bro": "brother", "sis": "sister",
        "idk": "I do not know", "omg": "oh my god", "btw": "by the way"
    }
    words = text.split()
    processed_words = [slang_map.get(word.lower(), word.lower()) for word in words]
    return ' '.join(processed_words)
# Apply replace_slang to both 'human' and 'gpt' columns
New_df['human'] = New_df['human'].apply(replace_slang)
New_df['gpt'] = New_df['gpt'].apply(replace_slang)


In [7]:

# Check the output
print(New_df.head())


           id                                              human  \
0  identity_0  I"ve been feeling so sad and overwhelmed latel...   
1  identity_0  I recently got a promotion at work, which I th...   
2  identity_0  Well, the workload has increased significantly...   
3  identity_0  I"ve been trying to prioritize my tasks and de...   
4  identity_0  You"re right. I haven"t really opened up about...   

                                                 gpt  
0  Hey there, I"m here to listen and support you....  
1  I can understand how it can be overwhelming wh...  
2  It sounds like you"re dealing with a lot of pr...  
3  It"s great to hear that you"re already impleme...  
4  It"s completely normal to feel that way, but r...  


In [None]:
def handle_negations(text):
    negation_words = ["not", "no", "never", "cannot", "n't"]
    words = text.split()
    processed_words = []
    negate = False

    for word in words:
        if any(neg in word.lower() for neg in negation_words):
            if not negate:  # Activate negation only if it's not already active
                processed_words.append(word.lower())
            negate = True
        elif negate:
            # Apply negation to the next word and reset
            processed_words.append(f"not {word.lower()}")
            negate = False
        else:
            processed_words.append(word.lower())
    
    # Remove consecutive "not not" cases
    final_text = ' '.join(processed_words).replace("not not", "not")
    return final_text

# Apply the function to your dataframe
New_df['human'] = New_df['human'].apply(handle_negations)
New_df['gpt'] = New_df['gpt'].apply(handle_negations)



In [8]:
# Check for any lingering issues
print(New_df.head())


           id                                              human  \
0  identity_0  I"ve been feeling so sad and overwhelmed latel...   
1  identity_0  I recently got a promotion at work, which I th...   
2  identity_0  Well, the workload has increased significantly...   
3  identity_0  I"ve been trying to prioritize my tasks and de...   
4  identity_0  You"re right. I haven"t really opened up about...   

                                                 gpt  
0  Hey there, I"m here to listen and support you....  
1  I can understand how it can be overwhelming wh...  
2  It sounds like you"re dealing with a lot of pr...  
3  It"s great to hear that you"re already impleme...  
4  It"s completely normal to feel that way, but r...  


In [None]:
# Function for thorough text cleaning
def clean_text(text):
    # Remove unwanted characters (e.g., quotes, extra spaces)
    text = re.sub(r'[^\w\s.,!?\'"-]', '', text)  # remove special characters
    text = text.replace('"', '').replace("'", '')  # Remove quotes
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# Apply cleaning to both human and GPT responses
New_df['human'] = New_df['human'].apply(clean_text)
New_df['gpt'] = New_df['gpt'].apply(clean_text)


In [9]:

# Check the cleaned data
print(New_df.head())


           id                                              human  \
0  identity_0  I"ve been feeling so sad and overwhelmed latel...   
1  identity_0  I recently got a promotion at work, which I th...   
2  identity_0  Well, the workload has increased significantly...   
3  identity_0  I"ve been trying to prioritize my tasks and de...   
4  identity_0  You"re right. I haven"t really opened up about...   

                                                 gpt  
0  Hey there, I"m here to listen and support you....  
1  I can understand how it can be overwhelming wh...  
2  It sounds like you"re dealing with a lot of pr...  
3  It"s great to hear that you"re already impleme...  
4  It"s completely normal to feel that way, but r...  


In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Download stopwords and punkt if you haven't already
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Set of stopwords in English
stop_words = set(stopwords.words('english'))

# Remove negations from the stopwords list (to preserve them)
negations = {"not", "no", "nor", "never", "isn't", "aren't", "don't", "didn't", "won't", "can't", "shouldn't"}
stop_words -= negations  # Remove negations from the stopwords list

# Function to remove punctuation
def remove_punctuation(tokens):
    # Remove punctuation from the token list
    return [word for word in tokens if word not in string.punctuation]

# Function to clean text (tokenize, remove stopwords and punctuation)
def clean_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove punctuation and stopwords
    tokens = [word for word in remove_punctuation(tokens) if word.lower() not in stop_words]
    # Join tokens back into a readable sentence
    return " ".join(tokens)

# Apply cleaning function to both 'human' and 'gpt' columns
New_df['human_clean'] = New_df['human'].apply(clean_text)
New_df['gpt_clean'] = New_df['gpt'].apply(clean_text)


In [None]:
import torch
from transformers import pipeline
import pandas as pd

# Check if CUDA (GPU) is available
device = 0 if torch.cuda.is_available() else -1
print(f"Using device: {'GPU' if device == 0 else 'CPU'}")


  from .autonotebook import tqdm as notebook_tqdm


Using device: GPU


In [None]:
# Load pre-trained NER model from Hugging Face (using a model fine-tuned on NER tasks)
ner_model = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", device=device)


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Load the dataset
data_path = 'C:/Users/singh/Downloads/NLP-1/Possible_case_Preprocessing.csv'
df = pd.read_csv(data_path)
df = df.head(3000)

# Drop rows with missing data in the columns 'human_clean' and 'gpt_clean'
df = df.dropna(subset=['human_clean', 'gpt_clean'])


In [None]:
# Function to mask named entities
def mask_named_entities(text):
    # Run NER to detect entities
    entities = ner_model(text)
    
    # Sort entities by their 'start' position in reverse order to avoid overlap during replacement
    entities = sorted(entities, key=lambda x: x['start'], reverse=True)
    
    # Replace entities with '<NAME>' in the text
    for entity in entities:
        # Check if the entity type is 'PER' (Person) or 'LOC' (Location)
        if entity['entity'] in ['I-PER', 'I-LOC']:
            text = text[:entity['start']] + '<NAME>' + text[entity['end']:]
    
    return text

# Test the function



In [None]:
test_text = "Barack Obama visited the White House yesterday."
print(mask_named_entities(test_text))  # Should return 'Barack Obama' (PER), 'White House' (ORG)


<NAME> <NAME> visited the <NAME> <NAME> yesterday.


In [None]:
# Apply NER-based name masking on the dataset columns
df['human_clean'] = df['human_clean'].apply(mask_named_entities)
df['gpt_clean'] = df['gpt_clean'].apply(mask_named_entities)


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [None]:

# # Save the modified dataframe
# df.to_csv('C:/Users/singh/Downloads/NLP-1/Possible_case_Preprocessing_NER.csv', index=False)

# # Check the first few rows of the modified dataframe
# print(df.head())


In [None]:
# Define the file path where the CSV will be saved
save_path = 'C:/Users/singh/Downloads/NLP-1/Possible_case_Preprocessing_NER.csv'

# Save the DataFrame to a CSV file
df.to_csv(save_path, index=False, encoding='utf-8')

print(f"File saved successfully at {save_path}")


File saved successfully at C:/Users/singh/Downloads/NLP-1/Possible_case_Preprocessing_NER.csv


# Data Pre-Processing Finished

# Using RAG approach on an open source LLM such as Llama or PALM or Gemini with a vector db using langchain, DSPy etc.

In [13]:
print(df.head())

           id                                              human  \
0  identity_0  i have been feeling so sad and overwhelmed lat...   
1  identity_0  i recently got a promotion at work, which i th...   
2  identity_0  well, the workload has increased significantly...   
3  identity_0  i have been trying to prioritize my tasks and ...   
4  identity_0  youre right. i have not really opened up about...   

                                                 gpt  \
0  hey there, i am here to listen and support you...   
1  i can understand how it can be overwhelming wh...   
2  it sounds like you are dealing with a lot of p...   
3  it is great to hear that you are already imple...   
4  it is completely normal not to feel that way, ...   

                                         human_clean  \
0  feeling sad overwhelmed lately work become mas...   
1  recently got promotion work thought would exci...   
2  well workload increased significantly find har...   
3  trying prioritize tasks del

# Initial Model

In [20]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pandas as pd

# Load the model to generate embeddings
embedding_model = SentenceTransformer("sentence-transformers/all-distilroberta-v1")

# Load your data (adjust the file path as necessary)
data_path = 'C:/Users/singh/Downloads/NLP-1/Possible_case_Preprocessing_NER.csv'
df = pd.read_csv(data_path)
df = df.head(100)

# Preprocess and combine human and GPT text into one document
df['combined'] = df['human_clean'] + " " + df['gpt_clean']

# Generate embeddings for each combined text
embeddings = np.array([embedding_model.encode(text) for text in df['combined']])



# Create a FAISS index (Flat Index in this example)
dim = embeddings.shape[1]  # Embedding dimension
index = faiss.IndexFlatL2(dim)  # L2 distance metric for similarity search

# Add embeddings to the index
index.add(embeddings)

# Create an IVF index
nlist = 20  # Number of clusters (adjust based on dataset size)
quantizer = faiss.IndexFlatL2(dim)  # Quantizer used for training
index = faiss.IndexIVFFlat(quantizer, dim, nlist)

# Train the index
index.train(embeddings)

# Add embeddings to the index
index.add(embeddings)



# Function to retrieve the top-k most similar documents for a query
def search_faiss(query, k=10):
    # Convert query into an embedding
    query_embedding = embedding_model.encode(query).reshape(1, -1)

    # Perform similarity search
    D, I = index.search(query_embedding, k)  # D is distances, I is indices of closest embeddings
    
    # Fetch the documents corresponding to the closest embeddings
    results = [df.iloc[i] for i in I[0]]
    
    return results
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer

# Model name
llm_model_name = "google/flan-t5-large"

# Load the model with 8-bit quantization
model = AutoModelForSeq2SeqLM.from_pretrained(
    llm_model_name,
    load_in_8bit=True,  # Enable 8-bit quantization
    device_map="auto",
    torch_dtype=torch.float16  # Reduce memory usage   # Automatically map model to available GPU
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(llm_model_name)

# Define the pipeline 
llm = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
  
)


result = llm("Explain the concept of RAG approach on llms.")
print(result[0]['generated_text'])


# Function to generate a chatbot response using the query and retrieved context
def generate_response(query, k=5):
    # Retrieve top-k similar documents
    retrieved_results = search_faiss(query, k)
    
    # Combine results into a context string
    context = "\n".join([
    f"User said: {result['human_clean']}\nResponse: {result['gpt_clean']}"
    for result in retrieved_results[:3]  # Use only the most relevant results
    ])

    # print(context)

     
    prompt = (
    f"Context:\n{context}\n\n"
    f"Query: {query}\n\n"
    f"As an empathetic assistant, consider the user's situation and the context provided above. "
    f"Respond with detailed and actionable advice that addresses their concerns thoughtfully."
    
)


    #Generate a response
    response = llm(prompt)[0]['generated_text']

    return response

# Example query
query = "what is best thing to do to deal with stress?"
response = generate_response(query)
print("Chatbot Response:")
print(response)


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


RAG approach is a method of analyzing the relationship between a lm and
Chatbot Response:
Response: i would suggest you take a hot shower, take a hot bath,


# Model Run 2


In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pandas as pd

# Load the model to generate embeddings
embedding_model = SentenceTransformer("sentence-transformers/all-distilroberta-v1")

# Load your data (adjust the file path as necessary)
data_path = 'C:/Users/singh/Downloads/NLP-1/Possible_case_Preprocessing_NER.csv'
df = pd.read_csv(data_path)
df = df.head(100)

# Preprocess and combine human and GPT text into one document
df['combined'] = df['human_clean'] + " " + df['gpt_clean']

# Generate embeddings for each combined text
embeddings = np.array([embedding_model.encode(text) for text in df['combined']])



# Create a FAISS index (Flat Index in this example)
dim = embeddings.shape[1]  # Embedding dimension
index = faiss.IndexFlatL2(dim)  # L2 distance metric for similarity search

# Add embeddings to the index
index.add(embeddings)

# Create an IVF index
nlist = 20  # Number of clusters (adjust based on dataset size)
quantizer = faiss.IndexFlatL2(dim)  # Quantizer used for training
index = faiss.IndexIVFFlat(quantizer, dim, nlist)

# Train the index
index.train(embeddings)

# Add embeddings to the index
index.add(embeddings)



# Function to retrieve the top-k most similar documents for a query
def search_faiss(query, k=10):
    # Convert query into an embedding
    query_embedding = embedding_model.encode(query).reshape(1, -1)

    # Perform similarity search
    D, I = index.search(query_embedding, k)  # D is distances, I is indices of closest embeddings
    
    # Fetch the documents corresponding to the closest embeddings
    results = [df.iloc[i] for i in I[0]]
    
    return results
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer

# Model name
llm_model_name = "google/flan-t5-large"

# Load the model with 8-bit quantization
model = AutoModelForSeq2SeqLM.from_pretrained(
    llm_model_name,
    load_in_8bit=True,  # Enable 8-bit quantization
    device_map="auto",
    torch_dtype=torch.float16  # Reduce memory usage   # Automatically map model to available GPU
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(llm_model_name)

# Define the pipeline 
llm = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    temperature=0.9 # Adjust for creativity
)


result = llm("Explain the concept of RAG approach on llms.")
print(result[0]['generated_text'])


# Function to generate a chatbot response using the query and retrieved context
def generate_response(query, k=5):
    # Retrieve top-k similar documents
    retrieved_results = search_faiss(query, k)
    
    # Combine results into a context string
    context = "\n".join([
    f"User said: {result['human_clean']}\nResponse: {result['gpt_clean']}"
    for result in retrieved_results[:3]  # Use only the most relevant results
    ])

     
    prompt = (
    f"Context:\n{context}\n\n"
    f"Query: {query}\n\n"
    f"As an empathetic assistant, consider the user's situation and the context provided above. "
    f"Respond with detailed and actionable advice that addresses their concerns thoughtfully."
    
)


    #Generate a response
    response = llm(prompt)[0]['generated_text']

    return response

# Example query
query = "what is best thing to do to deal with stress?"
response = generate_response(query)
print("Chatbot Response:")
print(response)


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


RAG approach is a method of analyzing the relationship between a lm and its underlying variables.
Chatbot Response:
Response: i would suggest you take a hot shower, take a hot bath, or take a hot shower.


: 

# Model Run 3(Best)  


**Successful**

In [1]:
import torch
import gc



# Clear cache and force garbage collection
torch.cuda.empty_cache()

print("Unused GPU memory has been freed.")


Unused GPU memory has been freed.


In [2]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pandas as pd

# Load the model to generate embeddings
embedding_model = SentenceTransformer("sentence-transformers/all-distilroberta-v1")

# Load your data (adjust the file path as necessary)
data_path = 'C:/Users/singh/Downloads/NLP-1/Possible_case_Preprocessing_NER.csv'
df = pd.read_csv(data_path)
df = df.head(100)

# Preprocess and combine human and GPT text into one document
df['combined'] = df['human_clean'] + " " + df['gpt_clean']


In [3]:

# Generate embeddings for each combined text
embeddings = np.array([embedding_model.encode(text) for text in df['combined']])


In [4]:



# Create a FAISS index (Flat Index in this example)
dim = embeddings.shape[1]  # Embedding dimension
index = faiss.IndexFlatL2(dim)  # L2 distance metric for similarity search

# Add embeddings to the index
index.add(embeddings)

# Create an IVF index
nlist = 20  # Number of clusters (adjust based on dataset size)
quantizer = faiss.IndexFlatL2(dim)  # Quantizer used for training
index = faiss.IndexIVFFlat(quantizer, dim, nlist)

# Train the index
index.train(embeddings)

# Add embeddings to the index
index.add(embeddings)




In [5]:

# Function to retrieve the top-k most similar documents for a query
def search_faiss(query, k=10):
    # Convert query into an embedding
    query_embedding = embedding_model.encode(query).reshape(1, -1)

    # Perform similarity search
    D, I = index.search(query_embedding, k)  # D is distances, I is indices of closest embeddings
    
    # Fetch the documents corresponding to the closest embeddings
    results = [df.iloc[i] for i in I[0]]
    
    return results


In [6]:

# Example query
query = "How can I manage stress at work?"
results = search_faiss(query)

# Display the results (human_clean and gpt_clean columns)
for result in results:
    print(f"Human: {result['human_clean']}")
    print(f"GPT: {result['gpt_clean']}")
    print("="*1)


Human: trying prioritize tasks delegate whenever not possible also started practicing meditation breaks help manage stress sometimes feels like no not matter not catch break constant struggle
GPT: great hear already implementing helpful strategies remember progress takes time okay setbacks addition already encourage also communicate supervisor team workload discuss possible solutions together
=
Human: recently got promotion work thought would exciting added responsibilities pressure taken toll mental health really moving experience
GPT: understand overwhelming faced higher expectations okay acknowledge not emotions allow feel sad situation important part healing process specific challenges facing work
=
Human: well recently go breakup thought moved not expect affect much additionally workload office increased adding stress
GPT: breakups often lead wide range emotions normal not resurface unexpectedly seems like breakup coupled increased work stress might factors contributing current em

In [7]:
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer

# Model name
llm_model_name = "google/flan-t5-large"

# Load the model with 8-bit quantization
model = AutoModelForSeq2SeqLM.from_pretrained(
    llm_model_name,
    load_in_8bit=True,  # Enable 8-bit quantization
    device_map="auto",
    torch_dtype=torch.float16  # Reduce memory usage   # Automatically map model to available GPU
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(llm_model_name)


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


In [None]:

# Define the pipeline 
llm = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    num_beams=2,
    num_return_sequences=1,
    temperature=0.4 # Adjust for creativity
)


In [9]:

result = llm("Explain the concept of RAG approach on llms.")
print(result[0]['generated_text'])




RAG approach is a method of evaluating the effectiveness of a llm.


In [10]:


# Function to generate a chatbot response using the query and retrieved context
def generate_response(query, k=5):
    # Retrieve top-k similar documents
    retrieved_results = search_faiss(query, k)
    
    # Combine results into a context string
    context = "\n".join([
    f"User said: {result['human_clean']}\nResponse: {result['gpt_clean']}"
    for result in retrieved_results[:3]  # Use only the most relevant results
    ])

    # print(context)

     
    prompt = (
    f"Context:\n{context}\n\n"
    f"Query: {query}\n\n"
    f"As an empathetic assistant, consider the user's situation and the context provided above. "
    f"Respond with detailed and actionable advice that addresses their concerns thoughtfully."
    
)


    #Generate a response
    response = llm(prompt)[0]['generated_text']

    return response


# Testing

In [11]:

# Example query
query = "what is best thing to do to deal with stress?"
response = generate_response(query)
print("Chatbot Response:")
print(response)


Chatbot Response:
Response: I would recommend a stress reduction program that includes meditation, yoga, and relaxation techniques.


In [12]:
from rouge_score import rouge_scorer

In [13]:
def evaluate_model(df, k=5):
    # Instantiate the ROUGE scorer
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    
    # List to store ROUGE scores
    rouge_scores = []
    
    # Generate responses and compute ROUGE-L score
    for idx, row in df.iterrows():
        query = row['human_clean']  # Use the human_clean column as the query
        true_response = row['gpt_clean']  # Use the gpt_clean column as the true response
        
        # Generate response from the model
        generated_response = generate_response(query, k)
        
        # Compute ROUGE-L score
        score = scorer.score(true_response, generated_response)
        rouge_scores.append(score['rougeL'].fmeasure)  # Append the ROUGE-L F-measure
        
    # Compute average ROUGE-L score
    avg_rougeL = np.mean(rouge_scores)
    print(f"Average ROUGE-L score: {avg_rougeL}")
    return avg_rougeL

# Evaluate the model on the dataset
evaluate_model(df)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Average ROUGE-L score: 0.7647518193361473


0.7647518193361473

In [15]:
import numpy as np
from bert_score import score

def evaluate_bert_score(df, k=5, limit=None):
    # Limit the number of rows (if limit is specified)
    if limit:
        df = df.head(limit)
    
    # List to store BERT scores
    bert_scores = []
    
    # Generate responses and compute BERT score
    for idx, row in df.iterrows():
        query = row['human_clean']  # Use the human_clean column as the query
        true_response = row['gpt_clean']  # Use the gpt_clean column as the true response
        
        # Generate response from the model
        generated_response = generate_response(query, k)
        
        # Compute BERT score
        P, R, F1 = score([generated_response], [true_response], lang='en')
        bert_scores.append(F1.item())  # Append F1 score (which is the average of precision and recall)
    
    # Compute average BERT score
    avg_bert_score = np.mean(bert_scores)
    print(f"Average BERT score: {avg_bert_score}")
    return avg_bert_score

# Evaluate the model on the dataset with a limit of 100 samples for faster results
evaluate_bert_score(df, limit=10)




tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['ro

Average BERT score: 0.9388486438989639


0.9388486438989639