# Lab3 Intro
### Using RAG technique to enhance the solution by retrieving and injecting few shot examples into the prompt
### What is RAG?
RAG is a technique for augmenting LLM knowledge with additional data.
Large Language Models (LLMs) can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).

A typical RAG application has two main components:

- **Indexing**: a pipeline for ingesting data from a source and indexing it. This usually happens offline.

- **Retrieval and generation**: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

### What is Few-Shot Learning?
Few-shot learning is a machine learning paradigm that enables models to generalize from a limited number of labeled examples. Unlike traditional supervised learning, which requires large datasets for training, few-shot learning allows models to adapt to new tasks or classes with minimal data. This is akin to how humans can learn new concepts quickly from just a few instances.

### Using RAG and Few-Shot Learning to Classify VoC
With the introduction of RAG and Few-Shot Learning, we can now apply these concepts practically. In this lab, we will utilize the RAG technique together with Few-Shot Learning to classify Voice of Customer data. This session will build upon the skills developed in the last two labs, as we will continue to employ prompt engineering and embedding techniques. 

### Your objectives are:
- Setup a vector database using [Chroma integration in LangChain](https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/) 
- Use Amazon Titan embedding model to embed the labeled examples and store into the vector database
- Build a RAG chain that retrieves top k most relevant examples from vector store and generate the classification result

## 1. Install dependencies

In [28]:
!pip install -q langchain==0.2.16 langchain_aws==0.1.17 pandas==2.2.2 openpyxl==3.1.5 chromadb==0.5.5 langchain-chroma==0.1.2

## 2. Initialize Bedrock model using Langchain

We will utilize the same approach used in Lab2 to setup the embedding object. 
- We use [Langchain](https://www.langchain.com/) SDK to build the application
- Initialize a BedrockEmbeddings object with 'Titan Text Embeddings V2" with the model id "amazon.titan-embed-text-v2:0"

In [1]:
import boto3
import json
import copy
import pandas as pd
from termcolor import colored
from langchain_aws.embeddings.bedrock import BedrockEmbeddings
bedrock_embedding = BedrockEmbeddings(model_id='amazon.titan-embed-text-v2:0')

- test run and preview the result

In [2]:
test_embedding = bedrock_embedding.embed_documents(['I love programing'])
print(f"The embedding dimension is {len(test_embedding[0])}, first 10 elements are: {test_embedding[0][:10]}")

The embedding dimension is 1024, first 10 elements are: [-0.08799089, 0.06109212, -0.03054606, 0.023365457, 0.082975864, 0.009916072, 0.014931097, -0.036700863, 0.020402033, -0.04376749]


## 3. Preparing the Dataset to be classified. 
- The data (Voice of Customers) is subject to experiment usage only

We will once again load the categories.csv and comments_9.csv data files. In addition, we will introduce a new dataset, examples_with_label_9.csv, in this session. This new dataset provides an augmented knowledge base for the LLM, equipping it with relevant references. By incorporating these labeled examples into our workflow, it is expected to enhance the model's understanding and improve the accuracy of its output.

- Import libraries

In [3]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
category_definition = "data/categories.csv"
categories = pd.read_csv(category_definition)
display(categories)

Unnamed: 0,mappings
0,Camera lens glare
1,CarPlay connection failure
2,Update delay
3,Camera dust and dirt
4,"Automatic restart, shutdown"
5,High storage usage
6,Poor manufacturing quality
7,Frame rate drop
8,Camera black screen
9,App crashes


### Load the customer review data

In [5]:
comments_filepath = "data/comments_9.csv"
comments = pd.read_csv(comments_filepath)
comments

Unnamed: 0,comment,groundtruth
0,"After switching to a new operating system, my ...",Charging failure
1,"After 10 months, my phone's battery health has...",Abnormal battery health
2,"Yesterday, the battery health was still 94%. B...",Abnormal battery health
3,The camera quality of my smartphone has been q...,Camera color deviation
4,"The update speed of this phone is really slow,...",Update delay
...,...,...
195,My phone's been acting up for a while now. The...,Unresponsive screen
196,"After the recent software update, my device ha...","Automatic restart, shutdown"
197,Every major software update seems to cause iss...,SIM card not detected
198,"My phone has been getting slower and slower, a...",High storage usage


### Load examples data as we will use them as few shot examples

In [7]:
examples_filepath = "data/examples_with_label_9.csv"
examples_df = pd.read_csv(examples_filepath) 
examples_df

Unnamed: 0,comment,groundtruth
0,This phone's camera really isn't that great. T...,Camera color deviation
1,My phone has been having some issues lately. T...,Screen flickering
2,My phone has been experiencing some subtle iss...,CarPlay connection failure
3,My phone has been a real headache lately. The ...,Weak signal
4,My phone has been having some issues with the ...,Screen color deviation
...,...,...
184,"This phone is so unreliable, it keeps disappea...","Automatic restart, shutdown"
185,"Since the latest software update, my phone has...",Wi-Fi connection failure
186,"I've been using this phone for a while now, an...",Device freezing
187,My phone has been draining its battery really ...,Fast battery drain


## 4. Create Vector database to store the embedding vectors of customer review text

### 4.1 We will setup a vector database using [Chroma integration in LangChain](https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/) 
- Chroma is the AI-native open-source vector database. This is lightweight vector database suitable for developers to quickly develop and experiment with applications.
- We will use Chroma for our workshop experiment this time, but when you will develop your own application, there are variety of selections, such as Amazon OpenSearch, Amazon Aurora, Pinecone, Milvus,Faiss, etc. 

In [8]:
from langchain_chroma import Chroma
# vector_store.reset_collection()
vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=bedrock_embedding,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not neccesary
)

#### Let's consider each customer comment as an individual document, and then use the Langchain-Chroma SDK to upload all these customer comments into the vector store.

In [9]:
from uuid import uuid4
import hashlib
from langchain_core.documents import Document

#### build langchain documents

In [10]:
documents = []
for comment,groundtruth in examples_df.values:
    documents.append(
        Document(
        page_content=comment,
        metadata={"groundtruth":groundtruth}
        )
    )

#### add documents to vector store

In [11]:
hash_ids = [hashlib.md5(doc.page_content.encode()).hexdigest() for doc in documents]
vector_store.add_documents(documents=documents, ids=hash_ids)

['3b5fabfbd3d9c083d5e7655fcd59e9e0',
 '024e991ccfa2538dad706911ded85fe7',
 '624aaa6f69c5c2926be806573cb2448c',
 '42ca012f3765fb4e9f391b6a9cd71e57',
 'ff871bc5ef7501e04c63f29caaf285e9',
 '30dd9e119138a66c4f0b34cb2fd7a4df',
 '10006bc5d8c0cbf831a0ffd83b8af854',
 '9f6e78cd3b035cf80dfa10c61e3fd7c9',
 'dd8fde4eeebb3b675872f85d42f0eb95',
 '1527eacb9589e6fa9461aef47beb2158',
 'abc164b7cc3f9c1e0d207b822fe2d2db',
 '722438ba64654367a3537ba2889c0a7d',
 '70bba6fa8c42bbd6199b56cba038308c',
 '2a827f3b9707fada0c77452350550c21',
 '7523b3229ae2ccb5fab3dbbcf662afd8',
 'cb113aafe6e94104f5ecb528c16432c8',
 '289fd4075e9a7f13493cd7cd05d17707',
 'fa90ece2842518950e3dd423aca958c1',
 'b45c9e2f99f07e23d54519a97d48440b',
 'e83910e172516897c5fabc278feb675e',
 '7d374351eaa37a7d0240463a32d4f04c',
 '6a676fe6b2f0ea478b7ccf1e62e2f0df',
 'bd04dc11853f529b1d2bada3140fc43e',
 'c6d85ae44f3542da57a63b07904c7ad7',
 '15e61456a4e187ee11e5da49c3f13372',
 '4a11f610128d9fef006029de92713e9c',
 'ba0f900f8d87a178f534f96a23fb766b',
 


### 4.2 We will use vector search to compare the similarity between them and choose the category that is the similarest to the customer review text

Once the embedded customer review texts are stored in the vector database, we can apply the techniques we explored in Lab 2, semantic similarity search, to identify top k semantically relevant categorized review texts for each customer comment. 

- Test run Similarity search in vector store
- Performing a simple similarity search with relevance score can be done as follows:

In [12]:
query = comments['comment'].sample(1).values[0]
print(colored(f"******query*****:\n{query}","blue"))

results = vector_store.similarity_search_with_score(query, k=4)
print(colored("\n\n******results*****","green"))
for res, score in results:
    print(colored(f"* [SIM={score:3f}] \n{res.page_content}\n{res.metadata}","green"))

[34m******query*****:
It feels like the app is not well-adapted to the latest software update. On Friday, it kept crashing, and I couldn't send messages, which really frustrated me and made me have to restart the app. Just now, I tried to edit a photo using the app's own tools, but I couldn't even rotate it. It's not working at all.[0m
[32m

******results*****[0m
[32m* [SIM=0.839706] 
Since the software update, my phone has been automatically restarting. Several apps, including messaging and gaming apps, are unexpectedly crashing. This is really frustrating and I hope the developers can address these issues soon.
{'groundtruth': 'Automatic restart, shutdown'}[0m
[32m* [SIM=0.896166] 
Since the latest software update, my phone has been experiencing some issues. It keeps restarting on its own, and the Wi-Fi connection keeps turning off. I've tried various troubleshooting steps, like restarting the device and resetting the network settings, but the problems persist. After upgrading

## 5. Build a RAG chain to retrieve only top K relevant categories as candidate for classification

#### 5.1 Define a langchain llm with Titian text model
We will initialize a LLM to carry out the classification task. The process is repeated with Lab 1 where we write prompt to perform the classification using the Amazon Titan Model.
- We use [Langchain](https://www.langchain.com/) SDK to build the application
- Initialize a ChatBedrock object with Amzon Titan Text model with the model id is "amazon.titan-text-premier-v1:0"

In [13]:
from langchain_aws import ChatBedrock
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_core.output_parsers import StrOutputParser,XMLOutputParser
from langchain_core.prompts import ChatPromptTemplate,MessagesPlaceholder,HumanMessagePromptTemplate

model_id = "amazon.titan-text-premier-v1:0" 
llm = ChatBedrock( model_id=model_id,
                  streaming=True,
                callbacks=[StreamingStdOutCallbackHandler()],
                model_kwargs=dict(temperature=0.0,maxTokenCount=3072)
                 )

- Define system prompt and user prompt template

The system prompt template has been copied over from Lab 1, while the user prompt has been modified slightly. For this user prompt, we add a part where we provide examples to the LLM. This is a demonstration for Few-Shot Learning. The model will takes these exmaples as references when making classifications.

In [14]:
system = """You are a professional  customer feedback analyst. Your daily task is to categorize user feedback.
You will be given an input in the form of a JSON array. Each object in the array contains a comment ID and a 'c' field representing the user's comment content.
Your role is to analyze these comments and categorize them appropriately.
Please note:
1. Only output valid XML format data.
2. Do not include any explanations or additional text outside the XML structure.
3. Ensure your categorization is accurate and consistent.
4. If you encounter any ambiguous cases, use your best judgment based on the context provided.
5. Maintain a professional and neutral tone in your categorizations.
"""

user = """
Please categorize user comments according to the following category tags library:
<categories>
{tags}
</categories>

Here are examples for your to categorize:
<examples>
{examples}
<examples>

Please follow these instructions for categorization:
<instruction>
1. Categorize each comment using the tags above. If no tags apply, output "Invalid Data".
2. Summarize the comment content in no more than 50 words. Replace any double quotation marks with single quotation marks.
</instruction>

Below are the customer comments records to be categorized. The input is an array, where each element has an 'id' field representing the complaint ID and a 'c' field summarizing the complaint content.
<comments>
{input}
</comments>

For each record, summarize the comment, categorize according to the category explainations, and return the  ID, summary , reasons for tag matches, and category.

Output format example:
<output>
  <item>
    <id>xxx</id>
    <summary>xxx</summary>
    <reason>xxx</reason>
    <category>xxx</category>
  </item>
</output>

Skip the preamble and output only valid XML format data. Remember:
- Avoid double quotation marks within quotation marks. Use single quotation marks instead.
- Replace any double quotation marks in the content with single quotation marks.
"""

#### 5.2 Define a RAG pipeline with LangChain

We are now set to create a RAG pipeline using Langchain. The workflow of this pipelines involves:
- initially, it retrieves the top k most relevant customer review texts, which are labeled with categories, from the vector database. 
- This retrieved information is then integrated into the prompt. 
- Invoke LLM to generate outputs according to the provided instructions. 
- Finally, the output is displayed. 

- Define langchain retriever

In [15]:
from typing import List
from langchain_core.documents import Document
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel

retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 5})

In [16]:
def format_docs(docs):
    formatted = "\n".join([str(dict(comment=doc.page_content ,
                               category=doc.metadata['groundtruth'])) for doc in docs])
    print(colored(f"*****retrived examples******:\n{formatted}","yellow"))
    return formatted

- test run for retriever

In [17]:
retrieved_docs = (retriever|format_docs).invoke("my phone always freeze")

[33m*****retrived examples******:
{'comment': "What's going on with this phone? It kept freezing up and not moving just a moment ago, and I have no idea what the issue is. It just suddenly shut off while I was using it, which is really annoying. I've been wanting to get a new phone, but now might not be the best time to buy one. Does anyone know what the problem could be and how I can fix it?", 'category': 'Device freezing'}
{'comment': "I've been experiencing an issue with my new device where it seems to freeze up after transferring data from my old phone. According to an internal memo, the company is aware of this problem and is investigating it. As a temporary workaround, they've advised users to force restart the device if it becomes unresponsive for more than 5 minutes.\n\nI also read that there was a similar issue with device activation earlier this week, but the company released a software update to try and address those setup and migration problems. Hopefully that will help re

- Define prompt template

In [18]:
prompt = ChatPromptTemplate([
    ('system',system),
    ('user',user)],
partial_variables={'tags':categories['mappings'].values}
)

- Define a prompt chain

In [19]:

chain = (
    {"examples":retriever|format_docs,"input":RunnablePassthrough()}
    | prompt
    | llm
    | XMLOutputParser()
)

# chain = ({"input":RunnablePassthrough()}| prompt| llm| XMLOutputParser())

- convert the comments to data array

In [20]:
sample_data = [str({"id":'s_'+str(i),"comment":x[0]}) for i,x in enumerate(comments.values)]
print(sample_data[:3])

['{\'id\': \'s_0\', \'comment\': "After switching to a new operating system, my phone is no longer able to charge. I\'ve tried updating to the latest version, but it still won\'t charge. The issue seems to be getting worse. #tech"}', '{\'id\': \'s_1\', \'comment\': "After 10 months, my phone\'s battery health has dropped to 95%. The system is quite smooth, but the battery life is just okay. The messaging app does seem a bit more responsive now, which is a small improvement. [Sad emoji]"}', '{\'id\': \'s_2\', \'comment\': "Yesterday, the battery health was still 94%. But when I woke up today, it had dropped to 93% [shocked emoji][shocked emoji][shocked emoji] \\nHow can I save my battery? I still have about a month and a half left on the warranty, but it\'s unlikely it will reach 80% [crying laughing emoji]\\nAnd it seems like getting it repaired at the service center would require me to wipe all the data on my phone. That\'s such a hassle..."}']


- We will iterate comment one by one to get the categorization

In [21]:
import math,json
resps = []
for i in range(len(sample_data)):
    data = sample_data[i]
    print(colored(f"*****input[{i}]*****:\n{data}","blue"))
    resp = chain.invoke(data)
    print(colored(f"*****response*****\n{resp}","green"))
    # resps += json.loads(resp)
    for item in resp['output']:
        row={}
        for it in item['item']:
            row[list(it.keys())[0]]=list(it.values())[0]
        resps.append(row)

[34m*****input[0]*****:
{'id': 's_0', 'comment': "After switching to a new operating system, my phone is no longer able to charge. I've tried updating to the latest version, but it still won't charge. The issue seems to be getting worse. #tech"}[0m
[33m*****retrived examples******:
{'comment': "My phone's charging is giving me some trouble, and it's just not charging normally - it's really frustrating. Every time I plug it in, the battery indicator light doesn't respond at all, and it just won't charge. I'm not sure if it's an issue with the charger or if there's something wrong with the phone itself. I'm hoping the official support can help me fix this problem quickly, otherwise I might have to consider getting a new phone.", 'category': 'Charging failure'}
{'comment': "My phone has been having some issues lately. The charging speed has slowed down, and the battery drains really fast. I can barely use it for long before it runs out of juice. What's going on? Is there something wron

- covert the data array to pandas dataframe

In [22]:
prediction_df = pd.DataFrame(resps).rename(columns={"category":"predict_label"}).drop_duplicates(['id']).reset_index(drop='index')
# convert the label value to lowercase
prediction_df['predict_label'] = prediction_df['predict_label'].apply(lambda x: x.strip().lower().replace("'",""))
prediction_df

Unnamed: 0,id,summary,reason,predict_label
0,s_0,"'After switching to a new operating system, my...",The user mentions that their phone is no longe...,charging failure
1,s_1,The phone's battery health has dropped to 95% ...,The user mentions a decrease in battery health...,abnormal battery health
2,s_2,The user expresses concern about the rapid dec...,The user mentions the decrease in battery heal...,abnormal battery health
3,s_3,The camera quality of the smartphone is disapp...,The user expresses dissatisfaction with the ca...,camera color deviation
4,s_4,"The update speed of this phone is slow, and it...",The user expresses concern about the slow upda...,slow update
...,...,...,...,...
195,s_195,"The phone's screen is less responsive, making ...",The user describes the phone's screen as less ...,unresponsive screen
196,s_196,"'After the recent software update, my device h...",The user mentions that their device has shut d...,"automatic restart, shutdown"
197,s_197,Users are reporting that their phones are no l...,Users are reporting that their phones are no l...,sim card not detected
198,s_198,The user's phone has been getting slower and s...,The user mentions that their phone is getting ...,high storage usage


### Merge the prediction result to the groundtruth and Calculate the accuracy

- copy comments to ground_truth dataframe

In [23]:
ground_truth = comments.copy()
# convert the label value to lowercase
ground_truth['groundtruth'] = ground_truth['groundtruth'].apply(lambda x: x.strip().lower())
ground_truth

Unnamed: 0,comment,groundtruth
0,"After switching to a new operating system, my ...",charging failure
1,"After 10 months, my phone's battery health has...",abnormal battery health
2,"Yesterday, the battery health was still 94%. B...",abnormal battery health
3,The camera quality of my smartphone has been q...,camera color deviation
4,"The update speed of this phone is really slow,...",update delay
...,...,...
195,My phone's been acting up for a while now. The...,unresponsive screen
196,"After the recent software update, my device ha...","automatic restart, shutdown"
197,Every major software update seems to cause iss...,sim card not detected
198,"My phone has been getting slower and slower, a...",high storage usage


- merge the date prediction to the groudtruth data

In [24]:
merge_df=pd.concat([ground_truth,prediction_df],axis=1)
merge_df

Unnamed: 0,comment,groundtruth,id,summary,reason,predict_label
0,"After switching to a new operating system, my ...",charging failure,s_0,"'After switching to a new operating system, my...",The user mentions that their phone is no longe...,charging failure
1,"After 10 months, my phone's battery health has...",abnormal battery health,s_1,The phone's battery health has dropped to 95% ...,The user mentions a decrease in battery health...,abnormal battery health
2,"Yesterday, the battery health was still 94%. B...",abnormal battery health,s_2,The user expresses concern about the rapid dec...,The user mentions the decrease in battery heal...,abnormal battery health
3,The camera quality of my smartphone has been q...,camera color deviation,s_3,The camera quality of the smartphone is disapp...,The user expresses dissatisfaction with the ca...,camera color deviation
4,"The update speed of this phone is really slow,...",update delay,s_4,"The update speed of this phone is slow, and it...",The user expresses concern about the slow upda...,slow update
...,...,...,...,...,...,...
195,My phone's been acting up for a while now. The...,unresponsive screen,s_195,"The phone's screen is less responsive, making ...",The user describes the phone's screen as less ...,unresponsive screen
196,"After the recent software update, my device ha...","automatic restart, shutdown",s_196,"'After the recent software update, my device h...",The user mentions that their device has shut d...,"automatic restart, shutdown"
197,Every major software update seems to cause iss...,sim card not detected,s_197,Users are reporting that their phones are no l...,Users are reporting that their phones are no l...,sim card not detected
198,"My phone has been getting slower and slower, a...",high storage usage,s_198,The user's phone has been getting slower and s...,The user mentions that their phone is getting ...,high storage usage


## 6.Calculate the accuracy

In [25]:
accuracy = (merge_df['groundtruth'] == merge_df['predict_label']).mean()

In [26]:
print(colored(f"****accuracy:****\n{accuracy}","green"))

[32m****accuracy:****
0.875[0m


In [27]:
### save the result 
merge_df.to_csv('result_lab_3.csv',index=False)