# Part 1: LLM Setup, Prompting, and Embedding Generations

In this part you will learn to setup a Hugging Face LLM, learning to prompt and parse the response, and then finally generate embeddings using an embedding model.
In addition, you will vary the temperature parameter to see how that affects the LLM's response.

**You will need to write functions to**
1. Generate prompts and input them to the tokenizer to the LLM.
2. Decode tokenizer's output by the LLM, and parse the LLM response as a string.


# Task 0: Setting up Environment and the LLM

## Installing necessary packages
First, we install Hugging Face and Llama Index libraries that would be used through out the project.

 **Make sure to run this cell again after a change of runtime type or when a runtime session is terminated.**


In [None]:
# install llama_index
! pip install llama_index==0.10.19 llama_index_core==0.10.19 torch llama-index-embeddings-huggingface peft optimum bitsandbytes

# install auto_gptq
!pip install auto_gptq

# install docx2txt
! pip install docx2txt

## LLM Setup
Run the following cells below to import necessary libraries, setting up the Hugging Face LLM and prompting the LLM to get an response. In this MP we use the Qwen2.5 (千问) LLM from Alibaba Cloud.

In [None]:
# import list of libraries
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings,SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
# instantiate the LLM from the Hugging Face library
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             trust_remote_code=False,
                                             revision="main",
                                             device_map="cuda:0"
                                             )


# instantiate the tokenizer, notice that the model name is the same as LLM
# why do you think is that, feel free to search online.
tokenizer = AutoTokenizer.from_pretrained(model_name)

Ensure that the above steps complete without any errors. Once successful, the tokenizer and LLM will be loaded and ready to use

**For all steps below, remove ```raise NotImplementedError``` after you have completed the TODO items.**

#  Task 1 - LLM Prompting and Output Response Processing



### Step 1: Prompt Generation

In this step we write a funciton that can generate the prompts used for prompting the LLM. Populate the prompt generating functions below based on the following specs.

**Input:** User query and context

**Output:** Prompt as a string.

**Example:**
1. Context: A customer is having issues with their smartphone battery draining quickly and the phone overheating.
2. User query: My battery is constantly draining, what are some suggestions.
3. Example output prompt:
```
Context: A customer is having issues with their smartphone battery draining quickly and the phone overheating.
Please respond to the following user comment. Use the context above if it is helpful.
User comments: My battery is constantly draining, what are some suggestions.
```
4. Expected format:
```
Context:<context>
Please respond to the following user comment. Use the context above if it is helpful.
User comments: <user query>
```


In [None]:
# TODO 1: Populate the user prompt generating function
def prompt_with_context(context, user_query):
  prompt = """
  TODO: Populate the prompt with context and user query
  """
  raise NotImplementedError
  return prompt

### Step 2: LLM Query Function

In this step we write a function to query the LLM given the context and the user_query, using the prompt generation function that you have implemented above.

Recall that LLM works with tokens instead of strings and characters directly, as such we need to tokenize the prompt first using tokenizer and decode the tokens generated by the LLM.

**Input:** User query and context

**Output:** LLM response as a string

Refer to the [Hugging Face Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode_plus) for additional information.

In [None]:
def get_llm_response(context, user_query, temperature=0.0001):
  # generate the prompt
  prompt = prompt_with_context(context, user_query)

  # setting up prompting messages, at a high-level
  # the overall prompting message sets the role for the LLM (system)
  # in this case LLM answers user query, and then also provide the actual
  # query contents and usesr query.
  # DO NOT change this part!
  messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
  ]
  input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
  )
  # TODO 2: call the tokenizer to tokenize the input_text into tokens
  # you can refer to: https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode_plus
  # for additional information of how to use a tokenizer
  # you can also take a look at what texts are tokenized
  model_inputs = "TODO tokenize the input_text"
  raise NotImplementedError

  # calling the LLM with input tokens to generate tokenized outputs
  generated_token_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    temperature=temperature
  )

  # TODO 3: post process the generated tokens
  # print the generated_token_ids and the model_inputs.input_ids, you will see that
  # the first part of the output: generated_token_ids, returned by the LLM contains
  # also the input tokens. Post processing step should remove input tokens and
  # leaves only the tokens output by the LLM.

  def post_processing(generated_token_ids):
    # TODO 3: insert post processing logic here to remove the input tokens from the
    # generated tokens
    raise NotImplementedError
    return generated_token_ids

  generated_token_ids_post_processed = post_processing(generated_token_ids)

  # TODO 4: decode the generated (post_processed) tokens. Hint: Take a look at
  # the "batch_decode" function of the Hugging Face tokenizer
  response = "TODO: decode the generated tokens"
  raise NotImplementedError

  return response

### Step 3: Query the LLM with Different Temperature Settings

In this step, we query the LLM using the implemented functions. We are now ready to write the first query to interact with the Qwen 2.5-1.5B LLM.

Additionally, we explore the effect of temperature on LLM inference.

**Example prompt (empty context):** What is the functionality of an LLM?


In [None]:
# TODO 5: uses the get_llm_response function implemented
# and print the LLM's response for the example prompt.
# Keep the temperature parameter at 0.0001 in this step.
raise NotImplementedError

The temperature parameter plays a crucial row in the LLM inference. **Find out what is the role of the temperature parameter and try to generate outputs with different temperature settings. What can you conclude?**

In [None]:
# TODO 6: change the temperature parameter value and see how that affects the LLM output.
# You might want to repeatly generate a few response of the same prompt to see the difference.
raise NotImplementedError

### Step 4: Hallucination in LLM
This is an open-ended question.

**Construct a prompt such that the prompt leads to a hallucination of LLM. Hallucination have different types: factual hallucination, nonsensical hallucination, or contradicted responses.**

**You can also vary the temperature parameter. Does increasing the temperature help?**

**Does providing some factual context help in reducing the hallucination?**


In [None]:
# TODO 7: Come up a prompt that results in hallucination, then see
# if changing the temperature or providing factual context help?

# Task 2 - Understanding Tokenization and Embeddings
In this part you learn to generate embeddings for texts and understand the relationship between tokenization and embeddings. Generating embeddings is essential for sentence and document understanding as well as building a RAG database, Part 3 of this MP.   

**You need to write functions to**
1. Instantiate an open-sourced Hugging Face embedding model.
2. Encode the given sentence examples.
3. Implement the cosine vector similarity score.
4. Compare the embeddings between those examples and a given reference and select the ones that have a similarity score greater than some threshold.

## Embedding Model Setup
Run the following cells below to setup the embedding model. In this project we use the all-mpnet-base-v2 embedding model. all-mpnet-base-v2 is an embedding model based on Microsoft's mpnet-base models. This model is intented to encode sentences and short paragraphs.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
from llama_index.core import Document, VectorStoreIndex

In [None]:
# Load a sentence-transformers model for text embeddings from Hugging Face
embed_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

Ensure that the above steps complete without any errors. Once successful, embedding model is loaded and ready to go.


**For all steps below, remove ```raise NotImplementedError``` after you have completed the TODO items.**

### Step 1: Embedding Generation

In this step we write a funciton that can generate the embeddings used for downstream tasks such as RAG. Populate the embedding generating functions below based on the following specs.

**Input:** User input text.

**Output:** Text embedding using the embedding model.

In [None]:
# TODO 8: Generate embeddings using the embedding model
def generate_embeddings(text):
  raise NotImplementedError
  return embedding

### Step 2: Cosine Similarity Score

In this step we write a funciton that ccomputes the similarity of two embeddings. We use cosine similarity.

**Input:** Two embeddings as numpy arrays.

**Output:** The cosine similarity score between the two embeddings.

In [None]:
# TODO 9: Calculate the cosine similarity score for the two embeddings
def cosine_similarity_score(src_embedding, tar_embedding):
  raise NotImplementedError
  return score

### Step 3: Comparing Similarity between Sentences.

In this step we uses the implemented embedding generation function and the cosine similarity score function to compute the similiarty between pairs of the provided sentences.

Here we provide a target sentence and a set of 10 other sentences to compare to as part of the sentences.txt file. **Write a program to calculate the cosine similarity score between each sentence in the sentences.txt file and the target sentence.**

**What can you conclude from the similarity scores between the target sentence and each provided sentences? Can you explain what does embedding of a sentence represent?**

In [None]:
# TODO 10: Calculate the cosine similarity score between target sentence and
# each of the other sentences provided in the sentences.txt file.


# Part 2: Retrieval-Augmented Generation (RAG)

In this part you will learn to setup a vector database Retrieval-Augmented Generation (RAG), learning to query and retrieve the top matching context, and use that to prompt the LLM. You will also evaluate the relevance of the LLM response with and without the RAG context in order to understand whether RAG help the LLM construct a better response.

**You will need to write functions to**
1. Construct a vector database from document embeddings.
2. Retrieve information from vector database and constrcut the prompting context for the LLM.
3. Get the LLM responses with and without RAG given the original user prompt.

# Task 0: Setting Up the LLM and the Embedding Model

Import the following packages to get started on part 2. We will use the same LLM and the embedding model as we did in part 1. We provide the starter code for setting up these components.

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings,SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
# instantiate the Embedding model from the Hugging Face library
# Parameters that can be tuned later to test different models and different chunk sizes
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
Settings.llm = None
Settings.chunk_size = 512
Settings.chunk_overlap = 32

In [None]:
# instantiate the LLM from the Hugging Face library
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             trust_remote_code=False,
                                             revision="main",
                                             device_map="cuda:0"
                                             )


# instantiate the tokenizer, notice that the model name is the same as LLM
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Task 1: Construct the Vector Database

### Step 1: Read the documents

In this step we write a helper function that reads the document from a directory from the Google colabs runtime filesystem. To access the filesystem, click on the folder icon on the left control pannel. By the default this leads to ```/content``` folder, and any subfolder or files under this folder can be directly accessed by this notebook.

1. Upload the provided 4 machine learning related documents to the ```sample_data``` folder.
2. Finsih the following function to read these documents

In [None]:
def read_documents(folder_path, file_extension=[".pdf", ".docx"],
                   chunk_size=512, chunk_overlap = 64):
  Settings.chunk_size = chunk_size
  Settings.chunk_overlap = chunk_overlap
  # TODO 1: invoke the SimpleDirectoryReader class to read the documents under
  # "folder_path" of certain extensions.
  documents = SimpleDirectoryReader(folder_path,
                                    required_exts=file_extension).load_data()


  # TODO 2: return the readed documents and the number of documents read
  return documents, len(documents)

In [None]:
documents, num_documents = read_documents("sample_data")
print(f"Number of documents read: {num_documents}")

### Step 2: Constructing the vector database

In this step, we write help functions to the construct vector database using the embedding model and the documents we read.

1. Use VectorStoreIndex class to setup an in-memory vector database
2. Convert the vector database as an retriever for user interaction
3. return the vector database and the retriever for later query

In [None]:
def build_vector_database(documents, top_k):
  # TODO 3: Setting the vector database using VectorStoreIndex class
  # refer to https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_guide/
  # of how to use this class.
  index = "TODO setup the vector database store index"
  raise NotImplementedError

  # TODO 4: Turn the vector database as a retriever for user intteraction.
  # refer to https://legacy.ts.llamaindex.ai/api/classes/VectorIndexRetriever
  # of how to use this class. Return only the top_k document in the similarity
  # search.
  raise NotImplementedError
  retriever = "TODO setup the vector database retriever using the index"
  return index, retriever

In [None]:
TOP_K = 2
index, retriever = build_vector_database(documents, top_k=TOP_K)

# Task 2: Query the Vector Database

### Step 1: Querying the vector database

In this step, we write a function that uses the document retriever that accepts the user prompt as input and returns the top-K relevant document chunks, then return these documents as contexts.

1. Assemble the query engine by completing the following todos.
2. The function accepts a user prompt and then ask the query engine for retrieving the top-K similar documents.
3. Return the text chunks of the top-K documents as additional contexts to be used in later prompting.

Use the following example query: \
 Query: **"What are the documents about?"**

Does the query engine and the retriever return the relevant context based on this query?

In [None]:
def database_query(query, retriever, top_k, similarity_cutoff=0.1):
  # TODO 5: Using the RetrieverQueryEngine to construct a query engine from the
  # retriever, refer to https://docs.llamaindex.ai/en/stable/api_reference/query_engine/retriever/#llama_index.core.query_engine.RetrieverQueryEngine
  query_engine = "TODO setup the query engine using the retriever"
  raise NotImplementedError

  # TODO 6: get the response object from the query_engine
  raise NotImplementedError
  response = "TODO get the response from the query engine"

  # TODO 7: the response from the query engine is not the original document
  # chunks as the query engine acts like a mini chatbot for user interaction
  # however, the "source_nodes" in the response object contains the original
  # document chunks. Iterate through the source_nodes object in the
  # response nodes and concatenate the top_k source_node texts as contexts
  # the concatenation of the chuncks should be separated by linebreak character.
  context = "TODO get the original document context from the query engine response"
  for i in range(top_k):
      raise NotImplementedError

  return context

In [None]:
query = "What are the documents about?"
context = database_query(query, retriever, TOP_K)
print(context)

### Step 2: Using the contexts as part of the LLM prompting

Recall that in part 1 you have written a user prompt generation function that
generates LLM prompt using the user provided query and the context. In part 1, the context is left empty. In this part, we will actually use the document chunks retrieved from the vector database retriever as the additional context to the prompt.

1. Invoke the prompt_with_context function and print an example prompt with the user query and the retrieved context.
2. Invoke the get_llm_response function get an response from the LLM on the following sample query **WITH** and **WITHOUT** the retrieved context: \
Query: **"What are the documents about?"**



In [None]:
user_query = "What are the documents about?"
empty_context = ""
# TODO 8: invoke the prompt_with_context to examine the prompt
# TODO 9: invoke the get_llm_response function to get an response from the LLM
raise NotImplementedError

In [None]:
user_query = "What are the documents about?"
context = database_query(user_query, retriever, TOP_K)
# TODO 10: invoke the prompt_with_context to examine the prompt
# TODO 11: invoke the get_llm_response function to get an response from the LLM
raise NotImplementedError