Created by: [Bharath Kumar Hemachandran](mailto:bharathh@gmail.com) [Linkedin](https://www.linkedin.com/in/bharath-hemachandran/)



# **RAG - A Complete Tutorial**

Retrieval-Augmented Generation (RAG) is an advanced technique that enhances the capabilities of language models by combining retrieval mechanisms with generative responses. This approach enables models to access external knowledge, improving accuracy and relevance, especially when handling complex or domain-specific queries.

## **What is RAG?**
RAG consists of two key components:
1. **Retriever:** Finds relevant information from a knowledge base or dataset.
2. **Generator:** Produces a response based on the retrieved information.

RAG is particularly useful in scenarios where:
- The base language model has limited knowledge.
- Up-to-date or specific information is required.
- Responses must be grounded in factual data.

---

## **What You Will Learn**

By the end, you’ll have a RAG system that can:
1. Retrieve relevant information from a dataset.
2. Generate context-aware responses from the dataset

We will be using Langchain SentenceTransformers to convert small chunks of the document into embeddings and store them into a vectorstore. We will be using FAISS for the vectorstore and use the similarity search feature to query the vectorstore.

Once the query has been completed, we will then use a call to the Llama3 API hosted on [Groqcloud](https://console.groqcloud.com) to generate a context aware response.

---

### **How RAG Works**

# **Introduction**

Retrieval-Augmented Generation (RAG) is an advanced technique that enhances language models by integrating retrieval mechanisms with generative capabilities. This approach allows models to access external knowledge bases, improving accuracy and relevance, especially for complex or domain-specific queries.

## **How RAG Works**

[![Retrieval-Augmented Generation Workflow](https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2023/04/25/ML-13871image003.jpg)](https://nbkomputer.com/retrieval-augmented-generation-with-langchain-amazon-sagemaker/)

#####[Image Source](https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2023/04/25/ML-13871image003.jpg)

In the diagram above, the RAG process is depicted as follows:
1. **Query Input:** A user submits a query.
2. **Retriever:** The system searches a knowledge base to find relevant information.
3. **Generator:** Using the retrieved information, the system generates a contextually appropriate response.

This combination ensures that responses are both informed by external data and coherently generated, providing more accurate and up-to-date information.

---

By understanding and implementing RAG, we can develop systems that effectively leverage vast datasets to produce high-quality, context-aware responses.


### **Application Areas**

Some practical use cases of RAG include:
- Building intelligent chatbots.
- Creating personalized recommendation systems.
- Answering questions with evidence-based responses.
- Improving search engine capabilities.

---

With this foundation, let’s set up our environment and dive into building a RAG system!





# **Document Ingestion into the Document Library**

In this section, we’ll outline how documents are ingested into the document library for use in a Retrieval-Augmented Generation (RAG) system. This process ensures that documents are preprocessed, embedded, and stored efficiently for retrieval during query handling.

---

## **Step 1: Uploading Documents**
- Documents can be uploaded in various formats such as `.txt`, `.pdf`, `.csv`, or `.json`.
- Use file upload options in your application or APIs to add files to the document library.

---

## **Step 2: Preprocessing**
- Once uploaded, documents are preprocessed to ensure uniformity:
  - **Text extraction**: Convert the content of files (e.g., PDFs) into plain text.
  - **Cleaning**: Remove special characters, stopwords, or unnecessary whitespace.
  - **Segmentation**: Split the text into smaller chunks, such as paragraphs or sentences, to facilitate efficient embedding.

---

## **Step 3: Embedding Generation**
- Use a pre-trained model from **SentenceTransformers** to encode text chunks into vector embeddings.
- Embeddings are dense numerical representations of the text, capturing its semantic meaning.

---

## **Step 4: Building the Index**
- Feed the embeddings into a **FAISS** index to enable fast similarity searches.
- Choose an appropriate FAISS index type (e.g., `IndexFlatL2` for small datasets or `IVF` for larger ones).
- Save the index to disk for future use.

---

## **Step 5: Storing Metadata**
- Alongside embeddings, store metadata (e.g., document titles, sources, or tags) to provide context during retrieval.
- Metadata is linked to embeddings to enhance query results.

---

## **Step 6: Verification and Testing**
- Validate the ingestion pipeline:
  - Ensure all documents are correctly preprocessed.
  - Test the FAISS index by performing similarity searches on sample queries.

---

Once these steps are complete, the documents are ready for retrieval in the RAG workflow, ensuring seamless integration between retrieval and generation.

First, let us install and import the necessary libraries required to complete this process. Hit the run button to get started.

In [None]:
%%capture
!pip install langchain_community
!pip install faiss-cpu
!pip install sentence-transformers
!pip install huggingface-hub
!pip install unstructured
!pip install nltk
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng') # Download the missing resource

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


In [None]:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

There should be a folder called sample_data with a file called refund_policy.txt within. This contains some sample policies for a return policy for a typical ecommerce retailer. We will be using this document for our trial purposes. Hit the run button below to ensure that the document exists.

If you get the following error
```ls: cannot access 'sample_data/refund_policy.txt': No such file or directory```

then on the LHS of this document, click on the folder icon, and add the following document to the sample_data folder.

In [None]:
# Setting up the files required - first removing all the default files in the sample_data folder
!rm sample_data/*
!wget 'https://raw.githubusercontent.com/bharathh80/genaiworkshop/refs/heads/main/docs/refund_policy.txt' -P sample_data
# Check if the file exists
!ls sample_data/refund_policy.txt

--2025-01-23 07:58:26--  https://raw.githubusercontent.com/bharathh80/genaiworkshop/refs/heads/main/docs/refund_policy.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11458 (11K) [text/plain]
Saving to: ‘sample_data/refund_policy.txt’


2025-01-23 07:58:26 (24.0 MB/s) - ‘sample_data/refund_policy.txt’ saved [11458/11458]

sample_data/refund_policy.txt


Specify where the documents reside

In [None]:
directory = './sample_data'

Create a function to load documents using the ```DirectoryLoader``` library



In [None]:
def load_docs(_directory):
    loader = DirectoryLoader(_directory)
    _documents = loader.load()
    return _documents

```DirectoryLoader``` reads all files in the specified directory.
The ```load()``` method returns a list of document objects.

You can now load the documents and print their count:

In [None]:
documents = load_docs(directory)
print(f"Number of documents ingested: {len(documents)}")

Number of documents ingested: 1


# **Types of Document Loaders in LangChain**

LangChain offers a variety of [document loaders](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/) to import data from different sources into the standard `Document` format. Below is a summary of the different types of document loaders:

| **Document Loader** | **Description** | **Use Case** |
|---------------------|-----------------|--------------|
| `TextLoader` | Loads simple `.txt` files as documents. | Ideal for loading plain text files. |
| `CSVLoader` | Loads data from CSV files. | Useful for structured data stored in CSV format. |
| `DirectoryLoader` | Loads documents from a specified directory. | Suitable for bulk loading of files from a folder. |
| `HTMLLoader` | Loads and parses HTML files. | Appropriate for extracting text from web pages or HTML documents. |
| `JSONLoader` | Loads data from JSON files. | Best for loading structured data in JSON format. |
| `MarkdownLoader` | Loads and parses Markdown files. | Designed for documents written in Markdown. |
| `UnstructuredWordDocumentLoader` | Loads Microsoft Word documents. | Useful for `.docx` files. |
| `PyPDFLoader` | Loads and parses PDF files. | Ideal for extracting text from PDFs. |
| `UnstructuredEmailLoader` | Loads email files. | Suitable for processing email content. |
| `EverNoteLoader` | Loads Evernote note files. | Designed for importing notes from Evernote. |
| `NotionDBLoader` | Loads data from Notion databases. | Useful for integrating Notion data. |
| `ObsidianLoader` | Loads notes from Obsidian. | Best for importing markdown notes from Obsidian. |
| `RoamLoader` | Loads data from Roam Research. | Suitable for integrating Roam Research notes. |
| `SlackLoader` | Loads messages from Slack. | Ideal for importing Slack conversation history. |
| `ConfluenceLoader` | Loads pages from Confluence. | Useful for integrating Confluence documentation. |
| `GoogleDriveLoader` | Loads files from Google Drive. | Suitable for accessing documents stored in Google Drive. |
| `S3Loader` | Loads files from AWS S3 buckets. | Best for loading data stored in S3. |
| `WebBaseLoader` | Loads content from web pages. | Ideal for scraping and loading web content. |
| `YouTubeLoader` | Loads transcripts from YouTube videos. | Useful for extracting text from video transcripts. |

Each loader is designed to handle specific data formats and sources, enabling efficient and contextually appropriate data ingestion into LangChain.


To process the text efficiently, split it into smaller chunks using the ```RecursiveCharacterTextSplitter``` library

In [None]:
def split_docs(_documents, chunk_size=1000, chunk_overlap=20):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs = text_splitter.split_documents(_documents)
    return docs

```chunk_size```: Maximum size of each chunk in characters.
```chunk_overlap```: Number of overlapping characters between chunks.
Split the loaded documents and print the number of chunks created

In [None]:
chunks = split_docs(documents)
print(f"Number of chunks created: {len(chunks)}")


Number of chunks created: 13


# **Types of Text Splitters in LangChain**

LangChain provides various [text splitters](https://api.python.langchain.com/en/latest/text_splitters/index.html) to divide text into manageable chunks, each tailored for specific formats and use cases. Below is a summary of the different types of text splitters:

| **Text Splitter**                          | **Description**                                                        | **Use Case**                                                                                  |
|--------------------------------------------|------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
| `CharacterTextSplitter`                    | Splits text based on characters, using a specified separator.          | General-purpose splitting when simple character-based division is sufficient.               |
| `RecursiveCharacterTextSplitter`          | Recursively splits text by characters, aiming to respect sentence boundaries and other delimiters. | Useful for maintaining semantic coherence in chunks.                                         |
| `TokenTextSplitter`                        | Splits text into tokens using a model tokenizer.                       | Ideal for applications requiring token-level processing, such as language models.           |
| `HTMLHeaderTextSplitter`                   | Splits HTML files based on specified header tags.                      | Suitable for processing HTML documents by sections defined by headers.                      |
| `HTMLSectionSplitter`                     | Splits HTML files based on specified tags and font sizes.              | Useful for segmenting HTML content into meaningful sections beyond headers.                 |
| `MarkdownHeaderTextSplitter`              | Splits Markdown files based on specified headers.                      | Designed for dividing Markdown documents into sections according to header levels.          |
| `MarkdownTextSplitter`                    | Attempts to split text along Markdown-formatted headings.              | Effective for processing Markdown content by its structural elements.                       |
| `NLTKTextSplitter`                        | Splits text using the NLTK package, typically at sentence boundaries.  | Appropriate for sentence-level splitting, especially in natural language processing tasks.   |
| `PythonCodeTextSplitter`                  | Attempts to split text along Python syntax.                            | Tailored for dividing Python code into logical segments.                                    |
| `SentenceTransformersTokenTextSplitter`   | Splits text into tokens using a sentence model tokenizer.              | Suitable for applications involving sentence embeddings and similarity tasks.               |
| `SpacyTextSplitter`                       | Splits text using the SpaCy package, leveraging its linguistic features. | Ideal for splitting text based on linguistic components like sentences or entities.         |
| `KonlpyTextSplitter`                      | Splits text using the Konlpy package, which is designed for Korean language processing. | Best suited for processing Korean text, utilizing Konlpy's capabilities.                    |
| `LatexTextSplitter`                       | Attempts to split text along LaTeX-formatted layout elements.          | Useful for segmenting LaTeX documents into logical parts.                                   |
| `RecursiveJsonSplitter`                   | Splits JSON content recursively based on specified criteria.           | Designed for breaking down JSON data structures into manageable pieces.                     |
| `ExperimentalMarkdownSyntaxTextSplitter`  | An experimental splitter for handling Markdown syntax.                 | Useful for testing and development purposes with Markdown content.                          |

Each splitter is designed to handle specific text formats and structures, enabling efficient and contextually appropriate text processing.


Now that we have chunked our document into small chunks, we now need to convert it into vector embeddings in order to be able to do a similarity search using FAISS.

### Why is this required?

The reason we create vectors using embeddings, is essential for enabling similarity search in

Retrieval-Augmented Generation (RAG). Vectors represent the semantic meaning of text in a high-dimensional space, allowing the system to retrieve contextually relevant chunks of information based on a query. This step ensures that retrieval is efficient and accurate, which is critical for generating meaningful and context-aware responses.

Without embeddings, the retrieval process would rely solely on keyword matching or other less effective methods, which can miss the deeper semantic connections between the query and the document content.

### Alternatives to Creating Vectors for Retrieval

While creating vectors is the most common and effective approach, there are some alternatives for retrieval in RAG:

1. **TF-IDF (Term Frequency-Inverse Document Frequency):**
   - This is a traditional method that uses keyword frequency to measure the importance of terms in a document.
   - Works well for simple tasks but lacks the ability to capture semantic meaning.

2. **BM25 (Best Match 25):**
   - An advanced ranking function based on TF-IDF with better handling of term saturation and document length.
   - Commonly used in search engines like Elasticsearch but does not encode semantic information.

3. **Keyword-Based Search:**
   - Matches exact keywords between the query and the documents.
   - Limited to surface-level matches and not ideal for nuanced queries.

4. **Ontology or Knowledge Graphs:**
   - Uses a structured representation of knowledge to retrieve relevant information.
   - Requires significant effort to build and maintain the ontology.

5. **Rule-Based Retrieval:**
   - Uses predefined rules or patterns for matching queries with documents.
   - Effective in specific, controlled domains but lacks flexibility.

Among these methods, vector-based retrieval using embeddings remains the most effective for RAG as it captures both syntax and semantics, enabling the retrieval of contextually relevant information for diverse and complex queries.


In [None]:
def create_embeddings(_chunks):
    modelPath = "sentence-transformers/all-MiniLM-l6-v2"
    model_kwargs = {'device': 'cpu'}
    encode_kwargs = {'normalize_embeddings': False}

    embeddings = HuggingFaceEmbeddings(
        model_name=modelPath,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
    )
    return embeddings


`modelPath`: Specifies the embedding model (e.g., all-MiniLM-l6-v2).

`model_kwargs`: Configures the model to use a specific device (e.g., CPU).

`encode_kwargs`: Provides additional encoding options

Next, we create the embedding object:

In [None]:
embedding = create_embeddings(chunks)

Now that we have our embeddings, let us create a search index using [FAISS](https://faiss.ai/) to create the document library

In [None]:
try:
    db = FAISS.from_documents(chunks, embedding)
    db.save_local(folder_path="../database/faiss_db", index_name="myFaissIndex")
    print("FAISS index created")
except Exception as e:
    print(e)
    print("FAISS index creation failed")


FAISS index created


The FAISS.from_documents() method creates the index.
The save_local() method saves the index to disk for future use.

If the previous code executed correctly, if you refresh the document folder for this Colab notebook, you will see a database folder with faiss_db created within. The folder will contain the index that can be used for later retrieval. This index can also be stored remotely for better persistence.

In [None]:
db = FAISS.load_local(folder_path="../database/faiss_db", embeddings=embedding, index_name="myFaissIndex", allow_dangerous_deserialization=True)
searchDocs = db.similarity_search("What is the return policy for?")
print(searchDocs[0].page_content)


Customer Responsibility: Unless otherwise stated, customers are responsible for the

cost of return shipping. We recommend using a trackable shipping method and insuring

the package to ensure its safe return to our facility.

2.5 Exclusions:

Exceptions to the Policy: AcmeCorp reserves the right to make exceptions to the return

policy on a case-by-case basis. Such exceptions may include extenuating

circumstances or errors on our part.

2.6 International Returns:

Additional Considerations: For returns originating from outside our domestic shipping

region, additional restrictions or requirements may apply. Please contact our customer

support team for assistance with international returns.

2.7 Multiple Returns:

Monitoring Returns: AcmeCorp monitors returns activity for abuse of the policy.

Excessive returns may result in denial of future return requests or account suspension.

Chapter 3: Return Process


### Loading and Using the FAISS Index for Retrieval

In this step, we load the previously saved FAISS index and use it to perform similarity searches. This is a crucial part of the Retrieval-Augmented Generation (RAG) workflow, as it allows us to efficiently find relevant chunks of information in response to a query.

#### Why Load the Index?
The FAISS index is a precomputed data structure that organizes embeddings in a way that makes similarity searches fast and scalable. By saving and reloading the index:
- **Persistence:** We avoid recomputing embeddings and re-indexing documents every time the program runs.
- **Scalability:** The index can be stored and distributed across systems, enabling large-scale applications.
- **Efficiency:** Loading the index allows immediate access to the vector database for retrieval tasks.

#### The Code Breakdown
```python
# Load the FAISS index from the saved location
db = FAISS.load_local(
    folder_path="../database/faiss_db",  # Path to the saved FAISS index
    embeddings=embedding,               # The embedding model used to create the index
    index_name="myFaissIndex"           # The specific name of the index to load
)
```
`folder_path`: Specifies the directory where the FAISS index was saved.

`embeddings`: Ensures that the same embedding model used to create the index is loaded for accurate similarity searches.

`index_name`: Names the index for easy identification, especially when managing multiple indices.

#####Performing a Similarity Search
```python
searchDocs = db.similarity_search("What is the return policy for?")
print(searchDocs[0].page_content)
```

`similarity_search`: Takes a query string as input and searches the index for the most similar chunks of text.

`Query Example`: In this case, we are searching for information about a return policy.

`Result`: The method retrieves the most relevant chunks and returns them in descending order of similarity. Here, we print the content of the top result.

#####Key Benefits of FAISS for RAG

*Speed*: FAISS is optimized for fast nearest-neighbor searches, even with large datasets.

*Scalability*: It supports billions of vectors, making it ideal for large-scale document retrieval.

*Flexibility*: The loaded index can be queried multiple times with different inputs, enabling dynamic interactions.

By completing this, you enable the core functionality of RAG: retrieving relevant context to enhance the output quality of downstream tasks, such as generating precise and accurate responses.

# **Augumenting the content above into an LLM query**

Now that we have retrieved the most relevant content from the file we imported our data from, let us query the document using an LLM to generate a nice response.

We will be using the `llama3-8b-8192` LLM to generate our reply. The LLM is hosted on [Groqcloud](https://console.groqcloud.com). Once you create an account with Groqcloud create an API key and set that in order to use the code below to send a request to the LLM and get a response back

### Setting Up Your API Key in Google Colab

To securely set your API key or secret in this notebook:
1. Run the provided code cell that prompts for the API key.
2. Enter your API key when prompted. The input will be hidden for security.
3. The key will be stored as an environment variable for use in subsequent code.



In [None]:
import os
from getpass import getpass
os.environ["GROQ_API_KEY"] = getpass("Enter your API key: ")

Enter your API key: ··········


Once you have set your API key, let us import the required libary for Groq and run the code against the libary


In [None]:
%%capture
!pip install groq
from groq import Groq

First let us get the context related content again by loading the index and specifying the question

In [None]:
question = "What is the refund policy?"

db = FAISS.load_local(folder_path="../database/faiss_db", embeddings=embedding, index_name="myFaissIndex", allow_dangerous_deserialization=True)
searchDocs = db.similarity_search(question)

answer = searchDocs[0].page_content
print(answer)


Promotional Items: Items purchased during promotions or sales may be subject to

special return or refund conditions. Please refer to the terms of the promotion for

specific details.

4.7 Refund or Exchange Denial:

Non-Eligible Returns: Items that do not meet the eligibility criteria for returns (as

outlined in Chapter 2) will not be refunded or exchanged. You will be notified of the

reason for denial, and the item may be returned to you at your expense.

Policy Abuse: AcmeCorp reserves the right to deny refunds or exchanges in cases of

policy abuse or fraudulent activity. Excessive return requests may be flagged and

result in account suspension or denial of future returns.

4.8 Customer Support:

Assistance: For any questions or assistance with refunds or exchanges, please contact

our customer support team. We are here to help and ensure your experience with

AcmeCorp is positive.

By understanding and following these guidelines, you can ensure a smooth refund or


In [None]:


client = Groq()
completion = client.chat.completions.create(
    model="llama3-8b-8192",
    messages=[
        {
            "role": "system",
            "content": "Play the role of a customer service professional "
            "for a leading E-Commerce company called AcmeCorp and "
            "analyse a provided question or ticket from a customer. "
            "Be as professional and polite as possible when replying "
            "to the provided question. "
            "The response  a good response will be based on the information "
            "provided here only. Do not include information not in the follwing"
            "context: " + answer
        },
        {
            "role": "user",
            "content": question
        }
    ],
    temperature=0,
    max_tokens=256,
    top_p=1,
    stream=True,
    stop=None,
)

for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")


Thank you for reaching out to AcmeCorp's customer support team! We're happy to help you understand our refund policy.

At AcmeCorp, we strive to provide a hassle-free shopping experience for our customers. Our refund policy is designed to ensure that you're satisfied with your purchase. Here are the key points to note:

* Eligible returns: You can return or exchange an item within [insert timeframe] of receiving your order, as long as the item is in its original condition with all original tags and packaging intact.
* Non-eligible returns: Items that do not meet our eligibility criteria for returns, such as items that are worn, damaged, or missing original tags and packaging, will not be refunded or exchanged.
* Refund or exchange denial: If your return is denied, you will be notified of the reason for denial, and the item may be returned to you at your expense.
* Policy abuse: We reserve the right to deny refunds or exchanges in cases of policy abuse or fraudulent activity. Excessive 

### Code Explanation

#### Overview
This code interacts with the **Groq API** to generate a response from a Large Language Model (LLM) in a conversational setup. It simulates a customer service assistant for the fictional company **AcmeCorp**, responding to user queries based on a predefined context.

---

#### Key Components and Explanation

1. **Creating the Groq Client**
   ```python
   client = Groq()
```
2. **Calling the Chat Completion Endpoint**

```python
completion = client.chat.completions.create(
    model="llama3-8b-8192",
    ...
)
```

**Purpose**:
This sends a request to Groq's chat.completions.create endpoint to generate a conversational response.

**Parameters**:

`model`: Specifies the LLM to use. Here, the model is llama3-8b-8192, which likely has 8 billion parameters and supports an 8192-token context window.

`messages`: A list of messages simulating a chat conversation. Each message includes:
- role: The role of the message sender (system, user).
- content: The message content.

**Other Arguments**:
- `temperature=0`: Ensures deterministic output by removing randomness. The lower the value, the more focused the responses.

- `max_tokens=256`: Limits the response length to 256 tokens.

- `top_p=1`: Implements nucleus sampling. With 1, all token probabilities are considered, ensuring completeness.
- `stream=True`: Enables streaming, where the response is sent in chunks.

3. **Message Structure**
- `system` role: Provides instructions to the LLM on how to behave and limits its responses to the given answer context. The variable answer contains relevant context for the query.

```python
{
    "role": "system",
    "content": "Play the role of a customer service professional for a leading E-Commerce company called AcmeCorp ... context: " + answer
}
```


- `user` role:
Represents the user's question, stored in the variable question.

```python
{
    "role": "user",
    "content": question
}
```

4. **Streaming and Printing the Response**

```python
for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")
```

- `Streaming`: Iterates over chunks of the model's response as they arrive in real time.
- Accessing Content:
`chunk.choices[0].delta.content`: Extracts the text from the response chunk.
or "": Ensures no interruptions if a chunk is empty.

- `Output`: Prints the response continuously as it is generated, giving the user a real-time experience.