# Quick Start : RAG(FAISS) + LLM(Gemini)
We use embedding models from HuggingFace for local embedding implemetation for a quick start customized test.

## Archtecture
```txt
┌──────────┐    ┌─────────────┐    ┌─────────────────┐    ┌───────┐
│  Corpus  │───▶│  Embedding  │───▶│      FAISS      │───▶│  LLM  │
└──────────┘    └──────▲──────┘    │ ┌─────────────┐ │    └───────┘
┌──────────┐           │           │ │  Vectorize  │ │
│  Query   │───────────┘           │ └─────────────┘ │
└──────────┘                       │        │        │
                                   │ ┌─────────────┐ │
                                   │ │    Search   │ │
                                   │ └─────────────┘ │
                                   └─────────────────┘
```

## Model
- **Embedding**: `BAAI/bge-m3`
- **Search Engine**: `FAISS`
- **LLM**: `genai`

## API Key
You may get the Gemini API Key from [here](https://ai.google.dev/gemini-api/docs/quickstart)

## Packages
```bash
conda create -n ml python=3.11
conda activate ml
conda install -c conda-forge faiss numpy
pip install faiss-cpu   # for simple quick start
pip install langchain-huggingface
pip install sentence-transformers
```
Or you may install packages through `pip` only in your `venv`.


## Function
- Load embedding model
- Add database to **FAISS** index
- Ask LLM with compound information(self-defined prompt + RAG results + user query)

In [1]:
import numpy as np
import faiss
from langchain_huggingface import HuggingFaceEmbeddings

def load_embedding_model(model_name):
    return HuggingFaceEmbeddings(
        model_name=model_name,
        encode_kwargs={'normalize_embeddings': True} # Normalize embeddings for better similarity scores
        # Optional: Specify device, e.g., model_kwargs={'device': 'cuda'} or {'device': 'cpu'}
    )

def add_to_faiss_index(embeddings):
    # cast embeddings' type as numpy array
    embeddings_np = np.array(embeddings, dtype=np.float32)
    print(f"Embeddings shape: {embeddings_np.shape}")
    print(f"Embeddings datatype: {embeddings_np.dtype}")
    
    # create faiss index object
    index = faiss.IndexFlatL2(embeddings_np.shape[1])
    print(f"FAISS index is trained: {index.is_trained}")
    print(f"Number of vectors currently in index: {index.ntotal}")

    # add embeddings to index
    index.add(embeddings_np)
    print(f"Number of vectors successfully added to index: {index.ntotal}")

    return index

def query_search(index, query_embeddings, text_array, k=1):
    # Ensure query_embeddings is a 2D numpy array of float32
    # If you embedded multiple queries, query_embeddings might already be 2D
    # If it's a single embedding (list or 1D array), reshape it
    query_embeddings_np = np.array(query_embeddings).astype('float32')
    if query_embeddings_np.ndim == 1:
       query_embeddings_np = np.array([query_embeddings_np]) # Reshape to (1, dimension)
    elif query_embeddings_np.shape[0] > 1:
        print("Warning: query_search designed for single query embedding. Using only the first.")
        query_embeddings_np = query_embeddings_np[0, :] # Take only the first if multiple passed

    # search response in index
    distances, indices = index.search(query_embeddings_np, k)

    # get results with valid index
    results = []
    for dist, idx in zip(distances[0], indices[0]):
        if 0 <= idx < len(text_array):
            results.append((float(dist), text_array[idx]))

    return results

def ask_llm_with_rag(client, original_query, rag_results):
    """
    Sends the original query and RAG results to Large Language Model for an answer.

    Args:
        client: Initialized LLM client.
        generation_config: Configures that pass into LLM for generation.
        original_query: The user's original question (string).
        rag_results: List of tuples [(distance, text_snippet)] from query_search.

    Returns:
        The content of the assistant's reply (string), or None if an error occurs.
    """
    # 1. Prepare Context
    context_snippets = [snippet for distance, snippet in rag_results]
    if not context_snippets:
        print("Warning: No relevant context found from RAG.")
        # Decide how to handle this: maybe ask without context or return specific message
        # For now, let's proceed without specific context in the prompt if none found
        formatted_context = "No specific context was retrieved."
    else:
        # Combine snippets into a single block. For clearly delineate the context (e.g., using headings like "Context:", separators like ---, or explicit instructions).
        formatted_context = "\n\n---\n\n".join(context_snippets) # Use --- as separator

    # 2. Construct Prompt Message for User Role
    # This template explicitly tells the model to use the provided context.
    prompt = f"""
                Strictly follow these instructions:
                1.  Answer the "User's Query" based on the provided "Context".
                2.  If the "Context" is empty or does not contain the answer, you must state: "The provided context does not contain the necessary information." After that, you may make a reasonable inference to answer the query based on your general knowledge.
                3.  Respond in the same language as the User's Query.
                
                ---

                Context:
                {formatted_context}

                ---

                User's Query:
                {original_query}
            """

    print("=============Input Message=============")
    print(f"prompt: {prompt}")
    print("=======================================")

    # 3. Call the Gemini API
    try:
        print(f"\nSending query and context to LLM...")
        response = client.generate_content(
            contents=prompt
        )
        # 4. Extract the response content
        assistant_reply = response.text
        return assistant_reply

    except Exception as e:
        print(f"An error occurred while calling the LLM API: {e}")
        return None


## Database & Configs

In [3]:
# database
CORPUS = [
    "台北101是台灣著名的地標建築。",
    "今天天氣晴朗，適合去郊外踏青。",
    "人工智慧正在改變我們的生活方式。",
    "這家餐廳的牛肉麵非常有名。",
    "CI/CD現在似乎很流行。",
    "我昨天看了一部非常感人的電影。",
    "使用再生能源有助於減少碳排放。",
    "他從小就對音樂有濃厚的興趣。",
    "請問附近有沒有便利商店？",
    "這本書深入探討了量子物理的基本概念。",
    "她正在準備考研究所，希望能進入國立大學。",
    "日本的櫻花季節吸引了許多觀光客。",
    "我每天早上都會喝一杯黑咖啡。",
    "手機沒電了，你可以借我充電器嗎？",
    "這次的颱風造成了不少災情。",
    "他正在學習如何開發網頁。",
    "狗是人類最忠實的朋友之一。",
    "這張畫是畢卡索晚年創作的作品。",
    "我們正在討論公司的年度預算計畫。",
    "最近股市波動很大，投資人需要多加注意。",
    "她夢想成為一名太空人，探索宇宙的奧秘。",
    "在知名程式裡出現的狗頭人喜歡蠟燭。"
]

EMBEDDING_MODEL_NAME = "BAAI/bge-m3"
LLM_NAME = "gemini-2.0-flash"
LLM_TEMP = 1.5  # 0.0-2.0
LLM_INSTRUCT = "You are a universal translator and multilingual assistant. Your primary rule is to mirror the user's language. If the user writes in Japanese, you must respond in Japanese. If they write in German, you must respond in German. If the language is ambiguous, default to English."

# number of top-k search results to provide to LLM
TOP_K = 5

# user query to send to LLM
QUERY = "哪一段提到了程式設計？"


## Main
Let's DO THIS!

In [4]:
import google.generativeai as genai


def main():
    # Load OpenAI API Key
    GEMINI_API_KEY = None
    with open('token', 'r') as reader:
        GEMINI_API_KEY = reader.readline().strip()
    if not GEMINI_API_KEY:
        print('Load Gemini API Key Failed. Process End.')
        return

    # Set up the local Hugging Face embedding model
    # This will download the model files the first time you run it if not cached
    print("Loading Embedding Model...")
    embedding_model = load_embedding_model(EMBEDDING_MODEL_NAME)
    print(f"Embedding Model: {EMBEDDING_MODEL_NAME}")

    # Get embeddings using the HuggingFaceEmbeddings object's method
    # This runs the model LOCALLY
    embeddings = embedding_model.embed_documents(CORPUS)

    # embeddings will be a list of lists (or numpy array depending on version/config)
    # Each inner list is the embedding vector for the corresponding text in text_array
    print(f"Generated {len(embeddings)} embeddings.")
    if embeddings:
        print(f"Embedding dimension: {len(embeddings[0])}")

    # Add embedding to FAISS index
    faiss_index = add_to_faiss_index(embeddings)

    # Create Client to Gemini
    genai.configure(api_key=GEMINI_API_KEY)
    llm_config=genai.types.GenerationConfig(
        temperature=LLM_TEMP,
        # max_output_tokens=300,
        # top_p=0.8,
        # stop_sequences=['Thank you']
    )
    client = genai.GenerativeModel(
        model_name=LLM_NAME,
        system_instruction=LLM_INSTRUCT,
        generation_config=llm_config
    )

    print(f"LLM: {LLM_NAME} has been used.")

    # Prepare message to send to LLM
    # First, embed the query to search from FAISS index, and select top-k results
    query_embedding = embedding_model.embed_query(QUERY)
    print(f"Query: {QUERY}")


    retrieved_results = query_search(faiss_index, query_embedding, CORPUS, TOP_K)
    print(f"Top {TOP_K} distances and results: ")
    for dist, rslt in retrieved_results:
        print(f"{dist} | {rslt}")

    # Ask LLM with RAG context
    if retrieved_results: # Proceed only if context was found
        final_answer = ask_llm_with_rag(client, QUERY, retrieved_results)

        if final_answer:
            print("\nLLM Response:")
            print(final_answer)
        else:
            print("\nFailed to get response from LLM.")
    else:
        print("\nSkipping LLM call as no relevant context was retrieved.")

if __name__ == '__main__':
    main()

Loading Embedding Model...
Embedding Model: BAAI/bge-m3
Generated 22 embeddings.
Embedding dimension: 1024
Embeddings shape: (22, 1024)
Embeddings datatype: float32
FAISS index is trained: True
Number of vectors currently in index: 0
Number of vectors successfully added to index: 22
LLM: gemini-2.0-flash has been used.
Query: 哪一段提到了程式設計？
Top 5 distances and results: 
0.9522208571434021 | 我們正在討論公司的年度預算計畫。
0.9524773955345154 | 他正在學習如何開發網頁。
1.0860790014266968 | 在知名程式裡出現的狗頭人喜歡蠟燭。
1.1489602327346802 | 請問附近有沒有便利商店？
1.204961895942688 | 她正在準備考研究所，希望能進入國立大學。
prompt: 
                Strictly follow these instructions:
                1.  Answer the "User's Query" based on the provided "Context".
                2.  If the "Context" is empty or does not contain the answer, you must state: "The provided context does not contain the necessary information." After that, you may make a reasonable inference to answer the query based on your general knowledge.
                3.  Respond in the same la