## Expermient with RAG offered by Google GenAI (Gemini API)

This notebook demonstrates how to ask questions about a single PDF using Gemini API. 
In this notebook, the PDF is passed in as a file path. PDFs from the web can also be used.
This code draws inspiration from the [Google Gemini API documentation](https://developers.generativeai.google/api/rest/v1alpha/gemini.projects.locations.models/chat/completions).

Before running this notebook, ensure you have `google-genai` installed. You can install it using pip:

`pip install google-genai`

In [None]:
from google import genai
from google.genai import types
from pathlib import Path
import os

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
# change this to your file path
file_path = Path(r"../../RAG from Scratch/data/Papers/2412.19437v2 DeepSeek V3 Tech Report 53 pages.pdf")
# Retrieve and encode the PDF byte
file_data = file_path.read_bytes()

In [4]:
len(file_data)/(1024*1024) # This is the exact size of file in MB

1.7999324798583984

In [5]:
prompt = "According to the document, how does DeepSeek V3 compare to GPT models in the structure?"

In [None]:
response = client.models.generate_content(
  model="gemini-2.5-flash",
  contents=[types.Part.from_bytes(data=file_data, mime_type='application/pdf'),prompt])

# this could take anywhere from 15s to 1.5min

In [8]:
from IPython.display import Markdown
Markdown(response.text)

According to the document, DeepSeek-V3's structure, while still based on the **Transformer framework**, incorporates several key architectural innovations that distinguish it, particularly from standard dense models and even other MoE models (though specific GPT structural details beyond being MoE are not extensively detailed in direct comparison).

Here's how DeepSeek-V3 compares to GPT models in structure, based on the provided text:

1.  **Core Architecture:**
    *   Both DeepSeek-V3 and GPT models (implicitly, as GPT is a Transformer) are built upon the **Transformer (Vaswani et al., 2017) framework**.

2.  **Mixture-of-Experts (MoE):**
    *   DeepSeek-V3 is a **strong Mixture-of-Experts (MoE) language model** with 671B total parameters and 37B activated for each token. It specifically employs the **DeepSeekMoE architecture** for its Feed-Forward Networks (FFNs). This architecture uses "finer-grained experts and isolates some experts as shared ones."
    *   While larger GPT models like GPT-4 are known to utilize MoE, the document highlights DeepSeekMoE as an innovation validated in DeepSeek-V2, suggesting it's a specific approach distinct from generic MoE implementations like GShard.

3.  **Attention Mechanism:**
    *   DeepSeek-V3 adopts **Multi-head Latent Attention (MLA)**. The core of MLA is the "low-rank joint compression for attention keys and values to reduce Key-Value (KV) cache during inference." This significantly reduces KV cache memory usage while maintaining performance.
    *   Standard GPT models typically use conventional Multi-Head Attention (MHA), which does not feature this specific latent compression for KV cache.

4.  **Load Balancing Strategy:**
    *   DeepSeek-V3 pioneers an **auxiliary-loss-free strategy for load balancing** within its DeepSeekMoE architecture. This aims to "minimize the adverse impact on model performance that arises from the effort to encourage load balancing."
    *   Traditional MoE models (which may include some GPT MoE variants, though not explicitly stated) often rely on an auxiliary loss for load balancing.

5.  **Multi-Token Prediction (MTP) Training Objective:**
    *   DeepSeek-V3 sets a **multi-token prediction training objective** for stronger performance. Its MTP implementation uses "D sequential modules to predict D additional tokens" and "sequentially predict additional tokens and keep the complete causal chain at each prediction depth."
    *   This is presented as an enhancement to the standard next-token prediction objective common in models like GPT, used for improved training signals and enabling speculative decoding during inference.

In summary, DeepSeek-V3 distinguishes itself structurally through its specific **DeepSeekMoE** implementation with an **auxiliary-loss-free load balancing strategy**, its novel **Multi-head Latent Attention (MLA)** mechanism for efficient KV cache, and the incorporation of **Multi-Token Prediction (MTP) modules** during training.