## Expermient with RAG offered by Google GenAI (Gemini API)

This notebook demonstrates how to ask questions about multiple PDFs using Gemini API.\
In this notebook, the PDFs are passed in as file paths. PDFs from the web can also be used.\
This code draws inspiration from the [Google Gemini API documentation](https://developers.generativeai.google/api/rest/v1alpha/gemini.projects.locations.models/chat/completions).

Before running this notebook, ensure you have `google-genai` installed. You can install it using pip:

`pip install google-genai`

In [None]:
from google import genai
from pathlib import Path
import os, io

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

In [None]:
# Enter file paths here
path1 = Path(r"../../data/Papers/2501.12948v1DeepSeek R1.pdf")
path2 = Path(r"../../data/Papers/DeepSeek_V3.pdf")

# Retrieve and upload both PDFs using the File API
doc_data_1 = io.BytesIO(path1.read_bytes())
doc_data_2 = io.BytesIO(path2.read_bytes())

sample_pdf_1 = client.files.upload(
  file=doc_data_1,
  config=dict(mime_type='application/pdf'))

sample_pdf_2 = client.files.upload(
  file=doc_data_2,
  config=dict(mime_type='application/pdf'))

In [13]:
prompt = "What are the key differences between DeepSeek-V3 and DeepSeek-R1 in the fine-tuning methods?"

response = client.models.generate_content(
  model="gemini-2.5-flash",
  contents=[sample_pdf_1, sample_pdf_2, prompt])

In [14]:
from IPython.display import Markdown
Markdown(response.text)

The key differences in fine-tuning methods between DeepSeek-V3 and DeepSeek-R1 lie primarily in their **objectives, initial training steps, and the source/role of reasoning data**:

1.  **Primary Objective:**
    *   **DeepSeek-R1:** The core goal is to **incentivize and enhance reasoning capabilities** in LLMs. DeepSeek-R1 (and its precursor DeepSeek-R1-Zero) is explicitly designed and trained to excel at complex reasoning tasks through reinforcement learning.
    *   **DeepSeek-V3:** Its post-training goal is broader: to **align the powerful pre-trained base model (DeepSeek-V3-Base) with human preferences and unlock its potential across diverse tasks**, including general capabilities (creative writing, role-playing), factual QA, and reasoning.

2.  **Initial State / Cold Start:**
    *   **DeepSeek-R1-Zero:** Starts with a base model and applies **pure RL without any supervised fine-tuning (SFT) as a preliminary step**. This is a key distinguishing feature, aiming for reasoning capabilities to emerge naturally.
    *   **DeepSeek-R1 (the final model):** Incorporates a **small amount of "cold-start" SFT data** (long Chain-of-Thought (CoT) examples) to fine-tune the DeepSeek-V3-Base model *before* its main RL stages. This helps stabilize the early RL phase and improve readability.
    *   **DeepSeek-V3:** Undergoes a comprehensive **Supervised Fine-Tuning (SFT) stage** on 1.5 million instances spanning multiple domains. Crucially, the **reasoning data for DeepSeek-V3's SFT is generated by an "internal DeepSeek-R1 model"** (via rejection sampling). This means DeepSeek-V3 *learns* reasoning patterns from DeepSeek-R1 rather than developing them from scratch itself during its fine-tuning.

3.  **Reinforcement Learning (RL) Stages & Reward Models:**
    *   **DeepSeek-R1:**
        *   Uses a **multi-stage RL pipeline**.
        *   Relies primarily on **rule-based reward systems** for reasoning tasks (math, code), which verify correctness and enforce structural formats (`<think>`, `<answer>` tags).
        *   Introduces a **language consistency reward** during RL to address language mixing issues observed in DeepSeek-R1-Zero.
        *   The RL process is focused on *deepening* the model's reasoning thought process.
    *   **DeepSeek-V3:**
        *   Also employs **RL stages** after SFT.
        *   Uses a combination of **rule-based Reward Models** (for deterministic tasks) and **model-based Reward Models** (for free-form answers, where DeepSeek-V3 itself can act as a generative reward model to judge outputs). This allows for alignment on a wider range of open-ended tasks beyond just reasoning.
        *   The RL helps to refine both reasoning and general helpfulness/harmlessness.

4.  **Role in Data Generation/Distillation:**
    *   **DeepSeek-R1:** Acts as a **teacher model** for reasoning. The high-quality reasoning data, particularly CoT trajectories, generated by DeepSeek-R1 (especially after its reasoning-oriented RL) is used to distill knowledge into smaller models and also into DeepSeek-V3 itself during its SFT stage.
    *   **DeepSeek-V3:** **Consumes reasoning data distilled from DeepSeek-R1**. Its fine-tuning process leverages the reasoning capabilities developed by R1, effectively incorporating them into a more general-purpose model.

In essence, DeepSeek-R1 is the specialist, meticulously trained via RL to achieve cutting-edge reasoning. DeepSeek-V3 is the generalist, which then incorporates the specialized reasoning knowledge from DeepSeek-R1 during its own post-training SFT and RL phases, alongside other general-purpose alignment objectives.