## Expermient with RAG offered by Google GenAI (Gemini API)

This notebook demonstrates how to ask questions about multiple PDFs using Gemini API.\
In this notebook, the PDFs are passed in as file paths. PDFs from the web can also be used.\
This code draws inspiration from the [Google Gemini API documentation](https://developers.generativeai.google/api/rest/v1alpha/gemini.projects.locations.models/chat/completions).

Before running this notebook, ensure you have `google-genai` installed. You can install it using pip:

`pip install google-genai`

In [4]:
from google import genai
from pathlib import Path
import os, io

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

In [5]:
# Enter file paths here
path1 = Path(r"../../RAG from Scratch/data/Papers/2501.12948v1DeepSeek R1.pdf")
path2 = Path(r"../../RAG from Scratch/data/Papers/DeepSeek_V3.pdf")

# Retrieve and upload both PDFs using the File API
doc_data_1 = io.BytesIO(path1.read_bytes())
doc_data_2 = io.BytesIO(path2.read_bytes())

sample_pdf_1 = client.files.upload(
  file=doc_data_1,
  config=dict(mime_type='application/pdf'))

sample_pdf_2 = client.files.upload(
  file=doc_data_2,
  config=dict(mime_type='application/pdf'))


In [6]:
prompt = "What are the key differences between DeepSeek-V3 and DeepSeek-R1 in the fine-tuning methods?"

response = client.models.generate_content(
  model="gemini-2.5-flash",
  contents=[sample_pdf_1, sample_pdf_2, prompt])

In [7]:
from IPython.display import Markdown
Markdown(response.text)

The key differences in fine-tuning methods between DeepSeek-V3 and DeepSeek-R1 (including DeepSeek-R1-Zero) lie in their primary objectives, training stages, and the source/role of reasoning data:

1.  **DeepSeek-R1-Zero (Pure RL for Reasoning):**
    *   **Objective:** To explore the potential of LLMs to develop reasoning capabilities *without any supervised fine-tuning (SFT) as a preliminary step*, relying solely on a large-scale reinforcement learning (RL) process. It aims for reasoning capabilities to *emerge* naturally.
    *   **Method:** Applies pure RL (using GRPO) directly to a base model (DeepSeek-V3-Base). It uses a rule-based reward system (accuracy and format rewards) to incentivize correct reasoning and specific output formats.

2.  **DeepSeek-R1 (Multi-Stage RL with Cold Start for Enhanced Reasoning):**
    *   **Objective:** To address challenges of DeepSeek-R1-Zero (poor readability, language mixing) and further enhance reasoning performance by incorporating a small amount of high-quality "cold-start" data and a multi-stage training pipeline. Its focus is *refining and making reasoning user-friendly*.
    *   **Method:**
        *   **Cold Start SFT:** Fine-tunes DeepSeek-V3-Base with "thousands of cold-start data" (long Chain-of-Thought, CoT, answers collected via prompting, human refinement of DeepSeek-R1-Zero outputs) as an initial RL actor.
        *   **Reasoning-oriented RL:** Applies large-scale RL (similar to DeepSeek-R1-Zero) focusing on reasoning tasks (coding, math, science, logic), adding a "language consistency reward" to improve readability.
        *   **Rejection Sampling & SFT:** After reasoning RL converges, generates new SFT data by rejection sampling from the RL checkpoint, combining reasoning data with supervised data from other domains (writing, factual QA, self-cognition) from DeepSeek-V3.
        *   **RL for all Scenarios:** A secondary RL stage to align the model with human preferences (helpfulness, harmlessness) across diverse scenarios, using a combination of rule-based rewards (for reasoning) and reward models (for general data).

3.  **DeepSeek-V3 (General Purpose with Distilled Reasoning):**
    *   **Objective:** To be a strong, economical, and general-purpose MoE model, with a particular emphasis on efficiency, broad capabilities, and alignment with human preferences. It *leverages* DeepSeek-R1's reasoning prowess.
    *   **Method (Post-Training):**
        *   **Supervised Fine-Tuning (SFT):** Curates a large instruction-tuning dataset (1.5M instances).
            *   **Crucially, for reasoning data, DeepSeek-V3 *distills* from DeepSeek-R1.** DeepSeek-R1 acts as a "teacher model" to generate high-accuracy reasoning data, which is then balanced with clarity and conciseness for DeepSeek-V3's SFT.
            *   Non-reasoning data is generated from DeepSeek-V2.5 and human-verified.
        *   **Reinforcement Learning (RL):** Follows SFT, aiming to align the model with human preferences and further unlock its potential.
            *   Uses both rule-based reward models (for verifiable tasks like math/code) and model-based reward models (for free-form answers and human preferences), with the model-based RM trained from DeepSeek-V3's *own* SFT checkpoints.
            *   Incorporates prompts from diverse domains (coding, math, writing, role-playing, QA).

**In summary:**

*   **DeepSeek-R1** (and R1-Zero) is fundamentally about **developing and refining core reasoning capabilities through iterative RL, with R1 specifically addressing readability and human-friendliness for this reasoning.** It *discovers* powerful reasoning behaviors.
*   **DeepSeek-V3** builds upon its strong pre-training by **distilling DeepSeek-R1's advanced reasoning into its supervised fine-tuning data**, and then performing further RL for *general alignment, helpfulness, and harmlessness* across a wide range of tasks, including the distilled reasoning. DeepSeek-R1 acts as a *teacher* for DeepSeek-V3's reasoning component.