# 📝 AI-Driven CV Optimisation with RAG

This notebook can be seen in Kaggle: [https://www.kaggle.com/code/franciscomunozg/ai-driven-cv-optimisation-with-rag](https://www.kaggle.com/code/franciscomunozg/ai-driven-cv-optimisation-with-rag)

## 1. Overview

**Goal:** This Kaggle notebook implements a Retrieval-Augmented Generation (RAG) system to analyze a Curriculum Vitae (CV) against a specific job description. It identifies the most relevant experiences from the CV and rewrites them, incorporating keywords from the job description to improve visibility for Applicant Tracking Systems (ATS).

**Target Audience:** Users aiming to tailor their CV for job applications, suitable for both technical and non-technical users (with guidance). This notebook is designed to run within the Kaggle environment.

**Key Features:**
* Parses PDF CVs into structured JSON format using Google's Gemini model.
* Generates vector embeddings for CV items (work experience, projects, etc.) using `text-embedding-004`.
* Stores embeddings in a persistent ChromaDB vector database (using a single collection with metadata).
* Performs RAG to find CV items most relevant to a given job description.
* Utilizes Gemini via a chat interface to rewrite relevant CV items with ATS optimization and keyword integration.
* Supports processing and interaction in multiple languages (e.g., English, Spanish - based on configuration and prompts).

**Notebook Workflow:**
1.  **Setup:** Install dependencies and configure the Google AI API Key via Kaggle Secrets.
2.  **User Configuration:** Set paths, language, processing parameters, and job details.
3.  **Load & Process CV:** Read the PDF, extract text, and structure it into JSON.
4.  **Embed & Store:** Generate embeddings for CV items and store them in ChromaDB.
5.  **RAG Retrieval:** Find relevant CV items based on the job description query.
6.  **Rewrite & Refine:** Engage in an interactive chat with the LLM to rewrite and optimize CV items.
7.  **Conclusion:** Review results and summary.

## 2. Setup

This section covers the initial setup required to run the notebook within the Kaggle environment.

### 2.1. Dependencies

In [None]:
# Install necessary libraries
# Uninstall jupyterlab first to potentially resolve conflicts in the Kaggle environment
!pip uninstall -qqy jupyterlab

# Install core AI, PDF processing, vector DB, and download libraries
# Using specific versions found in the original notebook where available
!pip install -U -q "google-genai==1.7.0"
!pip install -q pdfplumber chromadb gdown PyYAML

print("Dependencies installed.")  # noqa: T201

### 2.2. API Key Configuration (Kaggle Secrets)

To use Google Generative AI models (like Gemini), you need an API key. This notebook is configured to securely access your API key using Kaggle Secrets.

**Action Required:**
1.  Click on **"Add-ons"** in the top menu bar of the Kaggle editor.
2.  Select **"Secrets"**.
3.  Add a new secret with the **name** `GOOGLE_API_KEY` and paste your actual Google AI API key as the **value**.

The following code cell will then retrieve this secret. Ensure you have added the secret *before* running the next cell. You can get an API key from [Google AI Studio](https://aistudio.google.com/app/apikey).

In [None]:
# Import the Kaggle Secrets client and retrieve the API key

try:
    # This import works specifically in the Kaggle environment
    from kaggle_secrets import UserSecretsClient

    user_secrets = UserSecretsClient()
    # Retrieve the secret named "GOOGLE_API_KEY"
    GOOGLE_API_KEY = user_secrets.get_secret("GOOGLE_API_KEY")
    print("Successfully retrieved GOOGLE_API_KEY from Kaggle Secrets.")  # noqa: T201

except ImportError:
    GOOGLE_API_KEY = None
    print(  # noqa: T201
        "Warning: 'kaggle_secrets' library not found. This notebook is designed to run"
        " on Kaggle."
    )
    print("API key could not be loaded.")  # noqa: T201

except Exception as e:
    # Handles cases where the secret might not be set by the user
    GOOGLE_API_KEY = None
    print(  # noqa: T201
        "Warning: Could not retrieve GOOGLE_API_KEY from Kaggle Secrets."
    )  # noqa: T201
    print(f"Error details: {e}")  # noqa: T201
    print(  # noqa: T201
        "Please ensure you have added the secret named 'GOOGLE_API_KEY' in Kaggle"
        " Add-ons -> Secrets."
    )

# Verify if the API key was loaded successfully
if not GOOGLE_API_KEY:
    print(  # noqa: T201
        "\nERROR: GOOGLE_API_KEY was not loaded. \nThe notebook cannot proceed without"
        " a valid API key."
    )
    # Optionally, raise an error to halt execution:
    # raise ValueError("GOOGLE_API_KEY not found. Please configure it in Kaggle Secrets.")
else:
    # Optional: Print a confirmation snippet (masked)
    print(  # noqa: T201
        f"API Key loaded successfully (starting with: {GOOGLE_API_KEY[:4]}...)."
    )  # noqa: T201

### 2.3. Helper Functions (Retry Logic)

This notebook interacts with the Google Generative AI API, which might occasionally return errors due to rate limits (too many requests) or temporary server issues. We define a helper function (`is_retriable`) to identify these specific, temporary errors so that we can automatically retry the API calls a few times before failing. This makes the notebook more robust.

In [None]:
# Use original import style based on user feedback and source
from google import genai
from google.api_core import retry


def is_retriable(exception: Exception) -> bool:
    """Checks if an exception is a retriable Google APIError (429 or 503)."""
    return isinstance(exception, genai.errors.APIError) and exception.code in {429, 503}


# This function will be used with the @retry.Retry decorator for API calls.

## 3. User Configuration

Adjust the variables in the following code cell to control the notebook's behavior. These include file paths, language settings, model parameters, and the job description you want to analyze.

In [None]:
# Required import for path handling
from pathlib import Path

# Import Literal for specific type checking if needed, list/dict hints use built-ins
from typing import Literal

# ==============================================================================
# User Configuration Variables
# ==============================================================================
# Adjust these settings before running the rest of the notebook.

# --- 1. Input Files & Paths ---

# Specify the path where the CV PDF file is located or will be downloaded to.
CV_PDF_PATH = Path("./resume.pdf")  # Example: file in the root working directory

# Set to True to download the CV from Google Drive using the ID below.
# Set to False if you are providing the CV via CV_PDF_PATH directly.
DOWNLOAD_CV = True  # Default behavior from original notebook
CV_GDRIVE_FILE_ID = "1avK0u9HcyuEYgpyIs_pBuzPxWPrfgW_C"  # Example File ID

# --- 2. Job Description Input ---

# Paste the target job description text between the triple quotes.
JOB_DESCRIPTION = """
Job Title: AI/Machine Learning Engineer

Company: Innovate Solutions Inc.

Location: Remote (US Based)

About Us:
Innovate Solutions Inc. is at the forefront of applying artificial intelligence to solve real-world business challenges. We foster a collaborative environment where creative thinking and technical excellence drive our success. We are passionate about building intelligent systems that deliver significant value to our clients across various industries. Join our growing team and help shape the future of applied AI.

About the Role:
We are seeking a talented and motivated AI/Machine Learning Engineer to join our core development team. You will play a key role in the end-to-end lifecycle of machine learning projects, from conceptualization and data exploration to model deployment and monitoring. You'll work closely with data scientists, software engineers, and product managers to build innovative AI-powered features and products.

Responsibilities:
- Design, develop, train, and deploy machine learning models (including deep learning models) for tasks such as NLP, predictive analytics, anomaly detection, and personalization.
- Process, cleanse, and verify the integrity of large datasets used for analysis and model training.
- Collaborate with data engineering teams to build and maintain robust data pipelines for ML workflows.
- Implement and maintain MLOps best practices for model versioning, testing, deployment, and monitoring.
- Stay current with the latest advancements in AI/ML techniques, tools, and platforms.
- Analyze experimental results, iterate on models, and communicate findings to technical and non-technical stakeholders.
- Contribute to the development of internal AI platforms and tooling.

Required Qualifications:
- Bachelor's or Master's degree in Computer Science, Data Science, Statistics, or a related quantitative field.
- 2+ years of hands-on experience building and deploying machine learning models in a production environment.
- Strong programming skills in Python and proficiency with relevant ML libraries (e.g., Scikit-learn, TensorFlow, PyTorch, Keras).
- Solid understanding of core machine learning algorithms, statistical modeling, and evaluation metrics.
- Experience working with SQL and/or NoSQL databases.
- Familiarity with data processing and analysis libraries (e.g., Pandas, NumPy).
- Excellent problem-solving skills and attention to detail.
- Strong communication and teamwork abilities.

Desired Qualifications (Bonus Points):
- PhD in a related field.
- Experience with cloud platforms (AWS, GCP, or Azure) and their AI/ML services (e.g., SageMaker, Vertex AI, Azure ML).
- Experience with MLOps tools and practices (e.g., Docker, Kubernetes, MLflow, Kubeflow).
- Experience with Natural Language Processing (NLP) or Computer Vision (CV).
- Experience with big data technologies (e.g., Spark, Hadoop).
- Publications in relevant AI/ML conferences or journals.

What We Offer:
- Competitive salary and benefits package.
- Opportunity to work on challenging and impactful AI projects.
- A dynamic, collaborative, and supportive work environment.
- Flexible remote work policy.
- Professional development opportunities.
"""

# --- 3. Language Settings ---

# Set the primary language for prompts (influences LLM responses).
LANGUAGE: Literal["en", "es"] = "en"  # Options: 'en' or 'es'

# --- 4. RAG & Embedding Settings ---

# Retrieval Parameters for RAG Search (Controls how many results are fetched/processed)
MAX_RELEVANT_ITEMS = 3  # Target number of CV items to refine
RETRIEVAL_WINDOW = 2  # Extra items retrieved for context during search

# Embedding Model Name
EMBEDDING_MODEL_NAME = "models/text-embedding-004"

# Vector Database Configuration (Using Single Collection Strategy)
CHROMA_DB_PATH = Path("./chroma_db_persistent")  # Path for persistent DB
COLLECTION_NAME = "cv_embeddings_v1"  # Name for the single collection

# --- 5. LLM Configuration ---

# Generative model name for structuring the CV and for the rewriting chat.
# Using the specific name mentioned by the user.
GENERATIVE_MODEL_NAME = "gemini-2.0-flash"

# Parameters for the PDF -> JSON *Structuring* LLM call
STRUCTURING_LLM_TEMPERATURE = 0.1

# Parameters for the CV Item *Rewriting* LLM Chat
REWRITING_LLM_TEMPERATURE = 0.8
REWRITING_LLM_TOP_P = 0.95
REWRITING_LLM_TOP_K = 30

# --- 6. CV Processing Settings ---

# List the sections (JSON keys) from the structured CV to embed and use for RAG.
# Keeping this as a list allows processing/searching multiple sections with the single DB collection.
# Example sections based on JSON Resume schema and original recommendations:
# "work", "certificates", "publications", "projects", "skills", "education"
CV_SECTIONS_TO_FOCUS: list[str] = [
    "work",
    "certificates",
    "projects",
]  # Edit this list as needed

# Optional: Limit characters fed to the structuring LLM. None = no limit.
MAX_CV_TEXT_LENGTH_FOR_STRUCTURING = 12000


# ==============================================================================
# Directory Setup (Using pathlib)
# ==============================================================================

# Ensure the directory for ChromaDB exists
CHROMA_DB_PATH.mkdir(parents=True, exist_ok=True)

# Ensure the parent directory for the CV PDF exists
CV_PDF_PATH.parent.mkdir(parents=True, exist_ok=True)


print("User configuration loaded and directories created.")  # noqa: T201

## 4. Load and Process CV

### 4.1. Load PDF Document

This step ensures the CV PDF file specified in the configuration (`CV_PDF_PATH`) is available. If `DOWNLOAD_CV` was set to `True`, it downloads the file from Google Drive using the provided ID. Otherwise, it assumes the file exists at `CV_PDF_PATH`.

Finally, it checks if the file exists and displays the PDF inline for verification.

In [None]:
import gdown
from IPython.display import IFrame, display

# --- 1. Conditional Download ---
if DOWNLOAD_CV:
    print(  # noqa: T201
        f"Attempting to download CV from Google Drive (ID: {CV_GDRIVE_FILE_ID}) to"
        f" '{CV_PDF_PATH}'..."
    )
    try:
        # Download using gdown, specifying the output path
        gdown.download(id=CV_GDRIVE_FILE_ID, output=str(CV_PDF_PATH), quiet=False)
        print("Download attempt finished.")  # noqa: T201
    except Exception as e:
        print("\nERROR: Failed to download file from Google Drive.")  # noqa: T201
        print(  # noqa: T201
            f"Please check the file ID ('{CV_GDRIVE_FILE_ID}') and ensure the file is"
            " accessible."
        )
        print(f"Error details: {e}")  # noqa: T201
else:
    print(  # noqa: T201
        f"Skipping download. Assuming CV PDF is already present at: '{CV_PDF_PATH}'"
    )  # noqa: T201

# --- 2. Verify File Existence and Display ---
if CV_PDF_PATH.is_file():
    print(  # noqa: T201
        f"CV PDF found at '{CV_PDF_PATH}'. Displaying PDF inline below:"
    )  # noqa: T201
    # Display the PDF inline for verification
    display(IFrame(src=CV_PDF_PATH, width="90%", height="600px"))
else:
    print(  # noqa: T201
        f"\nERROR: CV PDF file was not found at the path: '{CV_PDF_PATH}'"
    )  # noqa: T201
    # Provide context based on whether download was attempted
    if DOWNLOAD_CV:
        print(  # noqa: T201
            "The download may have failed, or the configured path is incorrect."
        )  # noqa: T201
    else:
        print(  # noqa: T201
            "Please ensure the file exists at the specified path in the configuration."
        )
    # If the CV is critical for subsequent steps, consider stopping execution:
    # raise FileNotFoundError(f"CV PDF not found at {CV_PDF_PATH}")

### 4.2. Extract Text from PDF

Now that the PDF file is available, we will extract its text content. We use the `pdfplumber` library for this task, as it's generally effective for text-based PDFs. The extracted text will be stored in a variable for the next step (structuring with the LLM).

In [None]:
import pdfplumber

raw_cv_text = ""  # Initialize variable

print(f"Extracting text from PDF: '{CV_PDF_PATH}'...")  # noqa: T201
try:
    with pdfplumber.open(CV_PDF_PATH) as pdf:
        # Extract text page by page, handling potential None values from empty pages
        raw_cv_text = "".join(
            page.extract_text(x_tolerance=1, y_tolerance=3) or "" for page in pdf.pages
        )

    print(  # noqa: T201
        f"Text extraction successful. Total characters: {len(raw_cv_text)}"
    )  # noqa: T201
    # Optional: Uncomment the line below to print the first 500 characters
    print(f"--- Snippet ---\n{raw_cv_text[:500]}\n---------------")  # noqa: T201

except FileNotFoundError:
    print(  # noqa: T201
        f"\nERROR: PDF file not found at '{CV_PDF_PATH}'. Cannot extract text."
    )  # noqa: T201
    # Optional: Uncomment the 'raise' below to stop execution if the file is critical
    # raise
except Exception as e:
    print(  # noqa: T201
        f"\nERROR: Failed to open or extract text from PDF '{CV_PDF_PATH}'."
    )  # noqa: T201
    print(f"Error details: {e}")  # noqa: T201
    # Optional: Uncomment the 'raise' below to stop execution on errors
    # raise

# Ensure raw_cv_text exists as an empty string if extraction failed above
if "raw_cv_text" not in locals():
    raw_cv_text = ""

### 4.3. Structure Text into JSON using LLM

In this crucial step, we leverage the configured Generative AI model (Gemini) to parse the unstructured `raw_cv_text` extracted from the PDF. We instruct the model to return the information structured according to the [JSON Resume schema](https://jsonresume.org/schema/).

This involves:
1.  Defining the JSON Resume schema using Python's `TypedDict` for type safety and clarity (this helps the model adhere to the desired output structure).
2.  Constructing a prompt that tells the model the task, provides the raw CV text (potentially truncated based on `MAX_CV_TEXT_LENGTH_FOR_STRUCTURING`), and specifies the desired JSON output format.
3.  Calling the Gemini API, explicitly requesting JSON output and providing the schema definition.
4.  Parsing the LLM's JSON response into a Python dictionary.

In [None]:
from typing_extensions import TypedDict


# Define nested structures first (order matters for definition)
class Location(TypedDict, total=False):
    address: str
    postalCode: str
    city: str
    countryCode: str
    region: str


class Profile(TypedDict, total=False):
    network: str
    username: str
    url: str


class Basics(TypedDict, total=False):
    name: str
    label: str
    image: str  # URL to image
    email: str
    phone: str
    url: str  # Personal website/portfolio URL
    summary: str  # Professional summary
    location: Location
    profiles: list[Profile]  # Use modern list hint


class WorkItem(TypedDict, total=False):
    name: str  # Name of the company/organization
    position: str  # Job title
    url: str  # Company website
    startDate: str  # Format YYYY-MM-DD or YYYY-MM or YYYY
    endDate: str  # Format YYYY-MM-DD or YYYY-MM or YYYY, or Present
    summary: str  # High-level description of role/company
    highlights: list[str]  # Specific achievements or responsibilities (bullet points)


class VolunteerItem(TypedDict, total=False):
    organization: str
    position: str
    url: str
    startDate: str
    endDate: str
    summary: str
    highlights: list[str]


class EducationItem(TypedDict, total=False):
    institution: str
    url: str
    area: str  # e.g., Computer Science
    studyType: str  # e.g., Bachelor's Degree, Master's
    startDate: str
    endDate: str
    score: str  # e.g., GPA
    courses: list[str]  # Relevant coursework


class AwardItem(TypedDict, total=False):
    title: str
    date: str  # Date awarded
    awarder: str  # Organization that gave the award
    summary: str  # Description of the award


class CertificateItem(TypedDict, total=False):
    name: str  # Name of the certificate
    date: str  # Date issued
    issuer: str  # Issuing organization (e.g., Coursera, Google)
    url: str  # Link to certificate if available


class PublicationItem(TypedDict, total=False):
    name: str  # Title of the publication
    publisher: str  # e.g., Journal name, Conference
    releaseDate: str
    url: str  # Link to publication
    summary: str  # Abstract or brief description


class SkillItem(TypedDict, total=False):
    name: str  # Broad skill category (e.g., Web Development, Data Science)
    level: str  # Optional proficiency level (e.g., Intermediate, Advanced)
    keywords: list[str]  # Specific technologies or tools (e.g., Python, PyTorch, AWS)


class LanguageItem(TypedDict, total=False):
    language: str  # e.g., English, Spanish
    fluency: str  # e.g., Native, Fluent, Conversational


class InterestItem(TypedDict, total=False):
    name: str  # Category of interest (e.g., Open Source, AI Ethics)
    keywords: list[str]  # Specific interests


class ReferenceItem(TypedDict, total=False):
    name: str  # Name of reference (ensure consent)
    reference: str  # Testimonial or contact details (handle privacy appropriately)


class ProjectItem(TypedDict, total=False):
    name: str  # Project title
    startDate: str
    endDate: str
    description: str  # Overall description of the project
    highlights: list[str]  # Key contributions or features
    url: str  # Link to project demo or repository


# --- Top-Level Curriculum Schema ---
class Curriculum(TypedDict, total=False):
    """Represents the complete JSON Resume structure."""

    basics: Basics
    work: list[WorkItem]
    volunteer: list[VolunteerItem]
    education: list[EducationItem]
    awards: list[AwardItem]
    certificates: list[CertificateItem]
    publications: list[PublicationItem]
    skills: list[SkillItem]
    languages: list[LanguageItem]
    interests: list[InterestItem]
    references: list[ReferenceItem]
    projects: list[ProjectItem]

In [None]:
import json
from google import genai  # noqa: F811
from google.genai import types
from google.api_core import retry  # noqa: F811

# --- 1. Initialize Model & Validate API Key ---
if "GOOGLE_API_KEY" not in locals() or not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY not found or empty. Please check Kaggle Secrets.")

client = genai.Client(api_key=GOOGLE_API_KEY)

# --- 2. Prepare Prompt & Configuration for JSON Structuring ---
cv_text_for_prompt = raw_cv_text
if (
    MAX_CV_TEXT_LENGTH_FOR_STRUCTURING
    and len(raw_cv_text) > MAX_CV_TEXT_LENGTH_FOR_STRUCTURING
):
    cv_text_for_prompt = raw_cv_text[:MAX_CV_TEXT_LENGTH_FOR_STRUCTURING]
    print(  # noqa: T201
        f"Warning: CV text truncated to {MAX_CV_TEXT_LENGTH_FOR_STRUCTURING} chars for"
        " structuring prompt."
    )

# System instruction defines the LLM's role and desired output format
system_instruction = (
    "You are an expert CV parser. Extract information from the provided CV text and"
    " format it strictly according to the JSON Resume Schema provided. Return ONLY the"
    " valid JSON object conforming to the schema - no introductory text, no markdown"
    " formatting ('```json', '```'), no explanations."
)
# Prompt combines instructions with the actual CV text
prompt_message = f"""
Given the following CV text, populate the fields of the JSON Resume Schema as accurately as possible.
Use empty strings, arrays, or null values for fields where information is missing in the text.

CV Text:
---
{cv_text_for_prompt}
---
"""
full_structuring_prompt = system_instruction + "\n\n" + prompt_message

# Configuration forces JSON output matching our TypedDict schema
json_generation_config = {
    "temperature": STRUCTURING_LLM_TEMPERATURE,
    "response_mime_type": "application/json",
    "response_schema": Curriculum,
}


# --- 3. Define Function for API Call with Retry Logic ---
@retry.Retry(predicate=is_retriable)  # Use helper defined in Setup
def generate_structured_cv_json_with_retry(prompt, config):
    """Calls the Gemini API to generate structured JSON, with retries on specific errors."""
    print("Calling Gemini API to structure CV text into JSON...")  # noqa: T201
    response = client.models.generate_content(
        model=GENERATIVE_MODEL_NAME,
        contents=prompt,
        config=config,
    )
    print("Gemini API call finished.")  # noqa: T201
    return response.text


# --- 4. Execute API Call and Parse Response ---
structured_cv_data = None  # Initialize result variable

try:
    json_response_text = generate_structured_cv_json_with_retry(
        prompt=full_structuring_prompt,
        config=json_generation_config,
    )

    if json_response_text:
        try:
            # Parse the validated JSON text from the API response
            structured_cv_data = json.loads(json_response_text)
            print(  # noqa: T201
                "Successfully parsed LLM response into structured CV data."
            )  # noqa: T201
        except json.JSONDecodeError as e:
            # This error *should* be rare given response_mime_type="application/json"
            print("\nERROR: Failed to parse the LLM's response as JSON.")  # noqa: T201
            print(f"JSONDecodeError: {e}")  # noqa: T201
            print(  # noqa: T201
                "\nLLM Response Text Received:\n---\n", json_response_text, "\n---"
            )  # noqa: T201
    else:
        print("\nERROR: Received an empty response from the LLM API.")  # noqa: T201

except Exception as e:
    print(  # noqa: T201
        "\nERROR: An unexpected error occurred during the LLM structuring call."
    )  # noqa: T201
    print(f"Error details: {e}")  # noqa: T201

# --- 5. Final Status Update ---
if structured_cv_data:
    print("\nCV structuring process completed successfully.")  # noqa: T201
    # The variable `structured_cv_data` now holds the Python dictionary.
else:
    print("\nCV structuring process failed or produced no data.")  # noqa: T201
    # Consider stopping execution if this data is critical
    # raise RuntimeError("Failed to obtain structured CV data.")

### 4.4. (Optional) Display Structured CV Data

You can run the next cell to print the `structured_cv_data` dictionary in a readable YAML format. This helps verify that the LLM correctly parsed and structured the information from your CV text according to the JSON Resume schema. Check if key sections like 'basics', 'work', 'education', 'skills', etc., were populated as expected.

In [None]:
# Import yaml library for pretty printing dictionaries
import yaml

# Display the structured data only if it was successfully created and is a dictionary
if isinstance(structured_cv_data, dict):
    print("#" + "=" * 79)  # noqa: T201
    print("# Structured CV Data (YAML Format):")  # noqa: T201
    print("#" + "=" * 79)  # noqa: T201
    # Use yaml.dump for a readable, multi-line representation of the dictionary
    # allow_unicode ensures correct display of special characters (like accents)
    # sort_keys=False maintains the order defined in the schema where possible
    print(  # noqa: T201
        yaml.dump(
            structured_cv_data, allow_unicode=True, sort_keys=False, width=float("inf")
        )
    )
    print("#" + "=" * 79)  # noqa: T201
else:
    # Message if the data isn't available (e.g., previous step failed)
    print(  # noqa: T201
        "Structured CV data not available or is not a dictionary. Cannot display YAML."
    )

## 5. Prepare and Store Embeddings in ChromaDB

This section focuses on converting the relevant parts of your structured CV data into vector embeddings and storing them in a searchable vector database (ChromaDB). This allows us to later find the CV items most semantically similar to a job description (the core of RAG).

The key steps are:
1.  **Prepare Documents:** Extract the specific CV items (like individual work experiences or projects) from the sections listed in `CV_SECTIONS_TO_FOCUS` and format them as text documents suitable for embedding. We'll use the YAML dump method as previously decided.
2.  **Define Embedding Function:** Set up a function that uses the configured Google embedding model (`EMBEDDING_MODEL_NAME`) to convert text documents into numerical vectors (embeddings).
3.  **Initialize Vector Database:** Connect to or create a persistent ChromaDB database using the configured path (`CHROMA_DB_PATH`) and collection name (`COLLECTION_NAME`). We will use a single collection for all items.
4.  **Generate & Store Embeddings:** Iterate through the prepared documents, generate embeddings for them using the embedding function, and add the embeddings, the original documents, and relevant metadata (like the source section) to the ChromaDB collection.

### 5.1. Prepare Text Documents from Structured CV

In [None]:
import yaml
import re

# --- 1. Helper Functions for ID Generation ---

# Maps section keys to functions creating a base ID string from item content
BASE_ID_GENERATORS = {
    "work": lambda item: (
        f"{item.get('name', 'NoCompany')}.{item.get('position', 'NoPosition')}"
    ),
    "certificates": lambda item: (
        f"{item.get('issuer', 'NoIssuer')}.{item.get('name', 'NoCert')}"
    ),
    "publications": lambda item: (
        f"{item.get('publisher', 'NoPublisher')}.{item.get('name', 'NoPub')}"
    ),
    "projects": lambda item: item.get("name", "NoProject"),
    "volunteer": lambda item: (
        f"{item.get('organization', 'NoOrg')}.{item.get('position', 'NoVolunteerPos')}"
    ),
    "education": lambda item: (
        f"{item.get('institution', 'NoInstitution')}.{item.get('area', 'NoArea')}.{item.get('studyType', '')}"
    ),
    "basics": lambda item: item.get("name", "NoPerson"),
    "awards": lambda item: (
        f"{item.get('awarder', 'NoAwarder')}.{item.get('title', 'NoAward')}"
    ),
    "skills": lambda item: item.get("name", "NoSkill"),
    "languages": lambda item: item.get("language", "NoLang"),
    "interests": lambda item: item.get("name", "NoInterest"),
    "references": lambda item: item.get("name", "NoReference"),
}


def sanitize_id(text_id: str) -> str:
    """Cleans and formats a string into a valid ChromaDB ID."""
    text_id = text_id.lower()
    # Basic accent removal using a lambda for compactness
    text_id = re.sub(
        r"[áäâàãåéëêèíïîìóöôòõøúüûùñç]",
        lambda m: {
            "á": "a",
            "ä": "a",
            "â": "a",
            "à": "a",
            "ã": "a",
            "å": "a",
            "é": "e",
            "ë": "e",
            "ê": "e",
            "è": "e",
            "í": "i",
            "ï": "i",
            "î": "i",
            "ì": "i",
            "ó": "o",
            "ö": "o",
            "ô": "o",
            "ò": "o",
            "õ": "o",
            "ø": "o",
            "ú": "u",
            "ü": "u",
            "û": "u",
            "ù": "u",
            "ñ": "n",
            "ç": "c",
        }.get(m.group(0)),
        text_id,
    )
    text_id = re.sub(r"[\s_:-]+", ".", text_id)  # Replace separators with dot
    text_id = re.sub(r"[^a-z0-9.]", "", text_id)  # Keep only alphanumeric and dot
    text_id = re.sub(r"\.+", ".", text_id)  # Consolidate consecutive dots
    text_id = text_id.strip(".")  # Remove leading/trailing dots
    if len(text_id) < 3:  # noqa: PLR2004
        text_id = f"{text_id}.id"  # Ensure min length
    return text_id[:63]  # Truncate to 63 characters (ChromaDB limit)


def generate_unique_item_id(section_key: str, item: dict, item_index: int) -> str:
    """Generates a unique, sanitized ID for a CV item using dot separator."""
    id_generator = BASE_ID_GENERATORS.get(
        section_key, lambda i: f"item.{item_index}"
    )  # Fallback
    base_id = id_generator(item)
    full_id_base = f"{section_key}.{base_id}.{item_index}"  # Use dot separator
    return sanitize_id(full_id_base)


# --- 2. Function to Prepare Data for Embedding ---


def prepare_embedding_data(
    cv_data: dict, sections_to_include: list[str]
) -> tuple[list, list, list]:
    """
    Extracts items, formats documents (YAML strings), generates IDs, creates metadata.
    Returns tuple: (documents, ids, metadatas).
    """
    all_documents = []
    all_ids = []
    all_metadatas = []

    if not isinstance(cv_data, dict):
        print("Warning: structured_cv_data is not a dictionary.")  # noqa: T201
        return [], [], []

    print(  # noqa: T201
        f"Preparing embedding data for sections: {sections_to_include}..."
    )  # noqa: T201
    processed_count = 0
    for section_key in sections_to_include:
        section_items = cv_data.get(section_key)
        if isinstance(section_items, list):
            for index, item in enumerate(section_items):
                if isinstance(item, dict):
                    item_id = generate_unique_item_id(section_key, item, index)
                    try:
                        item_doc = yaml.dump(
                            item,
                            allow_unicode=True,
                            sort_keys=False,
                            width=float("inf"),
                            default_flow_style=None,
                        )
                    except yaml.YAMLError:
                        item_doc = str(item)  # Fallback
                        print(  # noqa: T201
                            f"Warning: YAML dump failed for item {index} in"
                            f" {section_key}."
                        )

                    metadata = {"section": section_key, "item_index": index}
                    # Add potentially useful fields from item to metadata if they exist
                    for key in [
                        "name",
                        "position",
                        "issuer",
                        "institution",
                        "organization",
                    ]:
                        if value := item.get(
                            key
                        ):  # Walrus operator requires Python 3.8+
                            metadata[f"id_{key}"] = value

                    all_documents.append(item_doc)
                    all_ids.append(item_id)
                    all_metadatas.append(metadata)
                    processed_count += 1

    print(f"Prepared {processed_count} documents for embedding.")  # noqa: T201
    return all_documents, all_ids, all_metadatas


# --- 3. Execute Preparation ---
embedding_documents = []
embedding_ids = []
embedding_metadatas = []

if isinstance(structured_cv_data, dict):
    embedding_documents, embedding_ids, embedding_metadatas = prepare_embedding_data(
        cv_data=structured_cv_data, sections_to_include=CV_SECTIONS_TO_FOCUS
    )

    if embedding_documents:
        print("\n--- Sample Prepared Data (First Item) ---")  # noqa: T201
        print(f"ID        : {embedding_ids[0]}")  # noqa: T201
        print(f"Metadata  : {embedding_metadatas[0]}")  # noqa: T201
        print(f"Doc Snippet:\n---\n{embedding_documents[0][:200]}...")  # noqa: T201
        print("---")  # noqa: T201
else:
    print(  # noqa: T201
        "\nERROR: `structured_cv_data` not available. Cannot prepare embedding data."
    )  # noqa: T201

### 5.2. Define Embedding Function

In [None]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry
import google.generativeai as genai
from google.genai import types  # noqa: F811


# Use the original class structure provided by the user
class GeminiEmbeddingFunction(EmbeddingFunction):
    """Custom ChromaDB embedding function using the original implementation structure."""

    def __init__(self, document_mode: bool = True) -> None:
        """Initializes based on document_mode, uses model from config."""
        self.embedding_task: str = (
            "retrieval_document" if document_mode else "retrieval_query"
        )
        self.model: str = EMBEDDING_MODEL_NAME

    @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        """Generates embeddings using client.models.embed_content."""
        response = client.models.embed_content(
            model=self.model,
            contents=input,
            config=types.EmbedContentConfig(task_type=self.embedding_task),
        )
        return [e.values for e in response.embeddings]

### 5.3. Initialize Vector Database

In [None]:
import chromadb

# Assumes GeminiEmbeddingFunction is defined in Cell 36
# Assumes CHROMA_DB_PATH, COLLECTION_NAME defined in Cell 16

cv_collection = None  # Initialize variable

try:
    # Initialize persistent client
    print(f"Initializing ChromaDB client at path: {CHROMA_DB_PATH}")  # noqa: T201
    chroma_client = chromadb.PersistentClient(path=str(CHROMA_DB_PATH))

    # Instantiate embedding function (uses default document_mode=True from its __init__)
    gemini_embedder = GeminiEmbeddingFunction(document_mode=True)

    # Get or create the single collection using name from config and the embedder
    print(f"Accessing collection: '{COLLECTION_NAME}'...")  # noqa: T201
    cv_collection = chroma_client.get_or_create_collection(
        name=COLLECTION_NAME,
        embedding_function=gemini_embedder,
        # Optional: Add metadata={"hnsw:space": "cosine"} here if needed
    )

    # Confirm collection is ready and show current count
    print(  # noqa: T201
        f"Collection '{cv_collection.name}' ready. Item count: {cv_collection.count()}"
    )

except Exception as e:
    print("\nERROR: Failed to initialize ChromaDB client or collection.")  # noqa: T201
    print(f"Check path ('{CHROMA_DB_PATH}') and ChromaDB setup.")  # noqa: T201
    print(f"Error details: {e}")  # noqa: T201
    # Consider uncommenting 'raise' if ChromaDB is essential for subsequent steps
    # raise

### 5.4. Generate and Add Embeddings to Collection

In [None]:
# Add the prepared documents, IDs, and metadata to the ChromaDB collection.
# ChromaDB uses the 'GeminiEmbeddingFunction' provided during collection creation
# to automatically generate embeddings for the documents list.

# Ensure we have a valid collection object and data to add
if (
    "cv_collection" in locals()
    and cv_collection
    and "embedding_documents" in locals()
    and embedding_documents
):

    print(  # noqa: T201
        f"Adding/updating {len(embedding_documents)} documents in ChromaDB collection"
        f" '{cv_collection.name}'..."
    )
    # Note: This step involves API calls via the embedding function and may take time.
    try:
        # Use upsert=True to add new items and update existing ones if IDs match.
        # This makes the process idempotent (safe to re-run).
        cv_collection.upsert(
            ids=embedding_ids,
            metadatas=embedding_metadatas,
            documents=embedding_documents,
        )

        print("\nDocuments successfully added/updated in the collection.")  # noqa: T201
        # Verify final count
        final_count = cv_collection.count()
        print(  # noqa: T201
            f"Collection '{cv_collection.name}' now contains {final_count} items."
        )  # noqa: T201
        if final_count < len(embedding_ids):
            print(  # noqa: T201
                "Warning: Final item count is less than prepared documents count."
            )  # noqa: T201

    except Exception as e:
        print("\nERROR adding/updating documents in ChromaDB:")  # noqa: T201
        print(f"Error details: {e}")  # noqa: T201
        # raise # Optional: Stop execution

else:
    print(  # noqa: T201
        "Skipping document addition to ChromaDB: Prerequisite data or collection"
        " missing."
    )

In [None]:
embedding_documents

## 6. Retrieve Relevant CV Items (RAG)

In [None]:
# Perform RAG query using the job description

retrieved_ids = []
retrieved_documents = []
retrieved_metadatas = []
retrieved_distances = []

try:
    # Basic validation of inputs from previous steps
    if "cv_collection" not in locals() or not cv_collection:
        raise ValueError("ChromaDB collection object 'cv_collection' not found.")
    if not JOB_DESCRIPTION:
        raise ValueError("JOB_DESCRIPTION variable is empty.")
    if "GeminiEmbeddingFunction" not in locals():
        raise NameError("GeminiEmbeddingFunction class definition not found.")

    print(f"Performing RAG query on collection '{cv_collection.name}'...")  # noqa: T201

    # Instantiate embedder for the 'retrieval_query' task type
    query_embedder = GeminiEmbeddingFunction(document_mode=False)

    print("Embedding job description (query)...")  # noqa: T201
    # Embed query (API call happens here, raises exception on failure via embedder)
    query_embedding = query_embedder([JOB_DESCRIPTION])[0]

    # Removed the ambiguous check: if not query_embedding:
    # If the line above succeeded, query_embedding contains the vector.
    # If it failed, the exception would be caught by the main except block below.

    num_results_to_fetch = MAX_RELEVANT_ITEMS + RETRIEVAL_WINDOW
    print(f"Querying for {num_results_to_fetch} items...")  # noqa: T201

    results = cv_collection.query(
        query_embeddings=[query_embedding],
        n_results=num_results_to_fetch,
        include=["documents", "metadatas", "distances"],
    )

    # Safely extract results
    retrieved_ids = results.get("ids", [[]])[0]
    retrieved_documents = results.get("documents", [[]])[0]
    retrieved_metadatas = results.get("metadatas", [[]])[0]
    retrieved_distances = results.get("distances", [[]])[0]

    print(f"RAG retrieval complete. Found {len(retrieved_ids)} items.")  # noqa: T201

except Exception as e:
    print(f"\nERROR during RAG retrieval: {type(e).__name__} - {e}")  # noqa: T201
    # Ensure lists are reset on error
    retrieved_ids, retrieved_documents, retrieved_metadatas, retrieved_distances = (
        [],
        [],
        [],
        [],
    )

if not retrieved_ids:
    print("\nWarning: No relevant items were retrieved.")  # noqa: T201

## 7. Rewrite CV Items using LLM Chat

### 7.1. Define Prompts for Rewriting (EN & ES)

In [None]:
# Structure to hold multi-language prompt components
content = {"en": {}, "es": {}}

In [None]:
content["en"]["quit_msg"] = "To exit, enter 'q' or 'quit'"
content["es"]["quit_msg"] = "Para salir, ingresa 'q' o 'salir'"

In [None]:
# Example format for the LLM to follow when presenting a modified experience
content["en"]["example"] = """
## Position Name / Title: [JOB TITLE]
- Company Name / Business Name: [COMPANY/ORG NAME]
- Industry type: [INDUSTRY]
- Job Field: [FIELD]
- Sub-Area of Work: [SUB-AREA]

### Original Description

[ORIGINAL DESCRIPTION TEXT]

### Modified description

[SHORT SUMMARY/LEAD-IN]
- [ATS-Optimized achievement/responsibility 1 incorporating keywords]
- [ATS-Optimized achievement/responsibility 2 incorporating keywords]
- ...

### Changes made

- Keywords used: [KEYWORD 1], [KEYWORD 2], ...
- Explanation of changes: [Brief summary of additions/removals/focus shifts].


Shall we continue [Y/n]?
"""

content["es"]["example"] = """
## Nombre del puesto / Título: [TÍTULO PUESTO]
- Nombre de empresa / Negocio: [NOMBRE EMPRESA/ORG]
- Tipo de industria: [INDUSTRIA]
- Área de trabajo: [ÁREA]
- Subárea de trabajo: [SUB-ÁREA]

### Descripción original

[TEXTO DESCRIPCIÓN ORIGINAL]

### Descripción modificada

[RESUMEN CORTO/INTRODUCCIÓN]
- [Logro/responsabilidad optimizado para ATS 1 incorporando palabras clave]
- [Logro/responsabilidad optimizado para ATS 2 incorporando palabras clave]
- ...

### Changes made

- Palabras clave utilizadas: [PALABRA CLAVE 1], [PALABRA CLAVE 2], ...
- Explicación de los cambios: [Resumen breve de adiciones/eliminaciones/reenfoques].


¿Continuamos? [Y/n]
"""

In [None]:
# Main prompt defining the LLM's role and the interactive workflow.
# Placeholders {n_max_exp}, {example}, {description}, {experiences} will be formatted later.

content["en"]["prompt"] = r"""
You are an expert CV writer specialized in optimizing resumes for Applicant Tracking Systems (ATS) and identifying keywords from job descriptions.
Analyze the provided job description and the list of retrieved CV experiences below. Your goal is to refine the experiences to align strongly with the job description keywords, making them ATS-compatible.

**Output Requirements:**
- ATS-friendly language.
- Concise and impactful wording.
- Focus on achievements and quantifiable results where possible.
- Use active voice (e.g., Developed, Managed, Led).
- Adhere strictly to the requested formats. Avoid extra conversational text.

**Initial Analysis Task:**
1. Identify and list the most critical keywords from the **JOB DESCRIPTION**.
2. List the top {n_max_exp} most relevant **HIGHLIGHTED EXPERIENCES** retrieved, ordered by relevance (most relevant first). Include start/end dates (if available), title/position, and company/organization name.
3. List any remaining **NON-SELECTED EXPERIENCES** retrieved similarly.

**Initial Analysis Output Format:**
**KEYWORDS:** [KEYWORD 1], [KEYWORD 2], ...
**HIGHLIGHTED EXPERIENCES:**
- ([START DATE] - [END DATE]) [JOB TITLE/PROJECT NAME 1], [COMPANY/ORG 1]
- ... (up to {n_max_exp} items) ...
**NON-SELECTED EXPERIENCES:**
- ([START DATE] - [END DATE]) [JOB TITLE/PROJECT NAME X], [COMPANY/ORG X]
- ... (remaining items) ...

**Interaction Flow:**
After presenting the initial analysis, ASK THE FOLLOWING QUESTION EXACTLY:
"Do you possess skills relevant to the keywords? Should any specific HIGHLIGHTED experience be modified?"

WAIT for my response. I will tell you which keywords are most relevant and which specific HIGHLIGHTED experience (by its title/position) I want to modify first.

**Modification Task (Perform ONLY AFTER I select an experience):**
When I ask you to modify a specific experience:
1. Focus ONLY on the single experience I selected.
2. Rewrite its description/highlights to incorporate the relevant **KEYWORDS** identified earlier.
3. Make the description achievement-oriented and ATS-friendly.
4. Present the modified experience using the **EXACT** format shown in the example below. Include the original description text for comparison. Mention the keywords used and explain the changes made.

**Modification Output Format Example:** {example}

**IMPORTANT:** Modify and present ONLY ONE experience at a time, based on my selection. After presenting a modified experience and asking "Shall we continue [Y/n]?", WAIT for my confirmation ('Y' or 'y') before proceeding to ask which *next* highlighted experience I want to modify, OR wait for my feedback/request for further changes on the *current* one. If I enter 'n' or 'q', stop the process.

**Context:**
JOB DESCRIPTION: "{description}"
RETRIEVED CV EXPERIENCES:
{experiences}

Start by performing the **Initial Analysis Task**.
"""

content["es"]["prompt"] = r"""
Eres un experto redactor de CVs, especialista en optimizar para Applicant Tracking Systems (ATS) e identificar palabras clave en descripciones de trabajo.
Analiza la descripción del puesto y la lista de experiencias recuperadas del CV que se proporcionan a continuación. Tu objetivo es refinar las experiencias para alinearlas fuertemente con las palabras clave de la descripción del puesto, haciéndolas compatibles con ATS.

**Requisitos de Salida:**
- Lenguaje amigable para ATS.
- Redacción concisa e impactante.
- Enfoque en logros y resultados cuantificables siempre que sea posible.
- Usar voz activa (ej., Desarrollé, Gestioné, Lideré).
- Adherirse estrictamente a los formatos solicitados. Evitar texto conversacional extra.

**Tarea de Análisis Inicial:**
1. Identifica y lista las palabras clave más críticas de la **DESCRIPCIÓN DE LA OFERTA**.
2. Lista las {n_max_exp} **EXPERIENCIAS DESTACADAS** más relevantes recuperadas, ordenadas por relevance (la más relevante primero). Incluye fechas de inicio/fin (si están disponibles), título/puesto y nombre de la empresa/organización.
3. Lista cualquier **EXPERIENCIA NO SELECCIONADA** restante recuperada de manera similar.

**Formato de Salida del Análisis Inicial:**
**PALABRAS CLAVES:** [PALABRA CLAVE 1], [PALABRA CLAVE 2], ...
**EXPERIENCIAS DESTACADAS:**
- ([FECHA INICIO] - [FECHA FIN]) [TÍTULO PUESTO/PROYECTO 1], [EMPRESA/ORG 1]
- ... (hasta {n_max_exp} ítems) ...
**EXPERIENCIAS NO SELECCIONADAS:**
- ([FECHA INICIO] - [FECHA FIN]) [TÍTULO PUESTO/PROYECTO X], [EMPRESA/ORG X]
- ... (ítems restantes) ...

**Flujo de Interacción:**
Después de presentar el análisis inicial, HAZ LA SIGUIENTE PREGUNTA EXACTAMENTE:
"¿Posees habilidades relevantes para las palabras claves? ¿Debería modificarse alguna EXPERIENCIA DESTACADA específica?"

ESPERA mi respuesta. Te indicaré qué palabras clave son más relevantes y qué EXPERIENCIA DESTACADA específica (por su título/puesto) quiero modificar primero.

**Tarea de Modificación (Realizar SÓLO DESPUÉS de que yo seleccione una experiencia):**
Cuando te pida modificar una experiencia específica:
1. Enfócate SÓLO en la única experiencia que seleccioné.
2. Reescribe su descripción/logros para incorporar las **PALABRAS CLAVES** relevantes identificadas anteriormente.
3. Haz la descripción orientada a logros y amigable para ATS.
4. Presenta la experiencia modificada usando el formato **EXACTO** mostrado en el ejemplo a continuación. Incluye el texto de la descripción original para comparación. Menciona las palabras clave usadas y explica los cambios realizados.

**Ejemplo de Formato de Salida de Modificación:** {example}

**IMPORTANTE:** Modifica y presenta SÓLO UNA experiencia a la vez, según mi selección. Después de presentar una experiencia modificada y preguntar "¿Continuamos? [Y/n]?", ESPERA mi confirmación ('Y', 'y' o Enter) antes de preguntar qué *siguiente* experiencia destacada quiero modificar, O espera mi feedback/solicitud de más cambios en la *actual*. Si ingreso 'n' o 'q', detén el proceso.

**Contexto:**
DESCRIPCIÓN DE LA OFERTA: "{description}"
EXPERIENCIAS DEL CV RECUPERADAS:
{experiences}

Comienza realizando la **Tarea de Análisis Inicial**.
"""

# Note: The placeholders {n_max_exp}, {example}, {description}, {experiences}
# will be filled using .format() before sending the prompt to the LLM.

### 7.2. Initiate Chat and Generate Initial Response

In [None]:
from IPython.display import Markdown

# Assumes 'genai' and 'types' are available from previous cell executions

# --- 1. Format Retrieved Experiences for Prompt ---
formatted_experiences = []
if "retrieved_documents" in locals() and retrieved_documents:
    for i, doc in enumerate(retrieved_documents):
        meta = retrieved_metadatas[i] if i < len(retrieved_metadatas) else {}
        section = meta.get("section", "Unknown Section")
        name = meta.get("id_name", "")
        position = meta.get("id_position", "")
        header = f"{name}{(' - ' + position) if position else ''} ({section})"
        formatted_experiences.append(f"--- Experience {i+1}: {header} ---\n{doc}")
    experiences_context_string = "\n\n".join(formatted_experiences)
else:
    experiences_context_string = (
        "No relevant experiences were retrieved from the database."
    )
    print("Warning: No retrieved experiences to include in the prompt.")  # noqa: T201

# --- 2. Select Language and Format Final Prompt ---
content_lang = content[LANGUAGE]  # Select EN or ES content dict
initial_prompt = content_lang["prompt"].format(
    n_max_exp=MAX_RELEVANT_ITEMS,
    example=content_lang["example"],
    description=JOB_DESCRIPTION,
    experiences=experiences_context_string,
)

# --- 3. Initialize LLM and Chat for Rewriting ---
chat = None  # Initialize chat variable
try:
    # Initialize the model instance for chat/rewriting
    # Removed safety_settings argument as it wasn't defined previously
    rewriting_model = GENERATIVE_MODEL_NAME

    # Define generation config using parameters from Cell 16
    # Assumes 'types' is available from a previous import
    rewriting_generation_config = {
        "temperature": REWRITING_LLM_TEMPERATURE,  # From Cell 16
        "top_p": REWRITING_LLM_TOP_P,  # From Cell 16
        "top_k": REWRITING_LLM_TOP_K,  # From Cell 16
    }

    # Start a new chat session (history is empty)
    chat = client.chats.create(
        model=GENERATIVE_MODEL_NAME, history=[], config=rewriting_generation_config
    )
    print(  # noqa: T201
        f"Chat session initialized with model '{GENERATIVE_MODEL_NAME}'."
    )  # noqa: T201

    # --- 4. Send Initial Prompt and Get Response ---
    print("Sending initial prompt to LLM for analysis...")  # noqa: T201
    # Send the formatted prompt, applying the specific generation config
    initial_response = chat.send_message(initial_prompt)


except NameError as e:
    # Catch errors if genai or types wasn't imported/available
    print(  # noqa: T201
        "ERROR: Required object not defined (e.g., 'genai', 'types'). Check cell"
        " execution order."
    )
    print(f"Error: {e}")  # noqa: T201
    chat = None
except Exception as e:
    print("\nERROR initializing chat or sending initial message:")  # noqa: T201
    print(f"Error Type: {type(e).__name__}")  # noqa: T201
    print(f"Error details: {e}")  # noqa: T201
    chat = None

# Ensure 'chat' variable exists for the next step
if "chat" not in locals():
    chat = None

# Ensure variables exist for the next step
if "chat" not in locals():
    chat = None
if "initial_response" not in locals():
    initial_response = None

# Verification print moved to focus only on successful execution of this cell's task
if chat and initial_response:
    print("Initial LLM response obtained successfully.")  # noqa: T201
else:
    print("Failed to obtain initial LLM response or initialize chat.")  # noqa: T201

### 7.3. Interactive Refinement Loop

The following cell starts the interactive loop. First, it displays the initial analysis received from the LLM in the previous step. Then, it repeatedly prompts you for input.

Based on the LLM's analysis and questions, provide your feedback or select an experience to modify. Your input will be sent back to the LLM, and its response will be displayed.

Enter 'q' or 'quit' (or the Spanish equivalent if configured) to exit the loop.

In [None]:
from IPython.display import Markdown, display  # noqa: F811

# Assumes 'chat', 'initial_response', 'content', 'LANGUAGE',
# 'rewriting_generation_config' exist from previous cells.

# --- 1. Display Initial LLM Analysis ---
# Display the response received from the initial prompt (Cell 50)
if "initial_response" in locals() and initial_response:
    print("\n" + "=" * 80)  # noqa: T201
    print("LLM Initial Analysis Response:")  # noqa: T201
    print("=" * 80)  # noqa: T201
    display(Markdown(initial_response.text))
    print("=" * 80)  # noqa: T201
else:
    print("\nWarning: Initial LLM response not available to display.")  # noqa: T201

# --- 2. Interactive Refinement Loop ---
# Proceed only if the chat object was successfully created
if "chat" in locals() and chat:
    print("\n--- Starting Interactive Refinement ---")  # noqa: T201
    # Get the appropriate quit message for the configured language
    quit_msg = content[LANGUAGE].get("quit_msg", "Enter 'q' or 'quit' to exit")
    # Define various inputs that signal exiting the loop (incl. 'n' for No)
    quit_options = {"q", "quit", "exit", "salir", "n"}

    while True:
        # Display quit instructions clearly
        print("\n" + f" {quit_msg} ".center(80, "-"))  # noqa: T201
        try:
            # Prompt user for input
            user_input = input("> ")

            # Check if the user wants to exit
            if user_input.lower() in quit_options:
                print("\nExiting interactive refinement loop.")  # noqa: T201
                break

            # Assume 'Yes' (continue/approve) if user just presses Enter
            if not user_input:
                user_input = "Y"
                print(  # noqa: T201
                    "(Interpreting empty input as 'Y' to continue/confirm)"
                )  # noqa: T201

            # Send the user's message to the LLM via the chat session
            print("\nSending message to LLM...")  # noqa: T201
            llm_response = chat.send_message(user_input)
            print("LLM response received:")  # noqa: T201

            # Display the LLM's response using Markdown rendering
            display(Markdown(llm_response.text))

        except KeyboardInterrupt:
            print("\nLoop interrupted by user (Ctrl+C). Exiting.")  # noqa: T201
            break
        except Exception as e:
            # Catch potential errors during chat interaction
            print("\nERROR during interactive chat:")  # noqa: T201
            print(f"Error Type: {type(e).__name__}")  # noqa: T201
            print(f"Error details: {e}")  # noqa: T201
            print("Exiting interactive loop due to error.")  # noqa: T201
            break  # Exit loop on error

    print("--- End of Interactive Refinement ---")  # noqa: T201

else:
    # Message if chat object wasn't created successfully earlier
    print(  # noqa: T201
        "\nSkipping interactive loop: Chat object ('chat') is not available."
    )  # noqa: T201

## 8. Conclusion & Take-aways

This notebook demonstrated a complete workflow for optimizing CV sections against a job description using RAG and Generative AI:

* **CV Structuring:** Successfully parsed a PDF CV into a structured JSON format using an LLM (Gemini).
* **Vector Search (RAG):** Created embeddings for CV items and utilized ChromaDB to retrieve the sections most semantically relevant to a specific job description.
* **LLM-Powered Rewriting:** Employed a Gemini chat model in an interactive loop to refine the retrieved CV sections, incorporating keywords and optimizing for ATS.
* **Self-Contained Workflow:** The entire process, from PDF loading to interactive rewriting, is contained within this single Kaggle notebook.

This approach allows for targeted CV customization, potentially improving application success rates by better aligning the CV with job requirements and ATS filtering.

## 9. Final Remarks / Contact

🌟 CONGRATULATIONS 🌟

If you’ve made it all the way to the end of this notebook (whether you’re an evaluator, a fellow data‑enthusiast, or just curious about the process), you definitely deserve a prize!

Our project is evolving every week, and your feedback keeps it alive.
👉 Show some love by giving the repo a star on GitHub:
https://github.com/framunoz/cv-analyser-with-rag/

A single ⭐ helps the framework grow, motivates new features, and lets us know this work is valuable to the community.

Thanks for reading, building, and experimenting alongside us—see you in the next commit!