# insAIghtCV: Intelligent Resume Assessment Using Retrieval-Augmented Generation

by: Allan Khester Mesa

## Captsone Overview

Recruiters often struggle with efficiently and fairly evaluating large volumes of technical resumes, especially when trying to match candidates to specific developer roles. Traditional keyword-based filtering can overlook qualified applicants or fail to assess contextual relevance of experience and skills. This capstone project introduces **insAIghtCV**, an intelligent resume analyst powered by Retrieval-Augmented Generation (RAG). By leveraging semantic search to retrieve relevant resume excerpts and using a large language model to analyze them against targeted hiring criteria, insAIghtCV provides structured, unbiased feedback on applicant suitability. The system is designed to enhance hiring accuracy, reduce manual review time, and support fairer candidate assessments in the tech industry.

## CAPSTONE PROJECT CREATION

In [8]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

### Setup

First, install ChromaDB and the Gemini API Python SDK.

In [9]:
!pip uninstall -qqy jupyterlab kfp  # Remove unused conflicting packages
!pip install -qU "google-genai==1.7.0" "chromadb==0.6.3"


[0m

install langchain

In [10]:
!pip install -qU langchain
!pip install -qU langchain-community

#### Import Required Libraries

In [11]:
from google import genai
from google.genai import types
from IPython.display import Markdown

genai.__version__

'1.7.0'

In [12]:
client = genai.Client(api_key=GOOGLE_API_KEY)

for m in client.models.list():
    if "embedContent" in m.supported_actions:
        print(m.name)

models/embedding-001
models/text-embedding-004
models/gemini-embedding-exp-03-07
models/gemini-embedding-exp


### Data

- first, import a resume in **strictly** pdf file using the `PyPDFLoader` of langchain
- Preprocess the `page_content` of the document returned by the `.load()` function of the `PyPDFLoader` by replacing the breaklines with space to create one-liner string



In [13]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from typing import List
import re

# Load PDF
# pdf_path: str = "/kaggle/input/pdf-data-example/Rsum_Mesa_linkedin (SE).pdf"
pdf_path2: str = "/kaggle/input/pdf-data-example/REsume v2.1.pdf"
py_pdf: PyPDFLoader = PyPDFLoader(file_path=pdf_path2)
documents = py_pdf.load()

# Clean each document page
cleaned_documents2: List[Document] = []

for doc in documents:
    cleaned_text = doc.page_content.replace('\n', ' ').replace('\r', ' ').strip()
    cleaned_text = ' '.join(cleaned_text.split())
    
    cleaned_documents2.append(Document(
        page_content=cleaned_text,
        metadata=doc.metadata
    ))

print(cleaned_documents2)

# # Improved Section-aware splitting
# SECTION_HEADERS = [
#     "contacts", "experience", "work experience", "professional experience",
#     "education", "skills", "projects", "certifications",
#     "summary", "objective", "interests", "awards", "languages"
# ]

# # Match headers that start a section (even with trailing colons or extra text)
# SECTION_SPLIT_REGEX = re.compile(
#     rf"(?i)\b({'|'.join(SECTION_HEADERS)})\b[\s:]*", re.IGNORECASE
# )

# def split_resume_by_sections(text: str) -> List[str]:
#     matches = list(SECTION_SPLIT_REGEX.finditer(text))
#     chunks = []

#     if not matches:
#         print("⚠️ No section headers found — returning full text")
#         return [text.strip()]
    
#     for i, match in enumerate(matches):
#         start = match.start()
#         end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
        
#         # Include the matched section header in the chunk
#         section_text = text[start:end].strip()
#         if section_text:
#             chunks.append(section_text)

#     return chunks

# # Split long sections
# def final_chunking(section_texts: List[str], chunk_size=500, chunk_overlap=100) -> List[str]:
#     splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
#     final_chunks = []
#     for section in section_texts:
#         final_chunks.extend(splitter.split_text(section))
#     return final_chunks

# # Apply splitter
# final_documents: List[Document] = []

# for doc in cleaned_documents:
#     section_chunks = split_resume_by_sections(doc.page_content)
#     chunked_sections = final_chunking(section_chunks)
    
#     for chunk in chunked_sections:
#         final_documents.append(Document(
#             page_content=chunk,
#             metadata=doc.metadata
#         ))

# # Preview
# for i, doc in enumerate(final_documents):
#     print(f"\n--- Final Chunk {i+1} ---\n{doc.page_content}\nMetadata: {doc.metadata}")



[Document(metadata={'producer': 'Skia/PDF m136 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'REsume v2', 'source': '/kaggle/input/pdf-data-example/REsume v2.1.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content="Kenn-Roe V. Basseg kennroeb@gmail.com | 09754467109 | github.com/Nooblese SUMMARY Motivated and team-oriented Computer Science student with a foundation in programming, problem-solving, and software development. Eager to apply academic knowledge and technical skills in a dynamic internship environment. Committed to learning from experienced professionals, contributing meaningfully to real-world projects, and continuously growing both personally and professionally within a collaborative and innovative team setting. AWARDS ● 1st place, DICT's AI.DEAS. A regional AI workshop with pitch competition on AI | University of Saint Louis - Tuguegarao | (2024) ● Successfully passed the Civil Service Commission Career Service Examination - Profess

### Create the embbedding database with ChromaDB

In [14]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry

from google.genai import types


# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})


class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents, or queries
    document_mode = True

    @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]

#### Add documents to the db.

extract each `page_content` and `metadata` in the list of `Document` then use length of the list as the their `id`

In [15]:
import chromadb
# Initialize ChromaDB with Gemini embeddings
chroma_client1 = chromadb.Client()
embed_fn = GeminiEmbeddingFunction()
collection = chroma_client1.get_or_create_collection(name="resumedb", embedding_function=embed_fn)


# Add documents
texts = [doc.page_content for doc in cleaned_documents2]
metadatas = [doc.metadata for doc in cleaned_documents2]
ids = [f"doc_{i}" for i in range(len(cleaned_documents2))]

collection.add(
    documents=texts,
    metadatas=metadatas,
    ids=ids
)

collection.peek(0)


{'ids': ['doc_0', 'doc_1'],
 'embeddings': array([[-0.02374197,  0.01962563, -0.01994732, ..., -0.00566674,
          0.03233953, -0.03629556],
        [-0.00889046, -0.02098409, -0.08990833, ..., -0.02340102,
          0.01596081, -0.01379331]]),
 'documents': ["Kenn-Roe V. Basseg kennroeb@gmail.com | 09754467109 | github.com/Nooblese SUMMARY Motivated and team-oriented Computer Science student with a foundation in programming, problem-solving, and software development. Eager to apply academic knowledge and technical skills in a dynamic internship environment. Committed to learning from experienced professionals, contributing meaningfully to real-world projects, and continuously growing both personally and professionally within a collaborative and innovative team setting. AWARDS ● 1st place, DICT's AI.DEAS. A regional AI workshop with pitch competition on AI | University of Saint Louis - Tuguegarao | (2024) ● Successfully passed the Civil Service Commission Career Service Examination 

## PROGRAM APPLICATION

### Create initial AI feedback based on the given resume

The proposed project will have an initial screening of some sort (e.g., assessing the qualities of a given applicant) by creating a feedback to enhance decision making of the employer. 

> Example: <br>
Assuming the applicant is applying for **developer** role, 

In [16]:
def ai_initial_feedback(collection, applicant_question=None, n_results=4):
    if applicant_question is None:
        applicant_question = "Do you think this applicant is a good choice for a developer role in our company?"

    results = collection.query(
        query_texts=[applicant_question],
        n_results=n_results      
    )

    retrieved_chunks = results['documents'][0]
    query_oneline = applicant_question.replace("\n", " ")

    prompt = f"""
        You are a Resume Analysis Assistant at **Company XYZ**, a leader in ERP solutions that enhance productivity and performance management.
        
        You specialize in evaluating resumes for **technical roles** (e.g., Software Engineers, Front-End/Back-End Developers, DevOps, Data Engineers).
        
        Your task is to review the following resume excerpts and answer the given question based solely on the content provided.
        
        **Guidelines:**
        1. Focus only on the resume excerpts. Do not use external knowledge or assumptions.
        2. Be clear, specific, and objective. Highlight relevant skills, experience, or gaps.
        3. Maintain a fair, professional, and factual tone.
        4. If there's insufficient info, clearly say so.
        5. End with a concise overall assessment of the applicant’s fit for a developer role at Company XYZ.
        
        **Resume Excerpts:**
        {retrieved_chunks}
        
        **Question:**
        {query_oneline}
    """

    answer = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=prompt)

    return answer.text

Markdown(ai_initial_feedback(collection))

Based on the provided resume excerpts, here's an assessment of the applicant's suitability for a developer role at Company XYZ:

**Strengths:**

*   **Educational Background:** Currently pursuing a B.S. in Computer Science, which is relevant to technical roles at Company XYZ.
*   **Programming Skills:** Lists Java, Python, and C# as languages, which are useful for a developer role.
*   **Development Tools:** Familiarity with Visual Studio Code and Google Colab is a plus.
*   **Project Experience:** Contributed to a project that won an AI competition, demonstrating practical application of skills.
*   **Certifications:** Lists certifications in Python, C#, and SQL/PostgreSQL.
*   **Awards:** Awarded 1st place in DICT's AI competition. Consistent President's Lister. Successfully passed the Civil Service Commission Career Service Examination - Professional Level.
*   **Summary:** Shows that he is motivated and team-oriented, which is great.

**Weaknesses:**

*   **Lack of Professional Experience:** The resume focuses on academic achievements and projects. There is no mention of professional experience as a developer.
*   **Unclear Skill Levels:** The resume lists skills without indicating proficiency levels (e.g., beginner, intermediate, expert).
*   **Future Dated Certifications:** Certifications are future dated (2025), so he is not yet certified.

**Overall Assessment:**

The applicant is a Computer Science student with some relevant skills and project experience. The listed languages (Java, Python, C#) and tools (Visual Studio Code, Google Colab) align with potential needs at Company XYZ. The AI project win suggests problem-solving abilities. However, the absence of professional development experience is a significant gap.
He does look like a good candidate for an entry level position at Company XYZ.


#### Other Queries


For further assessment and analysis of the applicant

In [20]:
query1 = "What is the given contacts of the applicant?"
Markdown(ai_initial_feedback(collection, query1))

Based on the provided resume excerpts, the applicant's contact information is:

*   **Email:** kennroeb@gmail.com
*   **Phone:** 09754467109
*   **GitHub:** github.com/Nooblese
