<a href="https://colab.research.google.com/github/dashang/HireFlow-RAG-LangChain/blob/main/%5B23rd_Nov_2025%5D_%5BProject%5D_HireFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data:**

[Resume dataset](https://drive.google.com/file/d/1UBNrMu6bwpRtkCd-M64UqGJFH5i5rbzJ/view?usp=drive_link) (PDF) parsed into structured text + metadata

[Job Description](https://drive.google.com/file/d/1DMKzvpTYDmFx67tjEB_PQ2BJ_R5neT38/view?usp=drive_link) from various domains for testing semantic matching

Note:
All data in this dataset are synthetically generated for research and testing purposes. Any similarity to actual persons, companies, or events is entirely coincidental and unintended.

# **HireFlow - Intelligent Candidate Search and Evaluation**


**[Problem Statement]**

**Introduction**

In modern recruitment, finding the right talent quickly is a decisive factor in organizational success. While traditional Applicant Tracking Systems (ATS) rely on keyword-based filtering, they often fail to capture the deeper semantic alignment between a candidateâ€™s experience and the role requirements. This results in missed opportunities, irrelevant shortlists, and longer hiring cycles.

HireFlow addresses this by combining semantic search with Generative AI to intelligently match candidates to job descriptions (JDs). By understanding the meaning and context of both resumes and job postings, the system ranks candidates based on confidence scoring and provides human-readable insights into their strengths, weaknesses, and overall fit.

Designed as a modular, scalable, and explainable system, HireFlow enables recruiters to go beyond keyword matches and make faster, data-driven hiring decisions.


**Steps**

**Step 1 :** Setup the environment.

**Step 2 :** Reading from 1 resume  for trial then add multiple

**Step 3 :**
Ingesting of resumes
Vector store
Retrieval
Re-ranking

**Step 4 :** Making sure this all works together.

Recruiter Asks HireFlow system :
jd_text = "Looking for a Senior Python Developer with Cloud experience."
Should recommend multiple candidates from it.


# CODE

In [56]:
# Libraries

!pip install faiss-cpu
!pip install rank_bm25
!pip install langchain_google_genai


Collecting langchain-core<2.0.0,>=1.0.5 (from langchain_google_genai)
  Using cached langchain_core-1.1.0-py3-none-any.whl.metadata (3.6 kB)
Using cached langchain_core-1.1.0-py3-none-any.whl (473 kB)
Installing collected packages: langchain-core
  Attempting uninstall: langchain-core
    Found existing installation: langchain-core 0.3.80
    Uninstalling langchain-core-0.3.80:
      Successfully uninstalled langchain-core-0.3.80
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-openai 0.3.35 requires langchain-core<1.0.0,>=0.3.78, but you have langchain-core 1.1.0 which is incompatible.
langchain 0.3.27 requires langchain-core<1.0.0,>=0.3.72, but you have langchain-core 1.1.0 which is incompatible.[0m[31m
[0mSuccessfully installed langchain-core-1.1.0


In [55]:
!pip install ragas

Collecting langchain-core (from ragas)
  Using cached langchain_core-0.3.80-py3-none-any.whl.metadata (3.2 kB)
Using cached langchain_core-0.3.80-py3-none-any.whl (450 kB)
Installing collected packages: langchain-core
  Attempting uninstall: langchain-core
    Found existing installation: langchain-core 1.1.0
    Uninstalling langchain-core-1.1.0:
      Successfully uninstalled langchain-core-1.1.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-google-genai 3.1.0 requires langchain-core<2.0.0,>=1.0.5, but you have langchain-core 0.3.80 which is incompatible.[0m[31m
[0mSuccessfully installed langchain-core-0.3.80


In [57]:
import os
from typing import List

In [58]:
!pip install llama-index



In [59]:
import os
from typing import List
from langchain_community.document_loaders import PyPDFLoader

In [60]:
# 1) Load your documents
from langchain_core.documents import Document as LCDocument # Alias for clarity

resume_collection_path = "/content/ResumeCollection"
docs = []

if not os.path.exists(resume_collection_path):
    print(f"The folder '{resume_collection_path}' does not exist.")
else:
    for filename in os.listdir(resume_collection_path):
        if filename.endswith(".pdf"):
            file_path = os.path.join(resume_collection_path, filename)
            loader = PyPDFLoader(file_path)
            loaded_docs = loader.load()
            for doc in loaded_docs:
                # Ensure each document has metadata like file_name
                doc.metadata["file_name"] = filename
                docs.append(doc)

if not docs:
    print("No PDF documents found in 'ResumeCollection' Directory.")
else:
    print(f"Successfully loaded {len(docs)} documents.")

Successfully loaded 5 documents.


In [61]:
docs

[Document(metadata={'producer': 'ReportLab PDF Library - www.reportlab.com', 'creator': '(unspecified)', 'creationdate': '2025-09-06T20:08:17+05:00', 'author': '(anonymous)', 'keywords': '', 'moddate': '2025-09-06T20:08:17+05:00', 'subject': '(unspecified)', 'title': '(anonymous)', 'trapped': '/False', 'source': '/content/ResumeCollection/Angela_Lewis_Resume_09.pdf', 'total_pages': 3, 'page': 0, 'page_label': '1', 'file_name': 'Angela_Lewis_Resume_09.pdf'}, page_content='ANGELA LEWIS\nVP\nContact Information:\nEmail: angela.lewis@email.com\nPhone: (237) 754-2918\nLocation: San Antonio, TX\nLinkedIn: linkedin.com/in/angelalewis\nPROFESSIONAL SUMMARY\nSenior accounting professional with 10+ years of progressive experience in financial management,\nteam leadership, and strategic planning. Proven track record of driving process improvements,\nreducing costs, and leading successful financial initiatives. Strong expertise in complex accounting\nprinciples and regulatory compliance.\nTECHNICA

In [62]:
print(len(docs))

5


In [63]:
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from google.colab import userdata

# Access the API key from Colab secrets
GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')

# LLM
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    api_key=GEMINI_API_KEY,
    temperature=0.2
)

#Config Embeddings
embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004", google_api_key= GEMINI_API_KEY)

In [64]:
from langchain_community.vectorstores import FAISS
from langchain_community.retrievers import BM25Retriever
from langchain_core.documents import Document as LCDocument # Alias to avoid name collision

# Build Retrievers

# Dense Search (FAISS)
vectorstore = FAISS.from_documents(docs, embeddings)
faiss_retriever = vectorstore.as_retriever(k=2)

# Sparse Search (BM25)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 2

In [65]:
from langchain_core.prompts import ChatPromptTemplate

In [66]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a Expert HR Specialist. As per question filter the resume and answer from the context. If something asked is out of resume or context then mention 'Out of Context'. Answer ONLY using the provided CONTEXT in professional and polite way."),
    ("system", "CONTEXT:\n{context}"),
    ("human", "{question}")
])

In [67]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.documents import Document

In [68]:
# Chain :
chain = prompt | llm | StrOutputParser()

In [69]:
# # 4. Manual Ensemble Retriever Implementation
# def ensemble_retrieve(query: str, retrievers: List, weights: List[float] = None, k: int = 2) -> List[Document]:
#     """
#     Combines results from multiple retrievers using weighted scoring.

#     Args:
#         query: The search query
#         retrievers: List of retriever objects
#         weights: Weight for each retriever (default: equal weights)
#         k: Number of top documents to return
#     """
#     if weights is None:
#         weights = [1.0 / len(retrievers)] * len(retrievers)

#     # Get results from all retrievers
#     all_docs = []
#     doc_scores = {}

#     for retriever, weight in zip(retrievers, weights):
#         docs = retriever.invoke(query)

#         # Assign scores (higher position = higher score)
#         for idx, doc in enumerate(docs):
#             score = (len(docs) - idx) * weight
#             doc_id = doc.page_content  # Use content as identifier

#             if doc_id in doc_scores:
#                 doc_scores[doc_id]["score"] += score
#             else:
#                 doc_scores[doc_id] = {"doc": doc, "score": score}

#     # Sort by score and return top k
#     sorted_docs = sorted(doc_scores.values(), key=lambda x: x["score"], reverse=True)
#     return [item["doc"] for item in sorted_docs[:k]]

# 5. Context Formatter & Prompt
def format_context(docs):
    return "\n\n".join([f"[S{i+1}] {d.page_content}" for i, d in enumerate(docs)])


In [70]:
!pip install --upgrade langchain-core



In [77]:
from langchain.retrievers import EnsembleRetriever


q = "Who is applicable for director role?"
# Hybrid :
ctx_docs = EnsembleRetriever(retrievers=[faiss_retriever, bm25_retriever], weights=[0.5, 0.5])


In [78]:
ctx_docs

EnsembleRetriever(retrievers=[VectorStoreRetriever(tags=['FAISS', 'GoogleGenerativeAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x7f5d16e16900>, search_kwargs={}), BM25Retriever(vectorizer=<rank_bm25.BM25Okapi object at 0x7f5e055ae510>, k=2)], weights=[0.5, 0.5])

In [79]:
retrieved_docs = ctx_docs.invoke(q)
content = "\n\n".join([doc.page_content for doc in retrieved_docs])

print("Content successfully stored in 'content' variable. First 500 characters:")
print(content[:500])

Content successfully stored in 'content' variable. First 500 characters:
PROFESSIONAL EXPERIENCE
Financial Planning Analyst | Professional Accounting Partners | 2021 - Present
 Led financial reporting and analysis for $50M+ revenue company
 Managed month-end closing process and prepared financial statements
 Collaborated with cross-functional teams on strategic initiatives
 Implemented process improvements resulting in 20% efficiency gains
 Mentored junior staff and provided training on new procedures
Fixed Asset Accountant | Premier Financial Advisors | 2020 - 


In [80]:
print(format_context(retrieved_docs))

[S1] PROFESSIONAL EXPERIENCE
Financial Planning Analyst | Professional Accounting Partners | 2021 - Present
 Led financial reporting and analysis for $50M+ revenue company
 Managed month-end closing process and prepared financial statements
 Collaborated with cross-functional teams on strategic initiatives
 Implemented process improvements resulting in 20% efficiency gains
 Mentored junior staff and provided training on new procedures
Fixed Asset Accountant | Premier Financial Advisors | 2020 - 2021
 Prepared monthly financial reports and variance analysis
 Processed journal entries and maintained general ledger
 Assisted with budget planning and forecasting processes
 Reconciled bank statements and credit card accounts
 Supported audit preparation and documentation
Accounting Manager | Excellence in Finance | 2019 - 2021
 Prepared monthly financial reports and variance analysis
 Processed journal entries and maintained general ledger
 Assisted with budget planning and for

In [81]:
answer = chain.invoke({"context": content, "question": q})
print(answer)

Based on the provided context:

Angela Lewis is applicable for a director role. Her current title is VP, and her professional summary highlights "10+ years of progressive experience in financial management, team leadership, and strategic planning."


In [83]:
# Question no -2

q = "Who is good in power BI?"   # Question based on skill
# Hybrid :
ctx_docs = EnsembleRetriever(retrievers=[faiss_retriever, bm25_retriever], weights=[0.5, 0.5])

retrieved_docs = ctx_docs.invoke(q)
content = "\n\n".join([doc.page_content for doc in retrieved_docs])
answer = chain.invoke({"context": content, "question": q})
print(answer)

Based on the provided resumes, Lisa Campbell is proficient in Power BI.


In [85]:
# Question no -3

q = "Who has worked in Techflow solution?"   # Question based on Company work history
# Hybrid :
ctx_docs = EnsembleRetriever(retrievers=[faiss_retriever, bm25_retriever], weights=[0.5, 0.5])

retrieved_docs = ctx_docs.invoke(q)
content = "\n\n".join([doc.page_content for doc in retrieved_docs])
answer = chain.invoke({"context": content, "question": q})
print(answer)

Based on the provided context, Lisa Campbell worked as an Internal Auditor at TechFlow Solutions from 2020 to 2021.


**Chat Engine**

In [86]:
# from llama_index.core.memory import ChatMemoryBuffer
# from llama_index.core.chat_engine import ContextChatEngine

In [None]:
# memory = ChatMemoryBuffer(token_limit=2000)

# # 4) Create chat engine with memory
# chat_engine = index.as_chat_engine(
#     chat_mode="context",
#     memory=memory,
#     llm=llm
# )