# Job Candidate Semantic Matching - Embedding Pipeline

This notebook demonstrates the end-to-end pipeline for creating semantic embeddings of job descriptions and candidate profiles, storing them in a vector database (Qdrant), and performing similarity searches.

In [1]:
!pip install python-dotenv



## 1. Install Required Dependencies

First, we need to install all necessary Python packages for the pipeline.

In [4]:
!pip install pymongo pandas



## 2. Load Environment Variables

Load environment variables from `.env` file containing MongoDB URI and API keys.

In [2]:
%load_ext dotenv
%dotenv

## 3. Connect to MongoDB and Load Data

Establish connection to MongoDB database and load job descriptions and candidate profiles into pandas DataFrames.

In [3]:
from pymongo import MongoClient
import pandas as pd  
import os 

client = MongoClient(os.getenv("MONGODB_URI") ) 


db = client['job_matching_db']  


print("My Collections :", db.list_collection_names())


My Collections : ['candidates', 'jobs']


In [4]:
jobs_collection = db['jobs']
jobs_all = list(jobs_collection.find())  
jobs_df = pd.DataFrame(jobs_all)
jobs_df.head() 

Unnamed: 0,_id,title,required_skills,experience_required,location,description,search_text
0,job_001,Backend Developer,"[Node.js, MongoDB, REST]",2,Remote,Backend developer needed to build REST APIs us...,Backend Developer\nRemote\nNode.js MongoDB RES...
1,job_002,Frontend Developer,"[React, JavaScript, CSS]",1,Hybrid,Frontend developer to build responsive user in...,Frontend Developer\nHybrid\nReact JavaScript C...
2,job_003,Full Stack Engineer,"[Node.js, React, PostgreSQL]",3,Remote,Looking for a full stack engineer comfortable ...,Full Stack Engineer\nRemote\nNode.js React Pos...
3,job_004,Data Analyst,"[Python, SQL, Data Visualization]",2,Onsite,Analyze business data and create dashboards us...,Data Analyst\nOnsite\nPython SQL Data Visualiz...
4,job_005,Machine Learning Engineer,"[Python, Machine Learning, Pandas]",3,Remote,Build and deploy machine learning models for r...,Machine Learning Engineer\nRemote\nPython Mach...


In [5]:
jobs_df.shape

(25, 7)

### Load Candidates Data

In [6]:
candidates_collection = db.candidates
candidates_all = candidates_collection.find()
candidates_df = pd.DataFrame(candidates_all)
candidates_df.head(6)

Unnamed: 0,_id,name,title,experience_years,skills,education,summary,search_text
0,cand_001,Ahmed Ben Ali,Backend Developer,3,"[Node.js, MongoDB, REST]",Computer Science,Backend developer with experience building API...,Ahmed Ben Ali Backend Developer Node.js MongoD...
1,cand_002,Sara Trabelsi,Frontend Developer,2,"[React, JavaScript, CSS]",Software Engineering,Frontend developer focused on responsive and a...,Sara Trabelsi Frontend Developer React JavaScr...
2,cand_003,Youssef Kacem,Full Stack Developer,4,"[React, Node.js, PostgreSQL]",Computer Engineering,Full stack engineer experienced in building sc...,Youssef Kacem Full Stack Developer React Node....
3,cand_004,Amira Zribi,Data Analyst,2,"[Python, SQL, Power BI]",Data Science,Data analyst skilled in extracting insights fr...,Amira Zribi Data Analyst Python SQL Power BI ...
4,cand_005,Mohamed Ghali,Machine Learning Engineer,3,"[Python, Machine Learning, Scikit-learn]",AI Engineering,ML engineer with experience building predictiv...,Mohamed Ghali Machine Learning Engineer Python...
5,cand_006,Rania Ben Salah,DevOps Engineer,4,"[Docker, CI/CD, Linux]",Computer Science,DevOps engineer specializing in automation and...,Rania Ben Salah DevOps Engineer Docker CI/CD L...


In [7]:
candidates_df.shape


(20, 8)

## 4. Setup Embedding Function

Configure Google Generative AI (Gemini) for creating text embeddings using the text-embedding-004 model.

In [13]:
!pip install google.generativeai

Collecting google.generativeai
  Using cached google_generativeai-0.8.6-py3-none-any.whl.metadata (3.9 kB)
Collecting google-ai-generativelanguage==0.6.15 (from google.generativeai)
  Using cached google_ai_generativelanguage-0.6.15-py3-none-any.whl.metadata (5.7 kB)
Collecting google-api-core (from google.generativeai)
  Using cached google_api_core-2.29.0-py3-none-any.whl.metadata (3.3 kB)
Collecting google-api-python-client (from google.generativeai)
  Using cached google_api_python_client-2.188.0-py3-none-any.whl.metadata (7.0 kB)
Collecting google-auth>=2.15.0 (from google.generativeai)
  Using cached google_auth-2.47.0-py3-none-any.whl.metadata (6.4 kB)
  Downloading google_auth-2.48.0rc0-py3-none-any.whl.metadata (6.0 kB)
Collecting proto-plus<2.0.0dev,>=1.22.3 (from google-ai-generativelanguage==0.6.15->google.generativeai)
  Using cached proto_plus-1.27.0-py3-none-any.whl.metadata (2.2 kB)
Collecting googleapis-common-protos<2.0.0,>=1.56.2 (from google-api-core->google.generat

In [16]:
import google.generativeai as genai
def embed_text(text: str) -> list[float]:
    genai.configure(api_key=os.getenv("GEMINI_API_KEY"))
    response = genai.embed_content(
        model="models/text-embedding-004",
        content=text
    )
    return response["embedding"] 


## 5. Setup Qdrant Vector Database

Connect to Qdrant vector database running on localhost to store and query embeddings.

In [9]:
!pip install qdrant_client


Collecting qdrant_client
  Using cached qdrant_client-1.16.2-py3-none-any.whl.metadata (11 kB)
Collecting grpcio>=1.41.0 (from qdrant_client)
  Using cached grpcio-1.76.0-cp313-cp313-win_amd64.whl.metadata (3.8 kB)
Collecting portalocker<4.0,>=2.7.0 (from qdrant_client)
  Using cached portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting h2<5,>=3 (from httpx[http2]>=0.20.0->qdrant_client)
  Using cached h2-4.3.0-py3-none-any.whl.metadata (5.1 kB)
Collecting hyperframe<7,>=6.1 (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant_client)
  Using cached hyperframe-6.1.0-py3-none-any.whl.metadata (4.3 kB)
Collecting hpack<5,>=4.1 (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant_client)
  Using cached hpack-4.1.0-py3-none-any.whl.metadata (4.6 kB)
Using cached qdrant_client-1.16.2-py3-none-any.whl (377 kB)
Using cached portalocker-3.2.0-py3-none-any.whl (22 kB)
Using cached grpcio-1.76.0-cp313-cp313-win_amd64.whl (4.7 MB)
Using cached h2-4.3.0-py3-none-any.whl (61 kB)
Using cached hpack-4.1.0

In [28]:
from qdrant_client import QdrantClient , models

# With Docker on localhost
client = QdrantClient("http://localhost:6333")               # REST
# Quick smoke test
print(client.get_collections())

collections=[CollectionDescription(name='jobs'), CollectionDescription(name='candidates')]


## 6. Generate and Store Job Embeddings

Create embeddings for all job descriptions and store them in the Qdrant "jobs" collection.

In [33]:
job_records=[]
for index, job in jobs_df.iterrows():

    text_embedding = embed_text(job["search_text"]) 
    
    
    
    job_records.append(models.PointStruct(
        id=index,
        payload={
            "title": job["title"],
            "mongodb_id": job["_id"]
        },
        vector = text_embedding,
        ))
    
client.upsert(
    collection_name="jobs",
    points = job_records , 
)
    

UpdateResult(operation_id=1, status=<UpdateStatus.COMPLETED: 'completed'>)

## 7. Generate and Store Candidate Embeddings

Create embeddings for all candidate profiles and store them in the Qdrant "candidates" collection.

In [36]:
candidates_records=[]
for index, candidate in candidates_df.iterrows():

    text_embedding = embed_text(candidate["search_text"]) 
    
    
    
    candidates_records.append(models.PointStruct(
        id=index,
        payload={
            "name": candidate["name"] , 
            "title": candidate["title"],
            "mongodb_id": candidate["_id"]
        },
        vector = text_embedding,
        ))
    
client.upsert(
    collection_name="candidates",
    points = candidates_records , 
)

UpdateResult(operation_id=1, status=<UpdateStatus.COMPLETED: 'completed'>)

## 8. Test Semantic Search

Retrieve a candidate's vector and find the top 5 most similar jobs based on cosine similarity.

In [40]:
candidate_point = client.retrieve(
    collection_name="candidates",
    ids=[0],          
    with_vectors=True,           
    with_payload=False           
)
candidate_vector = candidate_point[0].vector

In [50]:
search_result = client.query_points(
    collection_name="jobs",
    query=candidate_vector,                   
    limit=5,
    with_payload=True,
)
print(search_result.points[0])

id=0 version=1 score=0.74189216 payload={'title': 'Backend Developer', 'mongodb_id': 'job_001'} vector=None shard_key=None order_value=None


### Display Matching Results

Show the job descriptions and similarity scores for the top matching jobs.

In [54]:
for point in search_result.points:
    mongo_id=point.payload["mongodb_id"]
    print("job decription: ", jobs_df.loc[jobs_df["_id"]==mongo_id]["description"] , " | score:", point.score)



job decription:  0    Backend developer needed to build REST APIs us...
Name: description, dtype: object  | score: 0.74189216
job decription:  2    Looking for a full stack engineer comfortable ...
Name: description, dtype: object  | score: 0.6736795
job decription:  10    Develop backend services using Java and Spring...
Name: description, dtype: object  | score: 0.62713313
job decription:  1    Frontend developer to build responsive user in...
Name: description, dtype: object  | score: 0.58442724
job decription:  13    Develop automation scripts and backend service...
Name: description, dtype: object  | score: 0.56576777
