# Applying BERT for Query based result search

### Step1 : Import necessary libraries

To generate embeddings, we begin by importing the necessary libraries: numpy, pandas, torch, transformers, sklearn.metrics.pairwise, nltk.tokenize, 
nltk.corpus, and nltk.stem. These libraries are essential for data processing, deep learning, and similarity computation.


In [1]:
import numpy as np
import pandas as pd
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
import warnings
warnings.filterwarnings('ignore')

### Step2 : Load and visualize the dataset

In [2]:
# Load dataset from CSV file
df = pd.read_csv("/usr/local/datasetsDir/text-dataset/job_title_des.csv")

# Read only the first twenty job descriptions
df = df.head(20)

In [3]:
df

Unnamed: 0.1,Unnamed: 0,Job Title,Job Description
0,0,Flutter Developer,We are looking for hire experts flutter develo...
1,1,Django Developer,PYTHON/DJANGO (Developer/Lead) - Job Code(PDJ ...
2,2,Machine Learning,"Data Scientist (Contractor)\n\nBangalore, IN\n..."
3,3,iOS Developer,JOB DESCRIPTION:\n\nStrong framework outside o...
4,4,Full Stack Developer,job responsibility full stack engineer – react...
5,5,Java Developer,Software Developer - Integration*\nImmediate O...
6,6,Full Stack Developer,senior full stack developer \- 1800026h cwt lo...
7,7,JavaScript Developer,"Job Description:\n\nReactJS + NodeJs, Azure Fu..."
8,8,DevOps Engineer,Main Responsibilities and Deliverables:\nManag...
9,9,Software Engineer,"Overview\n\n\nBased in Silicon Valley, Tintri ..."


### Step3 : After loading the dataset, we preprocess the text data by tokenizing, removing stopwords, punctuation, and lemmatizing.

In [4]:
# Preprocess text data
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [5]:
def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words and token not in string.punctuation]
    return ' '.join(tokens)

df['processed_description'] = df['Job Description'].apply(preprocess)

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,Job Title,Job Description,processed_description
0,0,Flutter Developer,We are looking for hire experts flutter develo...,looking hire expert flutter developer eligible...
1,1,Django Developer,PYTHON/DJANGO (Developer/Lead) - Job Code(PDJ ...,python/django developer/lead job code pdj 04 s...
2,2,Machine Learning,"Data Scientist (Contractor)\n\nBangalore, IN\n...",data scientist contractor bangalore responsibi...
3,3,iOS Developer,JOB DESCRIPTION:\n\nStrong framework outside o...,job description strong framework outside io al...
4,4,Full Stack Developer,job responsibility full stack engineer – react...,job responsibility full stack engineer – react...


### Step4 : we load a pretrained BERT tokenizer and model.

In [7]:
# Load pretrained BERT tokenizer and model

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

### Step5 : Define the embedding function

write a function generate_embedding to generate BERT embeddings for each word in the dataset after preprocessing. It tokenizes the job description 
text, feeds it to BERT, and extracts embedding for it using the [CLS] token embedding.

While doing tokenization, we need to set the truncation parameter to True as long descriptions will result in generating an error that the token 
index sequence length is longer than the specified maximum sequence length for this model, which is 512. Running such a sequence through the model
will result in indexing errors. Setting this parameter will truncate the sequences larger than the maximum sequence length the model allows.
   

In [8]:
# Function to generate BERT embedding for a text using [CLS] token

def generate_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :]  # Extract [CLS] token embedding
    return cls_embedding.numpy()

### Step6 : Define the similarity function

Define a function, semantic_search, for performing a semantic search using BERT embeddings. It computes cosine similarity between the 
query embedding and embeddings of job descriptions to find the most similar jobs.

we access the cosine similarity value at the index [0][0]. This is because both query_embedding and desc_embedding are of shape [1, hidden_size].
When we compute cosine_similarity between them, the result is a similarity matrix of shape [1, 1], containing a single similarity score.
Accessing [0][0] retrieves this single value.


In [9]:
# Function for semantic search

def semantic_search(query, top_n=3):
    query_embedding = generate_embedding(query)
    similarities = []
    for idx, row in df.iterrows():
        desc_embedding = generate_embedding(row['processed_description'])
        similarity = cosine_similarity(query_embedding, desc_embedding)[0][0]
        similarities.append((row['Job Title'], similarity, row['Job Description']))
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_n]

### Step7 : Example usage

In [10]:
# Example usage

query = "python developer with experience in web development"
results = semantic_search(query)
for title, similarity, description in results:
    print(f"Title: {title}, Similarity: {similarity:.2f}")

Title: Django Developer, Similarity: 0.86
Title: DevOps Engineer, Similarity: 0.82
Title: iOS Developer, Similarity: 0.81
