# Information Retrieval
For this feature I will use document embeddings to calculate the cosine distance scores between resumes and job descriptions in order to be able to quantify the similarity between the skills of an applicant and the requirements for a job. The goal of this feature is to show how strongly several resumes match a job description. We will use embeddings created with the langchain library in order to  Embeddings are especially also able to detect context and not only based on keywords.

Firstly we have to import the required libraries such as langchain which will give us access to embeddings. The functions create_resume_vectorstore and creaate_vectorstore will clean the text data, by removing ... Afterwards the text gets split into text chunks and finally the embeddings are stored in a chroma vectorstore.

In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
import pandas as pd
import matplotlib.pyplot as plt

from create_vectorstore_resume import create_resume_vectorstore
from create_job_summary import create_vectorstore, make_chain

api_key = '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'

In order to test our Information Retrieval system I used the Google Jobs API to scrape SERP results from a Google Jobs search to get open jobs as example. As an example we will use an open position with the title "Research Scientist-NLP". As a very first step I used a prompt to summarize the required technical skills of the job vacancy.

As one can see the required skills are highly related to NLP and MLOps. 

In [2]:
################### Create job vectorstore ###################
open_position="C:/Users/SEPA/lanchain_ir2/Job_data/Research Scientist - NLP.txt"

# as your embeddings get persisted in folder job_embeddings/chroma you can reuse them later
# please only run this one time and delete the folder job_embeddings/chroma in case you want to rerun create_vectorstore()
# create_vectorstore(open_position)
chat_history = []
chain = make_chain()
question_job_1 = f"Please summarise the technical skills and profile needed for this job? Please return the answer in a concise manner, no more than 250 words. If not found, return 'Not provided'"
response = chain({"question": question_job_1, "chat_history": chat_history})
required_skills = response['answer']
print(required_skills)


The ideal candidate for this job is someone who is passionate about collaborating with teammates, enjoys learning new skills, and is eager to share knowledge with others. They should have expertise in NLP, specifically in question-answering, summarization, dialog systems, reinforcement learning, or distributed systems. Prioritizing question-answering, the candidate should be familiar with dialog-based QA using APIs, program synthesis, and large knowledge-graphs/databases. A PhD in Computer Science or a related field (or a Masters with significant research experience) is preferred. The candidate should have published in top-tier ML/NLP conferences and be proficient in coding with PyTorch, Tensorflow, or JAX. Experience working with large, messy real-world data is also required.


The folder consists of three resumes of applicants, and using the langchain and … we will find the resume which fits best to the position based on the context.

In [None]:
##################### Create resume vectorstore ################
create_resume_vectorstore('C:/Users/SEPA/lanchain_ir2/Resume_data_pdf/')
embedding = OpenAIEmbeddings(openai_api_key=api_key)
resume_vector_store = Chroma(
    collection_name="resume-embeddings",
    embedding_function=embedding,
    persist_directory="embeddings/chroma",
)

In [None]:
retriever = resume_vector_store.as_retriever(search_kwargs={"k": 3})
# docs = retriever.get_relevant_documents("I have one year of experience with NLP and MLOps. Moreover I have worked with AWS, Kubernetes and Docker.")
docs = retriever.get_relevant_documents(required_skills)

docs_score = resume_vector_store.similarity_search_with_score(query= required_skills, distance_metric="cos", k = 3)

In [None]:
Again the result shows that the resume of Scheppach fits best to job description.

In [None]:
applicant_values = []
score_values = []
for doc in docs_score:
    applicant_values.append(doc[0].metadata["source"].split('\\')[-1].split('.')[0:-1])
    score_values.append(doc[1])

data = pd.DataFrame({'Applicant': applicant_values, 'Score': score_values})

fig, ax = plt.subplots(figsize=(10, 10))
data.plot.bar(x='Applicant', y='Score', ax=ax)
ax.set_xlabel('Applicants')
ax.set_ylabel('Cosine distance')
plt.tight_layout()
plt.show()