# 1.4.1. Task 1: Working with the documents 

## 1. Translation.
Find the resumes (or parts of them) that are not in English, and translate them into English using LLM.

## 2. Entities extraction.
Extract useful named entities from the resume using LLM. For example, you can extract the job title, years of experience, highest level of education, language skills, and key skills, or define any entities that you find interesting. As an additional task, you may create an Excel report that contains entities from 20-30 resumes.

## 3. Summarisation.
Make a short summary of the resume. You may choose any size you find useful. Defining the structure of the summary (adding the obligatory entities) or just getting it from LLM is up to you. The general idea is to provide an opportunity for recruiters to read it quickly and not scan 2-3 pages.

## 4. Resume scoring.
Develop a mechanism to provide a ranking of the resumes for a vacancy by providing a score (float value from 0 to 1). A particular vacancy can be found at [https://www.dataart.team/vacancies](https://www.dataart.team/vacancies) (or on LinkedIn). It should work in 2 modes: calculate the score for the provided vacancy and resume, and present the top 10 candidates for the vacancy.
a and algorithms for scoring.
ncy.

### Installations:

In [1]:
!pip install -q openai pandas langdetect PyPDF2 faiss-cpu


[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


### Imports

In [2]:
import os
import re
from pathlib import Path

import numpy as np
import pandas as pd
from PyPDF2 import PdfReader
from langdetect import detect, DetectorFactory
import openai

### Set environment variable for Open Ai client:

In [3]:
%env OPENAI_API_KEY=

env: OPENAI_API_KEY=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


### Create basic Variables, paths and set Open AI client

In [4]:
from pathlib import Path
client = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
directory_path = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/test_resumes_dataset")
translated_output_directory = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/resumes_translated")
logs_directory = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs")
max_chunk_size = 3500  
overlap_size = 50  

DetectorFactory.seed = 0

### extract_text_from_pdf

In [5]:
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        page_text = page.extract_text()
        if page_text:
            text += page_text + "\n"
    return text

### Split text into chunks with overlap

In [6]:
def split_text(text, max_chunk_size, overlap_size=50):
    words = text.split()
    chunks = []
    current_chunk = ""
    for word in words:
        if len(current_chunk) + len(word) + 1 <= max_chunk_size:
            current_chunk += word + " "
        else:
            chunks.append(current_chunk)
            current_chunk = word + " "
    chunks.append(current_chunk)
    return chunks

### Function to translate text using OpenAI's API  for openai>=1.0.0

In [7]:
def translate_text(client, text, target_language="en"):
    response = client.completions.create(
        model="gpt-3.5-turbo-instruct",
        prompt=f"Translate the following text to {target_language}:\n\n{text}",
        max_tokens=500
    )
    return response.choices[0].text.strip()

### Function to process PDFs and save translated versions

In [8]:
def process_pdfs(directory_path, max_chunk_size, overlap_size, translated_output_directory, client):
    directory_path = Path(directory_path)
    translated_output_directory = Path(translated_output_directory)
    
    translated_output_directory.mkdir(parents=True, exist_ok=True)

    translated_files_list = []
    english_files_list = []

    for pdf_path in directory_path.glob('*.pdf'):  
        text = extract_text_from_pdf(str(pdf_path))

        if text.strip():
            if detect(text) != 'en':
                chunks = split_text(text, max_chunk_size, overlap_size)
                translated_text = ""

                for chunk in chunks:
                    if detect(chunk) != 'en':
                        chunk = translate_text(client, chunk, target_language="en")
                    translated_text += chunk + " "
                
                translated_filename = f"translated_{pdf_path.stem}.txt"
                translated_path = translated_output_directory / translated_filename
                save_text_to_file(translated_text, translated_path)

                translated_files_list.append(translated_filename)
            else:
                english_files_list.append(pdf_path.name)
        else:
            print(f"Document {pdf_path.name} is empty or contains very little text.")


    save_file_list(translated_files_list, Path(logs_directory), 'translated_files_list.txt')
    save_file_list(english_files_list, Path(logs_directory), 'english_files_list.txt')

def save_file_list(file_list, directory, filename):
    directory = Path(directory)  
    file_path = directory / filename  
    with file_path.open('w', encoding='utf-8') as f:
        for file in file_list:
            f.write(f"{file}\n")


def save_text_to_file(text, file_path):
    file_path = Path(file_path)  
    with file_path.open('w', encoding='utf-8') as f:
        f.write(text)


In [9]:
process_pdfs(directory_path, max_chunk_size, overlap_size, translated_output_directory, client)

Document 12632728.pdf is empty or contains very little text.


### Create named entities to look for in resumes

In [10]:
entities = ["job title", "years of experience", "highest level of education", "language skills", "key skills"]

### Function to extract text from TXT files with different encodings

In [11]:
def extract_text_from_txt(file_path):
    file_path = Path(file_path) 
    encodings = ['utf-8', 'latin1', 'ISO-8859-1', 'cp1252']
    for encoding in encodings:
        try:
            with file_path.open('r', encoding=encoding) as file:
                return file.read()
        except UnicodeDecodeError:
            continue
    raise ValueError(f"Cannot decode file {file_path} with any of the provided encodings.")


### Function to extract entities using the language model

In [19]:
def extract_entities_with_llm(client, text, entities, max_chunk_size, overlap_size):
    extracted_info = ""
    chunks = split_text(text, max_chunk_size, overlap_size) 

    for chunk in chunks:
        prompt = (
                "Extract the following entities from this text, calculating years of experience as a decimal number where months are converted to a fractional year without any additional info: "
                + ", ".join(entities)
                + ". If a value for an entity is not present or cannot be extracted, fill in with just: NaN."
                + "\n\n"
                + chunk
            )
        prompt_length = len(prompt.split())

        max_tokens_for_completion = 4097 - prompt_length
        max_tokens_for_completion = min(max_tokens_for_completion, 300)

        response = client.completions.create(
            model="gpt-3.5-turbo-instruct",
            prompt=prompt,
            max_tokens=max_tokens_for_completion,
            temperature=0.35
        )
        extracted_info += response.choices[0].text.strip() + "\n"

    return extracted_info


### Function to process resumes and extract named entities

In [20]:
def process_resume(directory, filename, client, entities, data_list, is_txt=False):
    file_path = directory / filename 
    text = extract_text_from_txt(file_path) if is_txt else extract_text_from_pdf(file_path)

    if text.strip():
        extracted_info = extract_entities_with_llm(client, text, entities, max_chunk_size, overlap_size)
        info_dict = {'Filename': filename}

        for entity in entities:
            pattern = re.compile(rf"{entity}\s*:\s*(.*)", re.IGNORECASE)
            match = pattern.search(extracted_info)
            if match:
                info_dict[entity] = match.group(1).strip()
            else:
                info_dict[entity] = None

        data_list.append(info_dict)


### Function to create a report of named entities from resumes

In [21]:
def create_entities_report(directory_path, translated_output_directory, client, entities):
    data = []
    directory_path = Path(directory_path)
    translated_output_directory = Path(translated_output_directory)

    for pdf_path in directory_path.glob('*.pdf'):
        process_resume(directory_path, pdf_path.name, client, entities, data)

    for txt_path in translated_output_directory.glob('0_translated_*.txt'):
        process_resume(translated_output_directory, txt_path.name, client, entities, data, is_txt=True)

    df = pd.DataFrame(data)
    report_path = logs_directory / '1_resumes_entities_report.xlsx'
    df.to_excel(report_path, index=False)

create_entities_report(directory_path, translated_output_directory, client, entities)

file_path = logs_directory / '1_resumes_entities_report.xlsx'

if file_path.is_file():
    df = pd.read_excel(file_path)
    df.set_index('Filename', inplace=True)
    df.sort_index(inplace=True)
    years_of_experience = df['years of experience']
else:
    print(f"Error: The file {file_path} does not exist.")

### Load the DataFrame from the Excel file

In [16]:
file_path = os.path.join(logs_directory, '1_resumes_entities_report.xlsx')
df = pd.read_excel(file_path)
df.set_index('Filename', inplace=True)
df.sort_index(inplace=True)

years_of_experience = df['years of experience']

In [17]:
df

Unnamed: 0_level_0,job title,years of experience,highest level of education,language skills,key skills
Filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10554236.pdf,Financial Accountant,9 years,,,"Financial planning, reporting, analysis, accou..."
10818478.pdf,Retail Sales Consultant,5.5 years (September 2015 to Current),High School Diploma,,"Administrative, Cash handling, Excellent commu..."
10820510.pdf,QA / QC Manager,21 years,,,"Microsoft Word, Excel, Weld Pro, Auto-Cad, AWS..."
11257723.pdf,Claims Representative,10.75,B.S in Journalism,,"Claims file management, Litigation resolution,..."
11409460.pdf,Buyer/Planner,4.5,Bachelor of Science in Petroleum Engineering,"English, Portuguese, Spanish","Solid Works, CAD, Matlab, MS Office, Process I..."
12491898.pdf,Construction Laborer,,,,"Construction, Labor, Equal Opportunity, Traini..."
13518263.pdf,Interior Designer,26.5,Associate of Arts,,"Concept development, space planning, color and..."
13907230.pdf,General Construction Intern,0.25,Bachelor of Science in Construction Management,Bilingual and biliterate in Spanish,"Proficient technical skills in AutoCAD, Micros..."
14724186.pdf,Interior Designer,22.5,High School Diploma,,"Photoshop, art, Budget Preparation, budgets, b..."
15601399.pdf,Self-Sustaining Engineering Fabrication Techni...,20 years (10/2000 to current),Bachelor of Science in Electronic Engineering,,"Technical expertise, people skills, communicat..."


### Function to calculate numeric years of experience from text descriptions

In [18]:
file_path = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/1_resumes_entities_report.xlsx")
df = pd.read_excel(file_path)

text_descriptions = df['years of experience'].astype(str).tolist()

new_file_path = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/2_resumes_updated_years.xlsx")

def calculate_years_of_experience(client, text_descriptions):
    numeric_experience_list = []

    for text in text_descriptions:
        prompt = f"Convert the following description of work experience '{text}' into a numeric value representing total years of experience."

        try:
            response = client.completions.create(
                model="gpt-3.5-turbo-instruct",
                prompt=prompt,
                max_tokens=50,
                temperature=0.1
            )

            extracted_numbers = re.findall(r'\b\d+\.?\d*\b', response.choices[0].text.strip())
            if extracted_numbers:
                numeric_experience = float(extracted_numbers[0])
                numeric_experience_list.append(numeric_experience)
            else:
                numeric_experience_list.append(float('nan'))
        except Exception as e:
            print(f"An error occurred: {e}")
            numeric_experience_list.append(float('nan'))

    return numeric_experience_list

numeric_years_of_experience = calculate_years_of_experience(client, text_descriptions)

df['numeric_years_of_experience'] = numeric_years_of_experience

df.to_excel(new_file_path, index=False)

print(f"The updated DataFrame has been saved to {new_file_path}.")

The updated DataFrame has been saved to C:\Users\apleczkan\PycharmProjects\task1-cv-resumes\logs\2_resumes_updated_years.xlsx.


### Function to summarize resumes

In [22]:
def summarize_text(client, text, max_chunk_size=3000, overlap_size=50):
    chunks = split_text(text, max_chunk_size, overlap_size)
    summary = ""

    for chunk in chunks:
        prompt = (
            "Please summarize the following resume into a short paragraph that includes "
            "the job title, years of experience, highest level of education, language skills, "
            "and key skills:\n\n" + chunk
        )

        try:
            response = client.completions.create(
                model="gpt-3.5-turbo-instruct",
                prompt=prompt,
                max_tokens=150,  
                temperature=0.2
            )
            chunk_summary = response.choices[0].text.strip()
            summary += chunk_summary + "\n"
        except Exception as e:
            print(f"An error occurred: {e}")

    return summary

### Process and Summarize Resumes

In [23]:
for filename in df.index:
    if filename.startswith('translated_'):
        resume_path = translated_output_directory / filename
    else:
        resume_path = directory_path / filename

    if 'translated_files_list' in filename:
        continue

    if resume_path.suffix.lower() == '.pdf':
        try:
            resume_text = extract_text_from_pdf(str(resume_path))
        except Exception as e:
            print(f"An error occurred while reading PDF file: {e}")
            continue
    elif resume_path.suffix.lower() == '.txt':
        try:
            resume_text = extract_text_from_txt(str(resume_path))
        except Exception as e:
            print(f"An error occurred while reading text file: {e}")
            continue
    else:
        print(f"Unsupported file format for file: {resume_path}")
        continue

    summary = summarize_text(client, resume_text)
    df.at[filename, 'Summary'] = summary

print(df.head())
updated_file_path = logs_directory / '3_resumes_summaries.xlsx'
df.to_excel(str(updated_file_path))
df.head()

                                                      job title  \
Filename                                                          
10554236.pdf                               Financial Accountant   
10818478.pdf                            Retail Sales Consultant   
10820510.pdf                                    QA / QC Manager   
11257723.pdf             General Liability Claim Representative   
11409460.pdf  Buyer/Planner, Logistics Analyst, Warehouse Ex...   

                         years of experience  \
Filename                                       
10554236.pdf                         8 years   
10818478.pdf                         5 years   
10820510.pdf                        21 years   
11257723.pdf               5.083333333333333   
11409460.pdf  4.5 years, 1.25 years, 4 years   

                                highest level of education  \
Filename                                                     
10554236.pdf                                           NaN   
1081847

Unnamed: 0_level_0,job title,years of experience,highest level of education,language skills,key skills,Summary
Filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10554236.pdf,Financial Accountant,8 years,,,"Financial planning, reporting, analysis, accou...",Experienced Financial Accountant with a Bachel...
10818478.pdf,Retail Sales Consultant,5 years,High School Diploma,,"Administrative, Cash handling, Excellent commu...",This experienced Retail Sales Consultant has o...
10820510.pdf,QA / QC Manager,21 years,,,"Microsoft Word, Excel, Weld Pro, Auto-Cad, AWS...",Matt Halderman is a highly experienced QA/QC M...
11257723.pdf,General Liability Claim Representative,5.083333333333333,B.S in Journalism,,"Claims file management processes, Litigation r...",This resume highlights the experience and skil...
11409460.pdf,"Buyer/Planner, Logistics Analyst, Warehouse Ex...","4.5 years, 1.25 years, 4 years",Bachelor of Science in Petroleum Engineering,"English, Portuguese, Spanish","Solid Works, CAD, Matlab, MS Office, Process I...","3D modeling, Warehouse\n\nThis candidate is a ..."


### Scoring criteria based on provided vacancy:

### job description from vacancies.DataArt:

In [24]:
job_description = """
FullStack(NodeJS, ReactJS), Online Genealogy Service
Client
The client is an international company that provides an online genealogy service that helps its clients understand their past and family history.

Project overview
The core programming language is JavaScript (ES2020), a website running on React.js and GraphQL and the back-end platform is based on Node.js (Express). Microservices running under Kubernetes. The project methodology is Scrum.

Team
There are a few Full Stack teams, up to 8 people each. Each team has a team lead and a product owner.

Position overview
We are looking for a specialist to join one of the teams (which is more Frontend oriented) is working on the further development of existing platforms. Regarding the work schedule, each employee should be available till 4 pm UK time.

Technology stack
JavaScript, React.js, GraphQL, Node.js (Express), Kubernetes.
 
Requirements
Development experience using a Node.js (Express) + React.js stack
Experience with SQL Server
Experience with PostgreSQL
Knowledge of Kafka
Knowledge of RabbitMQ
Dev-level experience with K8s/Docker
Knowledge of sound engineering practices like pair programming, upfront automated testing, continuous deployment, and trunk-based development
Spoken English

Nice to have
Knowledge of Apollo engine, Kafka, Postgres
Experience with microservices architecture development
Experience with GraphQL
Experience with RabbitMQ, SQL Server
Experience in development with C#
Experience with SOLR
Software development experience in Python
"""

{
  "job title": "FullStack Developer",
  "years of experience": "3+ years",
  "highest level of education": "Bachelor's degree in Computer Science or related field",
  "language skills": "English (spoken)",
  "key skills": [
    "Node.js (Express)",
    "React.js",
    "GraphQL",
    "JavaScript (ES2020)",
    "Kubernetes",
    "SQL Server",
    "PostgreSQL",
    "Kafka",
    "RabbitMQ",
    "C#",
    "SOLR",
    "Python"
  ]
}


### combine columns in excel:

In [25]:
excel_file_1 = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/3_resumes_summaries.xlsx")
excel_file_2 = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/2_resumes_updated_years.xlsx")

df1 = pd.read_excel(excel_file_1)
df2 = pd.read_excel(excel_file_2)

df1.set_index('Filename', inplace=True)
df2.set_index('Filename', inplace=True)

df1['years of experience'] = df2['numeric_years_of_experience']
df1.reset_index(inplace=True)

save_path = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/4_resumes_years_summaries.xlsx")

df1.to_excel(save_path, index=False)

In [26]:
df1.head(10)

Unnamed: 0,Filename,job title,years of experience,highest level of education,language skills,key skills,Summary
0,10554236.pdf,Financial Accountant,9.0,,,"Financial planning, reporting, analysis, accou...",Experienced Financial Accountant with a Bachel...
1,10818478.pdf,Retail Sales Consultant,5.5,High School Diploma,,"Administrative, Cash handling, Excellent commu...",This experienced Retail Sales Consultant has o...
2,10820510.pdf,QA / QC Manager,21.0,,,"Microsoft Word, Excel, Weld Pro, Auto-Cad, AWS...",Matt Halderman is a highly experienced QA/QC M...
3,11257723.pdf,General Liability Claim Representative,10.75,B.S in Journalism,,"Claims file management processes, Litigation r...",This resume highlights the experience and skil...
4,11409460.pdf,"Buyer/Planner, Logistics Analyst, Warehouse Ex...",4.5,Bachelor of Science in Petroleum Engineering,"English, Portuguese, Spanish","Solid Works, CAD, Matlab, MS Office, Process I...","3D modeling, Warehouse\n\nThis candidate is a ..."
5,12491898.pdf,Construction Laborer,,,,,Construction laborer with several years of exp...
6,13518263.pdf,Interior Designer,26.5,Associate of Arts,,"Concept development, space planning, color and...",This successful Interior Designer has 26 years...
7,13907230.pdf,General Construction Intern,0.25,Bachelor of Science in Construction Management,Bilingual and biliterate in Spanish,"Proficient technical skills in AutoCAD, Micros...",This candidate is a dedicated student with exc...
8,14724186.pdf,Interior Designer,22.5,High School Diploma,,"Photoshop, art, Budget Preparation, budgets, b...",This Interior Designer has over 20 years of ex...
9,15601399.pdf,Self-Sustaining Engineering Technician,20.0,Bachelor of Science in Electronic Engineering,,"Versatile, project management, hardware troubl...",This resume belongs to a Self-Sustaining Engin...


### working embeddings

### Read file as xlsx and save as CSV, clean excel file

In [27]:
import json

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    response = client.embeddings.create(input=[text], model=model)
    return response.data[0].embedding

def list_to_json_str(lst):
    return json.dumps(lst)

excel_file_path = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/4_resumes_years_summaries.xlsx")

df = pd.read_excel(excel_file_path, index_col='Filename')

text_columns = ['job title', 'years of experience', 'highest level of education', 'language skills', 'key skills', 'Summary']

for column in text_columns:
    if df[column].dtype == 'object':
        df[column + ' embedding'] = df[column].apply(lambda x: list_to_json_str(get_embedding(x)) if pd.notnull(x) else np.nan)

save_csv_path = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/5_resumes_embeddings.csv")
df.to_csv(save_csv_path, index=True)

save_excel_path = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/5_resumes_embeddings.xlsx")
df.to_excel(save_excel_path, index=True)

print(f"DataFrame saved to {save_csv_path} and {save_excel_path}")

DataFrame saved to C:\Users\apleczkan\PycharmProjects\task1-cv-resumes\logs\5_resumes_embeddings.csv and C:\Users\apleczkan\PycharmProjects\task1-cv-resumes\logs\5_resumes_embeddings.xlsx


### Check if everything worked as expected

In [28]:
df = pd.read_csv("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/5_resumes_embeddings.csv")
df.head()
df.tail()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38 entries, 0 to 37
Data columns (total 12 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Filename                              38 non-null     object 
 1   job title                             38 non-null     object 
 2   years of experience                   37 non-null     float64
 3   highest level of education            28 non-null     object 
 4   language skills                       10 non-null     object 
 5   key skills                            37 non-null     object 
 6   Summary                               38 non-null     object 
 7   job title embedding                   38 non-null     object 
 8   highest level of education embedding  28 non-null     object 
 9   language skills embedding             10 non-null     object 
 10  key skills embedding                  37 non-null     object 
 11  Summary embedding    

### Scoring using vector similarity search

In [62]:
from numpy.linalg import norm
from ast import literal_eval

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input = [text], model=model).data[0].embedding

def string_to_float_list(s):
    try:
        return np.array(literal_eval(s))
    except:
        return np.nan

def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

job_description = {
    "job title": "FullStack Developer",
    "years of experience": "At least 2 years of development experience",
    "highest level of education": "Bachelor's or higher in Computer Science or related field",
    "language skills": "Fluent in spoken English",
    "key skills": "Node.js, React.js, GraphQL, Kubernetes, SQL Server, PostgreSQL, Kafka, RabbitMQ, C#, SOLR, Python, Sound engineering practices, Pair programming, Automated testing, Continuous deployment, Trunk-based development"
}

job_description_embeddings = {key: get_embedding(value) for key, value in job_description.items()}

df_path = "C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/5_resumes_embeddings.csv"
df = pd.read_csv(df_path)

embedding_columns = [col for col in df.columns if 'embedding' in col]
df[embedding_columns] = df[embedding_columns].applymap(string_to_float_list)

def search_resumes(df, job_description_embeddings):
    df['similarity'] = 0.0

    for index, row in df.iterrows():
        similarity_scores = []
        for key, job_embedding in job_description_embeddings.items():
            embedding_col_name = f'{key} embedding'

            if embedding_col_name in df.columns:
                resume_embedding = row.get(embedding_col_name)

                if isinstance(resume_embedding, np.ndarray) and not np.isnan(resume_embedding).any():
                    similarity_scores.append(cosine_similarity(resume_embedding, job_embedding))

        if similarity_scores:
            df.at[index, 'similarity'] = np.mean(similarity_scores)
        else:
            df.at[index, 'similarity'] = 0.0  

    sorted_df = df.sort_values('similarity', ascending=False)

    csv_file_path = "C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/6_scores.csv"
    sorted_df.to_csv(csv_file_path, index=False)
    print(f"DataFrame saved to {csv_file_path}")

    return sorted_df

top_matches = search_resumes(df, job_description_embeddings)
top_matches.head(10)

DataFrame saved to C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/6_scores.csv
                         Filename  \
37             test_resume_PL.pdf   
36  perfect_match_test_resume.pdf   
9                    15601399.pdf   
12                   19007667.pdf   
4                    11409460.pdf   
7                    13907230.pdf   
31                   39434376.pdf   
21                   25157655.pdf   
34                   81011612.pdf   
24                   29521434.pdf   
11                   17660419.pdf   
25                   30529547.pdf   
33                   77626587.pdf   
14                   20488267.pdf   
2                    10820510.pdf   
8                    14724186.pdf   
20                   24533931.pdf   
22                   25930778.pdf   
6                    13518263.pdf   
26                   32518109.pdf   

                                            job title  years of experience  \
37                               Full Stack Developer  

In [63]:
top_matches.head(10)

Unnamed: 0,Filename,job title,years of experience,highest level of education,language skills,key skills,Summary,job title embedding,highest level of education embedding,language skills embedding,key skills embedding,Summary embedding,similarity
37,test_resume_PL.pdf,Full Stack Developer,5.5,Bachelor of Science in Computer Science,English (Fluent),"JavaScript (ES2020), React.js, Node.js (Expres...",Alex Johnson is a dynamic and creative Full St...,"[-0.027571901679039, -0.04344268515706062, 0.0...","[-0.03735457360744476, -0.01849149726331234, 0...","[-0.020018214359879494, 0.026221809908747673, ...","[-0.05989794433116913, 0.019079990684986115, 0...","[-0.053971629589796066, -0.019841112196445465,...",0.818348
36,perfect_match_test_resume.pdf,Full Stack Developer,5.5,Bachelor of Science in Computer Science,English (Fluent),"JavaScript, React.js, Node.js, C#, Python, SQL...",Alex Johnson is a dynamic and creative Full St...,"[-0.027571901679039, -0.04344268515706062, 0.0...","[-0.03735457360744476, -0.01849149726331234, 0...","[-0.020018214359879494, 0.026221809908747673, ...","[-0.059905365109443665, 0.019764361903071404, ...","[-0.05070486292243004, -0.013921516016125679, ...",0.817668
9,15601399.pdf,Self-Sustaining Engineering Technician,20.0,Bachelor of Science in Electronic Engineering,,"Versatile, project management, hardware troubl...",This resume belongs to a Self-Sustaining Engin...,"[0.008474043570458889, 0.02479414828121662, -0...","[-0.03314077481627464, -0.01538262888789177, -...",,"[-0.027127940207719803, 0.02288758009672165, 0...","[-0.005034045781940222, 0.031484685838222504, ...",0.410744
12,19007667.pdf,Chef,11.5,,"English, Spanish","Catering, International cuisine, Food handling...",Experienced catering chef with over 20 years o...,"[0.020689690485596657, -0.04564774036407471, 0...",,"[-0.005115291569381952, 0.0031532866414636374,...","[-0.04559430480003357, 0.0006426058826036751, ...","[-0.013046729378402233, -0.003387243952602148,...",0.396838
4,11409460.pdf,"Buyer/Planner, Logistics Analyst, Warehouse Ex...",4.5,Bachelor of Science in Petroleum Engineering,"English, Portuguese, Spanish","Solid Works, CAD, Matlab, MS Office, Process I...","3D modeling, Warehouse\n\nThis candidate is a ...","[-0.03179485350847244, -0.002672777511179447, ...","[-0.05517318472266197, -0.023150218650698662, ...","[-0.011648587882518768, 0.017744550481438637, ...","[-0.019217632710933685, 0.021598920226097107, ...","[-0.044104475528001785, 0.030383646488189697, ...",0.384999
7,13907230.pdf,General Construction Intern,0.25,Bachelor of Science in Construction Management,Bilingual and biliterate in Spanish,"Proficient technical skills in AutoCAD, Micros...",This candidate is a dedicated student with exc...,"[-0.03458902984857559, 0.03296443819999695, 0....","[-0.02583170495927334, 0.02778453193604946, 0....","[0.0036688519176095724, -0.013028540648519993,...","[0.010658502578735352, 0.034927863627672195, 0...","[-0.013416905887424946, 0.023733437061309814, ...",0.382523
31,39434376.pdf,Graphic Designer and Illustrator,3.5,Bachelor of Science in Fine Arts,Spanish,"Creative, Relational, Engaging, Painting/Drawi...",This graphic designer and illustrator has 5 ye...,"[-0.017479347065091133, 0.024557972326874733, ...","[-0.035439398139715195, -0.0010236717062070966...","[-0.03011108934879303, -0.037121593952178955, ...","[-0.008999950252473354, 0.027606280520558357, ...","[-0.0014269789680838585, 0.028805717825889587,...",0.372915
21,25157655.pdf,"Inside Account Manager, Event Manager and Sale...",10.0,Bachelor of Arts in Psychology,,"Team leadership, customer service, volume lice...",Experienced Inside Account Manager with eight ...,"[-0.02725711464881897, 0.019757211208343506, 0...","[-0.0339832529425621, -0.013096763752400875, 0...",,"[-0.015760717913508415, 0.019223403185606003, ...","[-0.01193070225417614, -0.008359400555491447, ...",0.3651
34,81011612.pdf,Graphic Designer,8.0,,,"Photoshop, Illustrator, Basic HTML coding, Mic...",This graphic designer has eight years of exper...,"[-0.018504373729228973, 0.0258307047188282, 0....",,,"[-0.05779535323381424, 0.0304099228233099, 0.0...","[-0.031176120042800903, 0.05340581387281418, 0...",0.362495
24,29521434.pdf,Cashier,0.25,Associate of Arts in Early Childhood Education...,Conversant in Korean,"Problem Solving, Adaptability, Collaboration, ...",This resume belongs to a Cashier with experien...,"[-0.028124133124947548, -0.01013905368745327, ...","[-0.012859285809099674, -0.013897830620408058,...","[-0.02016090229153633, -0.033935196697711945, ...","[0.011242450214922428, -0.002014272380620241, ...","[0.007931552827358246, 0.008196759968996048, 0...",0.351056


### Combine scores with existing excel file and save under new name

In [66]:
excel_file = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/4_resumes_years_summaries.xlsx")
csv_file_path = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/6_scores.csv")
output_excel_file = Path("C:/Users/apleczkan/PycharmProjects/task1-cv-resumes/logs/7_resumes_summary_scores_sorted.xlsx")

excel_df = pd.read_excel(excel_file)
scores_df = pd.read_csv(csv_file_path)

excel_df = excel_df.sort_values(by='Filename')
scores_df = scores_df.sort_values(by='Filename')

scores_df = scores_df.rename(columns={'similarity': 'scores'})

excel_df['scores'] = scores_df['scores'].values

if "Unnamed: 0" in excel_df.columns:
    excel_df = excel_df.drop(columns=["Unnamed: 0"])

sorted_df = excel_df.sort_values(by='scores', ascending=False)

sorted_df.to_excel(output_excel_file, index=False)

print(f"Sorted and saved DataFrame to {output_excel_file}")

Sorted and saved DataFrame to C:\Users\apleczkan\PycharmProjects\task1-cv-resumes\logs\7_resumes_summary_scores_sorted.xlsx


In [67]:
print("Sorted DataFrame:")
sorted_df.head(10)

Sorted DataFrame:


Unnamed: 0,Filename,job title,years of experience,highest level of education,language skills,key skills,Summary,scores
37,test_resume_PL.pdf,Full Stack Developer,5.5,Bachelor of Science in Computer Science,English (Fluent),"JavaScript (ES2020), React.js, Node.js (Expres...",Alex Johnson is a dynamic and creative Full St...,0.818348
36,perfect_match_test_resume.pdf,Full Stack Developer,5.5,Bachelor of Science in Computer Science,English (Fluent),"JavaScript, React.js, Node.js, C#, Python, SQL...",Alex Johnson is a dynamic and creative Full St...,0.817668
9,15601399.pdf,Self-Sustaining Engineering Technician,20.0,Bachelor of Science in Electronic Engineering,,"Versatile, project management, hardware troubl...",This resume belongs to a Self-Sustaining Engin...,0.410744
12,19007667.pdf,Chef,11.5,,"English, Spanish","Catering, International cuisine, Food handling...",Experienced catering chef with over 20 years o...,0.396838
4,11409460.pdf,"Buyer/Planner, Logistics Analyst, Warehouse Ex...",4.5,Bachelor of Science in Petroleum Engineering,"English, Portuguese, Spanish","Solid Works, CAD, Matlab, MS Office, Process I...","3D modeling, Warehouse\n\nThis candidate is a ...",0.384999
7,13907230.pdf,General Construction Intern,0.25,Bachelor of Science in Construction Management,Bilingual and biliterate in Spanish,"Proficient technical skills in AutoCAD, Micros...",This candidate is a dedicated student with exc...,0.382523
31,39434376.pdf,Graphic Designer and Illustrator,3.5,Bachelor of Science in Fine Arts,Spanish,"Creative, Relational, Engaging, Painting/Drawi...",This graphic designer and illustrator has 5 ye...,0.372915
21,25157655.pdf,"Inside Account Manager, Event Manager and Sale...",10.0,Bachelor of Arts in Psychology,,"Team leadership, customer service, volume lice...",Experienced Inside Account Manager with eight ...,0.3651
34,81011612.pdf,Graphic Designer,8.0,,,"Photoshop, Illustrator, Basic HTML coding, Mic...",This graphic designer has eight years of exper...,0.362495
24,29521434.pdf,Cashier,0.25,Associate of Arts in Early Childhood Education...,Conversant in Korean,"Problem Solving, Adaptability, Collaboration, ...",This resume belongs to a Cashier with experien...,0.351056
