# Resume Recommender

Experimenting with a way to handle (potentially) long inputs that can't be truncated - recommending resumes with retrieve-and-rank!

In [1]:
from typing import *
import os.path
import re
import textwrap
from uuid import uuid4
import zipfile
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import pandas as pd
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from cleantext import clean
import dill

[nltk_data] Downloading package punkt to /home/azureuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm
Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


## Outline Of Approach
I'll be using a [retrieve and rank approach](https://www.sbert.net/examples/applications/retrieve_rerank/README.html#retrieve-re-rank).  The basic idea is to reduce what could be a huge list of possibilities with an initial coarse filtration (the "retrieval"), then apply a fine filtration (the "rank") to return the best possible results:

* From the universe of available candidates, choose an initial list of candidates with a (relatively) fast and performant technique.  I'll be using the excellent [SentenceTransformer](https://www.sbert.net/docs/quickstart.html#quickstart).
* From this subset, re-rank the candidates with a more sophisticated technique.  Here I'll use sentence-transformers' [CrossEncoder](https://www.sbert.net/examples/applications/cross-encoder/README.html), which is basically just the standard transformer architecture we all know and love.

That's the high level, but how will I handle lengthy documents?  I can't just truncate at 256/512 tokens, because there's a good chance that I'll miss the most relevant parts of the resume.  That leaves me with using a sliding window to extract subsets from each document.  I usually get fairly decent results just splitting on whitespace, but here I'm mainly interested in trying out [NLTK's sent_tokenize](https://www.nltk.org/api/nltk.tokenize.html) function.

So here's the full process:

1. **Tokenize** : split each document into a list of sentences.
2. **Retrieval** : for a given query, run a semantic search against each tokenized resume (list of sentences).  The relevance of a given resume to the query is then the maximum score (dot product) between the query and the resume's sentences, and the most relevant resumes are the top N resumes by score.
3. **Ranking** : for each of the top N resumes found in Step 2, compute the [score](https://www.sbert.net/docs/package_reference/cross_encoder.html#sentence_transformers.cross_encoder.CrossEncoder.predict) between a query string and the most relevant sentence in the resume.  Re-order the top N resumes.

## Data
Here I'll use a set of resumes from [florex's resume_corpus](https://github.com/florex/resume_corpus).  There are several thousand resumes in the corpus, so rather than extract to disk I'll just read them directly from the archive.

### References
Jiechieu, K.F.F., Tsopze, N. Skills prediction based on multi-label resume classification using CNN with model predictions explanation. Neural Comput & Applic (2020). https://doi.org/10.1007/s00521-020-05302-x

In [2]:
data_archive = './data/resumes_corpus.zip'  # https://github.com/florex/resume_corpus/blob/master/resumes_corpus.zip


def read_resume_zip(resume_zip_file: AnyStr, resume_extension: AnyStr = '.txt') -> Dict[AnyStr, Any]:
    """
    Directly reads files with a given filename extension.  Files are tokenized into sentences.
    resume_zip_file: path to the ZIP file
    resume_extension: extension of files to read, defaults to '.txt'.
    
    Returns a dict where keys are unique IDs (uuid4) of the form
    {uuid_str: {'sentences': [list of sentences], 'embeddings': None}}
    """
    resumes = dict()
    with zipfile.ZipFile(data_archive, 'r') as datazip:
        resume_files = [file for file in datazip.namelist() if file.endswith(resume_extension)]    
        for file in resume_files:
            with datazip.open(file) as fidin:
                key = str(uuid4())
                resumes[key] = dict()
                resumes[key]['sentences'] = sent_tokenize(clean_text(fidin.read()))
                resumes[key]['embeddings'] = None  # Placeholder for encoding
        print("Found {0} file(s)".format(len(resume_files)))
    return resumes

In [3]:
TAG_RE = re.compile('<.*?>')  # Basic regex to match tags

def clean_text(txt: AnyStr, remove_tags: bool = True) -> AnyStr:
    """
    Cleans up text - fixing Unicode, normalizing line breaks, and optionally stripping tags.
    txt: text to clean
    remove_tags: if True (default), try to strip tags from the cleaned text.  Otherwise leave as-is.
    
    Returns the (hopefully) cleaner text.
    """
    if txt and len(txt) > 0:
        cleaned = clean(
            txt,
            fix_unicode=True,               # fix various unicode errors
            to_ascii=True,                  # transliterate to closest ASCII representation
            lower=False,                     # lowercase text
            no_line_breaks=False,           # fully strip line breaks as opposed to only normalizing them
            no_urls=False,                  # replace all URLs with a special token
            no_emails=False,                # replace all email addresses with a special token
            no_phone_numbers=False,         # replace all phone numbers with a special token
            no_numbers=False,               # replace all numbers with a special token
            no_digits=False,                # replace all digits with a special token
            no_currency_symbols=False,      # replace all currency symbols with a special token
            no_punct=False,                 # remove punctuations
            replace_with_punct="",          # instead of removing punctuations you may replace them
            replace_with_url="<URL>",
            replace_with_email="<EMAIL>",
            replace_with_phone_number="<PHONE>",
            replace_with_number="<NUMBER>",
            replace_with_digit="0",
            replace_with_currency_symbol="<CUR>",
            lang="en"                       # set to 'de' for German special handling
        )
        if remove_tags:
            cleaned = re.sub(TAG_RE, '', cleaned)
    else:
        cleaned = txt
    return cleaned

In [4]:
# Bi-encoder - retrieval model
# Full list of options: https://www.sbert.net/docs/pretrained_models.html
# 'multi-qa-MiniLM-L6-cos-v1' with a maximum sequence length of 256 seems to work well
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256     #Truncate long passages to 256 tokens

# Cross-encoder - ranking model
# Full list of options: https://www.sbert.net/docs/pretrained-models/ce-msmarco.html#models-performance
# Fairly good results w. 'cross-encoder/ms-marco-TinyBERT-L-2-v2', very fast doc processing
cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2-v2')

In [5]:
def encode_resumes(
    resumes: Dict[AnyStr, Any], 
    bi_encoder: SentenceTransformer) -> Dict[AnyStr, Any]:
    """
    Encodes a set of documents with a SentenceTransformer model.
    
    resumes: set of documents to encode
    bi_encoder: SentenceTransformer model used to encode the documents
    
    Returns a dict of documents with the embeddings field populated.
    """
    for fid in resumes.keys():
        resumes[fid]['embeddings'] = bi_encoder.encode(
            resumes[fid]['sentences'], 
            convert_to_tensor=True, 
            show_progress_bar=False
        )
    return resumes


def read_encode_resumes(
    resume_zip_file: AnyStr,
    bi_encoder: SentenceTransformer) -> Dict[AnyStr, Any]:
    """
    Convenience function - reads a ZIP and generates embeddings in a single call.
    
    resume_zip_file: path to the ZIP file
    bi_encoder: SentenceTransformer model used to encode the documents
    
    Returns a dict of documents with the embeddings field populated.
    """
    
    resumes = read_resume_zip(resume_zip_file)
    return encode_resumes(resumes=resumes, bi_encoder=bi_encoder)


def load_resume_pickle(pkl_file: AnyStr) -> Any:
    """
    Deserializes a set of documents
    
    pkl_file: name of file from which to load object
    
    Returns deserialized object.
    """
    with open(pkl_file, 'rb') as fidin:
        return dill.load(fidin)


def save_resume_pickle(resumes: Any, pkl_file: AnyStr):
    """
    Serializes an object.
    
    resumes: object to serialize
    pkl_file: name of file to which to save object
    """
    with open(pkl_file, 'wb') as fidout:
        dill.dump(resumes, fidout)

In [10]:
# Run this cell if you need embeddings...
resumes = read_encode_resumes(resume_zip_file=data_archive, bi_encoder=bi_encoder)
save_resume_pickle(resumes=resumes, pkl_file='./data/resumes.dpkl')

Found 29783 file(s)


In [6]:
# ...otherwise it's much faster to just load from storage
resumes = load_resume_pickle('./data/resumes.dpkl')

With the data prep done I can finally get to the good part.  Here I'll plan on casting a fairly wide net with retrieval, on the assumption that I might have a few spurious recommendations from a fast and coarse initial filtering.  The main thing is to reduce the universe of candidates for the ranking, which often uses a complicated (slow) model.

In [8]:
def retrieve_candidates(
    candidate_query: AnyStr,
    resumes: Dict[AnyStr, Any],
    bi_encoder: SentenceTransformer,
    n_candidates: int = 10) -> List[Dict[AnyStr, Any]]:
    """
    Retrieves the most relevant documents from a list of candidates.
    
    candidate_query: free-form text query
    resumes: universe of candidates
    bi_encoder: SentenceTransformer model used to encode the documents
    n_candidates: number of candidates to return (defaults to 10).
    
    Returns a list of the top n_candidates.
    """
    question_embedding = bi_encoder.encode(candidate_query, convert_to_tensor=True).cuda()
    results = list()
    for resid in resumes.keys():
        best_sentence_hit = util.semantic_search(
            question_embedding, 
            resumes[resid]['embeddings'], 
            top_k=1, 
            score_function=util.dot_score
        )[0]
        results.append({
            'resume_id': resid,
            'score': best_sentence_hit[0]['score'],
            'sentence_id': best_sentence_hit[0]['corpus_id'],
            'sentence': resumes[resid]['sentences'][best_sentence_hit[0]['corpus_id']],
            'sentences': resumes[resid]['sentences']
        })
    return sorted(results, key=lambda el: el['score'], reverse=True)[:n_candidates]


def rank_candidates(
    candidate_query: AnyStr,
    candidates: List[Dict[AnyStr, Any]],
    cross_encoder: CrossEncoder,
    n_candidates: int = None) -> List[Dict[AnyStr, Any]]:
    """
    Re-ranks a list of candidates by their relevance to a query.
    
    candidate_query: free-form text query
    candidates: universe of candidates
    cross_encoder: CrossEncoder model used to rank the candidates
    n_candidates: number of candidates to return.  If None (default), returns the entire list of candidates.
    
    Returns a list of the top n_candidates.
    """
    if not n_candidates:
        n_candidates = len(candidates)
    results = list(candidates)
    cross_inp = [[candidate_query, candidate['sentence']] for candidate in candidates]
    cross_scores = cross_encoder.predict(cross_inp)
    for i in range(len(results)):
        results[i]['cross_score'] = cross_scores[i]
    return sorted(results, key=lambda el: el['cross_score'], reverse=True)[:n_candidates]

In [19]:
# Retrieval

query = 'SQL Server Database Administrator in Wilmington Delaware'
num_candidates = 5

retrieved_candidates = retrieve_candidates(
    candidate_query=query,
    resumes=resumes,
    bi_encoder=bi_encoder,
    n_candidates=num_candidates * 10  # Cast a wide initial net, we'll down select in ranking
)

print("Query: '{0}'".format(query))
print("Top {0} Candidates\n".format(len(retrieved_candidates)))
for candidate in retrieved_candidates:
    print("Resume ID: {0}".format(candidate['resume_id']))
    print("Score: {0:.4f}".format(candidate['score']))
    print("Most Relevant Sentence: '{0}'".format(candidate['sentence']))
    print()

Query: 'SQL Server Database Administrator in Wilmington Delaware'
Top 50 Candidates

Resume ID: 112916bf-1c08-43bb-b62a-9d9e2930450a
Score: 0.7459
Most Relevant Sentence: 'Database Administrator using SQL Server.'

Resume ID: d015ba48-3dcb-4f79-bd7e-13afbe07fa8b
Score: 0.7452
Most Relevant Sentence: 'Senior Database Administrator - Technical lead Wilmington Trust Company - Wilmington, DE 2001 to 2010 for a team of seven database administrators that are responsible for administrating over 100 SQL Servers holding approximately 1100 unique user databases.'

Resume ID: a1c6e2b1-164f-4ea4-8606-ea2e2f8ea7f7
Score: 0.7128
Most Relevant Sentence: 'Work Experience MS SQL Server Database Administrator AAA Carolinas - Charlotte, NC February 2017 to Present Responsibilities: ?'

Resume ID: 67bc88ac-4a0d-4fa6-b41f-f8b1ccc4b61f
Score: 0.6938
Most Relevant Sentence: 'Database Administrator ?'

Resume ID: a441fd0d-f462-47a9-b851-e555f817644e
Score: 0.6913
Most Relevant Sentence: 'Database Administrato

In [21]:
# (Re-)Ranking - reorder w. a more sophisticated model, then take the top N

ranked_top_candidates = rank_candidates(
    candidate_query=query,
    candidates=retrieved_candidates,
    cross_encoder=cross_encoder,
    n_candidates=num_candidates
)

print("Query: '{0}'".format(query))
print("Top {0} Candidates\n".format(len(ranked_top_candidates)))
for candidate in ranked_top_candidates:
    print("Resume ID: {0}".format(candidate['resume_id']))
    print("Score: {0:.4f}".format(candidate['score']))
    print("Most Relevant Sentence: '{0}'".format(candidate['sentence']))
    print()

Query: 'SQL Server Database Administrator in Wilmington Delaware'
Top 5 Candidates

Resume ID: d015ba48-3dcb-4f79-bd7e-13afbe07fa8b
Score: 0.7452
Most Relevant Sentence: 'Senior Database Administrator - Technical lead Wilmington Trust Company - Wilmington, DE 2001 to 2010 for a team of seven database administrators that are responsible for administrating over 100 SQL Servers holding approximately 1100 unique user databases.'

Resume ID: c8b6b305-b9fe-4857-85c4-862decda10e0
Score: 0.6874
Most Relevant Sentence: 'Database Administrator Wilmington Trust Company - Wilmington, DE December 1998 to February 2000 DB2 Database Administration, for mainframe applications Maintained DB2 utility jobs (load, unload, image copies, reorg & recovery) for development, QA and production environments Worked with BMC utilities toolsets (Catalog Manager, Change Manager, DASD Manager, Patrol Logmaster & Recovery Manager) SQL Server Administration for release 6.5 applications Monitored SQL Server space usage,

Now I can bring it all back together - given a query, return a list of the top N matches.

In [22]:
def suggest_candidates(
    job_query: AnyStr,
    resumes: Dict[AnyStr, Any],
    bi_encoder: SentenceTransformer,
    cross_encoder: CrossEncoder,
    n_candidates: int = 3) -> List[Dict[AnyStr, Any]]:
    """
    Recommends a list of candidate resumes based on their relevance to a free-form text query.
    
    job_query: free-form text query
    resumes: universe of candidates
    bi_encoder: SentenceTransformer model used to encode the documents
    cross_encoder: CrossEncoder model used to rank the candidates
    n_candidates: number of candidates to return (defaults to 10).
    """
    retrieved_candidates = retrieve_candidates(
        candidate_query=job_query,
        resumes=resumes,
        bi_encoder=bi_encoder,
        n_candidates=num_candidates * 10
    )
    ranked_top_candidates = rank_candidates(
        candidate_query=job_query,
        candidates=retrieved_candidates,
        cross_encoder=cross_encoder,
        n_candidates=num_candidates
    )
    return [
        {'resume_id': candidate['resume_id'],
         'relevant_snippet': candidate['sentence'],
         'resume_text': ' '.join(candidate['sentences'])
        } for candidate in ranked_top_candidates
    ]

In [33]:
job_query = "A software developer with experience in .NET, based in Florida or Georgia"

recommended_candidates = suggest_candidates(
    job_query=job_query,
    resumes=resumes,
    bi_encoder=bi_encoder,
    cross_encoder=cross_encoder,
    n_candidates=5
)

print("Query: '{0}'".format(job_query))
print("Top {0} Candidates\n".format(len(ranked_top_candidates)))

for candidate in recommended_candidates:
    print("Resume ID: {0}".format(candidate['resume_id']))
    print("Relevant snippet:\n'{0}'".format(textwrap.shorten(candidate['relevant_snippet'], width=750))),
    print("\n")
    print("{0}".format(textwrap.shorten(candidate['resume_text'], width=750)))
    print(25 * '-')
    print()

Query: 'A software developer with experience in .NET, based in Florida or Georgia'
Top 5 Candidates

Resume ID: 355a7525-4388-45e4-b4f1-9cc37b82dab6
Relevant snippet:
'b'Software Developer Software Developer Software Developer Atlanta, GA I am a developer with a lot of experience who gets along well with others.'


b'Software Developer Software Developer Software Developer Atlanta, GA I am a developer with a lot of experience who gets along well with others. Authorized to work in the US for any employer Work Experience Software Developer CoreXpand August 2006 to December 2018 Software engineer responsible for programming both desktop and web applications in the ecommerce field.Worked with the Microsoft suite of programming tools and databases to create custom online stores.Extensive backend experience with SQL Server. Software Developer Allison Research Technologies - Atlanta, GA June 1995 to June 2006 Desktop and web application programmer for market research software on Linux and Mic

In [34]:
job_query = '''
Project Manager or a software developer with experience in
project management and is based in Minnesota, Iowa, Wisconsin, 
or Michigan
'''

recommended_candidates = suggest_candidates(
    job_query=job_query,
    resumes=resumes,
    bi_encoder=bi_encoder,
    cross_encoder=cross_encoder,
    n_candidates=5
)

print("Query: '{0}'".format(job_query))
print("Top {0} Candidates\n".format(len(ranked_top_candidates)))

for candidate in recommended_candidates:
    print("Resume ID: {0}".format(candidate['resume_id']))
    print("Relevant snippet:\n'{0}'".format(textwrap.shorten(candidate['relevant_snippet'], width=750))),
    print("\n")
    print("{0}".format(textwrap.shorten(candidate['resume_text'], width=750)))
    print(25 * '-')
    print()

Query: '
Project Manager or a software developer with experience in
project management and is based in Minnesota, Iowa, Wisconsin, 
or Michigan
'
Top 5 Candidates

Resume ID: af7d25b7-1122-4ab7-ba2d-384dbbf0a090
Relevant snippet:
'b'IT Project Manager - Enterprise Technologies IT Project Manager - Enterprise Technologies IT Project & Program Manager Brookfield, WI Authorized to work in the US for any employer Work Experience IT Project Manager - Enterprise Technologies Robert W. Baird - Milwaukee, WI November 2015 to Present Manage project planning & delivery in a matrix organization, using waterfall, Lean, and Agile practices.'


b'IT Project Manager - Enterprise Technologies IT Project Manager - Enterprise Technologies IT Project & Program Manager Brookfield, WI Authorized to work in the US for any employer Work Experience IT Project Manager - Enterprise Technologies Robert W. Baird - Milwaukee, WI November 2015 to Present Manage project planning & delivery in a matrix organization, 