## Project Brief: Résumé parsing and information extraction


### CoNVO

**Context:** Bloc is a career services management platform that builds smart career and data management tools for job-seekers and the organizations serving them. In particular, Bloc seeks to provide and facilitate access to tools for effectively presenting job-seekers' credentials and matching employers' job postings, and thereby improve outcomes.

**Need:** Many job-seekers come to Bloc's platform with a résumé already written. Forcing new users to re-enter all that information before they can utilize other tools (e.g. for résumé evaluation) is tedious, at best. This can also reduce time pressure during in-person sessions facilitated by Bloc, where every second counts.

**Vision:** Automated extraction of key information from existing résumés, submitted as PDFs while onboarding new users, in order to facilitate and streamline the process.

**Outcome:** A standalone, proof-of-concept process for extracting key résumé information and returning it as structured data, complete with unit tests and documentation on expected usage, limitations, and potential improvements.


### Data Summary

Bloc has provided ~125 résumés with a variety of styles, layouts, and contents, in PDF format. Data quality seems good, and appears to be composed entirely of electronically-generated PDFs rather than (much more troublesome) scans of physical documents.

They typically include personal contact information, professional experience, education, and skills; they sometimes include information on other relevant experience (volunteering, leadership, side projects), professional and academic associations, honors and awards, and personal interests; they rarely include a professional objective / statement of purpose and references.

Since the amount of data is relatively small, and since résumés are so structured and standardized in terms of the information they include, a rules-based approach seems likely to succeed.


### Proposed Methodology

Cleanly extracting text from PDFs is tricky, since the format alters or throws out information for the sake of human-friendly layout, formatting, and such. Given this, it's best to use well-established tools for the text extraction, and highly accommodating parsing logic for the texts themselves. Rather than going full-bore on a complex, computer-vision based résumé parsing system, it'll be best to start with more foundational tools of text processing: regular expressions, fuzzy string matching, gazetteers/dictionaries, data sanitization/cleanup, and lots of trial-and-error.

See the code below for something to get you started.


### Definitions of Success

- **Baseline:** A function that accepts a résumé (TBD: as filepath or already-extracted text) and returns structured data for the most common résumé components: contact information, professional experience, education, and skills. The quality of the extracted values may be messy or not fully parsed, but shouldn't contain values for other components. Atypical résumé components may be skipped. This function should have basic unit tests and documentation.
- **Target:** A function that accepts a résumé and returns structured data for the most common résumé components (see Baseline), as well as other relevant experience, professional/academic associations, honors and awards, and personal interests. The quality of the extracted values should be almost fully parsed (e.g. no large blocks of relevant but unstructured text) and should not contain values for other components. Atypical résumé components may be skipped. This function should have unit tests covering a variety of expected scenarios and good documentation.
- **Stretch:** A function that accepts a résumé and returns structured data for any component that could be reasonably expected in such a document. The quality of the extracted values should be almost fully parsed (see Target). Particularly unusual résumé components may be skipped. This function should have comprehensive unit tests and documentation.

Note: We should try to get Bloc's buy-in / feedback on a schema, since they already ingest and store some of this data in their systems.


### Risks

It's possible that the information included in / extracted from résumés is too complex or varied for sufficiently accurate rules-based parsing, in which case a more sophisticated (ML- or DL-based) approach would be necessary, albeit impractical owing to time and data constraints. It's also possible that a rules-based approach is feasible, but too difficult / large a task for a single day's work.

Another separate risk deals with personally-identifiable information (PII), which is intrinsic to a résumé, but which DataKind typically prefers to strip out of the data assigned to volunteers. A relatively practical solution would entail extracting text from the PDFs beforehand, then replacing direct PII (name and contact info) with placeholder values, but we'd still have volunteers working with indirect PII such as education / employment history. DataKind may not be able to abide such a middle ground.


### DataDive Recommendation

I think this is a good, interesting challenge for a DataDive. It's reasonably chunkable, and could be hacked on by one or multiple people simultaneously.

## Getting Started Code

In [1]:
%load_ext watermark

### Dataset Generation

In [2]:
import io
import operator
import os
import pathlib
import re
import shutil

import ftfy
from faker import Faker
from toolz import itertoolz

In [3]:
%watermark -v -iv

ftfy  5.6
toolz 0.10.0
re    2.2.1
CPython 3.7.4
IPython 7.8.0


In [4]:
def get_filepaths(dirpath, suffixes):
    """
    Get full paths to all files under ``dirpath``
    with a file type in ``suffixes``.

    Args:
        dirpath (:class:`pathlib.Path`)
        suffixes (Set[str])
    
    Returns:
        List[str]
    """
    return sorted(
        str(path) for path in dirpath.resolve().iterdir()
        if path.is_file() and
        path.suffix in suffixes
    )

In [5]:
raw_data_dir = pathlib.Path("/Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/fellows")
filepaths = get_filepaths(raw_data_dir, {".pdf"})
print("# files:", len(filepaths))
filepaths[:3]

# files: 128


['/Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/fellows/2018FellowsResumes[002-002].pdf',
 '/Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/fellows/2018FellowsResumes[003-003].pdf',
 '/Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/fellows/2018FellowsResumes[004-004].pdf']

#### (a caveat)

Programmatically extracting text from a PDF with an atypical layout — such as a résumé — is _tricky_. Mistakes happen, and the results aren't always consistent with how a human would type it out.

I tried several options (see below). The Python binding to Apache Tika (`python-tika`) seemed to give the nicest text extractions, although the JVM dependency is unfortunate. `textract` provides a convenient and consistent interface, but results are mediocre and installation involves a lot of extra packages. `pdfminer` and its many forks are highly customizable, but confusing to use and, to be honest, a hot mess as far as code quality goes. I'm surprised Python doesn't have a better solution to this problem, but _whatchagonnado_.

In [6]:
RE_NAME = re.compile(r"^(([(\"][A-Z]\w+[)\"]|[A-Z]\w+|[A-Z])[.,]?[ -]?){2,5}$", flags=re.UNICODE)
RE_URL = re.compile(
    r"(?:^|(?<![\w/.]))"
    # protocol identifier
    # r"(?:(?:https?|ftp)://)"  <-- alt?
    r"(?:(?:https?://|ftp://|www\d{0,3}\.))"
    # user:pass authentication
    r"(?:\S+(?::\S*)?@)?"
    r"(?:"
    # IP address exclusion
    # private & local networks
    r"(?!(?:10|127)(?:\.\d{1,3}){3})"
    r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    # IP address dotted notation octets
    # excludes loopback network 0.0.0.0
    # excludes reserved space >= 224.0.0.0
    # excludes network & broadcast addresses
    # (first & last IP address of each class)
    r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    r"|"
    # host name
    r"(?:(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
    # domain name
    r"(?:\.(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)*"
    # TLD identifier
    r"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
    r")"
    # port number
    r"(?::\d{2,5})?"
    # resource path
    r"(?:/\S*)?"
    r"(?:$|(?![\w?!+&/]))",
    flags=re.UNICODE | re.IGNORECASE,
)
RE_SHORT_URL = re.compile(
    r"(?:^|(?<![\w/.]))"
    # optional scheme
    r"(?:(?:https?://)?)"
    # domain
    r"(?:\w-?)*?\w+(?:\.[a-z]{2,12}){1,3}"
    r"/"
    # hash
    r"[^\s.,?!'\"|+]{2,12}"
    r"(?:$|(?![\w?!+&/]))",
    flags=re.UNICODE | re.IGNORECASE,
)
RE_EMAIL = re.compile(
    r"(?:mailto:)?"
    r"(?:^|(?<=[^\w@.)]))([\w+-](\.(?!\.))?)*?[\w+-]@(?:\w-?)*?\w+(\.([a-z]{2,})){1,3}"
    r"(?:$|(?=\b))",
    flags=re.UNICODE | re.IGNORECASE,
)
RE_PHONE_NUMBER = re.compile(
    # core components of a phone number
    r"(?:^|(?<=[^\w)]))(\+?1[ .-]?)?(\(?\d{3}\)?[ .-]?)?(\d{3}[ .-]?\d{4})"
    # extensions, etc.
    r"(\s?(?:ext\.?|[#x-])\s?\d{2,6})?(?:$|(?=\W))",
    flags=re.UNICODE | re.IGNORECASE,
)
RE_STREET_ADDRESS = re.compile(
    # r"[ \w]{3,}([A-Za-z]\.)?([ \w]*\#\d+)?,?[ \w]{3,}, [A-Za-z]{2} \d{5}(-\d{4})?",
    r"(\d+ ((?! \d+ ).)*?) [A-Za-z]{2} \d{5}(-\d{4})?",
    flags=re.UNICODE,
)

faker = Faker(local="en_US")

In [7]:
def extract_and_clean_text(filepath, min_len=100):
    """
    Extract text from a PDF at ``filepath`` using the first package
    to get the job done in extracting at least ``min_len`` chars.
    
    Args:
        filepath (str)
        min_len (int)
    
    Returns:
        str
    """
    text = ""
    funcs = [extract_text_tika, extract_text_pdfminer, extract_text_textract]
    # extract text from pdf
    for func in funcs:
        _text = func(filepath)
        if len(_text) >= min_len:
            text = _text
            break
    if not text:
        return text
    else:
        # correct any encoding / mojibake / other weirdness
        return ftfy.fix_text(text)


def replace_pii(text):
    """
    Replace personally-identifying information in ``text``
    with randomly generated fake equivalents.
    
    Args:
        text (str)
    
    Returns:
        str
    """
    # let's start with names, which are usually on the first line    
    first_line, *the_rest = text.split("\n", maxsplit=1)
    first_line = RE_NAME.sub(faker.name(), first_line.strip())
    text = "\n".join([first_line] + the_rest)
    # next, let's replace emails, urls, and addresses
    # which are usually in the first "chunk" of info
    # first_chunk, *the_rest = re.split(r"\n{2,}", text, maxsplit=1)
    first_chunk = text[:150]
    the_rest = text[150:]
    first_chunk = RE_PHONE_NUMBER.sub(faker.phone_number(), first_chunk)
    first_chunk = RE_EMAIL.sub(faker.email(), first_chunk)
    first_chunk = RE_URL.sub(faker.url(), first_chunk)
    first_chunk = RE_STREET_ADDRESS.sub(faker.address().replace("\n", " "), first_chunk)
    text = first_chunk + the_rest
    return text


def extract_text_textract(filepath):
    """
    Extract text from a PDF at ``filepath`` using ``textract`` + ``pdftotext``.
    
    Args:
        filepath (str)
    
    Returns:
        str
    """
    # hiding the import so folks don't have to worry about installing it
    import textract
    
    return textract.process(
        filepath, method="pdftotext", encoding="utf-8"
    ).decode("utf-8").strip()


def extract_text_pdfminer(filepath):
    """
    Extract text from a PDF at ``filepath`` using ``yapdfminer``.

    Args:
        filepath (str)
    
    Returns:
        str
    """
    # hiding the import so folks don't have to worry about installing it
    import pdfminer.converter
    import pdfminer.layout
    import pdfminer.pdfinterp
    import pdfminer.pdfpage
    
    laparams = pdfminer.layout.LAParams(boxes_flow=0.5)
    retstr = io.StringIO()
    rsrcmgr = pdfminer.pdfinterp.PDFResourceManager()
    device = pdfminer.converter.TextConverter(
        rsrcmgr, retstr, codec="utf-8", laparams=laparams)
    interpreter = pdfminer.pdfinterp.PDFPageInterpreter(rsrcmgr, device)
    fp = io.open(filepath, mode="rb")
    for page in pdfminer.pdfpage.PDFPage.get_pages(fp, set(), maxpages=0, caching=True, check_extractable=True):
        interpreter.process_page(page)
    text = retstr.getvalue()
    fp.close()
    device.close()
    retstr.close()
    return text.strip()


def extract_text_tika(filepath):
    """
    Extract text from a PDF at ``filepath`` using ``python-tika``.
    
    Args:
        filepath (str)
    
    Returns:
        str
    """
    # hiding the import so folks don't have to worry about installing it
    from tika import parser
    
    result = parser.from_file(filepath)
    return result["content"].strip()

In [8]:
out_data_dir = pathlib.Path("/Users/burtondewilde/Desktop/datakind/bloc/msvdd_Bloc/data/resumes")
for i, filepath in enumerate(filepaths):
    text = extract_and_clean_text(filepath)
    if not text:
        print("unable to extract text from", filepath)
        continue

    text = replace_pii(text)
    fname = "resume_{}.txt".format(i)
    with out_data_dir.joinpath(fname).open(mode="wt", encoding="utf-8") as f:
        f.write(text)

unable to extract text from /Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/fellows/2018FellowsResumes[024-024].pdf
unable to extract text from /Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/fellows/2018FellowsResumes[027-027].pdf


In [9]:
# save all resume texts as a zip archive
archive_filepath = out_data_dir.joinpath("resumes_data.zip")
if archive_filepath.is_file():
    os.remove(archive_filepath)
temp_filepath = shutil.make_archive("resumes_data", "zip", root_dir=out_data_dir)
shutil.move(temp_filepath, archive_filepath)

PosixPath('/Users/burtondewilde/Desktop/datakind/bloc/msvdd_Bloc/data/resumes/resumes_data.zip')