## Task 1: Résumé parsing and information extraction


### CoNVO

**Context:** Bloc is a career services management platform that builds smart career and data management tools for job-seekers and the organizations serving them. In particular, Bloc seeks to provide and facilitate access to tools for effectively presenting job-seekers' credentials and matching employers' job postings, and thereby improve outcomes.

**Need:** Many job-seekers come to Bloc's platform with a résumé already written. Forcing new users to re-enter all that information before they can utilize other tools (e.g. for résumé evaluation) is tedious, at best. This can also reduce time pressure during in-person sessions facilitated by Bloc, where every second counts.

**Vision:** Automated extraction of key information from existing résumés, submitted as PDFs while onboarding new users, in order to facilitate and streamline the process.

**Outcome:** A standalone, proof-of-concept process for extracting key résumé information and returning it as structured data, complete with unit tests and documentation on expected usage, limitations, and potential improvements.


### Data Summary

Bloc has provided ~125 résumés with a variety of styles, layouts, and contents, in PDF format. (Bonus: ~2400 résumés scraped from external sources...) Data quality seems good, and appears to be composed entirely of electronically-generated PDFs rather than (much more troublesome) scans of physical documents.

They typically include personal contact information, professional experience, education, and skills; they sometimes include information on other relevant experience (volunteering, leadership, side projects), professional and academic associations, honors and awards, and personal interests; they rarely include a professional objective / statement of purpose and references.

Since the amount of data is relatively small, and since résumés are so structured and standardized in terms of the information they include, a rules-based approach seems likely to succeed.


### Proposed Methodology

Cleanly extracting text from PDFs is tricky, since the format alters or throws out information for the sake of human-friendly layout, formatting, and such. Given this, it's best to use well-established tools for the text extraction, and highly accommodating parsing logic for the texts themselves. Rather than going full-bore on a complex, computer-vision based résumé parsing system, it'll be best to start with more foundational tools of text processing: regular expressions, fuzzy string matching, gazetteers/dictionaries, data sanitization/cleanup, and lots of trial-and-error.

See the code below for something to get you started.


### Definitions of Success

- **Baseline:** A function that accepts a résumé (TBD: as filepath or already-extracted text) and returns structured data for the most common résumé components: contact information, professional experience, education, and skills. The quality of the extracted values may be messy or not fully parsed, but shouldn't contain values for other components. Atypical résumé components may be skipped. This function should have basic unit tests and documentation.
- **Target:** A function that accepts a résumé and returns structured data for the most common résumé components (see Baseline), as well as other relevant experience, professional/academic associations, honors and awards, and personal interests. The quality of the extracted values should be almost fully parsed (e.g. no large blocks of relevant but unstructured text) and should not contain values for other components. Atypical résumé components may be skipped. This function should have unit tests covering a variety of expected scenarios and good documentation.
- **Stretch:** A function that accepts a résumé and returns structured data for any component that could be reasonably expected in such a document. The quality of the extracted values should be almost fully parsed (see Target). Particularly unusual résumé components may be skipped. This function should have comprehensive unit tests and documentation.

Note: We should try to get Bloc's buy-in / feedback on a schema, since they already ingest and store some of this data in their systems.


### Risks

It's possible that the information included in / extracted from résumés is too complex or varied for sufficiently accurate rules-based parsing, in which case a more sophisticated (ML- or DL-based) approach would be necessary, albeit impractical owing to time and data constraints. It's also possible that a rules-based approach is feasible, but too difficult / large a task for a single day's work.

Another separate risk deals with personally-identifiable information (PII), which is intrinsic to a résumé, but which DataKind typically prefers to strip out of the data assigned to volunteers. A relatively practical solution would entail extracting text from the PDFs beforehand, then replacing direct PII (name and contact info) with placeholder values, but we'd still have volunteers working with indirect PII such as education / employment history. DataKind may not be able to abide such a middle ground.

## Source Code

In [1]:
%load_ext watermark

### Dataset Generation

**Note:** Don't run this section! (Besides, you _can't_, because the raw data has not been made available to you.) This is just to help you understand the data's provenance. Instead, download the already-generated datasets from OneDrive. (There's a link on the Bloc Project Home document in Dropbox.)

In [2]:
import pathlib

from faker import Faker

import msvdd_bloc

In [3]:
%watermark -v -iv

CPython 3.7.4
IPython 7.8.0


In [4]:
FAKER = Faker(local="en_US")

In [5]:
out_dirpath = pathlib.Path("/Users/burtondewilde/Desktop/datakind/bloc/msvdd_Bloc/data/resumes")

#### Bloc Fellows' résumés

**Note:** Programmatically extracting text from a PDF with an atypical layout — such as a résumé — is _tricky_. Mistakes happen, and the results aren't always consistent with how a human would type it out.

I tried several options... The Python binding to Apache Tika (`python-tika`) seemed to give the nicest text extractions, although the JVM dependency is unfortunate. `textract` provides a convenient and consistent interface, but results are mediocre and installation involves a lot of extra packages. `pdfminer` and its many forks are highly customizable, but confusing to use and, to be honest, a hot mess as far as code quality goes. I'm surprised Python doesn't have a better solution to this problem, but _whatchagonnado_.

In [6]:
in_fellows_dirpath = pathlib.Path("/Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/fellows")
filepaths = msvdd_bloc.data.fileio.get_filepaths(in_fellows_dirpath, ".pdf")
print("# files:", len(filepaths))
filepaths[:3]

# files: 128


['/Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/fellows/2018FellowsResumes[002-002].pdf',
 '/Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/fellows/2018FellowsResumes[003-003].pdf',
 '/Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/fellows/2018FellowsResumes[004-004].pdf']

In [7]:
text_files = []
for i, filepath in enumerate(filepaths):
    text = msvdd_bloc.data.resumes.extract_text_from_pdf(filepath, min_len=150)
    if not text:
        print("unable to extract text from", filepath)
        continue
    text = msvdd_bloc.data.resumes.clean_fellows_text(text)
    text = msvdd_bloc.data.resumes.replace_pii(text, faker=FAKER)
    fname = "fellows_resume_{}.txt".format(i)
    text_files.append((fname, text))

unable to extract text from /Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/fellows/2018FellowsResumes[024-024].pdf
unable to extract text from /Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/fellows/2018FellowsResumes[027-027].pdf


In [8]:
out_fellows_fpath = out_dirpath.joinpath("fellows_resumes.zip")
msvdd_bloc.data.fileio.save_text_files_to_zip(out_fellows_fpath, text_files)

#### Amina's bonus scraped résumés

**Note:** Still waiting on details from Amina...

In [9]:
in_bonus_dirpath = pathlib.Path("/Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/bonus")
filepaths = msvdd_bloc.data.fileio.get_filepaths(in_bonus_dirpath, ".txt")
print("# files:", len(filepaths))
filepaths[:3]

# files: 2432


['/Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/bonus/_resume_ab0iv8_hopewell-fluorescence-stainless-steel-haledon-nj.txt',
 '/Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/bonus/_resume_ab2abk_available-upon-request-givers-cmo-icd-new-york-ny.txt',
 '/Users/burtondewilde/Desktop/datakind/bloc/raw_data/resumes/bonus/_resume_ab3i5x_july-2013-5th-pharm-upon-request-harrison-nj-07029.txt']

In [10]:
text_files = []
for i, filepath in enumerate(filepaths):
    text = msvdd_bloc.data.resumes.extract_text_from_pdf(filepath, min_len=150)
    if not text:
        print("unable to extract text from", filepath)
        continue
    text = msvdd_bloc.data.resumes.clean_bonus_text(text)
    text = msvdd_bloc.data.resumes.replace_pii(text, faker=FAKER)
    fname = "bonus_resume_{}.txt".format(i)
    text_files.append((fname, text))

In [11]:
out_bonus_fpath = out_dirpath.joinpath("bonus_resumes.zip")
msvdd_bloc.data.fileio.save_text_files_to_zip(out_bonus_fpath, text_files)

### Getting Started

In [3]:
import collections
import io
import operator
import pathlib
import re

from toolz import itertoolz

import msvdd_bloc

In [4]:
%watermark -v -iv

UsageError: Line magic function `%watermark` not found.


In [5]:
RE_BULLETS = re.compile(r"[\u25cf\u2022\u2023\u2043]", flags=re.UNICODE)
RE_BREAKING_SPACE = re.compile(r"(\r\n|[\n\v]){2,}", flags=re.UNICODE)
RE_NONBREAKING_SPACE = re.compile(r"[^\S\n\v]+", flags=re.UNICODE)
RE_MONTH = re.compile(
    r"(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|sep|september|oct|october|nov|november|dec|december)",
    flags=re.IGNORECASE
)
RE_YEAR = re.compile(r"((19|20)\d{2})")

SECTION_HEADERS = {
    "education": {
        "education",
    },
    "experience": {
        "experience",
        "work experience",
        "professional experience",
        "work & research experience",
        "relevant experience",
        "experiences",
        "additional experience",
        "leadership",
        "leadership experience",
        "leadership and service",
    },
    "skills": {
        "skills",
        "technical skills",
        "skills & expertise",
        "technological skills",
        "tools",
        "languages",
        "programming languages",
        "languages and technologies",
        "language and technologies",
    },
    "achievements": {
        "achievements",
        "awards",
        "honors",
        "honors & awards",
        "honors, awards, and memberships",
        "fellowships & awards",
        "awards and certifications",
    },
    "projects": {
        "projects",
        "side projects",
        "technical projects",
        "programming projects",
        "github projects",
        "other projects",
    },
    "activities": {
        "activities",
        "volunteering",
        "activities and student groups",
    }
}

In [37]:
def preprocess_resume_text(text):
    """
    Args:
        text (str)
        
    Returns:
        str
    """
    # clean up weird stuff
    text = RE_BULLETS.sub("-", text)
    # normalize whitespace
    text = RE_NONBREAKING_SPACE.sub(" ", text).strip()
    text = RE_BREAKING_SPACE.sub(r"\n\n", text)
    # TODO: any other roughness that can be consistently smoothed out
    return text


def get_section_idxs(lines):
    """
    Args:
        lines (List[str])
    
    Returns:
        List[Tuple[str, int]]
    """
    section_idxs = [("START", 0)]
    for idx, line in enumerate(lines):
        for section, headers in SECTION_HEADERS.items():
            if (
                any(line.lower() == header for header in headers) or
                any(line.lower().startswith(header + ":") for header in headers)
            ):
                section_idxs.append((section, idx))
    section_idxs.append(("END", len(lines)))
    return section_idxs


def get_section_lines(lines, section_idxs):
    """
    Args:
        lines (List[str])
        section_idxs (List[Tuple[str, int]])
    
    Returns:
        Dict[str, List[str]]
    """
    section_lines = collections.defaultdict(list)
    for (section, idx1), (_, idx2) in itertoolz.sliding_window(2, section_idxs):
        section_lines[section].extend(lines[idx1 : idx2])
    return dict(section_lines)


def parse_skills_section(lines):
    """
    Super rough example for extracting structured data from skills...
    
    Args:
        lines (List[str])
    
    Returns:
        Dict[str, List[str]]
    """
    skills = [
        skill.lstrip("- ")
        for line in lines
        for skill in re.split(r", +", line)
        if skill.strip() and
        skill.strip().lower() not in SECTION_HEADERS["skills"]
    ]
    return {"skills": skills}


def parse_education_section(lines):
    """
    Super rough example for extracting structured data from education...
    
    Args:
        lines (List[str])
    
    Returns:
        Dict[str, List[str]]
    """
    
    school = [line for line in lines if "University" in line or "university" in line][0]
    graduation = ''
    degree = ''
    coursework = []
    for line in lines:
        graduationSearch = re.search( r'.*Graduation: (.*)', line)
        if graduationSearch:
            graduation = graduationSearch.group(1)
            break
        degreeSearch = re.search( r')
        courses = re.search( r'coursework: )
        

    return {"school": school,
            "graduationDate": graduation,
            "coursework": coursework}
    

# and so on and so forth

In [55]:
resumes_fpath ="/Users/anluc/source/repos/DataKind/msvdd_Bloc/data/resumes/fellows_resumes.zip"
for fname, text in msvdd_bloc.data.fileio.load_text_files_from_zip(resumes_fpath):
    text = preprocess_resume_text(text)
    lines = [line.strip() for line in text.split("\n")]
    section_idxs = get_section_idxs(lines)
    section_lines = get_section_lines(lines, section_idxs)
    break  # just stopping here so we can test things out

In [56]:
section_idxs

[('START', 0),
 ('experience', 3),
 ('experience', 31),
 ('education', 44),
 ('projects', 53),
 ('skills', 58),
 ('END', 65)]

In [51]:
section_lines.get("education", [])

['EDUCATION',
 'University of Minnesota - Twin Cities',
 'Bachelor of Science Computer Science \u200bAnticipated Graduation: May 2020',
 'Relevant coursework:',
 '',
 '- Introduction to C/C++ Programming for Scientists and Engineers; Introduction to Algorithms, Data',
 'Structures, and Program Development; Advanced Programming Principles; Machine Architecture and',
 'Organization; Algorithms & Data Structures',
 '']

In [52]:
parse_education_section(section_lines.get("education", []))

{'school': 'University of Minnesota - Twin Cities',
 'graduationDate': 'May 2020'}

In [53]:
section_lines.get("skills", [])

['SKILLS',
 '',
 '- Java, Python, C/C++, C#, Git, MySQL',
 '- HTML, CSS, JavaScript, PHP, Sass, LESS',
 '',
 '- React, Angular, NodeJS',
 '- AWS, Drupal, WordPress']

In [54]:
parse_skills_section(section_lines.get("skills", []))

{'skills': ['Java',
  'Python',
  'C/C++',
  'C#',
  'Git',
  'MySQL',
  'HTML',
  'CSS',
  'JavaScript',
  'PHP',
  'Sass',
  'LESS',
  'React',
  'Angular',
  'NodeJS',
  'AWS',
  'Drupal',
  'WordPress']}

_just curious_: what are the most common section headers?

In [10]:
header_counts = collections.Counter()
for fname, text in msvdd_bloc.data.fileio.load_text_files_from_zip(resumes_fpath):
    text = preprocess_resume_text(text)
    lines = [line.strip() for line in text.split("\n")]
    section_idxs = get_section_idxs(lines)
    header_counts.update(lines[idx].lower() for _, idx in section_idxs if idx != len(lines))
[item for item in header_counts.most_common() if item[1] > 1]

[('education', 86),
 ('projects', 45),
 ('skills', 44),
 ('experience', 41),
 ('work experience', 26),
 ('technical skills', 21),
 ('awards', 17),
 ('languages', 10),
 ('leadership', 9),
 ('professional experience', 8),
 ('relevant experience', 4),
 ('achievements', 4),
 ('education:', 4),
 ('activities', 4),
 ('experience:', 3),
 ('leadership experience', 3),
 ('additional experience', 2),
 ('side projects', 2),
 ('volunteering', 2),
 ('experiences', 2),
 ('honors & awards', 2),
 ('languages and technologies', 2),
 ('programming projects', 2),
 ('projects:', 2),
 ('tools', 2),
 ('1', 2)]