## Task 1: Résumé parsing and information extraction


### CoNVO

**Context:** Bloc is a career services management platform that builds smart career and data management tools for job-seekers and the organizations serving them. In particular, Bloc seeks to provide and facilitate access to tools for effectively presenting job-seekers' credentials and matching employers' job postings, and thereby improve outcomes.

**Need:** Many job-seekers come to Bloc's platform with a résumé already written. Forcing new users to re-enter all that information before they can utilize other tools (e.g. for résumé evaluation) is tedious, at best. This can also reduce time pressure during in-person sessions facilitated by Bloc, where every second counts.

**Vision:** Automated extraction of key information from existing résumés, submitted as PDFs while onboarding new users, in order to facilitate and streamline the process.

**Outcome:** A standalone, proof-of-concept process for extracting key résumé information and returning it as structured data, complete with unit tests and documentation on expected usage, limitations, and potential improvements.


### Data Summary

Bloc has provided ~125 résumés with a variety of styles, layouts, and contents, in PDF format. (Bonus: ~2400 résumés scraped from external sources...) Data quality seems good, and appears to be composed entirely of electronically-generated PDFs rather than (much more troublesome) scans of physical documents.

They typically include personal contact information, professional experience, education, and skills; they sometimes include information on other relevant experience (volunteering, leadership, side projects), professional and academic associations, honors and awards, and personal interests; they rarely include a professional objective / statement of purpose and references.

Since the amount of data is relatively small, and since résumés are so structured and standardized in terms of the information they include, a rules-based approach seems likely to succeed.


### Proposed Methodology

Cleanly extracting text from PDFs is tricky, since the format alters or throws out information for the sake of human-friendly layout, formatting, and such. Given this, it's best to use well-established tools for the text extraction, and highly accommodating parsing logic for the texts themselves. Rather than going full-bore on a complex, computer-vision based résumé parsing system, it'll be best to start with more foundational tools of text processing: regular expressions, fuzzy string matching, gazetteers/dictionaries, data sanitization/cleanup, and lots of trial-and-error.

See the code below for something to get you started.


### Definitions of Success

- **Baseline:** A function that accepts a résumé (TBD: as filepath or already-extracted text) and returns structured data for the most common résumé components: contact information, professional experience, education, and skills. The quality of the extracted values may be messy or not fully parsed, but shouldn't contain values for other components. Atypical résumé components may be skipped. This function should have basic unit tests and documentation.
- **Target:** A function that accepts a résumé and returns structured data for the most common résumé components (see Baseline), as well as other relevant experience, professional/academic associations, honors and awards, and personal interests. The quality of the extracted values should be almost fully parsed (e.g. no large blocks of relevant but unstructured text) and should not contain values for other components. Atypical résumé components may be skipped. This function should have unit tests covering a variety of expected scenarios and good documentation.
- **Stretch:** A function that accepts a résumé and returns structured data for any component that could be reasonably expected in such a document. The quality of the extracted values should be almost fully parsed (see Target). Particularly unusual résumé components may be skipped. This function should have comprehensive unit tests and documentation.

Note: We should try to get Bloc's buy-in / feedback on a schema, since they already ingest and store some of this data in their systems.


### Risks

It's possible that the information included in / extracted from résumés is too complex or varied for sufficiently accurate rules-based parsing, in which case a more sophisticated (ML- or DL-based) approach would be necessary, albeit impractical owing to time and data constraints. It's also possible that a rules-based approach is feasible, but too difficult / large a task for a single day's work.

Another separate risk deals with personally-identifiable information (PII), which is intrinsic to a résumé, but which DataKind typically prefers to strip out of the data assigned to volunteers. A relatively practical solution would entail extracting text from the PDFs beforehand, then replacing direct PII (name and contact info) with placeholder values, but we'd still have volunteers working with indirect PII such as education / employment history. DataKind may not be able to abide such a middle ground.

### Getting Started

In [57]:
import collections
import io
import operator
import pathlib
import re

from toolz import itertoolz

import msvdd_bloc

In [4]:
%watermark -v -iv

UsageError: Line magic function `%watermark` not found.


In [58]:
RE_BULLETS = re.compile(r"[\u25cf\u2022\u2023\u2043]", flags=re.UNICODE)
RE_BREAKING_SPACE = re.compile(r"(\r\n|[\n\v]){2,}", flags=re.UNICODE)
RE_NONBREAKING_SPACE = re.compile(r"[^\S\n\v]+", flags=re.UNICODE)
RE_MONTH = re.compile(
    r"(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|sep|september|oct|october|nov|november|dec|december)",
    flags=re.IGNORECASE
)
RE_YEAR = re.compile(r"((19|20)\d{2})")

SECTION_HEADERS = {
    "education": {
        "education",
    },
    "experience": {
        "experience",
        "work experience",
        "professional experience",
        "work & research experience",
        "relevant experience",
        "experiences",
        "additional experience",
        "leadership",
        "leadership experience",
        "leadership and service",
    },
    "skills": {
        "skills",
        "technical skills",
        "skills & expertise",
        "technological skills",
        "tools",
        "languages",
        "programming languages",
        "languages and technologies",
        "language and technologies",
    },
    "achievements": {
        "achievements",
        "awards",
        "honors",
        "honors & awards",
        "honors, awards, and memberships",
        "fellowships & awards",
        "awards and certifications",
    },
    "projects": {
        "projects",
        "side projects",
        "technical projects",
        "programming projects",
        "github projects",
        "other projects",
    },
    "activities": {
        "activities",
        "volunteering",
        "activities and student groups",
    }
}

In [204]:
def preprocess_resume_text(text):
    """
    Args:
        text (str)
        
    Returns:
        str
    """
    # clean up weird stuff
    text = RE_BULLETS.sub("-", text)
    # normalize whitespace
    text = RE_NONBREAKING_SPACE.sub(" ", text).strip()
    text = RE_BREAKING_SPACE.sub(r"\n\n", text)
    # TODO: any other roughness that can be consistently smoothed out
    return text


def get_section_idxs(lines):
    """
    Args:
        lines (List[str])
    
    Returns:
        List[Tuple[str, int]]
    """
    section_idxs = [("START", 0)]
    for idx, line in enumerate(lines):
        for section, headers in SECTION_HEADERS.items():
            if (
                any(line.lower() == header for header in headers) or
                any(line.lower().startswith(header + ":") for header in headers)
            ):
                section_idxs.append((section, idx))
    section_idxs.append(("END", len(lines)))
    return section_idxs


def get_section_lines(lines, section_idxs):
    """
    Args:
        lines (List[str])
        section_idxs (List[Tuple[str, int]])
    
    Returns:
        Dict[str, List[str]]
    """
    section_lines = collections.defaultdict(list)
    for (section, idx1), (_, idx2) in itertoolz.sliding_window(2, section_idxs):
        section_lines[section].extend(lines[idx1 : idx2])
    return dict(section_lines)

# and so on and so forth

### Preprocess and Assemble Resumes

In [213]:
resumes_fpath ="/Users/anluc/source/repos/DataKind/msvdd_Bloc/data/resumes/fellows_resumes.zip"
resumes = {}
for fname, text in msvdd_bloc.data.fileio.load_text_files_from_zip(resumes_fpath):
    text = preprocess_resume_text(text)
    lines = [line.strip() for line in text.split("\n")]
    section_idxs = get_section_idxs(lines)
    section_lines = get_section_lines(lines, section_idxs)
    
    # If there is some information in the start, it is more likely to be processed easier
    start = section_lines.get('START')
    if len(start) > 0:
        print(start)
        name = start[0] # TODO: come up with a better method to determine name
        resumes[name] = section_lines

['David Hill', '7574 Stacey Rue Suite 620 Hernandezfort, MA 01644 | 747-444-2791x82067 | sharon26@hotmail.com', '']
['Kayla Frost', 'austinrice@parks-neal.biz', '', '+1-551-219-5669x272', 'https://www.phillips.biz/', '']
['Stacey Miller', '4346, 7884 Arias Course Suite 591 Youngmouth, ID 79995 | markwilliams@hotmail.com | (347)‐748‐0333', '', '', '']
['Matthew Banks', '', '537-547-1777x09099 | fordvincent@gmail.com', 'Github.com/Abel2Code | Linkedin.com/in/Abel-Salinas', '', '']
['Adam Espinoza 273 Michael Ways Hamiltontown, LA 74895', 'Adespinoza in/adamespinoza vhernandez@yahoo.com Ÿ 675.289.5460x8199', '']
['Amanda Lin', '', '(329)180-0989 - chelsea54@nichols.com - Unit 0633 Box 0134 DPO AA 16987']
['900 Cancho Drive', 'La Habra Heights', 'California 90631', '', 'Alexander S. Garcia', '294.706.7930x2018', '', 'joseph33@hotmail.com', 'http://www.johnson-hutchinson.com/', '']
['Alexis Herrera Email : tina64@hotmail.com', 'halexis.me Mobile : (915)827-0406', '']
['Alexy Cruz stephen60@

['- Designed and implemented compatibility and stable matching algorithms', 'to ensure the best possible matches were always made.', '', '- Designed database schema and project architecture to be performant', 'and scalable.', '', '- Used test-driven development and continuous integration to ensure we', 'could quickly iterate without breaking anything.', '', '- Built using Node.js/MongoDB/jQuery/HTML/CSS.', '', 'BruinMeet - Product Manager / Lead Developer', 'February 2017 - Present', '', '- Transformed the coding culture at UCLA to be more beginner friendly.', '- Founded Hack School, an eight week long coding bootcamp that', '', 'teaches full stack JavaScript web development to over 100 students', 'each quarter.', '', '- Founded Hack On The Hill, a beginner focused hackathon with over 100', 'participants.', '', 'ACM Hack - President', 'May 2016 - February 2017', '']
['Bonnie Guerra', '', '(cid:4)(cid:1231)(cid:1229)(cid:1232)(cid:20)(cid:1219)(cid:1225) (cid:20)(cid:1236)(cid:1237)(cid

['1', '', 'Ignacio Delgado-Cay', '274-075-1113x87842', '', 'http://brown-mills.biz/', 'fisherkristina@mcdonald-estrada.biz', '']
['s _ 5', '', '06/2018 - Ongoing', '', '06/2017 - 08/2017', '', '08/2015 - Ongoing', '', 'JavaScript Python Git/Github', '', 'Swift Java 3D Graphics XML', '', 'NodeJs Ruby on Rails SQL React', '', 'HTML/CSS Postgres WebGL', '', '12/2017 - Ongoing', '', '01/2017 - 05/2017', '', '10/2016', '', 'www.enhancv.com Powered by/', '', 'Ivory Brown', 'Software Engineering Intern', '', '510-340-8065 ivoryanna.brown@duke.edu https://github.com/itb2', '']
['Patricia Conway', '001-776-801-6305x9093', '', 'aren D. Lynch', 'github.com/Jaren831', '', 'anthonybaker@tran.com', 'linkedin.com/in/jaren-lynch', '', 'E PL ENT', 'Mobile Engineering Intern Spotify June 2018 - Present', '', '- Member of N C-Infra team. N C-Infra is responsible for the development and maintenance of the', "development infrastructure used by teams working on Spotify's iOS, Android, and Web platforms.", '

['Katherine Smith', '496.808.1043x0521 - susanwoodward@bullock.info - linkedin.com/in/laura-godinez-87b770112/ - github.com/lgodz15', '']
['Joshua Mccarthy', 'williamhernandez@hotmail.com', '', 'Linkedin.com/in/lawrence-lawson/', '', 'Github.com/zagan202', '', '001-089-030-9027', '']
['Todd Scott', '', 'CALONA', '', 'Contact', 'mandy42@mccormick.com', '', '+1-463-071-4960x342', '', 'Luis_Calona', '', 'linkedin.com/in/luiscalona', '', 'lcalona', '', 'Summary', '', 'I am a passionate second-', 'year student enthusiastic to', 'learn and explore more in', 'the �eld of Computer', 'Science and Software', 'Engineering. I am currently', 'seeking to continue', 'building my skills and', 'gaining hands-on, real-life', 'experience in the tech', 'industry.', '']
['\u200bMahamed Yusuf', 'linkedin.com/in/mahamedy cell: 590.908.3114x84498 email: jeffery83@hansen-olson.com', '', 'Technical Skills\u200b:', '', '- Programming Languages: Java, Python, JS, PHP, Ocaml, Swift (Learning)', '- Web Frameworks: 

['Tracy Ramos', '147.299.7554 Ÿ wallspaul@hotmail.com', '']
['Olivia Jackson', '050-263-2165 ⇧ kcummings@miller.biz', '5513 Lisa Pines Blakeview, IL 82882', '', 'https://www.morales.com/', '']
['Aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', '', 'Nicolas Magaña', 'Email: connerraymond@ellis-hopkins.biz | Phone: (714) 420–7273 Github: https://github.com/nicooo21', '', 'Snap Score: 210,933', '_________________________________________________________________________ Objective', 'Seeking a Software Engineering Internship in Backend or Full Stack development.', '_________________________________________________________________________ Education', 'University of California, Los Angeles (UCLA) Expected Graduation: March 2019', 'Bachelor of Science: Electrical Engineering GPA: 3.19', '________________________________________________________________________']
['Oluwatamilore Tami Olafunmiloye', 'cookjohn@yahoo.com (741)669-0320x05412', '']
['Emily Rig

### Parse Sections
This is meant to be in line with the JSON schema here: https://jsonresume.org/schema/

In [214]:
def parse_basic_information(lines):
    """
    Basic parsing of resume information
    
    Args:
        lines (List[str])
    
    Returns:
        "basics": {
            "name": "John Doe",
            "label": "Programmer",
            "picture": "",
            "email": "john@gmail.com",
            "phone": "(912) 555-4321",
            "website": "http://johndoe.com",
            "summary": "A summary of John Doe...",
            "location": {
              "address": "2712 Broadway St",
              "postalCode": "CA 94115",
              "city": "San Francisco",
              "countryCode": "US",
              "region": "California"
            },
            "profiles": [{
              "network": "Twitter",
              "username": "john",
              "url": "http://twitter.com/john"
            }]
          }
    """
    
    # Usually first attribute is the name
    name = lines[0]
    label = ''
    picture = ''
    email = ''
    phone = ''
    website = ''
    summary = ''
    location = {}
    profiles = []
    
    #for line in lines:
        #phoneSearch = re.search( r'(\d{3}) \D* (\d{3}) \D* (\d{4}) \D* (\d*)', line, re.I)
        #if phoneSearch:
            #phone = phoneSearch.Group(1) + "-" + phoneSearch.Group(2) + "-" + phoneSearch.Group(3)
        
        #emailSearch = re.search( r'(.*)@(.*).(.*)', line, re.I)
        #if emailSearch:
            #email = emailSearch.Group(1) + "-" + emailSearch.Group(2) + "-" + emailSearch.Group(3)
    
    # Add last instutition
    return {
            "name": name,
            "label": label,
            "picture": picture,
            "email": email,
            "phone": phone,
            "website": website,
            "summary": summary,
            "location": {
            },
            "profiles": profiles
           }

    return education

def parse_work_section(lines):
    """
    Parsing of work section
    
    Args:
        lines (List[str])
    
    Returns:
        "work": [{
            "company": "Company",
            "position": "President",
            "website": "http://company.com",
            "startDate": "2013-01-01",
            "endDate": "2014-01-01",
            "summary": "Description...",
            "highlights": [
              "Started the company"
            ]
          }]
    """
    # TODO
    work = []

    return work

def parse_education_section(lines):
    """
    Super rough example for extracting structured data from education...
    
    Args:
        lines (List[str])
    
    Returns:
        Dict[str, List[str]]
    """
    
    education = []
    
    institution = ''
    area = ''
    studyType = ''
    startDate = ''
    endDate = ''
    gpa = ''
    courses = []
    courseWorkNextLines = False;
    for line in lines:
        institutionSearch = re.search( r'(.*)(university|school)(.*)(graduation|graduated|expected)(.*)', line, re.I)
        if institutionSearch:
            courseWorkNextLines = False # Start of another school information
            if institution != '':
                education.append(
                    {"institution": institution,
                    "area": area,
                    "studyType": studyType,
                    "startDate": startDate,
                    "endDate": endDate,
                    "gpa": gpa,
                    "courses": courses}
                )
            
            # Clear fields
            institution = ''
            area = ''
            studyType = ''
            startDate = ''
            endDate = ''
            gpa = ''
            courses = []
            
            # Set institution
            institution = institutionSearch.group(1) + institutionSearch.group(2) + institutionSearch.group(3)
        graduationSearch = re.search( r'.*(graduation|graduated|expected)(.*)', line, re.I)
        if graduationSearch:
            endDate = graduationSearch.group(2).lstrip(":").strip()
        bsSearch = re.search( r'.*(bs|b.s|bachelor).*', line, re.I)
        if bsSearch:
            area = re.split("'\u200b'|,", line)[1]
            studyType = 'Bachelor'
        msSearch = re.search( r'.*(ms|m.s|master).*', line, re.I)
        if msSearch:
            area = re.split("'\u200b'|,", line)[1]
            studyType = 'Master'
        phdSearch = re.search( r'.*(phd).*', line, re.I)
        if phdSearch:
            area = re.split("'\u200b'|,", line)[1]
            studyType = 'PhD'
        courseSearch = re.search( r'.*coursework:(.*)', line, re.I)
        if courseSearch:
            courseWorkNextLines = True
            
            if courseSearch.group(1) == '':
                # Need to extract from next lines
                continue
            
            for course in re.split(";|,", courseSearch.group(1).lstrip("- ")):
                courses.append(course.strip())
        if courseWorkNextLines and line != '':
            for course in re.split(";|,", line.lstrip("- ")):
                courses.append(course.strip())
    
    # Add last instutition
    education.append(
        {"institution": institution,
            "area": area,
            "studyType": studyType,
            "startDate": startDate,
            "endDate": endDate,
            "gpa": gpa,
            "courses": courses}
        )

    return education

def parse_skills_section(lines):
    """
    Parsing of skill section
    
    Args:
        lines (List[str])
    
    Returns:
        "skills": [{
            "name": "Web Development",
            "level": "Master",
            "keywords": [
              "HTML",
              "CSS",
              "Javascript"
            ]
          }]
    """
    # TODO: return object that follows JSON schema
    skills = [
        skill.lstrip("- ")
        for line in lines
        for skill in re.split(r", +", line)
        if skill.strip() and
        skill.strip().lower() not in SECTION_HEADERS["skills"]
    ]
    return skills



In [215]:
# Take Stacey Miller as an example
exampleresume = {}
resumeLines = resumes['Stacey Miller']
basics = parse_basic_information(resumeLines.get("START"))
education = parse_education_section(resumeLines.get("education"))
skills = parse_skills_section(resumeLines.get("skills"))
work = parse_work_section(resumeLines.get("experience"))

exampleresume["basics"] = basics
exampleresume["work"] = work
exampleresume["education"] = education
exampleresume["skills"] = skills

exampleresume

{'basics': {'name': 'Stacey Miller',
  'label': '',
  'picture': '',
  'email': '',
  'phone': '',
  'website': '',
  'summary': '',
  'location': {},
  'profiles': []},
 'work': [],
 'education': [{'institution': 'Columbia University, The Fu Foundation of Engineering and Applied Science ',
   'area': ' Computer Science',
   'studyType': 'Bachelor',
   'startDate': '',
   'endDate': 'May 2019',
   'gpa': '',
   'courses': ['Artificial Intelligence',
    'Advanced Programming',
    'Data Structures in Java',
    'Python',
    'Computer',
    'Relevant Coursework: Artificial Intelligence',
    'Advanced Programming',
    'Data Structures in Java',
    'Python',
    'Computer',
    'Science Theory',
    'Linear Algebra']},
  {'institution': 'Riverside High School, Durham, NC ',
   'area': '',
   'studyType': '',
   'startDate': '',
   'endDate': 'June 2015',
   'gpa': '',
   'courses': []}],
 'skills': ['Programming Languages: Advanced in Java',
  'Python',
  'and C/C++',
  'Basic in Java