---
# Basic Usage

Transform a résumé file (PDF) into structured data (JSON).

In [1]:
from pprint import pprint

import msvdd_bloc.resumes

In [2]:
filepath = msvdd_bloc.ROOT_DIR.joinpath("tests", "data", "fake-resume.pdf")

In [3]:
resume_text = msvdd_bloc.resumes.extract_text_from_pdf(str(filepath))
print(resume_text)

fake-resume


JOHN DOE 
123-456-7890    |    john.doe@fake.com    |    123 Fake St. Apt 456, Fake City, BS 78910 

SUMMARY 
Hard-working, self-motivated individual seeking a position in the field of data science to 
develop my professional skills and do good in the world. 

EXPERIENCE 

Senior Data Scientist,  Bloc;  Chicago, IL    |    Aug 2019 – Present 

• Led a team of volunteers to hack on a data project that proved trickier than anticipated 

• Developed code to generate fake data and train real models to parse résumé, then 
wrote documentation, tests, and scripts for successful usage of project outputs 

Data Scientist,  Datakind;  New York, NY    |    Mar. 2015 – Aug. 2019 

Data do-gooder who just can’t say no to Jake Porway. Contributed to many projects in 
many roles, from Data Creative scoping out potential work to event photographer at one 
of DataKind’s biggest gatherings ever. 

EDUCATION 
University of Fake State – Fake City    |    Sep 2007 - Aug 2012 

Ph.D. in Physic

In [4]:
resume_data = msvdd_bloc.resumes.parse_text(resume_text)
pprint(resume_data, width=120)

{'basics': {'email': 'john.doe@fake.com',
            'location': {'address': '123 Fake St. Apt 456',
                         'city': 'Fake City',
                         'postal_code': '78910',
                         'region': 'BS'},
            'name': 'JOHN DOE',
            'phone': '123-456-7890',
            'summary': 'Hard-working, self-motivated individual seeking a position in the field of data science to\n'
                       'develop my professional skills and do good in the world.'},
 'education': [{'area': 'Physics; Minor in Applied Math',
                'end_date': 'Aug 2012',
                'institution': 'University of Fake State – Fake City',
                'start_date': 'Sep 2007',
                'study_type': 'Ph.D.'},
               {'area': 'Computer Engineering',
                'courses': ['Data Structures & Algorithms',
                            'Database Administration',
                            'Coding 101',
                            'Explo

---

# How it Works

Under the hood, the extracted résumé text is first cleaned up and standardized, then split into lines that are associated with a particular section, such as "basics" or "education". Each section's lines is then tokenized into constituent words, featurized into sequences of numeric and categorical features, then individually tagged with labels such as "name" or "institution". These sequences of labeled tokens are then parsed into structured data, which often involves filtering out "junk" tokens and combining like adjacent tokens into contiguous text strings, such as "JOHN DOE" and "University of Fake State – Fake City". Lastly, the resulting data is validated according to a declared schema.

The code for the high-level parsing function is relatively straightforward, on its surface:

In [5]:
msvdd_bloc.resumes.parse_text??

[0;31mSignature:[0m [0mmsvdd_bloc[0m[0;34m.[0m[0mresumes[0m[0;34m.[0m[0mparse_text[0m[0;34m([0m[0mtext[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mparse_text[0m[0;34m([0m[0mtext[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""[0m
[0;34m    Parse raw extracted résumé ``text`` into structured data conforming to the schema[0m
[0;34m    specified in :class:`schemas.ResumeSchema()`.[0m
[0;34m[0m
[0;34m    Args:[0m
[0;34m        text (str)[0m
[0;34m[0m
[0;34m    Returns:[0m
[0;34m        Dict[str, object][0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m    [0mdata[0m [0;34m=[0m [0;34m{[0m[0;34m}[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0mnorm_text[0m [0;34m=[0m [0mmunge[0m[0;34m.[0m[0mnormalize_text[0m[0;34m([0m[0mtext[0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0mtext_lines[0m [0;34m=[0m [0mmunge[0m[0;34m.[0m[0mget_filtered_text_lines[0m[0;34m([0m[0mnorm_text[

Normalizing text clears up any text encoding weirdness, truncates long stretches of spaces/newlines, and very importantly, transforms a wide variety of bullet symbols into a consistent `-`.

In [6]:
norm_text = msvdd_bloc.resumes.munge.normalize_text(resume_text)
text_lines = msvdd_bloc.resumes.munge.get_filtered_text_lines(norm_text)
print(norm_text[-200:])

s & Algorithms, Database Administration, Coding 101, 
Exploratory Data Analysis 

SKILLS 

- Programming Languages: Python, SQL, HTML/CSS 

- English (native), Spanish (conversational), French (basic)


Next, we iterate through lines of normalized text, searching for matches to known patterns that resemble section headers, like `"Education:"` or `"SKILLS"`. These patterns are expressed as regular expressions, and are easily updated if a new, unambiguous pattern is encountered.

In [7]:
section_lines = msvdd_bloc.resumes.segment.get_section_lines(text_lines)
section_lines["skills"]

['',
 '- Programming Languages: Python, SQL, HTML/CSS',
 '- English (native), Spanish (conversational), French (basic)']

A given section's lines are then tokenized — that is, split into individual "words".

In [8]:
tokens = msvdd_bloc.tokenize.tokenize("\n".join(section_lines["skills"]).strip())
tokens

[-,
 Programming,
 Languages,
 :,
 Python,
 ,,
 SQL,
 ,,
 HTML,
 /,
 CSS,
 ,
 -,
 English,
 (,
 native,
 ),
 ,,
 Spanish,
 (,
 conversational,
 ),
 ,,
 French,
 (,
 basic,
 )]

Each token is then transformed into a collection of numerical and categorical features that a model can use to make good predictions about its label. These features include its position in the sequence, text length, its case, whether or not its entirely punctuation or whitespace, like a number or email, and more. A token's immediate neighbors' features are also added to its own, nested under a "prev" or "next" key to keep everything separate. And, depending on the section, additional section-specific features may be added, such as whether or not a token looks like a month or year, or is a word commonly used to indicate one's level of proficiency on a skill.

As an example, here's what the `Python` token's feature set looks like:

In [9]:
features = msvdd_bloc.resumes.skills.parse.featurize(tokens)
[features for token, features in zip(tokens, features) if token.text == "Python"][0]

{'idx': 4,
 'len': 6,
 'shape': 'Xxxxx',
 'prefix': 'P',
 'suffix': 'hon',
 'is_alpha': True,
 'is_digit': False,
 'is_lower': False,
 'is_upper': False,
 'is_title': True,
 'is_punct': False,
 'is_left_punct': False,
 'is_right_punct': False,
 'is_bracket': False,
 'is_quote': False,
 'is_space': False,
 'like_num': False,
 'like_url': False,
 'like_email': False,
 'is_stop': False,
 'is_alnum': True,
 'is_newline': False,
 'is_partial_digit': False,
 'is_partial_punct': False,
 'is_group_sep_text': False,
 'is_item_sep_text': False,
 'is_level_text': False,
 'ppprev': {'idx': 1,
  'len': 11,
  'shape': 'Xxxxx',
  'prefix': 'P',
  'suffix': 'ing',
  'is_alpha': True,
  'is_digit': False,
  'is_lower': False,
  'is_upper': False,
  'is_title': True,
  'is_punct': False,
  'is_left_punct': False,
  'is_right_punct': False,
  'is_bracket': False,
  'is_quote': False,
  'is_space': False,
  'like_num': False,
  'like_url': False,
  'like_email': False,
  'is_stop': False,
  'is_alnum': Tr

Now, a trained model can use the sequence of features to make per-token predictions about the most likely labels, because it's learned the patterns found in a training dataset that map features to labels.

In [10]:
labeled_tokens = msvdd_bloc.resumes.parse_utils.tag(
    tokens, features,
    tagger=msvdd_bloc.resumes.parse_utils.load_tagger(msvdd_bloc.resumes.skills.FPATH_TAGGER),
)
labeled_tokens[:10]

[(-, 'field_sep'),
 (Programming, 'name'),
 (Languages, 'name'),
 (:, 'field_sep'),
 (Python, 'keyword'),
 (,, 'item_sep'),
 (SQL, 'keyword'),
 (,, 'item_sep'),
 (HTML, 'keyword'),
 (/, 'keyword')]

Finally, this sequence of (token, label) pairs has to be parsed using rules that combine the tokens into structured fields. Each section is parsed differently, depending on the structure and relationships of its constituent fields. Here's how that looks for this skills section:

In [11]:
skills_data = msvdd_bloc.resumes.skills.parse._parse_labeled_tokens(labeled_tokens)
skills_data

[{'name': 'Programming Languages', 'keywords': ['Python', 'SQL', 'HTML/CSS']},
 {'name': 'English', 'level': 'native'},
 {'name': 'Spanish', 'level': 'conversational'},
 {'name': 'French', 'level': 'basic'}]

:tada:

---

# Generating Fake Data

The section-specific models used to label each token in a section's lines are [Conditional Random Field](https://en.wikipedia.org/wiki/Conditional_random_field) models, a type of statistical model that predicts items in a sequence while taking their context — i.e. their _neighbors_ — into account. The relationships between tokens-in-context and their labels are learned from already labeled training data; to learn more complicated relationships, models typically require more training data. Since Bloc's supply of real résumés is relatively limited, we make do by generating sufficiently realistic fakes and assigning known labels.

Each résumé section has functionality for randomly generating values for a variety of fields, building upon the framework of the [`faker` package](https://faker.readthedocs.io/en/master/). Field value generators are linked to field keys (shorthand names used as placeholders in template strings) and field labels (the labels we want to predict with a CRF model) by way of a `FIELDS` dictionary. Finally, sequences of fields are generated in randomized template strings, where each field follows the format `{field_key:field_label:probability}`. (Note: the field label and probability components are optional. The default field label is specified in `FIELDS`, and the default probability is 1.0 — as in, it will be generated every time.) A given section comes in many different forms, which entails many different template strings.

For example, a template like `"{uni} {fsep} {city_state} {fsep} {dt}"` will produce sequences of labeled tokens like these:

```
[
    [
        ('State', 'institution'),
        ('University', 'institution'),
        ('of', 'institution'),
        ('West', 'institution'),
        ('Virginia', 'institution'),
        (',', 'institution'),
        ('Harrisonville', 'institution'),
        ('   ', 'field_sep'),
        (',', 'field_sep'),
        ('  ', 'field_sep'),
        ('January', 'end_date'),
        ('1989', 'end_date')
    ],
    [
        ('College', 'institution'),
        ('of', 'institution'),
        ('Graceview', 'institution'),
        ('  ', 'field_sep'),
        (';', 'field_sep'),
        (' ', 'field_sep'),
        ('Nov.', 'end_date'),
        ('2004', 'end_date')
    ],
]
```

By using many different templates, many different fakes of a given section can be produced, ideally with enough variation in structure and values to effectively model real data. Here's how that looks in code:

In [12]:
from msvdd_bloc.resumes import education
from msvdd_bloc.resumes import generate_utils

In [13]:
fakes = list(generate_utils.generate_labeled_tokens(
    education.generate.TEMPLATES,
    education.generate.FIELDS,
    n=10,
    fixed_val_field_keys={"ws", "fsep", "isep"},
))
for i, fake in enumerate(fakes):
    print("\n[fake {}]".format(i))
    print(" ".join(tok for tok, label in fake))


[fake 0]
Doctorate ,   Political Science , Minor , Journalism , Media Studies and Communication 
 Community College of Vermont   –   Rileychester , NC 
 Feb 2005 - Present

[fake 1]
Guerra University ,    Mendozaton , Oklahoma    Expected Graduation : Mar 2015   – Current 
 AA Culinary Arts 
 Current GPA : 1.23

[fake 2]
Nguyen State University 
 Recent Courses- Environmental Studies and Policy ,   Ethnic and Gender Studies ,   Contemporary Resource Management 
 Minor ; Natural Sciences 

 Reese Polytechnic University 

 Grade Point Average : 1.6/2.7 
 July 2007 
 Associate Degree , Business , Major : The Arts 

 Recent Course Work : 
 - Intermediate Environmental Science ,   Physics ,   Marketing 1 ,   Electrical Engineering & Rhetoric ,   Graphic Design I ,   Biology ,   Ecology ,   21st Century Speech and Hearing Sciences

[fake 3]
State College of East Scott , Sep 2007   – Jul 2009 
 AA Psychology , Major in Logic 
 Relevant Course Work : 
 Comparative Literature 201 - Acting & Ea

For good measure, the faked data can be "augmented" by adding in additional random variation that may affect both values and relationships between fields. This entails the use of an `augment_utils.Augmenter` class, which takes a set of transform functions and produces randomly modified versions of the original labeled tokens.

Go ahead, run this next cell a few times in a row to see the additional variation.

In [14]:
aug_fakes = [
    education.augment.AUGMENTER.apply(tok_labels)
    for tok_labels in fakes
]
for i, aug_fake in enumerate(aug_fakes):
    print("\n[aug fake {}]".format(i))
    print(" ".join(tok for tok, label in aug_fake))


[aug fake 0]
Doctorate ,   Political Science , Minor , Journalism , Media Studies and Communication 
 Community College of Vermont   –   Rileychester , NC 
 Feb 2005 - Present

[aug fake 1]
Guerra University ,    Mendozaton , Oklahoma    Expected    Graduation : Mar 2015   – Current 
 AA Culinary Arts 
 Current GPA : 1.23

[aug fake 2]
Nguyen State 
 Recent Courses- Environmental Studies and Policy ,   Ethnic and Gender Studies ,   Contemporary Resource Management 
 Minor ; Natural Sciences 

 Reese Polytechnic University 

 Grade Point Average : 1.6/2.7 
 July 2007 
 Associate Degree , Business , Major : The Arts 

 Recent Course Work : 
 - Intermediate Environmental Science ,   Physics ,   Marketing 1 ,   Electrical Engineering & Rhetoric ,   Graphic Design I ,   Biology ,   Ecology ,   21st Century Speech and Hearing Sciences

[aug fake 3]
State College of East Scott , Sep 2007   – Jul 2009 
 AA Psychology , Major in Logic 
 Relevant Course Work : 
 Comparative Literature 201 - Act

---

# Next Steps?

You should probably check out the API Reference in the docs. It's _a lot_, but should have the information you need to start using — and tinkering.