# Introduction

This notebook explores how we can parse Resume's from unstructured into a structured document. This initial step allows us to perform to follow-up steps of redaction and reconstruction. We start from **[Resume as Plain Text]** and End at **[Resume as JSON]**.

![](https://raw.githubusercontent.com/aniruddha-adhikary/blind-recruiter/main/_docs/flow-diagram.drawio.png)

## Setting up Output Format using Pydantic

Setup `pydantic` models to for a Resume. Pydantic allows us define the data models precisely, and the descriptions would be used to feed OpenAI "meaning" of the fields.

I hesitated to use the [Structured output parser](https://python.langchain.com/docs/modules/model_io/output_parsers/structured) because I needed to define complex structures involving nested lists.

References:
 - https://docs.pydantic.dev/latest/
 - https://python.langchain.com/docs/modules/model_io/output_parsers/pydantic

In [1]:
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import date
from enum import Enum

class EducationLevel(str, Enum):
    primary_school = "primary_school"
    secondary_school = "secondary_school"
    junior_college = "junior_college"
    university = "university"
    unknown = "unknown"


class Experience(BaseModel):
    title: Optional[str] = Field(description="Job title")
    organization: Optional[str] = Field(description="Organization name")
    start_date: Optional[date] = Field(description="Start date")
    end_date: Optional[date] = Field(
        description="End date (Null if current job)"
    )
    achievements: List[str] = Field(
        description="List of achievements, succint (Leave empty if none provided))"
    )
    responsibilities: List[str] = Field(
        description="List of responsibilities (Leave empty if none provided)))"
    )


class Education(BaseModel):
    education_level: Optional[EducationLevel] = Field(
        description="Must be any of the following values: {}".format([e.value for e in EducationLevel])
    )
    credential_name: Optional[str] = Field(description="Credential name")
    institution_name: Optional[str] = Field(description="Institution name")
    start_date: Optional[date] = Field(description="Start date")
    end_date: Optional[date] = Field(description="End date")
    description: Optional[str] = Field(description="Education description")


class Experiences(BaseModel):
    experiences: List[Experience] = Field(description="List of experiences")


class EducationHistory(BaseModel):
    education_history: List[Education] = Field(
        description="List of education history"
    )


class Skills(BaseModel):
    skills: List[str] = Field(
        description="List of skills (Programming Languages, Frameworks)"
    )


class Certifications(BaseModel):
    certifications: List[str] = Field(
        description="List of Certifications or Professional Credentials"
    )

class ParsedResume(Certifications, Skills, EducationHistory, Experiences):
    pass


Create an output parser based on the composed `ParsedResume` model.

In [2]:
from langchain.output_parsers import PydanticOutputParser

resume_output_parser = PydanticOutputParser(pydantic_object=ParsedResume)

## Large-Language Model (LLM)

I have chosen to use `OpenAI GPT-3.5`, this specific variant supports upto 16_000 tokens. Enough for us to fit in the data model alongside a resume. The only reason behind using `gpt-3.5-turbo-16k` instead of `text-davinci-003` is the token size limit.

In [3]:
from langchain.chat_models import ChatOpenAI

chat_llm = ChatOpenAI(model='gpt-3.5-turbo-16k', temperature=0.0)

## Building the Prompt

What the cool kids call Prompt Engineering.

In [4]:
from langchain.prompts import (
    ChatPromptTemplate,
    PromptTemplate,
    SystemMessagePromptTemplate
)

prompt = PromptTemplate(
    template="I am providing a JSON Schema followed by a resume in plain text. Format the plain text into the JSON format.\n{format_instructions}\n{resume}\n",
    input_variables=["resume"],
    partial_variables={
        "format_instructions": resume_output_parser.get_format_instructions()
    },
)

system_message_prompt = SystemMessagePromptTemplate(prompt=prompt)
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt])

# Trying it out

## Taking resume input

This is a **fictional human being**, a resume generated by ChatGPT with GPT-4.

In [5]:
with open('test_data/fictional-resume.md') as input_file:
    resume_content = input_file.read()

In [6]:
print(resume_content)

**Jonathon Bracken**

1234 Silicon Valley Rd., San Jose, CA 95126
(408) 123-4567 | jonathon.bracken@example.com
LinkedIn: linkedin.com/in/jonathon-bracken

**Objective**
A versatile Software Engineer with over 10 years of diverse experience in front-end, back-end, and Android firmware engineering roles. Eager to apply problem-solving abilities and technical expertise in a challenging new role.

---

**Skills**

- Programming languages: Java, JavaScript, Python, C++, C#, Kotlin, Swift, SQL
- Web Technologies: HTML, CSS, React, AngularJS, Node.js, Express.js, REST APIs
- Databases: MySQL, PostgreSQL, MongoDB, Redis
- Android Firmware: Android Open Source Project (AOSP), Linux Kernel, Custom ROM Development
- Tools: Git, Docker, Jenkins, Jira, Agile/Scrum methodologies
- Soft Skills: Communication, Leadership, Problem Solving, Teamwork, Adaptability

---

**Work Experience**

**Senior Software Engineer | HoloWare Inc.**
*San Jose, CA | February 2021 - Present*

- Lead the development team

## Feed in the Resume

In [7]:
%%time

formatted_message = chat_prompt.format_prompt(resume=resume_content)

output = chat_llm(formatted_message.to_messages())

CPU times: user 15.6 ms, sys: 8.05 ms, total: 23.7 ms
Wall time: 13.3 s


### Initial Output
From first attempt. We may need to push this through to OutputFixingParser.

In [8]:
import json

print(output.content)

{
  "experiences": [
    {
      "title": "Senior Software Engineer",
      "organization": "HoloWare Inc.",
      "start_date": "February 2021",
      "end_date": null,
      "achievements": [
        "Lead the development team of HoloLens, a holographic glasses project, developing firmware based on Android Open Source Project (AOSP).",
        "Enhanced system performance by optimizing Linux Kernel and effectively reduced boot-up time by 30%.",
        "Assisted in the development of several in-house tools to facilitate rapid firmware testing and deployment."
      ],
      "responsibilities": []
    },
    {
      "title": "Full Stack Developer",
      "organization": "Nebula Dynamics",
      "start_date": "June 2017",
      "end_date": "February 2021",
      "achievements": [
        "Contributed to a team responsible for developing Nebula Cloud Suite, a comprehensive suite of cloud-based tools for businesses.",
        "Developed RESTful APIs using Node.js and Express, which serve

### Get Output Fixing Parser to Take over

Chat with OpenAI to alleviate complaints from `pydantic`. We are using OutputFixingParser.

References:
 - https://python.langchain.com/docs/modules/model_io/output_parsers/retry

In [None]:
%%time

from langchain.output_parsers import OutputFixingParser
output_fixing_parser = OutputFixingParser.from_llm(parser=resume_output_parser, llm=chat_llm)

resume = output_fixing_parser.parse(output.content).dict()

In [None]:
import json

json_formatted_str = json.dumps(resume, indent=2, default=str)
print(json_formatted_str)