# Extract CV (PDF) To JSON with Mistral

This program is designed to extract structured data from a PDF file (in this case, a CV or resume) using a language model. Here's a step-by-step breakdown of what the code does:

Importing Required Libraries:

1. ChatMistralAI from langchain_mistralai.chat_models:  
This is the language model used for processing the text.
BaseModel and Field from pydantic: These are used to define the structure of the extracted data.
2. PyPDFLoader from langchain_community.document_loaders:   
This is used to load and split the PDF file into pages. List from typing: This is used to define the type of the skills field in the CVDataExtraction model. The PyPDFLoader is used to load the PDF file named cv.pdf. The load_and_split method splits the PDF into individual pages.
3. Typing :   
The typing module provides a standard way to define these type hints, which can improve code readability, help with debugging, and enable better support from IDEs and static type checkers.
4. SSL :  
Standard security technology for establishing an encrypted link between a server and a client. In the context of Python programming, the ssl module provides access to Transport Layer Security (previously known as Secure Sockets Layer) encryption and peer authentication facilities for network sockets.

In [None]:
pip install langchain-core langchain-mistralai pypdf pydantic langchain-community

## Extracting Text from PDF

The code defines a CVDataExtraction model using Pydantic, which structures and validates the extracted data from a CV. It includes fields for various candidate details like full name, email, job titles, skills, professional experience, education, publications, distinctions, and certifications. The model ensures that the data is organized and validated. After extracting and combining the text from a PDF file, the model is used with a MistralAI model to process the CV and output the data in a structured format, like JSON, for further use.

In [None]:
from langchain_mistralai.chat_models import ChatMistralAI
from pydantic import BaseModel, Field
from langchain_community.document_loaders import PyPDFLoader
from typing import List
import ssl

print(ssl.OPENSSL_VERSION)
loader = PyPDFLoader("cv.pdf")
pages = loader.load_and_split()

text = " ".join(list(map(lambda page: page.page_content, pages)))

class CVDataExtraction(BaseModel):
    full_name: str = Field(description="The full name of the candidate, used as their username in the system.")
    email: str = Field(description="The candidate's email address for identification and communication purposes.")
    job_titles: str = Field(description="A summary of the candidate's current or most recent job titles.")
    promotion_years: int = Field(description="The year the candidate started their professional career.")
    profile: str = Field(description="A brief overview of the candidate's professional profile, including their key attributes and expertise.")
    skills: List[str] = Field(description="A list of the candidate's soft and technical skills, showcasing their capabilities.")
    professional_experiences: List[str] = Field(description="Detailed information about the candidate's professional work experiences, including roles, responsibilities, and achievements.")
    educations: List[str] = Field(description="Educational qualifications of the candidate, including degrees, institutions, and graduation years.")
    publications: List[str] = Field(description="Any publications authored or co-authored by the candidate, such as articles, papers, or books.")
    distinctions: List[str] = Field(description="Awards, honors, or recognitions received by the candidate throughout their career or education.")
    certifications: List[str] = Field(description="Professional certifications achieved by the candidate, indicating their specialized knowledge and qualifications.")


model = ChatMistralAI(api_key="XXXXXX", model='mistral-large-latest')

structured_llm = model.with_structured_output(CVDataExtraction)

# output as JSON
structured_llm.invoke(text)


LibreSSL 2.8.3


CVDataExtraction(full_name='Pirate King', email='pirateking@gmail.com', job_titles='Software Engineer', promotion_years=2014, profile='A highly skilled software engineer with experience in various programming languages and technologies, including C#, .NET, Java, JavaScript, TypeScript, C++, and more. Proficient in both frontend and backend development, with expertise in cloud computing, microservices, and game development. Has worked with major companies like Microsoft, Amazon, and eBay, contributing to significant projects and driving substantial revenue.', skills=['C#', '.NET', 'Java', 'JavaScript', 'TypeScript', 'C++', 'C', 'CosmosDB', 'MSSQL', 'Node', 'Express', 'React', 'Vue', 'Redux', 'jQuery', 'NoSQL', 'Git', 'Azure', 'Cloud Computing', 'CI/CD', 'XUnit', 'Jest', 'Cucumber', 'Nightwatch', 'Unit Testing', 'Lambda', 'OOP', 'Unity 2D', 'Game Development', 'Microservices', 'Distributed Systems', 'Frontend', 'Backend', 'Full-Stack', 'English', 'Korean', 'Japanese'], professional_exper

In [7]:
# Parse the text using the structured LLM
parsed_data = structured_llm.invoke(text)

# Convert the parsed data to a dictionary
cv_dict = parsed_data.dict()

print(cv_dict)

{'full_name': 'Pirate King', 'email': 'pirateking@gmail.com', 'job_titles': 'Software Engineer, YouTuber, Mentor', 'promotion_years': 2014, 'profile': 'A skilled software engineer with experience in various programming languages and technologies, including C#, .NET, Java, JavaScript, and more. Proficient in full-stack development, microservices, and cloud computing. Experienced in leading development projects and mentoring aspiring software engineers. Fluent in English, Korean, and Japanese.', 'skills': ['C#', '.NET', 'Java', 'JavaScript', 'TypeScript', 'C++', 'C', 'CosmosDB', 'MSSQL', 'Node', 'Express', 'React', 'Vue', 'Redux', 'jQuery', 'NoSQL', 'Git', 'Azure', 'Cloud Computing', 'CI/CD', 'XUnit', 'Jest', 'Cucumber', 'Nightwatch', 'Unit Testing', 'Lambda', 'OOP', 'Unity 2D', 'Game Development', 'Microservices', 'Distributed Systems', 'Frontend', 'Backend', 'Full-Stack', 'English', 'Korean', 'Japanese'], 'professional_experiences': ['YouTuber at YouTube, creating content on software e

/var/folders/f3/wps0vhy53yb_l6b_w_1dq2f80000gn/T/ipykernel_89636/1772127340.py:5: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  cv_dict = parsed_data.dict()
