# Resume Parsing with Spacy

## Introduction
In this project, we will build a resume parsing system using the Spacy library. The goal is to extract relevant information, such as entities (e.g., names, organizations, skills), from resumes in order to process and analyze them effectively.

## Dataset
We will utilize a dataset (`dataset.json`) containing resumes in JSON format. The dataset will be split into training and test sets for training and evaluating our model.

## Dependencies
Make sure you have the following dependencies installed:
- Spacy
- tqdm
- scikit-learn
- PyMuPDF
- spacy_transformers
## Training the Model
1. Load the dataset and split it into training and test sets.
2. Convert the training and test data into Spacy DocBin format.
3. Save the converted data to disk (`train_data.spacy` and `test_data.spacy`).
4. Train the Spacy model using the configuration file (`config.cfg`).
5. The trained model will be saved in the `output` directory.

## Parsing Resumes
1. Load the trained model from the `output` directory.
2. Install the PyMuPDF library for processing PDF files.
3. Provide the path to the PDF file you want to parse.
4. The extracted entities and their labels will be printed.

## Conclusion
Resume parsing is a useful technique for automating the extraction of information from resumes. By using Spacy and a well-prepared dataset, we can train a model to accurately identify and extract entities from resumes.

Let's get started!


In [1]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import json
from sklearn.model_selection import train_test_split
import fitz

#### Load CV data from JSON file

In [5]:
data = json.load(open('../data/dataset.json','r'))

In [6]:
len(data)

1014

In [8]:
!python -m spacy init fill-config ../config/base_config.cfg ../config/config.cfg

✔ Auto-filled config with all values
✔ Saved config
..\config\config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [9]:
# Function to convert data to Spacy DocBin format
def get_spacy_doc(file, data):
    nlp = spacy.blank("en")
    db = DocBin()

    for text, annot in tqdm(data):
        doc = nlp.make_doc(text)
        annot = annot['entities']

        ents = []
        entity_indices = []

        for start, end, label in annot:
            skip_entity = False
            for idx in range(start, end):
                if idx in entity_indices:
                    skip_entity = True
                    break
            if skip_entity == True:
                continue

            entity_indices = entity_indices + list(range(start, end))
            try:
                span = doc.char_span(
                    start, end, label=label, alignment_mode='strict')
            except:
                continue

            if span is None:
                err_data = str([start, end]) + "    " + str(text) + "\n"
                file.write(err_data)
            else:
                ents.append(span)

        try:
            doc.ents = ents
            db.add(doc)
        except:
            pass

    return db

In [10]:
# Split the dataset into train and test sets
train, test = train_test_split(data, test_size=0.2)

In [12]:
# Open file to write training data errors
file = open('../model/train_file.txt', 'w')

In [13]:
# Convert training data to Spacy DocBin format and save to disk
train_db = get_spacy_doc(file, train)
train_db.to_disk('../model/train_data.spacy')

# Convert test data to Spacy DocBin format and save to disk
test_db = get_spacy_doc(file, test)
test_db.to_disk('../model/test_data.spacy')

100%|██████████| 811/811 [00:08<00:00, 98.40it/s] 
100%|██████████| 203/203 [00:02<00:00, 92.16it/s] 


In [14]:
file.close()

- Train this model with gpu and make necessary changes in config file. recommendation is train this with google colab

In [14]:
!python -m spacy train ../config/config.cfg --output ../model/output --paths.train ../model/train_data.spacy --paths.dev ../model/test_data.spacy --gpu-id 0

[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2023-06-28 19:11:47,085] [INFO] Set up nlp object from config
[2023-06-28 19:11:47,101] [INFO] Pipeline: ['transformer', 'ner']
[2023-06-28 19:11:47,105] [INFO] Created vocabulary
[2023-06-28 19:11:47,105] [INFO] Finished initializing nlp object
Downloading (…)lve/main/config.json: 100% 481/481 [00:00<00:00, 3.36MB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 14.9MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 73.1MB/s]
Downloading (…)/main/tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 55.0MB/s]
Downloading model.safetensors: 100% 499M/499M [00:06<00:00, 72.2MB/s]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias']
- This IS expected if you are 

In [15]:
# Load the trained model
nlp = spacy.load('../model/output/model-best')

In [16]:
# Process the PDF file using Spacy
import sys, fitz
fname = '../yashnewresume.pdf'
doc = fitz.open(fname)

In [17]:
text = " "
for page in doc:
    text = text+str(page.get_text())

In [18]:
doc = nlp(text)
for ent in doc.ents:
  print(ent.text," ---->>>",ent.label_)

Yash Parmar  ---->>> NAME
Ahmedabad, Gujarat, 382440, india  ---->>> LOCATION
yashp3020@gmail.com  ---->>> EMAIL ADDRESS
K.S School of business Management  ---->>> COLLEGE NAME
Kum Kum Vidyalaya  ---->>> NAME
Kum Kum Vidyalaya  ---->>> NAME
EcoSnap  ---->>> SKILLS
SKILLS  ---->>> SKILLS
Python  ---->>> SKILLS
Javascript  ---->>> SKILLS
Mongodb  ---->>> SKILLS
MySql  ---->>> SKILLS
ReactJS  ---->>> SKILLS
Data Science  ---->>> SKILLS
ML/DL  ---->>> SKILLS
AWS  ---->>> SKILLS
NextJS  ---->>> SKILLS
Nestjs  ---->>> SKILLS
Microservices  ---->>> SKILLS
Express  ---->>> SKILLS
NodeJS  ---->>> SKILLS
Tensor�ow  ---->>> SKILLS
Docker  ---->>> SKILLS
Scikit-Learn  ---->>> SKILLS
Gujarati  ---->>> LANGUAGE
Hindi  ---->>> LANGUAGE
English  ---->>> LANGUAGE
Techathon Winner  ---->>> AWARDS
Gateway Group  ---->>> AWARDS
JK Laxmipat University Hackathon  ---->>> NAME
Machine learning with python  ---->>> CERTIFICATION
JKLU University,jaipur  ---->>> UNIVERSITY
IBM  ---->>> COMPANIES WORKED AT
