# CV and Resume Parsing with Spacy

+ Resume parsing is a process which converts an unstructured form of resume data into the structured format.

+ Resumes from the applicants have different formats in terms of presentation, design, fonts, and layouts.

+ An ideal system should extract insightful information or the content inside these resumes as quickly as possible and help recruiters no matter how they look because they contain essential qualifications like the candidate's experience, skills, academic excellence.

# What is NER?

+ Named Entity Recognition is an algorithm where it takes a string of text as an input (either a paragraph or sentence) and identifies relevant nouns (people, places, and organizations) and other specific words.

## Data Preparation

In [1]:
import spacy
import pickle
import random

In [2]:
train_data = pickle.load(open('train_data.pkl','rb'))

In [3]:
train_data[0][1].get('entities')

[(1749, 1755, 'Companies worked at'),
 (1696, 1702, 'Companies worked at'),
 (1417, 1423, 'Companies worked at'),
 (1356, 1793, 'Skills'),
 (1209, 1215, 'Companies worked at'),
 (1136, 1248, 'Skills'),
 (928, 932, 'Graduation Year'),
 (858, 889, 'College Name'),
 (821, 856, 'Degree'),
 (787, 791, 'Graduation Year'),
 (744, 750, 'Companies worked at'),
 (722, 742, 'Designation'),
 (658, 664, 'Companies worked at'),
 (640, 656, 'Designation'),
 (574, 580, 'Companies worked at'),
 (555, 573, 'Designation'),
 (470, 493, 'Companies worked at'),
 (444, 469, 'Designation'),
 (308, 314, 'Companies worked at'),
 (234, 240, 'Companies worked at'),
 (175, 198, 'Companies worked at'),
 (93, 137, 'Email Address'),
 (39, 48, 'Location'),
 (13, 38, 'Designation'),
 (0, 12, 'Name')]

## NER with Spacy

In [4]:
import warnings

nlp = spacy.blank('en')


def train_model(train_data):
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")
        
    # add labels
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
    
    #--------------------------
    
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
   
    # only train NER
    with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')

        # reset and initialize the weights randomly – but only if we're
        # training a new model
        optimizer = nlp.begin_training()
            
        for itn in range(20):
            print('Starting iterations ', str(itn))
            random.shuffle(train_data)
            
            
            losses = {}
            index = 0
            for text, annotations in train_data:
                print(index)
                index = index + 1
                try:
                    nlp.update(
                        [text],  # batch of texts
                        [annotations],  # batch of annotations
                        drop=0.2,  # dropout - make it harder to memorise data
                        sgd=optimizer,  # callable to update weights
                        losses=losses)
                except Exception as e:
                    pass
                    #print(text)
                
            print(losses)



In [5]:
train_model(train_data)

Starting iterations  0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
{'ner': 13680.804923822867}
Starting iterations  1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81


In [11]:
nlp.to_disk('nlp_model')

## Model Testing

In [12]:
nlp_model = spacy.load('nlp_model')

In [13]:
text = train_data[0][0]

In [14]:
doc = nlp_model(text)
for ent in doc.ents:
    print(f'{ent.label_.upper():{30}}- {ent.text}')

NAME                          - Arpit Jain
DESIGNATION                   - Quality Analyst
COMPANIES WORKED AT           - ThoughtWorks Technologies
LOCATION                      - Pune
EMAIL ADDRESS                 - indeed.com/r/Arpit-Jain/3714fe32f98b03a9
DESIGNATION                   - Quality Analyst
DESIGNATION                   - Quality Analyst
COMPANIES WORKED AT           - ThoughtWorks Technologies
COMPANIES WORKED AT           - Infosys Ltd
COMPANIES WORKED AT           - ThoughtWorks Technologies
DESIGNATION                   - Quality Analyst
COMPANIES WORKED AT           - ThoughtWorks Technologies
DEGREE                        - B.Tech
COLLEGE NAME                  - Jaypee Institute Of Information Technology


## CV Parsing from PDF Data

In [15]:
!pip install PyMuPDF



In [16]:
import sys, fitz

In [18]:
fname = 'Smith Resume.pdf'
doc = fitz.open(fname)
text = ""

for page in doc:
    text = text + str(page.getText())

In [19]:
txt1 = " ".join(text.split('\n'))

In [20]:
txt1

'Michael Smith  BI / Big Data/ Azure  Manchester, UK- Email me on Indeed: indeed.com/r/falicent/140749dace5dc26f    10+ years of Experience in Designing, Development, Administration, Analysis,  Management  inthe  Business  Intelligence  Data  warehousing,  Client  Server  Technologies, Web-based Applications, cloud solutions and Databases.  Data warehouse: Data analysis, star/ snow flake schema data modeling and design  specific todata warehousing and business intelligence environment.  Database: Experience in database designing, scalability, back-up and recovery,  writing andoptimizing SQL code and Stored Procedures, creating functions, views,  triggers and indexes.   Cloud platform: Worked on Microsoft Azure cloud services like Document DB, SQL  Azure, StreamAnalytics, Event hub, Power BI, Web Job, Web App, Power BI, Azure  data lake analytics(U-SQL).  Big Data: Worked Azure data lake store/analytics for big data processing and Azure  data factoryto schedule U-SQL jobs. Designed and 

In [21]:
doc = nlp_model(txt1)
for ent in doc.ents:
    print(f'{ent.label_.upper():{30}}- {ent.text}')

NAME                          - Michael Smith
EMAIL ADDRESS                 - indeed.com/r/falicent/140749dace5dc26f
LOCATION                      - Technologies
COMPANIES WORKED AT           - Microsoft
LOCATION                      - StreamAnalytics
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COLLEGE NAME                  - The University of Manchester - UK
SKILLS                        - problem solving (Less than 1 year), project lifecycle (Less th