In [1]:
import spacy
import pickle
import random

- Collecting Training Data can be incredibly painful

- For training data (how to prepare nlp training data): https://spacy.io/usage/training#training-data

In [2]:
train_data = pickle.load(open("C:/Users/Ashish/Desktop/Python Projects/CV and Resume Summarization (NLP)/train_data.pkl", "rb"))

In [3]:
train_data[0]

('Govardhana K Senior Software Engineer  Bengaluru, Karnataka, Karnataka - Email me on Indeed: indeed.com/r/Govardhana-K/ b2de315d95905b68  Total IT experience 5 Years 6 Months Cloud Lending Solutions INC 4 Month • Salesforce Developer Oracle 5 Years 2 Month • Core Java Developer Languages Core Java, Go Lang Oracle PL-SQL programming, Sales Force Developer with APEX.  Designations & Promotions  Willing to relocate: Anywhere  WORK EXPERIENCE  Senior Software Engineer  Cloud Lending Solutions -  Bangalore, Karnataka -  January 2018 to Present  Present  Senior Consultant  Oracle -  Bangalore, Karnataka -  November 2016 to December 2017  Staff Consultant  Oracle -  Bangalore, Karnataka -  January 2014 to October 2016  Associate Consultant  Oracle -  Bangalore, Karnataka -  November 2012 to December 2013  EDUCATION  B.E in Computer Science Engineering  Adithya Institute of Technology -  Tamil Nadu  September 2008 to June 2012  https://www.indeed.com/r/Govardhana-K/b2de315d95905b68?isid=rex-

In [19]:
nlp = spacy.blank("en")

def train_model(train_data):
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")

    # add labels
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
            
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  #Only train NER
        optimizer = nlp.begin_training()
        for itn in range(10):
            print("Starting Iteration " + str(itn))
            random.shuffle(train_data)
            losses = {}
            index = 0
            for text, annotations in train_data:
                # print index
                try:
                    nlp.update(
                        [text], #batch of texts
                        [annotations], #batch of annotations
                        drop = 0.2, #dropout - make it harder to memorize 
                        sgd = optimizer, #callable to update weights
                        losses = losses)
                except Exception as e:
                    pass
                      #print(text)
                    
            print(losses)


In [20]:
train_model(train_data)

Starting Iteration 0
{'ner': 14612.719882360358}
Starting Iteration 1
{'ner': 9009.208793745103}
Starting Iteration 2
{'ner': 10675.988715216505}
Starting Iteration 3
{'ner': 6225.49919928082}
Starting Iteration 4
{'ner': 6054.286762662675}
Starting Iteration 5
{'ner': 6323.285963396171}
Starting Iteration 6
{'ner': 5541.298066961361}
Starting Iteration 7
{'ner': 5744.006587369071}
Starting Iteration 8
{'ner': 4350.9219955215685}
Starting Iteration 9
{'ner': 4388.279996247049}


### Saving the NLP model for Future use

In [21]:
nlp.to_disk("nlp_CVmodel")

### Loading the saved NLP trained model

In [22]:
nlp_model = spacy.load("nlp_CVmodel")

In [23]:
train_data[0]

("Sanand Pal SQL and MSBI Developer with experience in Azure SQL and Data Lake store.  Hyderabad, Telangana - Email me on Indeed: indeed.com/r/Sanand-Pal/5c99c42c3400737c  I intend to establish myself as Software Engineer / architect with an integrated business solution provider through a long time commitment, contributing to the company's growth and in turn ensuring personal growth within the organization. I believe that my technical, functional and communication skills will enable me in facing the challenging career ahead.  Willing to relocate to: Kolkata, West Bengal - hyderbad, Telangana  WORK EXPERIENCE  Assistant Consultant  TCS  • Expertise in SQL Server(2008 R2, 2012, 2014) development, Microsoft BI (SSIS) • Experience with Microsoft BI (SSAS, SSRS), ASP.NET, VSTO, C#. • Experience in all the phases of Software Development Life Cycle (SDLC). • Experience in Business Requirements Analysis, meeting customer expectations. • Have had the opportunity to handle and work in multiple p

- It not a good idea to test the model with training data.
- But for understanding purposes, we pass the above training data to our nlp model 

In [24]:
doc = nlp_model(train_data[0][0]) # loading only the text data and not the entities.
for ent in doc.ents:
    print(f'{ent.label_.upper():{30}}-{ent.text}')

NAME                          -Sanand Pal
DESIGNATION                   -SQL and MSBI Developer
LOCATION                      -Hyderabad
EMAIL ADDRESS                 -indeed.com/r/Sanand-Pal/5c99c42c3400737c
LOCATION                      -Kolkata
DESIGNATION                   -Assistant Consultant
COMPANIES WORKED AT           -TCS
COMPANIES WORKED AT           -Microsoft
DESIGNATION                   -SSIS developer/Sustain resource
COMPANIES WORKED AT           -MICROSOFT
LOCATION                      -Hyderabad
LOCATION                      -Hyderabad
DEGREE                        -Bachelor of Technology in Branch
COLLEGE NAME                  -East Point College of Engg. & Tech.
LOCATION                      -Bengaluru
SKILLS                        -Sql Server, Ssis, T-SQL, ETL, SSRS


### Perform Classification on an unseen data

- Steps:
    1. Convert the pdf to text format
    2. Get all the text from text formatted data
    3. Perform classification

In [25]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading https://files.pythonhosted.org/packages/19/1a/aa35448efb2ec495b515030684f60ba1ea805c314f0109740df04d060d17/PyMuPDF-1.16.17-cp37-cp37m-win_amd64.whl (4.9MB)
Installing collected packages: PyMuPDF
Successfully installed PyMuPDF-1.16.17


In [27]:
import sys, fitz
fname = "Alice Clark CV.pdf"
doc = fitz.open(fname)
text = ""
for page in doc:
    text = text + str(page.getText())

In [28]:
print(text)

Alice Clark 
AI / Machine Learning 
 
Delhi, India Email me on Indeed 
• 
20+ years of experience in data handling, design, and development 
• 
Data Warehouse: Data analysis, star/snow flake scema data modelling and design specific to 
data warehousing and business intelligence 
• 
Database: Experience in database designing, scalability, back-up and recovery, writing and 
optimizing SQL code and Stored Procedures, creating functions, views, triggers and indexes. 
Cloud platform: Worked on Microsoft Azure cloud services like Document DB, SQL Azure, 
Stream Analytics, Event hub, Power BI, Web Job, Web App, Power BI, Azure data lake 
analytics(U-SQL) 
Willing to relocate anywhere 
 
WORK EXPERIENCE 
Software Engineer 
Microsoft – Bangalore, Karnataka 
January 2000 to Present 
1. Microsoft Rewards Live dashboards: 
Description: - Microsoft rewards is loyalty program that rewards Users for browsing and shopping 
online. Microsoft Rewards members can earn points when searching with Bing, bro

In [29]:
# removing new lines that are present in the above text

tx = " ".join(text.split("\n"))

In [31]:
print(tx)

Alice Clark  AI / Machine Learning    Delhi, India Email me on Indeed  •  20+ years of experience in data handling, design, and development  •  Data Warehouse: Data analysis, star/snow flake scema data modelling and design specific to  data warehousing and business intelligence  •  Database: Experience in database designing, scalability, back-up and recovery, writing and  optimizing SQL code and Stored Procedures, creating functions, views, triggers and indexes.  Cloud platform: Worked on Microsoft Azure cloud services like Document DB, SQL Azure,  Stream Analytics, Event hub, Power BI, Web Job, Web App, Power BI, Azure data lake  analytics(U-SQL)  Willing to relocate anywhere    WORK EXPERIENCE  Software Engineer  Microsoft – Bangalore, Karnataka  January 2000 to Present  1. Microsoft Rewards Live dashboards:  Description: - Microsoft rewards is loyalty program that rewards Users for browsing and shopping  online. Microsoft Rewards members can earn points when searching with Bing, bro

In [32]:
# testing the model on unseen data

doc = nlp_model(tx)
for ent in doc.ents:
    print(f'{ent.label_.upper():{30}}- {ent.text}')

NAME                          - Alice Clark
DESIGNATION                   - AI / Machine Learning
LOCATION                      - Delhi
DESIGNATION                   - Software Engineer
COMPANIES WORKED AT           - Microsoft
LOCATION                      - Bangalore
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
DEGREE                        - Indian Institute of Technology
COLLEGE NAME                  - Mumbai
SKILLS                        - Machine Learning, Natural Language Processing, and Big Data Handling    ADDITIONAL INFORMATION  Professional Skills
