# Resume NER
## Extract Information from Resumes using NER (Named Entity Recognition)

### Part 1 - Data Exploration and preprocessing

#### Load the Dataset
The dataset we will be using is located in the dataset folder included in the project. Verify the data is available by executing the code cell below 

In [1]:
import os
dataset_path = "./data/Entity Recognition in Resumes.json"
print("Path exists? {}".format(os.path.exists(dataset_path)))

Path exists? True


In [2]:
## use the "open" function to get a filehandle. 
with open(dataset_path,encoding="utf8") as f:
    ## use the filehandle to read all lines into an array of text lines. 
    lines = f.readlines()
    ## print how many lines were loaded
    print(len(lines))
    ## now print one resume/line to see how the resumes look in raw text form
    print("Sample resume:", lines[0])
    #TODO print sample resume


701
Sample resume: {"content": "Afreen Jamadar\nActive member of IIIT Committee in Third year\n\nSangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6\n\nI wish to use my knowledge, skills and conceptual understanding to create excellent team\nenvironments and work consistently achieving organization objectives believes in taking initiative\nand work to excellence in my work.\n\nWORK EXPERIENCE\n\nActive member of IIIT Committee in Third year\n\nCisco Networking -  Kanpur, Uttar Pradesh\n\norganized by Techkriti IIT Kanpur and Azure Skynet.\nPERSONALLITY TRAITS:\n• Quick learning ability\n• hard working\n\nEDUCATION\n\nPG-DAC\n\nCDAC ACTS\n\n2017\n\nBachelor of Engg in Information Technology\n\nShivaji University Kolhapur -  Kolhapur, Maharashtra\n\n2016\n\nSKILLS\n\nDatabase (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECH

#### Convert the dataset to json

In [3]:
## import json module to load json strings
import json
## using a for loop or a list comprehension, cycle through all lines (loaded above) and convert them to dictionaries
## using json loads function. Make sure all converted resumes are stored in the 'all_resumes' array below  
all_resumes = [json.loads(line) for line in lines]
## select one resume to examine from the all_resumes list
resume = all_resumes[1]
print(resume)

{'content': 'Alok Khandai\nOperational Analyst (SQL DBA) Engineer - UNISYS\n\nBengaluru, Karnataka - Email me on Indeed: indeed.com/r/Alok-Khandai/5be849e443b8f467\n\n❖ Having 3.5 Years of IT experience in SQL Database Administration, System Analysis, Design,\nDevelopment & Support of MS SQL Servers in Production, Development environments &\nReplication and Cluster Server Environments.\n❖ Working Experience with relational database such as SQL.\n❖ Experience in Installation, Configuration, Maintenance and Administration of SQL Server.\n❖ Experience in upgrading SQL Server.\n❖ Good experience with implementing DR solution, High Availability of database servers using\nDatabase mirroring and replications and Log Shipping.\n❖ Experience in implementing SQL Server security and Object permissions like maintaining\nDatabase authentication modes, creation of users, configuring permissions and assigning roles\nto users.\n❖ Experience in creating Jobs, Alerts, SQL Mail Agent\n❖ Experience in per

##### Explore the resume data structure

In [4]:
## explore keys in cv
print("keys and values in resume:")
## TODO print out the keys and values for the sample resume
for key, value in resume.items():
    print("key: {} value: {}".format(key, value))

keys and values in resume:
key: content value: Alok Khandai
Operational Analyst (SQL DBA) Engineer - UNISYS

Bengaluru, Karnataka - Email me on Indeed: indeed.com/r/Alok-Khandai/5be849e443b8f467

❖ Having 3.5 Years of IT experience in SQL Database Administration, System Analysis, Design,
Development & Support of MS SQL Servers in Production, Development environments &
Replication and Cluster Server Environments.
❖ Working Experience with relational database such as SQL.
❖ Experience in Installation, Configuration, Maintenance and Administration of SQL Server.
❖ Experience in upgrading SQL Server.
❖ Good experience with implementing DR solution, High Availability of database servers using
Database mirroring and replications and Log Shipping.
❖ Experience in implementing SQL Server security and Object permissions like maintaining
Database authentication modes, creation of users, configuring permissions and assigning roles
to users.
❖ Experience in creating Jobs, Alerts, SQL Mail Agent
❖ 

In [5]:
## TODO print the resume text
print("resume content:")
print(resume["content"])
## TODO print the resume's list of entity annotations
print("resume entity list:")
print(resume["annotation"])

resume content:
Alok Khandai
Operational Analyst (SQL DBA) Engineer - UNISYS

Bengaluru, Karnataka - Email me on Indeed: indeed.com/r/Alok-Khandai/5be849e443b8f467

❖ Having 3.5 Years of IT experience in SQL Database Administration, System Analysis, Design,
Development & Support of MS SQL Servers in Production, Development environments &
Replication and Cluster Server Environments.
❖ Working Experience with relational database such as SQL.
❖ Experience in Installation, Configuration, Maintenance and Administration of SQL Server.
❖ Experience in upgrading SQL Server.
❖ Good experience with implementing DR solution, High Availability of database servers using
Database mirroring and replications and Log Shipping.
❖ Experience in implementing SQL Server security and Object permissions like maintaining
Database authentication modes, creation of users, configuring permissions and assigning roles
to users.
❖ Experience in creating Jobs, Alerts, SQL Mail Agent
❖ Experience in performing integr

##### Explore the list of entity labels

In [6]:
## explore entity list
##TODO print out each key and each value for each entity in the entities list
for entity in resume["annotation"]:
    print("Entity")
    for key, value in entity.items():
        print("key: {} value: {}".format(key, value))
        
print(resume["annotation"][1]['points'])
print(type(resume["annotation"][1]['points']))

Entity
key: label value: ['Skills']
key: points value: [{'start': 8098, 'end': 8383, 'text': '❖ Operating Environment: […] Windows95/98/XP/NT\n❖ Database Tool: SQL Management Studio (MSSQL), Business\nDevelopment Studio, Visual studio 2005\n❖ Database Language: SQL, PL/SQL\n❖ Ticket Tracking Tool: Service Now\n❖ Reporting Tools: MS Reporting Services, SAS\n❖ Languages: C, C++, PL/SQL'}]
Entity
key: label value: ['Skills']
key: points value: [{'start': 8008, 'end': 8049, 'text': 'Database (3 years), SQL (3 years), Sql Dba'}]
Entity
key: label value: ['Graduation Year']
key: points value: [{'start': 7994, 'end': 7997, 'text': '2012'}]
Entity
key: label value: ['College Name']
key: points value: [{'start': 7955, 'end': 7991, 'text': 'Indira Gandhi Institute Of Technology'}]
Entity
key: label value: ['College Name']
key: points value: [{'start': 7904, 'end': 7952, 'text': 'B.Tech in Computer Science and Engineering in CSE'}]
Entity
key: label value: ['Companies worked at']
key: points valu

##### Convert  data to "spacy" offset format
1. Use the provided data conversion function
2. Convert the data with that function, storing the results in a variable
3. Inspect the converted data

In [7]:
## data conversion method
def convert_data(data):
    """
    Creates NER training data in Spacy format from JSON dataset
    Outputs the Spacy training data which can be used for Spacy training.
    """
    text = data['content']
    entities = []
    if data['annotation'] is not None:
        for annotation in data['annotation']:
            # only a single point in text annotation.
            point = annotation['points'][0]
            labels = annotation['label']
            # handle both list of labels or a single label.
            if not isinstance(labels, list):
                labels = [labels]
            for label in labels:
                # dataturks indices are both inclusive [start, end] but spacy is not [start, end)
                entities.append((point['start'], point['end'] + 1, label))
    return (text, {"entities": entities})
   
## TODO using a loop or list comprehension, convert each resume in all_resumes using the convert function above, storing the result
converted_resumes = [convert_data(res) for res in all_resumes]
## TODO print the number of resumes in converted resumes 
print(len(converted_resumes))
print(converted_resumes[1])
print(type(converted_resumes[1]))

701
('Alok Khandai\nOperational Analyst (SQL DBA) Engineer - UNISYS\n\nBengaluru, Karnataka - Email me on Indeed: indeed.com/r/Alok-Khandai/5be849e443b8f467\n\n❖ Having 3.5 Years of IT experience in SQL Database Administration, System Analysis, Design,\nDevelopment & Support of MS SQL Servers in Production, Development environments &\nReplication and Cluster Server Environments.\n❖ Working Experience with relational database such as SQL.\n❖ Experience in Installation, Configuration, Maintenance and Administration of SQL Server.\n❖ Experience in upgrading SQL Server.\n❖ Good experience with implementing DR solution, High Availability of database servers using\nDatabase mirroring and replications and Log Shipping.\n❖ Experience in implementing SQL Server security and Object permissions like maintaining\nDatabase authentication modes, creation of users, configuring permissions and assigning roles\nto users.\n❖ Experience in creating Jobs, Alerts, SQL Mail Agent\n❖ Experience in performing

##### filter out resumes without annotations
A few of the resumes have an empty entity list. We want to filter these resumes out of our data before continuing. We will do the following:
1. cycle through all resumes using for loop or list comprehension
2. for each resume, if the resume has no labled entities, ignore it. Otherwise save it to new resume list 

In [8]:
## TODO filter out resumes where resume entities list is None (you can do this in a one-line list comprehension)
## sove to converted_resumes variable
converted_resumes =  [res for res in converted_resumes if len(res[1]["entities"])>0]
## TODO print length of new filtered converted_resumes.  
print(len(converted_resumes))

690


##### Print all entities for one converted resume
1. Store one converted resume in the 'converted_resume' variable
2. Find the entity list in the converted_resume
3. Cycle through the entities, and - using the start and end index - print the label of the entity and the value of the entity. This will be the text substring pointed to by the start and end index

In [9]:
## store one resume in the variable
converted_resume = converted_resumes[1]
## find text content and store in variable
text = converted_resume[0]
## find the entities list and store in variable
entities_list = converted_resume[1]
## TODO for each entity, print the label, and the text (text content substring pointed to by start and end index)
for entity in entities_list["entities"]:
    print("label: {}, text: {}".format(entity[2], text[entity[0]:entity[1]]))

label: Skills, text: ❖ Operating Environment: […] Windows95/98/XP/NT
❖ Database Tool: SQL Management Studio (MSSQL), Business
Development Studio, Visual studio 2005
❖ Database Language: SQL, PL/SQL
❖ Ticket Tracking Tool: Service Now
❖ Reporting Tools: MS Reporting Services, SAS
❖ Languages: C, C++, PL/SQL
label: Skills, text: Database (3 years), SQL (3 years), Sql Dba
label: Graduation Year, text: 2012
label: College Name, text: Indira Gandhi Institute Of Technology
label: College Name, text: B.Tech in Computer Science and Engineering in CSE
label: Companies worked at, text: Microsoft Corporation
label: Location, text: Bengaluru
label: Companies worked at, text: HCL Technologies
label: Designation, text: SQL DBA Analyst
label: Designation, text: DBA Support Analyst
label: projects, text:  Finance Support
label: Companies worked at, text: Microsoft Corporation
label: Email Address, text: indeed.com/r/Alok-Khandai/5be849e443b8f467
label: Companies worked at, text: Microsoft Corporation


##### Collect unique labels of all entities in dataset

In [10]:
## collect names of all entities in complete resume dataset
all_labels = list()
for res in converted_resumes:
    ## entity list of res
    entity_list = res[1]["entities"][0]
    #print(entity_list[2])
    all_labels.append(entity_list[2])
    ## TODO extend all_labels with labels of entities 
    ##all_labels.           
## TODO all_labels is not yet unique. Make the list a set of unique values
# print(all_labels)
unique_labels = set(all_labels)
## Print unique entity labels
print("Entity labels: ",unique_labels)

Entity labels:  {'projects', 'links', 'Skills', 'Email Address', 'Graduation Year', 'Links', 'Rewards and Achievements', 'College Name', 'UNKNOWN', 'Designation', 'state', 'College', 'Years of Experience', 'Companies worked at', 'Degree', 'Can Relocate to', 'Location', 'Name'}


##### Validate entities

In [11]:
## TODO store entity label names for the entities you want to work with in an array 
chosen_entity_label = ["Skills", "Years of Experience", "Designation", "Degree", "College Name"]
## for each chosen entity label, count how many documents have a labeled entity for that label, and how many labeled entities total there are 
## for that entity
for chosen in chosen_entity_label:
    found_docs_with_entity = 0
    entity_count = 0
    for resume in converted_resumes:
        entity_list = resume[1]["entities"]
        _,_,labels = zip(*entity_list)
        if chosen in labels:
            found_docs_with_entity+=1
            entity_count+=len([l for l in labels if l == chosen])
    print("Docs with {}: {}".format(chosen,found_docs_with_entity))
    print("Total count of {}: {}".format(chosen,entity_count))

Docs with Skills: 536
Total count of Skills: 2152
Docs with Years of Experience: 217
Total count of Years of Experience: 623
Docs with Designation: 650
Total count of Designation: 2842
Docs with Degree: 606
Total count of Degree: 1012
Docs with College Name: 497
Total count of College Name: 1160


##### Save converted data for later use

In [113]:
converted_resumes_path = "./data/converted_resumes.json"
##TODO save converted resumes to path using "open" and json's "dump" function. 
with open(converted_resumes_path, 'w') as outfile:
    json.dump(converted_resumes, outfile)