# Resume NER
## Extract Information from Resumes using NER (Named Entity Recognition)

### Part 1 - Data Exploration and preprocessing
In this first part of the challenge, we will load and examine the dataset we will be working with. We will also prepare the data for training which we will start in the second part of the challenge. You will be required to program some basic python pertaining to file loading, data conversion, and basic dictionaries and array manipulation. If you are experienced with Python, this will be easy. If you are new to python and/or programming, it will be a good opportunity to learn some basic programming you will need for data loading and exploration.

* *If you need help setting up python or running this notebook, please get help from the  assistants to the professor*
* *It might be helpful to try your code out first in a python ide like pycharm before copying it an running it here in this notebook*  

#### Load the Dataset
The dataset we will be using is located in the dataset folder included in the project. Verify the data is available by executing the code cell below 

In [46]:
import os
dataset_path = "./dataset/Entity Recognition in Resumes.json"
print("Path exists? {}".format(os.path.exists(dataset_path)))

Path exists? True


So far so good? OK then let's load the dataset. The dataset is structured so that each line of text is a resume. 
You will do the following:
1. using python's built-in "open" function, get a filehandle to the dataset (tip don't forget the file is utf8!)
2. load the data into an array of resumes (each text line is one resume) 
3. use the print function to print how many resumes were loaded
4. use the print function to output one of the resumes so we can see how the resumes look in raw text form 


In [47]:
import numpy as np
# List Resume
Resume = []
## use the "open" function to get a filehandle.
with open(dataset_path,encoding="utf8") as f:
    ## use the filehandle to read all lines into an array of text lines. 
    for line in f.readlines():
        Resume.append(line)
    Resume = np.array(Resume)    
    ## print how many lines were loaded
    print(Resume.shape)
    ## now print one resume/line to see how the resumes look in raw text form
    print("Sample resume:")
    #TODO print sample resume
    print(Resume[0])


(701,)
Sample resume:
{"content": "Afreen Jamadar\nActive member of IIIT Committee in Third year\n\nSangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6\n\nI wish to use my knowledge, skills and conceptual understanding to create excellent team\nenvironments and work consistently achieving organization objectives believes in taking initiative\nand work to excellence in my work.\n\nWORK EXPERIENCE\n\nActive member of IIIT Committee in Third year\n\nCisco Networking -  Kanpur, Uttar Pradesh\n\norganized by Techkriti IIT Kanpur and Azure Skynet.\nPERSONALLITY TRAITS:\n• Quick learning ability\n• hard working\n\nEDUCATION\n\nPG-DAC\n\nCDAC ACTS\n\n2017\n\nBachelor of Engg in Information Technology\n\nShivaji University Kolhapur -  Kolhapur, Maharashtra\n\n2016\n\nSKILLS\n\nDatabase (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nT

#### Convert the dataset to json
As we can see, the resumes are not in a convenient human-readable form, but are json dictionaries. We want to work with the resumes as python dictionaries and not as raw text, so we will convert the resumes from text to dictionaries. We will do the following:
1. Import the json module
2. Loop through all of the text lines and use the json 'loads' function to convert the line to a python dictionary. Tip - you can use a 'for' loop, or if you want to get fancy, a python 'list comprehension' to accomplish this. 
3. Select one of the converted resumes so that we can examine its structure.   


In [48]:
import numpy as np
## import json module to load json strings
import json
## using a for loop or a list comprehension, cycle through all lines (loaded above) and convert them to dictionaries 
## using json loads function. Make sure all converted resumes are stored in the 'all_resumes' array below  
all_resumes = []
for i in range(len(Resume)):
    all_resumes.append(json.loads(Resume[i]))
all_resumes = np.array(all_resumes) 
print(all_resumes.shape)
## select one resume to examine from the all_resumes list
resume = all_resumes[0]
print(resume)
print(len(resume))


(701,)
{'content': 'Afreen Jamadar\nActive member of IIIT Committee in Third year\n\nSangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6\n\nI wish to use my knowledge, skills and conceptual understanding to create excellent team\nenvironments and work consistently achieving organization objectives believes in taking initiative\nand work to excellence in my work.\n\nWORK EXPERIENCE\n\nActive member of IIIT Committee in Third year\n\nCisco Networking -  Kanpur, Uttar Pradesh\n\norganized by Techkriti IIT Kanpur and Azure Skynet.\nPERSONALLITY TRAITS:\n• Quick learning ability\n• hard working\n\nEDUCATION\n\nPG-DAC\n\nCDAC ACTS\n\n2017\n\nBachelor of Engg in Information Technology\n\nShivaji University Kolhapur -  Kolhapur, Maharashtra\n\n2016\n\nSKILLS\n\nDatabase (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECHNICAL SKILLS

##### Explore the resume data structure
You should have one sample resume saved in the "resume" variable. Now we will examine the resume dictionary. Complete the code below to see the keys in the dictionary 

In [49]:
import numpy as np
## explore keys in cv
print("keys and values in resume:")
## TODO print out the keys and values for the sample resume
for key, value in resume.items():
    print(key, value)

keys and values in resume:
content Afreen Jamadar
Active member of IIIT Committee in Third year

Sangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6

I wish to use my knowledge, skills and conceptual understanding to create excellent team
environments and work consistently achieving organization objectives believes in taking initiative
and work to excellence in my work.

WORK EXPERIENCE

Active member of IIIT Committee in Third year

Cisco Networking -  Kanpur, Uttar Pradesh

organized by Techkriti IIT Kanpur and Azure Skynet.
PERSONALLITY TRAITS:
• Quick learning ability
• hard working

EDUCATION

PG-DAC

CDAC ACTS

2017

Bachelor of Engg in Information Technology

Shivaji University Kolhapur -  Kolhapur, Maharashtra

2016

SKILLS

Database (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT
ACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)

ADDITIONAL INFORMATION

TECHNICAL SKILLS:

• Programming Languages

##### Question: which key do you think points to the text content of the resume?
The key _content_ (first key in order) points to the text content of the resume.
##### Question: which key do you think points to the list of entity annotations? 
The key _annotation_ (second key in order) points to the list of entity annotations.

Based on your answers above, see if you were right by printing the text content and the entity list by completing and executing the code below

In [50]:
## TODO print the resume text
print("resume content:")
print(resume['content'])

print(" ")

## TODO print the resume's list of entity annotations
print("resume entity list:")
print(resume['annotation'])

resume content:
Afreen Jamadar
Active member of IIIT Committee in Third year

Sangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6

I wish to use my knowledge, skills and conceptual understanding to create excellent team
environments and work consistently achieving organization objectives believes in taking initiative
and work to excellence in my work.

WORK EXPERIENCE

Active member of IIIT Committee in Third year

Cisco Networking -  Kanpur, Uttar Pradesh

organized by Techkriti IIT Kanpur and Azure Skynet.
PERSONALLITY TRAITS:
• Quick learning ability
• hard working

EDUCATION

PG-DAC

CDAC ACTS

2017

Bachelor of Engg in Information Technology

Shivaji University Kolhapur -  Kolhapur, Maharashtra

2016

SKILLS

Database (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT
ACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)

ADDITIONAL INFORMATION

TECHNICAL SKILLS:

• Programming Languages: C, C++, Java, .ne

##### Explore the list of entity labels
The entity list is a list of dictionaries, we want to explore this list
1. Cycle through the entities in the list. You can use a 'for' loop for this
2. For each entity - which will be a dictionary - print out each key and each value for the key

In [51]:
import numpy as np
## explore entity list
resume_entity = np.array(resume['annotation'])
print(resume_entity.shape)
#print(resume_entity)
#print(resume_entity[0])

##TODO print out each key and each value for each entity in the entities list
key, value = resume_entity[0]
print(key, value)
for i in range(len(resume_entity)):
    for key, value in resume_entity[i].items():
        print (i, key, value)

(13,)
label points
0 label ['Email Address']
0 points [{'start': 1155, 'end': 1198, 'text': 'indeed.com/r/Afreen-Jamadar/8baf379b705e37c6'}]
1 label ['Links']
1 points [{'start': 1143, 'end': 1239, 'text': 'https://www.indeed.com/r/Afreen-Jamadar/8baf379b705e37c6?isid=rex-download&ikw=download-top&co=IN'}]
2 label ['Skills']
2 points [{'start': 743, 'end': 1140, 'text': 'Database (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECHNICAL SKILLS:\n\n• Programming Languages: C, C++, Java, .net, php.\n• Web Designing: HTML, XML\n• Operating Systems: Windows […] Windows Server 2003, Linux.\n• Database: MS Access, MS SQL Server 2008, Oracle 10g, MySql.'}]
3 label ['Graduation Year']
3 points [{'start': 729, 'end': 732, 'text': '2016'}]
4 label ['College Name']
4 points [{'start': 675, 'end': 702, 'text': 'Shivaji University Kolhapur '}]
5 label ['Degree']
5 points [

##### Question: What keys do the entity entries have? What is the datatype of the values of these keys?
The keys are _label_ and _points_. The values are list objects.
##### Question: What do these keys and values mean? (think of their significance as entity labels)
The keys are a representation of the classes, which should be marked correctly in a given data set (test data/resume). The values are labeled data, which furthermore will be used as training data, in order to train our model to mostly predict correctly or in this case to extract correctly the classes in a given data set.

##### Convert  data to "spacy" offset format
Before we go any further, we need to convert the data into a slightly more compact format. This format is the format we will be using to train our first models in the next part of the challenge. Here we will do the following:
1. Use the provided data conversion function
2. Convert the data with that function, storing the results in a variable
3. Inspect the converted data

In [52]:
import numpy as np
#import copy
## data conversion method
def convert_data(data):
    """
    Creates NER training data in Spacy format from JSON dataset
    Outputs the Spacy training data which can be used for Spacy training.
    """
    text = data['content']
    entities = []
    if data['annotation'] is not None:
        for annotation in data['annotation']:
            # only a single point in text annotation.
            point = annotation['points'][0]
            labels = annotation['label']
            # handle both list of labels or a single label.
            if not isinstance(labels, list):
                labels = [labels]
            for label in labels:
                # dataturks indices are both inclusive [start, end] but spacy is not [start, end)
                entities.append((point['start'], point['end'] + 1, label))
    return (text, {"entities": entities})
   
## TODO using a loop or list comprehension, convert each resume in all_resumes using the convert function above, storing the result
converted_resumes = []
for i in range(len(all_resumes)):
    converted_resumes.append(convert_data(all_resumes[i]))
#converted_resumes_list = copy.copy(converted_resumes)
converted_resumes = np.array(converted_resumes)    
## TODO print the number of resumes in converted resumes 
print(converted_resumes.shape)
print(converted_resumes)

(701, 2)
[['Afreen Jamadar\nActive member of IIIT Committee in Third year\n\nSangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6\n\nI wish to use my knowledge, skills and conceptual understanding to create excellent team\nenvironments and work consistently achieving organization objectives believes in taking initiative\nand work to excellence in my work.\n\nWORK EXPERIENCE\n\nActive member of IIIT Committee in Third year\n\nCisco Networking -  Kanpur, Uttar Pradesh\n\norganized by Techkriti IIT Kanpur and Azure Skynet.\nPERSONALLITY TRAITS:\n• Quick learning ability\n• hard working\n\nEDUCATION\n\nPG-DAC\n\nCDAC ACTS\n\n2017\n\nBachelor of Engg in Information Technology\n\nShivaji University Kolhapur -  Kolhapur, Maharashtra\n\n2016\n\nSKILLS\n\nDatabase (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECHNICAL SKILLS:\n\n• P

##### Filter out duplicates:

In [53]:
r = converted_resumes
n=0
converted_resumes_w_duplicates = []

for i in range(len(r)):
    for j in range(len(r)):
        if(j != i):
            if(r[i,0] == r[j,0]):
                #print(i, j)
                converted_resumes_w_duplicates.append(r[i])
                n=n+1
              
            
converted_resumes_w_duplicates = np.array(converted_resumes_w_duplicates)
converted_resumes = converted_resumes_w_duplicates
print(n)
print(converted_resumes_w_duplicates.shape)
print(converted_resumes_w_duplicates)


378
(378, 2)
[['Afreen Jamadar\nActive member of IIIT Committee in Third year\n\nSangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6\n\nI wish to use my knowledge, skills and conceptual understanding to create excellent team\nenvironments and work consistently achieving organization objectives believes in taking initiative\nand work to excellence in my work.\n\nWORK EXPERIENCE\n\nActive member of IIIT Committee in Third year\n\nCisco Networking -  Kanpur, Uttar Pradesh\n\norganized by Techkriti IIT Kanpur and Azure Skynet.\nPERSONALLITY TRAITS:\n• Quick learning ability\n• hard working\n\nEDUCATION\n\nPG-DAC\n\nCDAC ACTS\n\n2017\n\nBachelor of Engg in Information Technology\n\nShivaji University Kolhapur -  Kolhapur, Maharashtra\n\n2016\n\nSKILLS\n\nDatabase (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECHNICAL SKILLS:\n\

##### Question: how is the converted data different than the original data? How is it the same? 
In ``converted_resumes`` the data now contains just the two values, ones the value of key: _content_ as a string value, and second the value of the key: _annotation_ as a dictionary with one key: _entities_ and its value as a list of entities. Compared to the previous setup of the data, now there are missing some keys with its respective values and the content of the resume isn't saved as dictionary anymore, but instead as a string object.

##### filter out resumes without annotations
A few of the resumes have an empty entity list. We want to filter these resumes out of our data before continuing. We will do the following:
1. cycle through all resumes using for loop or list comprehension
2. for each resume, if the resume has no labled entities, ignore it. Otherwise save it to new resume list 

In [54]:
import numpy as np
## TODO filter out resumes where resume entities list is None (you can do this in a one-line list comprehension)
## sove to converted_resumes variable
new_resume = []
for i in range(len(converted_resumes)):
    if(converted_resumes[i,1]['entities']!=[]):
        new_resume.append(converted_resumes[i])
converted_resumes = np.array(new_resume)
converted_new_resumes = converted_resumes #create copy
## TODO print length of new filtered converted_resumes.  
print(converted_resumes.shape)

(366, 2)


##### Print all entities for one converted resume
The converted data also has an entity list. You should be able to examine this using similar techniques we have used above on the converted data. In the next code block you will write code that will print all of the entities for one resume. TIP each entity entry in the 'entities' list consists of a start index of the entity in the resume text, an end index, and the entity label. We will do the following:
1. Store one converted resume in the 'converted_resume' variable
2. Find the entity list in the converted_resume
3. Cycle through the entities, and - using the start and end index - print the label of the entity and the value of the entity. This will be the text substring pointed to by the start and end index

In [55]:
import copy
# Which Resume ?
Resume = 0 
## store one resume in the variable
converted_resume = converted_resumes[Resume]
## find text content and store in variable
text = converted_resume[0]
## find the entities list and store in variable
entities_list = converted_resume[1]['entities']
#resume_entity_del = copy.copy(resume_entity)
## TODO for each entity, print the label, and the text (text content substring pointed to by start and end index)
resume_entity = np.array(all_resumes[Resume]['annotation'])

for i in range(len(entities_list)):
    start, end, label = entities_list[i]
    
    for j in range(len(resume_entity)):
            value = resume_entity[j]['points']
            if(value[0]['start'] == start and value[0]['end'] == end-1):
                text = value[0]['text']
                #resume_entity_del = np.delete(resume_entity_del, j)
    #print(resume_entity_del.shape)
    print(label,": ", text)
    print(" ")

Email Address :  indeed.com/r/Afreen-Jamadar/8baf379b705e37c6
 
Links :  https://www.indeed.com/r/Afreen-Jamadar/8baf379b705e37c6?isid=rex-download&ikw=download-top&co=IN
 
Skills :  Database (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT
ACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)

ADDITIONAL INFORMATION

TECHNICAL SKILLS:

• Programming Languages: C, C++, Java, .net, php.
• Web Designing: HTML, XML
• Operating Systems: Windows […] Windows Server 2003, Linux.
• Database: MS Access, MS SQL Server 2008, Oracle 10g, MySql.
 
Graduation Year :  2016
 
College Name :  Shivaji University Kolhapur 
 
Degree :  Bachelor of Engg in Information Technology
 
Graduation Year :  2017

 
College Name :  CDAC ACTS
 
Degree :  PG-DAC
 
Companies worked at :  Cisco Networking
 
Email Address :  indeed.com/r/Afreen-Jamadar/8baf379b705e37c6
 
Location :  Sangli
 
Name :  Afreen Jamadar
 


##### Question: What are some of the entity labels you see? Are there any entity values that seem surprising or particularly interesting? 
The label: _Email Address_ appears twice and its content doesn't look like an email, rather it's an url-link, that's a litte bit surprising.  
Really interesting is the label _Skills_. I think it contains the most important and relevant informations regarding hiring somebody for a job and not to be ignored are _Companies worked at_ and _College Name_. The last to labels can be considered as indicators for the two first chosen labels, whether the information in _Skills_ and _Technical Skills_ is true ore wrong.

##### Collect unique labels of all entities in dataset
Now we are interested in finding out all of the (unique) entity labels which exist in our dataset. Complete and execute the code below to do this.

In [56]:
## collect names of all entities in complete resume dataset
all_labels = list()
for res in converted_resumes:
    text, entities = res
    entities = entities['entities']
   ## entity list of res
    entity_list = [entity[2] for entity in entities]
    ## TODO extend all_labels with labels of entities 
    ##all_labels. 
    all_labels.append(entity_list[0])
## TODO all_labels is not yet unique. Make the list a set of unique values
all_labels = np.array(all_labels)
unique_labels = np.unique(all_labels)
## Print unique entity labels
print(len(unique_labels))
print("Entity labels: ",unique_labels)

14
Entity labels:  ['College' 'College Name' 'Companies worked at' 'Degree' 'Designation'
 'Email Address' 'Graduation Year' 'Links' 'Location' 'Name' 'Skills'
 'UNKNOWN' 'Years of Experience' 'state']


Now we see all entity labels in our dataset. Do some of them seem particularly interesting to you? 

Choose up to 3 Entities from the list that you would like to use for training a named entity recognition model. 
##### Question: which entities did you choose? 
Base on my explanation in the previous section, the first label I'll choose is _Skills_, the second one is _Years of Experience_ and the last one _Companies worked at_.

##### Validate entities
Now we need to check that there is adequate training data for the entities you have chosen. 

In [57]:
## TODO store entity label names for the entities you want to work with in an array
chosen_entity_label = [unique_labels[10], unique_labels[2], unique_labels[3]]
## for each chosen entity label, count how many documents have a labeled entity for that label, and how many labeled entities total there are 
## for that entity
for chosen in chosen_entity_label:
    found_docs_with_entity = 0
    entity_count = 0
    for resume in converted_resumes:
        entity_list = resume[1]["entities"]
        _,_,labels = zip(*entity_list)
        if chosen in labels:
            found_docs_with_entity+=1
            entity_count+=len([l for l in labels if l == chosen])
    print("Docs with {}: {}".format(chosen,found_docs_with_entity))
    print("Total count of {}: {}".format(chosen,entity_count))

Docs with Skills: 304
Total count of Skills: 1131
Docs with Companies worked at: 344
Total count of Companies worked at: 1493
Docs with Degree: 318
Total count of Degree: 528


#####  Question: Is adequate training data available for the entities you have chosen? (there should be at least several hundered examples total of each entity)
Yes !

##### Save converted data for later use
We are almost done with the first part of the challenge! One more detail. We need to save the "converted_resumes" list so we can load it in the next notebook. We will do the following:
1. Store the location we want to save the data to in the 'converted_resumes_path' variable
2. Using python's 'open' function and the 'json' module's 'dump' function, save the data to disk. Make sure to create missing directories (if applicable) using python's "os.makedirs" function. Save the file with a ".json" file extension

3. Check the filesystem if the file exists and is complete

In [58]:
##TODO save converted resumes to path using "open" and json's "dump" function. 
converted_resumes_path = "./dataset/converted_resumes.json"
with open(converted_resumes_path, 'w') as outfile:
    json.dump(converted_resumes.tolist(), outfile)


### Congratulations!
We are done with part one. Now we will go on to train our own NER Models with the dataset and the entities we have chosen. 