# Resume NER
## Extract Information from Resumes using NER (Named Entity Recognition)

### Part 1 - Data Exploration and preprocessing
In this first part of the challenge, we will load and examine the dataset we will be working with. We will also prepare the data for training which we will start in the second part of the challenge. You will be required to program some basic python pertaining to file loading, data conversion, and basic dictionaries and array manipulation. If you are experienced with Python, this will be easy. If you are new to python and/or programming, it will be a good opportunity to learn some basic programming you will need for data loading and exploration.

* *If you need help setting up python or running this notebook, please get help from the  assistants to the professor*
* *It might be helpful to try your code out first in a python ide like pycharm before copying it an running it here in this notebook*  

#### Load the Dataset
The dataset we will be using is located in the dataset folder included in the project. Verify the data is available by executing the code cell below 

In [2]:
import os
dataset_path = "./data/Entity Recognition in Resumes.json"
print("Path exists? {}".format(os.path.exists(dataset_path)))

Path exists? True


So far so good? OK then let's load the dataset. The dataset is structured so that each line of text is a resume. 
You will do the following:
1. using python's built-in "open" function, get a filehandle to the dataset (tip don't forget the file is utf8!)
2. load the data into an array of resumes (each text line is one resume) 
3. use the print function to print how many resumes were loaded
4. use the print function to output one of the resumes so we can see how the resumes look in raw text form 


In [3]:
## use the "open" function to get a filehandle. 
with open(dataset_path,encoding="utf8") as f:
    ## use the filehandle to read all lines into an array of text lines. 
    lines = f.readlines()
    ## print how many lines were loaded
    print(len(lines))
    ## now print one resume/line to see how the resumes look in raw text form
    print("Sample resume:", lines[0])
    #TODO print sample resume


701
Sample resume: {"content": "Afreen Jamadar\nActive member of IIIT Committee in Third year\n\nSangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6\n\nI wish to use my knowledge, skills and conceptual understanding to create excellent team\nenvironments and work consistently achieving organization objectives believes in taking initiative\nand work to excellence in my work.\n\nWORK EXPERIENCE\n\nActive member of IIIT Committee in Third year\n\nCisco Networking -  Kanpur, Uttar Pradesh\n\norganized by Techkriti IIT Kanpur and Azure Skynet.\nPERSONALLITY TRAITS:\n• Quick learning ability\n• hard working\n\nEDUCATION\n\nPG-DAC\n\nCDAC ACTS\n\n2017\n\nBachelor of Engg in Information Technology\n\nShivaji University Kolhapur -  Kolhapur, Maharashtra\n\n2016\n\nSKILLS\n\nDatabase (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECH

#### Convert the dataset to json
As we can see, the resumes are not in a convenient human-readable form, but are json dictionaries. We want to work with the resumes as python dictionaries and not as raw text, so we will convert the resumes from text to dictionaries. We will do the following:
1. Import the json module
2. Loop through all of the text lines and use the json 'loads' function to convert the line to a python dictionary. Tip - you can use a 'for' loop, or if you want to get fancy, a python 'list comprehension' to accomplish this. 
3. Select one of the converted resumes so that we can examine its structure.   


In [4]:
## import json module to load json strings
import json
## using a for loop or a list comprehension, cycle through all lines (loaded above) and convert them to dictionaries
## using json loads function. Make sure all converted resumes are stored in the 'all_resumes' array below  
all_resumes = [json.loads(line) for line in lines]
## select one resume to examine from the all_resumes list
resume = all_resumes[1]
print(resume)

{'content': 'Alok Khandai\nOperational Analyst (SQL DBA) Engineer - UNISYS\n\nBengaluru, Karnataka - Email me on Indeed: indeed.com/r/Alok-Khandai/5be849e443b8f467\n\n❖ Having 3.5 Years of IT experience in SQL Database Administration, System Analysis, Design,\nDevelopment & Support of MS SQL Servers in Production, Development environments &\nReplication and Cluster Server Environments.\n❖ Working Experience with relational database such as SQL.\n❖ Experience in Installation, Configuration, Maintenance and Administration of SQL Server.\n❖ Experience in upgrading SQL Server.\n❖ Good experience with implementing DR solution, High Availability of database servers using\nDatabase mirroring and replications and Log Shipping.\n❖ Experience in implementing SQL Server security and Object permissions like maintaining\nDatabase authentication modes, creation of users, configuring permissions and assigning roles\nto users.\n❖ Experience in creating Jobs, Alerts, SQL Mail Agent\n❖ Experience in per

##### Explore the resume data structure
You should have one sample resume saved in the "resume" variable. Now we will examine the resume dictionary. Complete the code below to see the keys in the dictionary 

In [5]:
## explore keys in cv
print("keys and values in resume:")
## TODO print out the keys and values for the sample resume
for key, value in resume.items():
    print("key: {} value: {}".format(key, value))

keys and values in resume:
key: content value: Alok Khandai
Operational Analyst (SQL DBA) Engineer - UNISYS

Bengaluru, Karnataka - Email me on Indeed: indeed.com/r/Alok-Khandai/5be849e443b8f467

❖ Having 3.5 Years of IT experience in SQL Database Administration, System Analysis, Design,
Development & Support of MS SQL Servers in Production, Development environments &
Replication and Cluster Server Environments.
❖ Working Experience with relational database such as SQL.
❖ Experience in Installation, Configuration, Maintenance and Administration of SQL Server.
❖ Experience in upgrading SQL Server.
❖ Good experience with implementing DR solution, High Availability of database servers using
Database mirroring and replications and Log Shipping.
❖ Experience in implementing SQL Server security and Object permissions like maintaining
Database authentication modes, creation of users, configuring permissions and assigning roles
to users.
❖ Experience in creating Jobs, Alerts, SQL Mail Agent
❖ 

##### Question: which key do you think points to the text content of the resume?
*Answer here*

content
##### Question: which key do you think points to the list of entity annotations? 
*Answer here*

annotation

Based on your answers above, see if you were right by printing the text content and the entity list by completing and executing the code below

In [6]:
## TODO print the resume text
print("resume content:")
print(resume["content"])
## TODO print the resume's list of entity annotations
print("resume entity list:")
print(resume["annotation"])

resume content:
Alok Khandai
Operational Analyst (SQL DBA) Engineer - UNISYS

Bengaluru, Karnataka - Email me on Indeed: indeed.com/r/Alok-Khandai/5be849e443b8f467

❖ Having 3.5 Years of IT experience in SQL Database Administration, System Analysis, Design,
Development & Support of MS SQL Servers in Production, Development environments &
Replication and Cluster Server Environments.
❖ Working Experience with relational database such as SQL.
❖ Experience in Installation, Configuration, Maintenance and Administration of SQL Server.
❖ Experience in upgrading SQL Server.
❖ Good experience with implementing DR solution, High Availability of database servers using
Database mirroring and replications and Log Shipping.
❖ Experience in implementing SQL Server security and Object permissions like maintaining
Database authentication modes, creation of users, configuring permissions and assigning roles
to users.
❖ Experience in creating Jobs, Alerts, SQL Mail Agent
❖ Experience in performing integr

##### Explore the list of entity labels
The entity list is a list of dictionaries, we want to explore this list
1. Cycle through the entities in the list. You can use a 'for' loop for this
2. For each entity - which will be a dictionary - print out each key and each value for the key

In [7]:
## explore entity list
##TODO print out each key and each value for each entity in the entities list
for entity in resume["annotation"]:
    print("Entity")
    for key, value in entity.items():
        print("key: {} value: {}".format(key, value))
        
print(resume["annotation"][1]['points'])
print(type(resume["annotation"][1]['points']))

Entity
key: label value: ['Skills']
key: points value: [{'start': 8098, 'end': 8383, 'text': '❖ Operating Environment: […] Windows95/98/XP/NT\n❖ Database Tool: SQL Management Studio (MSSQL), Business\nDevelopment Studio, Visual studio 2005\n❖ Database Language: SQL, PL/SQL\n❖ Ticket Tracking Tool: Service Now\n❖ Reporting Tools: MS Reporting Services, SAS\n❖ Languages: C, C++, PL/SQL'}]
Entity
key: label value: ['Skills']
key: points value: [{'start': 8008, 'end': 8049, 'text': 'Database (3 years), SQL (3 years), Sql Dba'}]
Entity
key: label value: ['Graduation Year']
key: points value: [{'start': 7994, 'end': 7997, 'text': '2012'}]
Entity
key: label value: ['College Name']
key: points value: [{'start': 7955, 'end': 7991, 'text': 'Indira Gandhi Institute Of Technology'}]
Entity
key: label value: ['College Name']
key: points value: [{'start': 7904, 'end': 7952, 'text': 'B.Tech in Computer Science and Engineering in CSE'}]
Entity
key: label value: ['Companies worked at']
key: points valu

##### Question: What keys do the entity entries have? What is the datatype of the values of these keys?
*Answer here*

- label: String
- points: List

##### Question: What do these keys and values mean? (think of their significance as entity labels)
*Answer here*

- label: Classified heading for resume parts
- points: Position in content text and value of label

##### Convert  data to "spacy" offset format
Before we go any further, we need to convert the data into a slightly more compact format. This format is the format we will be using to train our first models in the next part of the challenge. Here we will do the following:
1. Use the provided data conversion function
2. Convert the data with that function, storing the results in a variable
3. Inspect the converted data

In [8]:
## data conversion method
def convert_data(data):
    """
    Creates NER training data in Spacy format from JSON dataset
    Outputs the Spacy training data which can be used for Spacy training.
    """
    text = data['content']
    entities = []
    if data['annotation'] is not None:
        for annotation in data['annotation']:
            # only a single point in text annotation.
            point = annotation['points'][0]
            labels = annotation['label']
            # handle both list of labels or a single label.
            if not isinstance(labels, list):
                labels = [labels]
            for label in labels:
                # dataturks indices are both inclusive [start, end] but spacy is not [start, end)
                entities.append((point['start'], point['end'] + 1, label))
    return (text, {"entities": entities})
   
## TODO using a loop or list comprehension, convert each resume in all_resumes using the convert function above, storing the result
converted_resumes = [convert_data(res) for res in all_resumes]
## TODO print the number of resumes in converted resumes 
print(len(converted_resumes))
print(converted_resumes[1])
print(type(converted_resumes[1]))

701
('Alok Khandai\nOperational Analyst (SQL DBA) Engineer - UNISYS\n\nBengaluru, Karnataka - Email me on Indeed: indeed.com/r/Alok-Khandai/5be849e443b8f467\n\n❖ Having 3.5 Years of IT experience in SQL Database Administration, System Analysis, Design,\nDevelopment & Support of MS SQL Servers in Production, Development environments &\nReplication and Cluster Server Environments.\n❖ Working Experience with relational database such as SQL.\n❖ Experience in Installation, Configuration, Maintenance and Administration of SQL Server.\n❖ Experience in upgrading SQL Server.\n❖ Good experience with implementing DR solution, High Availability of database servers using\nDatabase mirroring and replications and Log Shipping.\n❖ Experience in implementing SQL Server security and Object permissions like maintaining\nDatabase authentication modes, creation of users, configuring permissions and assigning roles\nto users.\n❖ Experience in creating Jobs, Alerts, SQL Mail Agent\n❖ Experience in performing

##### Question: how is the converted data different than the original data? How is it the same? 
*Answer here*

- Converted into a tuple
- All labels grouped together in 'entities' beneth the content part
- 'entities'/labels missing their key parts

##### filter out resumes without annotations
A few of the resumes have an empty entity list. We want to filter these resumes out of our data before continuing. We will do the following:
1. cycle through all resumes using for loop or list comprehension
2. for each resume, if the resume has no labled entities, ignore it. Otherwise save it to new resume list 

In [9]:
## TODO filter out resumes where resume entities list is None (you can do this in a one-line list comprehension)
## sove to converted_resumes variable
converted_resumes =  [res for res in converted_resumes if len(res[1]["entities"])>0]
## TODO print length of new filtered converted_resumes.  
print(len(converted_resumes))

690


##### Print all entities for one converted resume
The converted data also has an entity list. You should be able to examine this using similar techniques we have used above on the converted data. In the next code block you will write code that will print all of the entities for one resume. TIP each entity entry in the 'entities' list consists of a start index of the entity in the resume text, an end index, and the entity label. We will do the following:
1. Store one converted resume in the 'converted_resume' variable
2. Find the entity list in the converted_resume
3. Cycle through the entities, and - using the start and end index - print the label of the entity and the value of the entity. This will be the text substring pointed to by the start and end index

In [10]:
## store one resume in the variable
converted_resume = converted_resumes[1]
## find text content and store in variable
text = converted_resume[0]
## find the entities list and store in variable
entities_list = converted_resume[1]
## TODO for each entity, print the label, and the text (text content substring pointed to by start and end index)
for entity in entities_list["entities"]:
    print("label: {}, text: {}".format(entity[2], text[entity[0]:entity[1]]))

label: Skills, text: ❖ Operating Environment: […] Windows95/98/XP/NT
❖ Database Tool: SQL Management Studio (MSSQL), Business
Development Studio, Visual studio 2005
❖ Database Language: SQL, PL/SQL
❖ Ticket Tracking Tool: Service Now
❖ Reporting Tools: MS Reporting Services, SAS
❖ Languages: C, C++, PL/SQL
label: Skills, text: Database (3 years), SQL (3 years), Sql Dba
label: Graduation Year, text: 2012
label: College Name, text: Indira Gandhi Institute Of Technology
label: College Name, text: B.Tech in Computer Science and Engineering in CSE
label: Companies worked at, text: Microsoft Corporation
label: Location, text: Bengaluru
label: Companies worked at, text: HCL Technologies
label: Designation, text: SQL DBA Analyst
label: Designation, text: DBA Support Analyst
label: projects, text:  Finance Support
label: Companies worked at, text: Microsoft Corporation
label: Email Address, text: indeed.com/r/Alok-Khandai/5be849e443b8f467
label: Companies worked at, text: Microsoft Corporation


##### Question: What are some of the entity labels you see? Are there any entity values that seem surprising or particularly interesting? 
*Answer here*

- Skills, College Name, Companies worked at, Years of Experience
- First Skill content contains more informations then the rest. Includes all informations normaly splitted in different entity label/values

##### Collect unique labels of all entities in dataset
Now we are interested in finding out all of the (unique) entity labels which exist in our dataset. Complete and execute the code below to do this.

In [11]:
## collect names of all entities in complete resume dataset
all_labels = list()
for res in converted_resumes:
    ## entity list of res
    entity_list = res[1]["entities"][0]
    #print(entity_list[2])
    all_labels.append(entity_list[2])
    ## TODO extend all_labels with labels of entities 
    ##all_labels.           
## TODO all_labels is not yet unique. Make the list a set of unique values
# print(all_labels)
unique_labels = set(all_labels)
## Print unique entity labels
print("Entity labels: ",unique_labels)

Entity labels:  {'projects', 'Graduation Year', 'College Name', 'Can Relocate to', 'Location', 'Rewards and Achievements', 'Designation', 'Links', 'Companies worked at', 'state', 'links', 'Email Address', 'Name', 'UNKNOWN', 'College', 'Degree', 'Years of Experience', 'Skills'}


Now we see all entity labels in our dataset. Do some of them seem particularly interesting to you? 

Choose up to 3 Entities from the list that you would like to use for training a named entity recognition model. 
##### Question: which entities did you choose? 
*Answer here*

- Skills
- Years of Expericence
- Designation

##### Validate entities
Now we need to check that there is adequate training data for the entities you have chosen. 

In [12]:
## TODO store entity label names for the entities you want to work with in an array 
chosen_entity_label = ["Skills", "Years of Experience", "Designation", "Degree"]
## for each chosen entity label, count how many documents have a labeled entity for that label, and how many labeled entities total there are 
## for that entity
for chosen in chosen_entity_label:
    found_docs_with_entity = 0
    entity_count = 0
    for resume in converted_resumes:
        entity_list = resume[1]["entities"]
        _,_,labels = zip(*entity_list)
        if chosen in labels:
            found_docs_with_entity+=1
            entity_count+=len([l for l in labels if l == chosen])
    print("Docs with {}: {}".format(chosen,found_docs_with_entity))
    print("Total count of {}: {}".format(chosen,entity_count))

Docs with Skills: 536
Total count of Skills: 2152
Docs with Years of Experience: 217
Total count of Years of Experience: 623
Docs with Designation: 650
Total count of Designation: 2842
Docs with Degree: 606
Total count of Degree: 1012


#####  Question: Is adequate training data available for the entities you have chosen? (there should be at least several hundered examples total of each entity)
*Answer here*

For 'Skills' and 'Designation' the training data is good,
for 'Experience' the amount of data is just enough.
Because of that i switched to 'Degree' which has a good amount of training data

##### Save converted data for later use
We are almost done with the first part of the challenge! One more detail. We need to save the "converted_resumes" list so we can load it in the next notebook. We will do the following:
1. Store the location we want to save the data to in the 'converted_resumes_path' variable
2. Using python's 'open' function and the 'json' module's 'dump' function, save the data to disk. Make sure to create missing directories (if applicable) using python's "os.makedirs" function. Save the file with a ".json" file extension
3. Check the filesystem if the file exists and is complete

In [113]:
converted_resumes_path = "./data/converted_resumes.json"
##TODO save converted resumes to path using "open" and json's "dump" function. 
with open(converted_resumes_path, 'w') as outfile:
    json.dump(converted_resumes, outfile)

### Congratulations!
We are done with part one. Now we will go on to train our own NER Models with the dataset and the entities we have chosen. 

In [13]:
for ent in converted_resumes:
    print(ent[1])

{'entities': [(1155, 1199, 'Email Address'), (1143, 1240, 'Links'), (743, 1141, 'Skills'), (729, 733, 'Graduation Year'), (675, 703, 'College Name'), (631, 673, 'Degree'), (625, 630, 'Graduation Year'), (614, 623, 'College Name'), (606, 612, 'Degree'), (438, 454, 'Companies worked at'), (104, 148, 'Email Address'), (62, 68, 'Location'), (0, 14, 'Name')]}
{'entities': [(8098, 8384, 'Skills'), (8008, 8050, 'Skills'), (7994, 7998, 'Graduation Year'), (7955, 7992, 'College Name'), (7904, 7953, 'College Name'), (6199, 6220, 'Companies worked at'), (5016, 5025, 'Location'), (4996, 5012, 'Companies worked at'), (4969, 4984, 'Designation'), (4922, 4941, 'Designation'), (4836, 4852, 'projects'), (2519, 2540, 'Companies worked at'), (2391, 2433, 'Email Address'), (2339, 2360, 'Companies worked at'), (2318, 2337, 'Designation'), (1607, 1659, 'Skills'), (1577, 1586, 'Years of Experience'), (1522, 1531, 'Location'), (1512, 1518, 'Companies worked at'), (1472, 1510, 'Designation'), (1433, 1442, 'Loc