# Resume NER
## Extract Information from Resumes using NER (Named Entity Recognition)

### Part 2 - NER with Spacy
In this second part of the challenge, we will be using the preprocessed data from part one to start training NER models. We will be using spacy (https://spacy.io/) here to "get our feet wet" with NER, as training spacy can be reasonably done on our laptops and does not yet necessarily require a GPU. Spacy is a powerful, effective, and resource-efficient NLP library - It might surprise us with its performance on the challenge!

We will run spacy's pretrained models on our data to get a feel for NER, and then we will perform some additional preprocessing on our data before we start training our own NER model using the labelled entities we have identified in part one. 
We will also explore evaluation metrics for NER, and decide how we want to quantify the performance of our trained models. 

* *If you need help setting up python or running this notebook, please get help from the  assistants to the professor*
* *It might be helpful to try your code out first in a python ide like pycharm before copying it an running it here in this notebook*
* *For solving the programming tasks, use the python reference linked here (Help->Python Reference) as well as Web-searches.* 

##### Reload preprocessed data
Here, we will load the data we saved in part one and save it to a variable. Provide code below to load the data and store it as a list in a variable. (Hint - use 'open' and the json module)

In [1]:
## import json module
import json
path = "../dataset/converted_resumes.json"
## TODO open file and load as json
with open(path,encoding="utf8") as f:
    resumes = json.load(f)
## TODO print length of loaded resumes list to be sure everything ok
print("Loaded: {} resumes".format(len(resumes)))

Loaded: 690 resumes


##### Take Spacy for a spin
Before we train our own NER model to recognize the resume-specific entities we want to capture, let's see how spacy's pretrained NER models do on our data. These pretrained models can't recognize our entities yet, but let's see how they do. Run the next code block to load spacy's English language model 


In [2]:
import spacy
nlp = spacy.load('en')
print(nlp)

<spacy.lang.en.English object at 0x000001ACCBB86E80>


Now we get the EntityRecognizer in the loaded nlp pipeline and display the labels it supports

In [3]:
ner = nlp.get_pipe('ner')
labels = ner.labels
print(labels)

('PERSON', 'ORG', 'DATE', 'LOC', 'QUANTITY', 'FAC', 'EVENT', 'WORK_OF_ART', 'TIME', 'MONEY', 'NORP', 'GPE', 'LAW', 'LANGUAGE', 'PRODUCT', 'PERCENT', 'ORDINAL', 'CARDINAL')


##### Question: What do the 'GPE', 'FAC' and 'NORP' labels stand for? (Tipp: use either the spacy.explain method, or google the spacy.io api docs) 
*Answer here*

In [4]:
### if you choose to use spacy's 'explain' method to get the answer to the question above, provide your code here
for label in labels:
    print("{}:  {}".format(label,spacy.explain(label)))

PERSON:  People, including fictional
ORG:  Companies, agencies, institutions, etc.
DATE:  Absolute or relative dates or periods
LOC:  Non-GPE locations, mountain ranges, bodies of water
QUANTITY:  Measurements, as of weight or distance
FAC:  Buildings, airports, highways, bridges, etc.
EVENT:  Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART:  Titles of books, songs, etc.
TIME:  Times smaller than a day
MONEY:  Monetary values, including unit
NORP:  Nationalities or religious or political groups
GPE:  Countries, cities, states
LAW:  Named documents made into laws.
LANGUAGE:  Any named language
PRODUCT:  Objects, vehicles, foods, etc. (not services)
PERCENT:  Percentage, including "%"
ORDINAL:  "first", "second", etc.
CARDINAL:  Numerals that do not fall under another type


As we can see, the entities are different than the entities we will train our custom model on. 
##### Question: what entities do you think this model will find in an example resume?
*Answer here*

Now we will work with one of our resumes, and get spacy to tell us what entities it recognizes. Complete the code block below to get a single resume text out of our resume list. 

In [5]:
### TODO get a single resume text and print it out
restxt = resumes[42][0]
print("\n".join(restxt.split('\n\n')))

Navas Koya
Test Engineer
Mangalore, Karnataka - Email me on Indeed: indeed.com/r/Navas-Koya/23c1e4e94779b465
Willing to relocate to: Mangalore, Karnataka - Bangalore, Karnataka - Chennai, Tamil Nadu
WORK EXPERIENCE
System Engineer
Infosys -
August 2014 to Present
.NET application Maintenance and do the code changes if required
Test Engineer
Infosys -
June 2015 to February 2016
PrProject 2:
Title: RBS W&G Proving testing.
Technology: Manual testing
Role: Software Test Engineer
Domain: Banking
Description:
Write test cases & descriptions. Review the entries. Upload and map the documents into
HP QC. Execute the testing operations in TPROD mainframe. Upload the result in QC along with
the proof.
Roles and Responsibilities:
•Prepared the Test Scenarios
•Prepared and Executed Test Cases
•Performed functional, Regression testing, Sanity testing.
•Reviewed the Test Reports and Preparing Test Summary Report.
•Upload Test cases to the QC.
•Execute in TPROD Mainframe.
•Defect Track and Report.
Te

Extracting entities with spacy is easy with a pretrained model. We simply call the model (here 'nlp') with our text to get a spacy Document. See https://spacy.io/api/doc for more detail. 

Execute the code below to process the resume txt.

In [6]:
doc = nlp(restxt)

The doc object has a list of entities predicted by spacy 'ents'. We would like to loop through all of these entities and print their label and associated text to see what spacy predicted for this resume.

Complete the code below to do this. You will probably need to google the spacy api docs to find the solution (Tipp: look for 'Doc.ents'). Also, trying code in your ide (for example pycharm) before copying it here might help with exploring and debugging to find the solution. 

In [7]:
##TODO loop through the doc's entities, and print the label and text for each entity found. 
for ent in doc.ents:
    print("{:10} {}".format(ent.label_,ent))

PERSON     Navas Koya
PERSON     Test Engineer
PERSON     Karnataka - Email
PERSON     Karnataka - Bangalore
ORG        Karnataka - Chennai
DATE       August 2014
PERSON     Maintenance
PERSON     Test Engineer
DATE       June 2015 to February 2016
ORG        RBS W&G Proving
ORG        HP QC
ORG        TPROD
PERSON     Responsibilities
ORG        Executed Test Cases
ORG        the Test Reports
ORG        Preparing Test Summary Report
ORG        Track and Report
PERSON     Infosys Limited -


DATE       August 2014
DATE       May 2015
CARDINAL   1
PERSON     CAWP
PERSON     Role
ORG        Software Test Executive
ORG        Admin
DATE       annual
ORG        Business Requirement
ORG        Functional
PERSON     Responsibilities
ORG        the Test Scenarios

ORG        Executed Test Cases
ORG        the Test Reports
ORG        Preparing Test Summary Report
ORG        Track and Report
ORG        Computer Applications

Mangalore University
DATE       June 2011
DATE       April 2014
ORG   

##### Questions: What is your first impression of spacy's NER based on the results above? Does it seem accurate/powerfull?
##### Does it make many mistakes? Do some entity types seem more accurate than others? How might one go about measuring the accuracy of such an NER system?
*Answers here*

Now as a comparison, we will list the entities contained in the resume's original annotated training data (remember, the existing annotations were created by a human-annotator, and not predicted by a machine like the entities predicted above) 

Complete the code below to do the following: 
* Access the 'entities' list of the example resume you chose, loop through the entities and print them out. 
* *Tipp: one entity in the list is a tuple with the following structure: (12,1222,"label") where the first element is the start index of the entity in the resume text, the second element is the end index, and the third element is the label.
* Use this Tipp to print out a formatted list of entities 



In [8]:
##TODO access entities
res = resumes[42]
restext = res[0] 
labeled_ents = res[1]['entities']
## TDOD print out formatted list of entities
for ent in labeled_ents:
    enttext = restext[ent[0]:ent[1]]
    enttext = " ".join(enttext.split())
    print("{:20} {}".format(ent[2],enttext))

Skills               SKILL SET • ASP.NET, C# • QA tools • Coding and modularization • Excellent communication skills • VB, VB.net, ASP • Technical specifications creation • HTML • System backups • Sql server 2005, Oracle • System upgrades • Java/C/C++ • Excellent problem-solving abilities Navas Najeer Koya 3
Location             Mangalore
Skills               C# (Less than 1 year), .NET, SQL Server, Css, Html5
Graduation Year      2014
Location             Mangalore
Location             Mangalore
Degree               Bachelor of Computer Application
Graduation Year      2014
Companies worked at  Infosys
Designation          Test Engineer
Companies worked at  Infosys
Designation          Test Engineer
Graduation Year      2014
Companies worked at  Infosys
Designation          System Engineer
Location             Mangalore
Location             Mangalore
Designation          Test Engineer
Name                 Navas Koya


As we already know, the annotated entities in the training data are different than the entities spacy can recognize with it's pretrainied models, so we need to train a custom NER model. We will get started with that now. 

##### Prepare Training Data for NER model training
We need to do some more preprocessing of our training data before we can train our model.

Remember the entity labels you chose in part 1 of the challenge? We will be training a model to predict those entities.
As a first step, we will gather all resumes that contain at least one training annotation for those entities.

Complete and execute the code below to gather your training data. 

In [9]:
##TODO Store the entity labels you want to train for as array in chosen_entity_labels
chosen_entity_labels = ["Companies worked at","Degree"]

## this method gathers all resumes which have all of the chosen entites above.
def gather_candidates(dataset,entity_labels):
    candidates = list()
    for resume in dataset:
        res_ent_labels = list(zip(*resume[1]["entities"]))[2]
        if set(entity_labels).issubset(res_ent_labels):
            candidates.append(resume)
    return candidates
## TODO use the gather candidates methods and store result in training_data variable
training_data = gather_candidates(resumes,chosen_entity_labels)
print("Gathered {} training examples".format(len(training_data)))

Gathered 560 training examples


Now we have those training examples which contain the entities we are interested in. Do you have at least a few hundred examples? If not, you might need to re-think the entities you chose or try just one or two of them and re-run the notebooks. It is important that we have several hundred examples for training (e.g. more than 200. 3-500 is better). 

##### Remove other entity annotations from training data
Now that we have our training data, we want to remove all but relevant (chosen) entity annotations from this data, so that the model we train will only train for our entities. Complete and execute the code below to do this. 

In [10]:
## filter all annotation based on filter list
def filter_ents(ents, filter):
    filtered = [ent for ent in ents if ent[2] in filter]
    return filtered

## now remove all but relevant (chosen) entity annotations and store in X variable 
X = [[dat[0], dict(entities=filter_ents(dat[1]['entities'], chosen_entity_labels))] for dat in training_data]


##### Remove resumes that cause errors in spacy
Depending on what entities you chose, some of the resumes might cause errors in spacy. We don't need to get into details as to why, suffice to say it has to do with whitespace and syntax in the entity annotations. If these resumes are not removed from our training data, spacy will throw an exception during training, so we need to remove them first. 

We will use the remove_bad_data function below to do this. This function does the following:
* calls train_spacy_ner with debug=True and n_iter=1. This causes spacy to process the documents one-by-one, and gather the documents that throw an exception in a list of "bad docs" which it returns. 
* You will complete the function to remove any baddocs (returned by remove_bad_data) from your training data list. 

You may or may not have any bad documents depending on the entities you chose. In any case, there should not be more than a dozen or so bad docs.  

In [11]:
from spacy_train_resume_ner import train_spacy_ner

def remove_bad_data(training_data):
    model, baddocs = train_spacy_ner(training_data, debug=True, n_iter=1)
    ## training data is list of lists with each list containing a text and annotations
    ## baddocs is a set of strings/resume texts.
    ## TODO complete implementation to filter bad docs and store filter result (good docs) in filtered variable
    filtered = [data for data in training_data if data[0] not in baddocs]
    print("Unfiltered training data size: ",len(training_data))
    print("Filtered training data size: ", len(filtered))
    print("Bad data size: ", len(baddocs))
    return filtered

## call remove method. It may take a few minutes for the method to complete.
## you will know it is complete when the print output above. 
X = remove_bad_data(X)

Created blank 'en' model
Losses {'ner': 46533.00248361219}
Unfiltered training data size:  560
Filtered training data size:  560
Bad data size:  0


##### Question: How many bad docs did you have? What is the size of your new (filtered) training data? 
*Answer here*

##### Train/Test Split
Now before we train our model, we have to split our available training data into training and test sets. Splitting our data into train and test (or holdout) datasets is a fundamental technique in machine learning, and essential to avoid the problem of overfitting.
Before we go on, you should get a grasp of how train/test split helps us avoid overfitting. Please take the time now to do a quick web search on the topic. There are many resources available. You should search for "train test validation overfitting" or some subset of those terms.

Here are a few articles to start with:
* https://machinelearningmastery.com/a-simple-intuition-for-overfitting/
* https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation (Note: you are free to install scikit learn and use the train_test_split method documented here, but it is not necessary. It is the concept that is important)

##### Question: What is overfitting and how does doing a train/test split help us avoid overfitting when training our models? Please answer in your own words. 
*Answer here*

Now that we understand why we do a train/test split, we will write some code that splits our data into train and test sets. Usually we want around 70-80% of the data for train, and the rest for test. 
##### TODO: Complete the code below to create a train and test dataset

In [12]:
##TODO complete the implementation  of the train test split function below
def train_test_split(X,train_percent):
    train_size = int(len(X)*train_percent)
    train = X[:train_size]
    test = X[train_size:]
    assert len(train)+len(test)==len(X)
    return train,test
## do train test split
train,test = train_test_split(X,0.8)
## TODO print the size of train and test sets. Do they add up to length of X? 
print("Train size: ",len(train))
print("Test size: ",len(test))
print("X size: ",len(X))


Train size:  448
Test size:  112
X size:  560


##### Train a spacy ner model with our training data
OK, now it is (finally) time to train our own custom NER model using spacy. Because our training data has been preprocessed to only include annotations for the entities we are interested in, the model will only be able to predict/extract those entities. 
*Depending on your computer, this step may take a while.* Training 20 epochs (iterations) using 480 training examples takes around 10 minutes on my machine (core i7 CPU). You will see output like *Losses {'ner':2342.23342342}* after each epoch/iteration. The default number of iterations is 20, so you will see this output 20 times. When this step is done, we will use the trained ner model to perform predictions on our test data in our test set.  

In [13]:
## run this code to train a ner model using spacy
custom_nlp,_= train_spacy_ner(train)

Created blank 'en' model
Losses {'ner': 25392.5586021567}
Losses {'ner': 36393.155208915356}
Losses {'ner': 35491.96489308332}
Losses {'ner': 36124.76872248634}
Losses {'ner': 41391.83340095449}
Losses {'ner': 32358.108264012262}
Losses {'ner': 49447.94876056269}
Losses {'ner': 22969.220427729888}
Losses {'ner': 20394.279494144535}
Losses {'ner': 9964.328544420878}
Losses {'ner': 6869.884511875207}
Losses {'ner': 5060.411938441683}
Losses {'ner': 4725.259286201559}
Losses {'ner': 4381.687212255663}
Losses {'ner': 4325.9589459544595}
Losses {'ner': 3576.2664387296454}
Losses {'ner': 3266.4662465489046}
Losses {'ner': 3357.245543475568}
Losses {'ner': 2972.806551699933}
Losses {'ner': 2803.824792371854}


##### Inspect NER predictions on one sample resume
Now that we have a trained model, let's see how it works on one of our resumes. 

In [14]:
## TODO fetch one resume out of our test dataset and store to the "resume" variable
resume = test[73]
## TODO create a spacy doc out of the resume using our trained model and save to the "doc" variable 
doc = custom_nlp(resume[0])

Now we will output the predicted entities and the existing annotated entities in that doc

In [15]:
## TODO output predicted entities (in "ents" variable of the spacy doc created above)
print("PREDICTED:")
for ent in doc.ents:
    print("{:20} {}".format(ent.label_,ent))
print()
## TODO output labeled entities (in "entities" dictionary of resume)
print("LABELED:")
for ent in resume[1]["entities"]:
    print("{:20} {}".format(ent[2],resume[0][ent[0]:ent[1]]))    


PREDICTED:
Companies worked at  Dhruva hiring services, Krishna Sales Corporation, Everest Enterprises.
Companies worked at  1)Shri Krishna Sales Corporation
Degree               BCom in Commerce
Degree               BCom

LABELED:
Degree               BCom
Degree               BCom in Commerce
Degree               BCom
Companies worked at  Dhruva hiring services, Krishna Sales Corporation, Everest Enterprises.


#### Evaluation Metrics for NER
Now that we can predict entities using our trained model, we can compare our predictions with the original annotations in our training data to evaluate how well our model performs for our task. The original annotations have been annotated manually by human annotators, and represent a "Gold Standard" against which we can compare our predictions. 

For a simple classification task, the following evaluation metrics are usually used:
* accuracy
* precision
* recall
* f1 score
* ROC/AUC

We will be most interested in *accuracy, precision and recall.* In order to understand these metrics, we need to understand the following concepts:
* True positives - How many of the predicted entities are "true" according to the Gold Standard? (training annotation) 
* True negatives - How many entities did the model not predict which are actually not entities according to the Gold Standard?
* False positives - How many entities did the model predict which are NOT entities according to the Gold Standard?  
* False negatives - How many entities did the model "miss" - e.g. did not recognize as entities which are entities according to the Gold Standard? 

Before we go on, it is important that you understand true/false positives/negatives as well as accuracy, precision and recall. Take some time now to research the web in order to find answers to the following questions:

##### Question: How are "accuracy", "precision", and "recall" defined in the context of evaluating Machine Learning models? How do they relate to True/False Positives/Negatives above? Please provide an intuitive description as well as the mathmatical formula for each metric. 
*Answers here*

##### Task: Based on the output in the previous cell (based on one single example), please calculate for each entity of interest  a precision and recall score for our model. 
Note - this should be a relatively simple calculation - you can even do this in your head. Keep in mind this is only based on one single example. Please describe your reasoning and calculations below: 
*Answer here*

##### Question: Did you have any problems or confusion calculating precision and recall? If so, what? 
*Answer here*

##### Calculating Metrics based on token-level annotations or full entity-level. 
The concepts above are our first step toward understanding how to evaluate our model effectively. However, in NER, we need to take into account that we can calculate our metrics either based on tokens (words) or on full entity level. 

##### Token-Level evaluation. 
Token level evaluation evaluates how accurately did the model tag *each individual word/token* in the input. In order to understand this, we need to understand something called the "BILUO" Scheme (or BILOU or BIO). The spacy docs have a good reference. Please read and familiarize yourself with BILUO. 

https://spacy.io/api/annotation#biluo

Up to now, we have not been working with the BILUO scheme, but with "offsets" (for example: (112,150,"Email") - which says there is an "Email" entity between positions 112 and 150 in the text). We would like to be able to evaluate our models on a token-level using BILUO - so we need to convert our data to BILUO. Fortunately, Spacy provides a helper method to do this for us.

*Execute the code below to see how our "Gold Standard" and predictions for our example doc above look in BILUO scheme.* 
Note: some of the lines might be ommited for display purposes. 

In [37]:
from spacy.gold import biluo_tags_from_offsets
import pandas as pd
from IPython.display import display, HTML

## returns a pandas dataframe with tokens, prediction, and true (Gold Standard) annotations of tokens
def make_bilou_df(nlp,resume):
    doc = nlp(resume[0])
    bilou_ents_predicted = biluo_tags_from_offsets(doc, [(ent.start_char,ent.end_char,ent.label_)for ent in doc.ents])
    bilou_ents_true = biluo_tags_from_offsets(doc,
                                                   [(ent[0], ent[1], ent[2]) for ent in resume[1]["entities"]])

    
    doc_tokens = [tok for tok in doc]
    bilou_df = pd.DataFrame()
    bilou_df["Tokens"] =doc_tokens
    bilou_df["Predicted"] = bilou_ents_predicted
    bilou_df["True"] = bilou_ents_true
    return bilou_df

bilou_df = make_bilou_df(custom_nlp,test[110])
display(bilou_df)  


Unnamed: 0,Tokens,Predicted,True
0,Zaheer,O,O
1,Uddin,O,O
2,\n,O,O
3,Technical,O,O
4,Project,O,O
5,Manager,O,O
6,\n\n,O,O
7,Hyderabad,O,O
8,",",O,O
9,Telangana,O,O


Based on this output, it should be very easy to calculate a token-level accuracy. We simply compare the "Predicted" to "True" columns and calculate what percentage are the same. 

In [38]:
## using pandas Dataframe here to get subset where predicted and true are the same. 
same_df = bilou_df[bilou_df["Predicted"]==bilou_df["True"]]
## accuracy is len of same_df/len of bilou_df
accuracy = float(same_df.shape[0])/bilou_df.shape[0]
print("Accuracy on one resume: ",accuracy)


Accuracy on one resume:  0.9888888888888889


The accuracy might be pretty good... if it is not 100%, then let's print out those tokens where the model predicted something different than the gold standard by running the code below. 

Note - if your score on one doc is 100%, pick another document and re-run the last few cells above. 

In [39]:
## diff_df contains all rows where predicted not equal Gold Standard. 
diff_df = bilou_df[bilou_df["Predicted"]!=bilou_df["True"]]
display(diff_df)

Unnamed: 0,Tokens,Predicted,True
264,Manager,O,-
265,\n\n,O,-
266,Microsoft,U-Companies worked at,-
870,\n\n,O,-
871,EDUCATION,O,-
873,BSc,U-Degree,O
892,Diploma,B-Degree,O
893,in,I-Degree,O
894,Computer,I-Degree,O
895,Application,L-Degree,O


How is the accuracy on one document? Now we need to calculate the accuracy on all our test resumes and average them for an accuracy score. 

Please complete the code below to report an accuracy score on our test resumes

In [41]:
import numpy as np
doc_accuracy = []
for tres in test:
    tres_df = make_bilou_df(custom_nlp,tres)
    same_df = bilou_df[bilou_df["Predicted"]==bilou_df["True"]]
    ## accuracy is len of same_df/len of bilou_df
    accuracy = float(same_df.shape[0])/bilou_df.shape[0]
    doc_accuracy.append(accuracy)

total_acc = np.mean(doc_accuracy)
print("Accuracy: ",total_acc)

    

Accuracy:  0.9888888888888889


Now, using what you know about precision and recall, supply code below which calculates precision and recall.

*Note - you will remember that:*

Precision = 𝑡𝑝/(𝑡𝑝+𝑓𝑝)

Recall = 𝑡𝑝/(𝑡𝑝+𝑓𝑛) 

In [51]:
true_positives = 0
false_positives = 0
false_negatives = 0
for tres in [resume]:
    tres_df = make_bilou_df(custom_nlp,tres)
    tp = bilou_df[(bilou_df["Predicted"]==bilou_df["True"]) & (bilou_df["Predicted"]=="O")]
    fp = bilou_df[(bilou_df["Predicted"]!=bilou_df["True"]) & (bilou_df["Predicted"]!="O")]
    fn = bilou_df[(bilou_df["Predicted"]!=bilou_df["True"]) & (bilou_df["Predicted"]=="O")]
    true_positives += tp.shape[0]
    false_positives += fp.shape[0]
    false_negatives += fn.shape[0]

print("tp: {} fp: {} fn: {}".format(true_positives,false_positives,false_negatives))
precision = float(true_positives)/(true_positives+false_positives)
recall = float(true_positives)/(true_positives+false_negatives)
print("Precision: ",precision)
print("Recall: ",recall)

tp: 888 fp: 6 fn: 4
Precision:  0.9932885906040269
Recall:  0.9955156950672646


##### Question: how does the model perform on token-level accuracy? What did it miss? In those cases where the predictions didn't match the gold standard, were the predictions plausible or just "spurious" (wrong)? 
*Answer here* 

We are almost Done with part II! We just need to save our BILUO training data for reuse in Part III. Complete the code below to do this. 