# Resume NER
## Extract Information from Resumes using NER (Named Entity Recognition)

### Part 2 - NER with Spacy
In this second part of the challenge, we will be using the preprocessed data from part one to start training NER models. We will be using spacy (https://spacy.io/) here to "get our feet wet" with NER, as training spacy can be reasonably done on our laptops and does not yet necessarily require a GPU. Spacy is a powerful, effective, and resource-efficient NLP library - It might surprise us with its performance on the challenge!

We will run spacy's pretrained models on our data to get a feel for NER, and then we will perform some additional preprocessing on our data before we start training our own NER model using the labelled entities we have identified in part one. 
We will also explore evaluation metrics for NER, and decide how we want to quantify the performance of our trained models. 

* *If you need help setting up python or running this notebook, please get help from the  assistants to the professor*
* *It might be helpful to try your code out first in a python ide like pycharm before copying it an running it here in this notebook*
* *For solving the programming tasks, use the python reference linked here (Help->Python Reference) as well as Web-searches.* 

##### Reload preprocessed data
Here, we will load the data we saved in part one and save it to a variable. Provide code below to load the data and store it as a list in a variable. (Hint - use 'open' and the json module)

In [138]:
import numpy as np
## import json module
import json
path = "./dataset/converted_resumes.json"
## TODO open file load as json and store in "resumes" variable
with open(path,'r') as f:
    resumes = json.load(f)
## TODO print length of loaded resumes list to be sure everything ok
resumes = np.array(resumes)
print(len(resumes))
print(len(resumes[0]))
print(resumes[0, 0])
print(resumes[0, 1])

366
2
Afreen Jamadar
Active member of IIIT Committee in Third year

Sangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6

I wish to use my knowledge, skills and conceptual understanding to create excellent team
environments and work consistently achieving organization objectives believes in taking initiative
and work to excellence in my work.

WORK EXPERIENCE

Active member of IIIT Committee in Third year

Cisco Networking -  Kanpur, Uttar Pradesh

organized by Techkriti IIT Kanpur and Azure Skynet.
PERSONALLITY TRAITS:
• Quick learning ability
• hard working

EDUCATION

PG-DAC

CDAC ACTS

2017

Bachelor of Engg in Information Technology

Shivaji University Kolhapur -  Kolhapur, Maharashtra

2016

SKILLS

Database (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT
ACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)

ADDITIONAL INFORMATION

TECHNICAL SKILLS:

• Programming Languages: C, C++, Java, .net, php.
• 

##### Take Spacy for a spin
Before we train our own NER model to recognize the resume-specific entities we want to capture, let's see how spacy's pretrained NER models do on our data. These pretrained models can't recognize our entities yet, but let's see how they do. Run the next code block to load spacy's English language model 


In [139]:
import spacy
nlp = spacy.load('en')
print(nlp)

<spacy.lang.en.English object at 0x000002421317A4E0>


Now we get the EntityRecognizer in the loaded nlp pipeline and display the labels it supports

In [140]:
ner = nlp.get_pipe('ner')
labels = ner.labels
print(len(labels))
print(labels)

18
('ORDINAL', 'LANGUAGE', 'QUANTITY', 'PRODUCT', 'ORG', 'GPE', 'CARDINAL', 'TIME', 'LAW', 'NORP', 'MONEY', 'FAC', 'EVENT', 'PERCENT', 'LOC', 'DATE', 'PERSON', 'WORK_OF_ART')


##### Question: What do the 'GPE', 'FAC' and 'NORP' labels stand for? (Tipp: use either the spacy.explain method, or google the spacy.io api docs) 
__GPE__: Countries, cities, states.  
__FAC__: Buildings, airports, highways, bridges, etc.  
__NORP__: Nationalities or religious or political groups.

As we can see, the entities are different than the entities we will train our custom model on. 
##### Question: what entities do you think this model will find in an example resume?
__NER-Labels:__ $ \ \ \ \ \ \ \ \ \ \ $ __Our-Labels:__  
_FAC_      $ \ \ \ \ \ \ \ \ \ \ \ $ - $ \ \ \ \ \ \ \ \ \ $Companies worked at, College  
_ORG_     $ \ \ \ \ \ \ \ \ \ \ $ - $ \ \ \ \ \ \ \ \ \ $Companies worked at, College  
_GPE_     $ \ \ \ \ \ \ \ \ \ \ \ $ - $ \ \ \ \ \ \ \ \ \  $state, Location  
_Date_    $ \ \ \ \ \ \ \ \ \ \ \ $ - $ \ \ \ \ \ \ \ \ \  $Graduation Year  
_Quantity_$ \ \ \ \ \ \ \ $ - $ \ \ \ \ \ \ \ \ \ $Years of Experience

Now we will work with one of our resumes, and get spacy to tell us what entities it recognizes. Complete the code block below to get a single resume text out of our resume list. 

In [141]:
### TODO get a single resume text and print it out
restxt = resumes[0,0]
## print it out, removing extraneous spaces
print("\n".join(restxt.split('\n\n')))

Afreen Jamadar
Active member of IIIT Committee in Third year
Sangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6
I wish to use my knowledge, skills and conceptual understanding to create excellent team
environments and work consistently achieving organization objectives believes in taking initiative
and work to excellence in my work.
WORK EXPERIENCE
Active member of IIIT Committee in Third year
Cisco Networking -  Kanpur, Uttar Pradesh
organized by Techkriti IIT Kanpur and Azure Skynet.
PERSONALLITY TRAITS:
• Quick learning ability
• hard working
EDUCATION
PG-DAC
CDAC ACTS
2017
Bachelor of Engg in Information Technology
Shivaji University Kolhapur -  Kolhapur, Maharashtra
2016
SKILLS
Database (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT
ACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)
ADDITIONAL INFORMATION
TECHNICAL SKILLS:
• Programming Languages: C, C++, Java, .net, php.
• Web Designing: HTML, XML

Extracting entities with spacy is easy with a pretrained model. We simply call the model (here 'nlp') with our text to get a spacy Document. See https://spacy.io/api/doc for more detail. 

Execute the code below to process the resume txt.

In [142]:
doc = nlp(restxt)

The doc object has a list of entities predicted by spacy 'ents'. We would like to loop through all of these entities and print their label and associated text to see what spacy predicted for this resume.

Complete the code below to do this. You will probably need to google the spacy api docs to find the solution (Tipp: look for 'Doc.ents'). Also, trying code in your ide (for example pycharm) before copying it here might help with exploring and debugging to find the solution. 

In [143]:
##TODO loop through the doc's entities, and print the label and text for each entity found. 
for ent in doc.ents:
    print('{:<20}'.format(ent.label_), '{:<20}'.format(ent.text))

GPE                  Jamadar             
ORG                  IIIT Committee      
DATE                 Third year          
ORG                  Maharashtra         
ORG                  IIIT Committee      
DATE                 Third year          
PERSON               Cisco Networking -  Kanpur
ORG                  Uttar Pradesh       
ORG                  Techkriti IIT Kanpur
PERSON               Azure Skynet        
PERSON               Quick               
ORG                  CDAC                
DATE                 2017                
GPE                  Kolhapur            
ORG                  Maharashtra         
DATE                 2016                
ORG                  HTML                
DATE                 Less than 1 year    
PERSON               Linux               
DATE                 Less than 1 year    
ORG                  MICROSOFT           
PERSON               ACCESS              
PERSON               C++                 
PERSON               Java   

##### Questions: What is your first impression of spacy's NER based on the results above? Does it seem accurate/powerfull?
##### Does it make many mistakes? Do some entity types seem more accurate than others? 
Spacy's NER didn't perform very well on that given resume example. There are a lot of misclassifications. Especially the entity person hasn't been assigned to the author of the resume, but instead the NER has classified most of the listed skills in the resume as Person, like Linux, Microsoft, etc.. Probably there people named Linux or Microsoft but based on our resume sample, it doesn't make sense at all. However ORG seems to work good, such as Date and Product. But in summary it didn't impress very much.

Now as a comparison, we will list the entities contained in the resume's original annotated training data (remember, the existing annotations were created by a human-annotator, and not predicted by a machine like the entities predicted above) 

Complete the code below to do the following: 
* Access the 'entities' list of the example resume you chose, loop through the entities and print them out. 
* *Tip: one entity in the list is a tuple with the following structure: (12,1222,"label") where the first element is the start index of the entity in the resume text, the second element is the end index, and the third element is the label.
* Use this Tip to print out a formatted list of entities 



In [144]:
##TODO print original entities for one resume
res = resumes[0]
restext = res[0] 
labeled_ents = res[1]['entities']
## TDOD print out formatted list of entity labels and text
for ent in labeled_ents:
    start, end, label = ent
    print('{:<25}:'.format(label), '{:<30}'.format(restext[start:end]))
    print(' ')
  

Email Address            : indeed.com/r/Afreen-Jamadar/8baf379b705e37c6
 
Links                    : https://www.indeed.com/r/Afreen-Jamadar/8baf379b705e37c6?isid=rex-download&ikw=download-top&co=IN
 
Skills                   : Database (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT
ACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)

ADDITIONAL INFORMATION

TECHNICAL SKILLS:

• Programming Languages: C, C++, Java, .net, php.
• Web Designing: HTML, XML
• Operating Systems: Windows […] Windows Server 2003, Linux.
• Database: MS Access, MS SQL Server 2008, Oracle 10g, MySql.
 
Graduation Year          : 2016                          
 
College Name             : Shivaji University Kolhapur   
 
Degree                   : Bachelor of Engg in Information Technology
 
Graduation Year          : 2017
                         
 
College Name             : CDAC ACTS                     
 
Degree                   : PG-DAC                        
 

As we already know, the annotated entities in the training data are different than the entities spacy can recognize with it's pretrainied models, so we need to train a custom NER model. We will get started with that now. 

##### Prepare Training Data for NER model training
We need to do some more preprocessing of our training data before we can train our model.

Remember the entity labels you chose in part 1 of the challenge? We will be training a model to predict those entities.
As a first step, we will gather all resumes that contain at least one training annotation for those entities.

Complete and execute the code below to gather your training data. 

In [145]:
# all unique labels
import numpy as np
unique_labels = np.array(['College', 'College Name', 'Companies worked at', 'Degree', 'Designation',
 'Email Address', 'Graduation Year', 'Links', 'Location', 'Name', 'Skills',
 'UNKNOWN', 'Years of Experience', 'state'])
#chosen labels in exercise1
c_label = [unique_labels[1], unique_labels[2], unique_labels[3]]
print(c_label)

['College Name', 'Companies worked at', 'Degree']


In [146]:
##TODO Store the entity labels you want to train for as array in chosen_entity_labels
chosen_entity_labels = c_label
print("Chosen entity labels: ",chosen_entity_labels)
## this method gathers all resumes which have all of the chosen entites above.
def gather_candidates(dataset,entity_labels):
    candidates = list()
    for resume in dataset:
        res_ent_labels = list(zip(*resume[1]["entities"]))[2]
        if set(entity_labels).issubset(res_ent_labels):
            candidates.append(resume)
    return candidates
## TODO use the gather candidates methods and store result in training_data variable
training_data = gather_candidates(resumes, chosen_entity_labels)
print("Gathered {} training examples".format(len(training_data)))

Chosen entity labels:  ['College Name', 'Companies worked at', 'Degree']
Gathered 266 training examples


Now we have those training examples which contain the entities we are interested in. Do you have at least a few hundred examples? If not, you might need to re-think the entities you chose or try just one or two of them and re-run the notebooks. It is important that we have several hundred examples for training (e.g. more than 200. 3-500 is better). 

In [147]:
# print training_data
print(training_data)

[array(['Afreen Jamadar\nActive member of IIIT Committee in Third year\n\nSangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6\n\nI wish to use my knowledge, skills and conceptual understanding to create excellent team\nenvironments and work consistently achieving organization objectives believes in taking initiative\nand work to excellence in my work.\n\nWORK EXPERIENCE\n\nActive member of IIIT Committee in Third year\n\nCisco Networking -  Kanpur, Uttar Pradesh\n\norganized by Techkriti IIT Kanpur and Azure Skynet.\nPERSONALLITY TRAITS:\n• Quick learning ability\n• hard working\n\nEDUCATION\n\nPG-DAC\n\nCDAC ACTS\n\n2017\n\nBachelor of Engg in Information Technology\n\nShivaji University Kolhapur -  Kolhapur, Maharashtra\n\n2016\n\nSKILLS\n\nDatabase (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECHNICAL SKILLS:\n\n• Prog

      dtype=object)]


##### Remove other entity annotations from training data
Now that we have our training data, we want to remove all but relevant (chosen) entity annotations from this data, so that the model we train will only train for our entities. Complete and execute the code below to do this. 

In [148]:
## filter all annotation based on filter list
def filter_ents(ents, filter):
    filtered = [ent for ent in ents if ent[2] in filter]
    return filtered

## TODO use method above to remove all but relevant (chosen) entity annotations and store in X variable 
X = []
for i in range(len(training_data)):
    entities = training_data[i][1]['entities']
    entities = filter_ents(entities, chosen_entity_labels)
    X.append([training_data[i][0],{'entities': entities}])
    
print(len(X))
print(X[0][1])

266
{'entities': [[675, 703, 'College Name'], [631, 673, 'Degree'], [614, 623, 'College Name'], [606, 612, 'Degree'], [438, 454, 'Companies worked at']]}


##### Remove resumes that cause errors in spacy
Depending on what entities you chose, some of the resumes might cause errors in spacy. We don't need to get into details as to why, suffice to say it has to do with whitespace and syntax in the entity annotations. If these resumes are not removed from our training data, spacy will throw an exception during training, so we need to remove them first. 

We will use the remove_bad_data function below to do this. This function does the following:
* calls train_spacy_ner with debug=True and n_iter=1. This causes spacy to process the documents one-by-one, and gather the documents that throw an exception in a list of "bad docs" which it returns. 
* You will complete the function to remove any baddocs (returned by remove_bad_data) from your training data list. 

You may or may not have any bad documents depending on the entities you chose. In any case, there should not be more than a dozen or so bad docs.  

In [149]:
from spacy_train_resume_ner import train_spacy_ner

def remove_bad_data(training_data):
    model, baddocs = train_spacy_ner(training_data, debug=True, n_iter=1)
    ## training data is list of lists with each list containing a text and annotations
    ## baddocs is a set of strings/resume texts.
    ## TODO complete implementation to filter bad docs and store filter result (good docs) in filtered variable
    filtered = [data for data in training_data if data[0] not in baddocs]
    print("Unfiltered training data size: ",len(training_data))
    print("Filtered training data size: ", len(filtered))
    print("Bad data size: ", len(baddocs))
    return filtered

## call remove method. It may take a few minutes for the method to complete.
## you will know it is complete when the print output above. 
X = remove_bad_data(X)

Created blank 'en' model
Exception thrown when processing doc:
("Punit Raghav\nSales Manager - Mukund Overseas - Magnum\n\nThane, Maharashtra - Email me on Indeed: indeed.com/r/Punit-Raghav/f36e9e4d0857ac5b\n\nA competent professional with over 8 years of experience in:\n\n- Handling Dealer & Distributor - Business Development - Handling Projects\n- Architect & Interior Designer - Handling Carpenters & Contractors\n\nCore Functional Skills:\n\n• Effectively meet deadlines, achieve targets and work under pressure.\n• Company success driven - passionate about company's product line.\n• Accounting-related computer literacy.\n• Supervising the performance of dealers / distributors with key emphasis on achieving revenue\ntargets.\n• Excellent communication skills, written and verbal.\n• Effective presentation of complex issues.\n• High level of negotiation skills.\n\nWilling to relocate to: Maharashtra - South india\n\nWORK EXPERIENCE\n\nSales Manager\n\nMukund Overseas - Magnum -  Mumbai, 

##### Question: How many bad docs did you have? What is the size of your new (filtered) training data? 
Nothing changed on the size of my training data. The number of bad data is zero.

##### Train/Test Split
Now before we train our model, we have to split our available training data into training and test sets. Splitting our data into train and test (or holdout) datasets is a fundamental technique in machine learning, and essential to avoid the problem of overfitting.
Before we go on, you should get a grasp of how train/test split helps us avoid overfitting. Please take the time now to do a quick web search on the topic. There are many resources available. You should search for "train test validation overfitting" or some subset of those terms.

Here are a few articles to start with:
* https://machinelearningmastery.com/a-simple-intuition-for-overfitting/
* https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation (Note: you are free to install scikit learn and use the train_test_split method documented here, but it is not necessary. It is the concept that is important)

##### Question: What is overfitting and how does doing a train/test split help us avoid overfitting when training our models? Please answer in your own words. 
By splitting the data "correctly" we want to avoid to fit very well or very closely to the training data, which than leads to a not very generalized model. This process is called _overfitting_ and results in bad predicitions on unseen data.

Now that we understand why we do a train/test split, we will write some code that splits our data into train and test sets. Usually we want around 70-80% of the data for train, and the rest for test. 
##### TODO: Complete the code below to create a train and test dataset

In [150]:
import numpy as np
import copy
##TODO complete the implementation  of the train test split function below
def train_test_split(X,train_percent):
    X_array = np.array(X)
    train_size = int(len(X_array) - len(X_array) * train_percent)
    global train_before 
    train_before = copy.copy(X_array[:train_size])
    np.random.shuffle(X_array)
    train = X_array[:train_size]
    print(len(train))
    test_size = - int(len(X_array) - train_size)
    test = X_array[test_size:]
    print(len(test))
    return train, test
## TODO chose train size percent and call train test split, storing results in "train" and "test" variables.
train, test = train_test_split(X, 0.2)

train_copy = copy.copy(train)
test_copy = copy.copy(test)
## TODO use python assert to assert that the size of train and test sets add up to the size of all the data 
assert len(train) + len(test) == len(X)
## check if shuffle method works
assert train.all() != train_before.all()
## check if train and test split works
assert train.all() != test.all()

208
52


##### Train a spacy ner model with our training data
OK, now it is (finally) time to train our own custom NER model using spacy. Because our training data has been preprocessed to only include annotations for the entities we are interested in, the model will only be able to predict/extract those entities. 
*Depending on your computer, this step may take a while.* Training 20 epochs (iterations) using 480 training examples takes around 10 minutes on my machine (core i7 CPU). You will see output like *Losses {'ner':2342.23342342}* after each epoch/iteration. The default number of iterations is 20, so you will see this output 20 times. When this step is done, we will use the trained ner model to perform predictions on our test data in our test set.  

In [151]:
## run this code to train a ner model using spacy
custom_nlp,_ = train_spacy_ner(train, n_iter = 21)

Created blank 'en' model
Losses {'ner': 22778.46796760335}
Losses {'ner': 7699.042200821423}
Losses {'ner': 13603.403596112414}
Losses {'ner': 13042.800633509985}
Losses {'ner': 8602.311598775908}
Losses {'ner': 3395.2406688699944}
Losses {'ner': 739.9971548573317}
Losses {'ner': 202.56857165942435}
Losses {'ner': 114.4411992087397}
Losses {'ner': 24.65546197887443}
Losses {'ner': 0.030643504539188664}
Losses {'ner': 0.19450170255177712}
Losses {'ner': 0.10347951885830613}
Losses {'ner': 0.0038328029601710856}
Losses {'ner': 0.0071200881213467045}
Losses {'ner': 0.03143359986257925}
Losses {'ner': 0.00028600870081909306}
Losses {'ner': 0.0047526732454013215}
Losses {'ner': 0.004162406509009507}
Losses {'ner': 0.011605435798525673}
Losses {'ner': 0.0011857672821065333}


##### Inspect NER predictions on one sample resume
Now that we have a trained model, let's see how it works on one of our resumes. 

In [152]:
## TODO fetch one resume out of our test dataset and store to the "resume" variable
import copy
import random
# pick out randomly a resume
n = random.randint(0, len(test))
resume = test[n]
resumetxt = resume[0]
labeled_ents = resume[1]['entities']
## TODO create a spacy doc out of the resume using our trained model and save to the "doc" variable 
doc = custom_nlp(resumetxt)
doc_copy = copy.copy(doc)

Now we will output the predicted entities and the existing annotated entities in that doc

In [153]:
## TODO output label and text of predicted entities (in "ents" variable of the spacy doc created above)
print("PREDICTED: \n")
for ent in doc.ents:
    print('{:<20}:'.format(ent.label_), '{:<20}'.format(ent.text))

PREDICTED: 

Companies worked at : Chartered Institute 
Companies worked at : BC                  
Companies worked at : B.TECH. in          
Degree              : Technology          
College Name        : Amity University    
Degree              : College             


In [154]:
## TODO output labeled entities (in "entities" dictionary of resume)
print("LABELED: \n")
for ent in labeled_ents:
    start, end, label = ent
    print('{:<20}:'.format(label), '{:<20}'.format(resumetxt[start:end]))
    print(' ')

LABELED: 

Companies worked at : Accenture           
 
College Name        : Canossa Convent School

 
College Name        : Canossa Convent Girls Inter College

 
College Name        : Amity University    
 
Degree              : B.TECH. in Information Technology

 
Companies worked at : Accenture           
 
Companies worked at : Accenture           
 
Companies worked at : Accenture           
 
Companies worked at : Accenture           
 
Companies worked at : Accenture           
 


#### Evaluation Metrics for NER
Now that we can predict entities using our trained model, we can compare our predictions with the original annotations in our training data to evaluate how well our model performs for our task. The original annotations have been annotated manually by human annotators, and represent a "Gold Standard" against which we can compare our predictions. 

For most classification tasks, the most common evaluation metrics are:
* accuracy
* precision
* recall
* f1 score

In order to understand these metrics, we need to understand the following concepts:
* True positives - How many of the predicted entities are "true" according to the Gold Standard? (training annotation) 
* True negatives - How many entities did the model not predict which are actually not entities according to the Gold Standard?
* False positives - How many entities did the model predict which are NOT entities according to the Gold Standard?  
* False negatives - How many entities did the model "miss" - e.g. did not recognize as entities which are entities according to the Gold Standard? 

Before we go on, it is important that you understand true/false positives/negatives as well as the evaluation metrics above. Take some time now to research the web in order to find answers to the following questions:

##### Question: How are the evaluation metrics above defined in the context of evaluating Machine Learning models? How do they relate to True/False Positives/Negatives above? Please provide an intuitive description as well as the mathmatical formula for each metric. 
To get a good intuition for the evaluation metrics above it is important to get a good understanding of the confusion matrix.
Because based on the confusion matrix we can derive all the mentioned metrics above, of course in the context of ML models.
The definition of the confusion matrix is as follows: The columns represent the hypothesis or the prediction made of our model, 
whereas the the rows are the true labeld values the references in that case.
<img src="./dataset/confusion_matrix.png" alt="Drawing" style="width: 400px;"/>    
    
* __Accuracy:__
$$ A = \dfrac{1}{N}\sum_{k=1}^{K}n_{k,k}\cdot 100\% $$  
In terms of True/False Positives/Negatives, we have to consider either the micro average:
  
  
\begin{align}
    A_{micro} =\frac{\#TP_{1} + ... + \#TP_{k}+\#TN_{1} + ... + \#TN_{k}}{\#TP_{1} + ... + \#TP_{k} + \#FP_{1} + ... + \#FP_{k} + \#FN_{1} + ... + \#FN_{k} + \#TN_{1} + ... + \#TN_{k}}
\end{align}

$\ \ \ \ \ \ $ or the macro average:

\begin{align}
    A_{macro} =\frac{A_{micro,1} + ... + A_{micro,k}}{k}
\end{align}

* __Precision__:
$$ PRE = \dfrac{n_{k,k}}{\sum_{i=1}^{K}n_{i,k}} $$  

  
  
\begin{align}
    PRE_{micro} =\frac{\#TP_{1} + ... + \#TP_{k}}{\#TP_{1} + ... + \#TP_{k}+\#FP_{1} + ... + \#FP_{k}}
\end{align}
  
  
* __Recall__:
$$ REC = \dfrac{n_{k,k}}{\sum_{i=1}^{K}n_{k,i}} = \dfrac{n_{k,k}}{N_{k,i}}  $$  
  
  
\begin{align}
    REC_{micro} =\frac{\#TP_{1} + ... + \#TP_{k}}{\#TP_{1} + ... + \#TP_{k}+\#FN_{1} + ... + \#FN_{k}}
\end{align}
  
  
* __F1 score__:

\begin{align}
    F_{1} =2 \cdot \frac{PRE \cdot REC}{PRE + REC}
\end{align}
  
 The macro average for _Precision_ and _Recall_ can be computed analogous to the macro average of _Accuracy_.

##### Calculating Metrics based on token-level annotations or full entity-level. 
The concepts above are our first step toward understanding how to evaluate our model effectively. However, in NER, we need to take into account that we can calculate our metrics either based on all tokens (words) found in the document, or only on the entities found in the document.  

##### Token-Level evaluation. 
Token level evaluation evaluates how accurately did the model tag *each individual word/token* in the input. In order to understand this, we need to understand something called the "BILUO" Scheme (or BILOU or BIO). The spacy docs have a good reference. Please read and familiarize yourself with BILUO. 

https://spacy.io/api/annotation#biluo 



Up to now, we have not been working with the BILUO scheme, but with "offsets" (for example: (112,150,"Email") - which says there is an "Email" entity between positions 112 and 150 in the text). We would like to be able to evaluate our models on a token-level using BILUO - so we need to convert our data to BILUO. Fortunately, Spacy provides a helper method to do this for us.

*Execute the code below to see how our "Gold Standard" and predictions for our example doc above look in BILUO scheme.* 
Note: some of the lines might be ommited for display purposes. 

In [155]:
from spacy.gold import biluo_tags_from_offsets
import pandas as pd
from IPython.display import display, HTML

## returns a pandas dataframe with tokens, prediction, and true (Gold Standard) annotations of tokens
def make_bilou_df(nlp,resume):
    """
    param nlp - a trained spacy model
    param resume - a resume from our train or test set
    """
    doc = nlp(resume[0])
    bilou_ents_predicted = biluo_tags_from_offsets(doc, [(ent.start_char,ent.end_char,ent.label_)for ent in doc.ents])
    bilou_ents_true = biluo_tags_from_offsets(doc, [(ent[0], ent[1], ent[2]) for ent in resume[1]["entities"]])

    
    doc_tokens = [tok.text for tok in doc]
    bilou_df = pd.DataFrame()
    bilou_df["Tokens"] =doc_tokens
    bilou_df["Tokens"] = bilou_df["Tokens"].str.replace("\\s+","") 
    bilou_df["Predicted"] = bilou_ents_predicted
    bilou_df["True"] = bilou_ents_true
    return bilou_df

## TODO call method above with a resume from test set and store result in bilou_df variable.
bilou_df = make_bilou_df(custom_nlp, resume)
display(bilou_df)  


Unnamed: 0,Tokens,Predicted,True
0,Manjari,O,O
1,Singh,O,O
2,,O,O
3,Senior,O,O
4,Software,O,O
5,Analyst,O,O
6,-,O,O
7,Accenture,O,U-Companies worked at
8,,O,O
9,Bengaluru,O,O


Based on this output, it should be very easy to calculate a token-level accuracy. We simply compare the "Predicted" to "True" columns and calculate what percentage are the same. 

In [156]:
import numpy as np
## TODO bilou_df is a pandas dataframe. Use pandas dataframe api to get a subset where predicted and true are the same. 
bilou_df_array = np.array(bilou_df)
# number of predicted = true
same_df = np.sum(np.where(bilou_df_array[:,1]==bilou_df_array[:,2],1,0))
#print(same_df)
## accuracy is the length of this subset divided by the length of bilou_df
accuracy = same_df/len(bilou_df_array)
print("Accuracy on one resume: ",accuracy)


Accuracy on one resume:  0.9800738007380074


The accuracy might seem pretty good... if it is not 100%, then let's print out those tokens where the model predicted something different than the gold standard by running the code below. 

Note - if your score on one doc is 100%, pick another document and re-run the last few cells above. 

In [157]:
import numpy as np
import pandas as pd
## TODO find all rows in bilou_df where "Predicted" not equal to "True" column.
bilou_df_array = np.array(bilou_df)
#print(bilou_df_array.shape)
# inidces of bilou_df_array where predicted != true
diff_df_indx = np.where(np.where(bilou_df_array[:,1]==bilou_df_array[:,2],1,0) == 0)[0]
#print(diff_df_indx)
# elements of the calculated inidices diff_df_indx
diff_df = np.take(bilou_df_array,diff_df_indx, axis=0)
# convert numpy to pandas
diff_df = pd.DataFrame(data=diff_df[:], index=np.arange(len(diff_df)),columns=['tokens', 'predicted', 'true'] )
display(diff_df)

Unnamed: 0,tokens,predicted,true
0,Accenture,O,U-Companies worked at
1,Chartered,B-Companies worked at,O
2,Institute,L-Companies worked at,O
3,Accenture,O,U-Companies worked at
4,Accenture,O,U-Companies worked at
5,Accenture,O,U-Companies worked at
6,Accenture,O,U-Companies worked at
7,BC,U-Companies worked at,O
8,B.TECH,B-Companies worked at,-
9,.,I-Companies worked at,-


<img src="./dataset/bilou.png" alt="Drawing" style="width: 500px;"/>    


Now let's calculate the accuracy on all our test resumes and average them for an accuracy score. 

Please complete the code below to report an accuracy score on our test resumes

In [158]:
import numpy as np
doc_accuracy = []
for res in test:
    ## TODO calculate accuracy for each 'res' and append to doc_accuracy list 
    bilou_df_a = make_bilou_df(custom_nlp, res)
    bilou_df_a = np.array(bilou_df_a)
    accuracy = np.sum(np.where(bilou_df_a[:,1] == bilou_df_a[:,2], 1, 0))/len(bilou_df_a)
    doc_accuracy.append(accuracy)

## TODO calculate mean/average of doc_accuracy (Tip: use numpy!)
total_acc = np.mean(doc_accuracy)
print("Accuracy: ",total_acc)

    

Accuracy:  0.9159421629152841


##### Question: how does the model perform on token-level accuracy? What did it miss? In those cases where the predictions didn't match the gold standard, were the predictions plausible or just "spurious" (wrong)? 
Based on the accuracy result of the total accuracy above, it performs relatively good. In the most misclassified cases it misses skills-tokens, so it predicts them as non-entity-token. It also has problems to handle punctuations. In some cases it predicts punctuations as some kind of entities or labels them as non entity token, which is fine but based on the golden standard some or maybe all punctuations are not considered at all and set to '-', which leads to misclassification even our model predicted it as a non entity token. However punctuations shouldn't be considered as tokens or words at all, they should be ignored in that case from our model, which NER seemingly doesn't fulfill. There is also a smaller amount of misclassifications, where the prediction and the golden standard aren't spurious wrong. For example, our model has predicted a token not as the first token of a multi-token entity, but as a single-token entity and there are of course to be considered also other possible combinations. This last kind of misclassification leads back to the fact that in the misclassification case some tokens are not considered as any kind of entity token. 

##### Question: What might the advantages and disadvantages be of calculating accuracy on token-level? Hint: think about a document with 1000 tokens where only 10 tokens are annotated as entities. What might the accuracy be on such a document?  
It is more a utility question. I mean it is a bit similar to the "ham or spam" problem. In our case the accuracy based in token-level might always have an high value but this automatically doesn't mean that our model is very good in extracting the right entities out of a given text. In the most cases we are dealing with a higher number of non entity token in a text. In that case our _TN_ value probably will always dominate the accuracy value, where as the other values as _TP_, _FP_ and _FN_ will not play an important role. So summerized, if someone isn't really interested in how good the trained model does extract the chosen entities from a given unseen text than the accuracy should be sufficient. For a better and more detailed evaluation of our model there must be also considered the other evaluation metrices as _Precision_, _Recall_, and _F1-score_.

##### Entity-Level evaluation #####
Another method of evaluating the performance of our NER model is to calculate metrics not on token-level, but on entity level. There is a good blog article that describes this method. 

http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/

The article goes into some detail, the most important part is the scenarios described in the section "Comparing NER system output and golden standard". 

##### Question: how do the first 3 scenarios described in the section "Comparing NER system output and golden standard" correlate to  true/false positives/negatives? 
<img src="./dataset/evaluation_metrics.png" alt="Drawing" style="width: 500px;"/>


* __First__ one correlates to __TP__ and __TN__.  
* __Second__ one to __FP__
* __Third__ one to __FN__

##### Precision, Recall, F1 #####

Now we would like to calculate precision, recall, and f1 for each entity type we are interested in (our chosen entities). To do this, we need to understand the formulas for each. A good article for this is https://skymind.ai/wiki/accuracy-precision-recall-f1. 

##### Question: how can we calculate precision, recall and f1 score based on the information above? Please provide the formulas for each #####
We will calculate the __precision__ , __recall__ and __f1__ for each entity type and than compute the macro average for each metrics.

* __Precision__:
$$ PRE_{k} = \frac{\#TP_{k}}{\#TP_{k}+ \#FP_{k}} $$  

    \begin{align}
           PRE_{macro} = \frac{\sum_{k=1}^{K} PRE_{k}}{k}
    \end{align}
  
  
* __Recall__:
$$ REC_{k} = \frac{\#TP_{k}}{\#TP_{k}+ \#FN_{k}} $$  

    \begin{align}
           PRE_{macro} = \frac{\sum_{k=1}^{K} REC_{k}}{k}
    \end{align}

  
* __F1 score__:

$$ F_{1,k} = 2 \cdot \frac{PRE_{k} \cdot REC_{k}}{PRE_{k} + REC_{k}} $$
    
\begin{align}
  F_{1,macro} = \frac{\sum_{k=1}^{K} F_{1, k}}{k}
\end{align}

Now supply code below which calculates precision and recall and F1 on our test data for each entity type we are interested in. 



In [159]:
import numpy as np
import pandas as pd
import numpy.ma as ma

def contains_string(list, string):
      
    list_new = np.char.count(list, string)
    
    return ma.make_mask(list_new)


## TODO cycle through chosen_entity_labels and calculate metrics for each entity using test data
data = []
for label in chosen_entity_labels:
    
    ## variables to store results for all resumes for one entity type
    true_positives = 0
    false_positives = 0
    false_negatives = 0
    for tres in test:
        ## use make_bilou_df on each resume in our test set, and calculate for each entity true and false positives,
        ## and false negatives.
        tres_df = make_bilou_df(custom_nlp, tres)
        tres_df = np.array(tres_df).astype('str')
        true = tres_df[:, 2]
        pred = tres_df[:, 1]
        
        
        ## calculate true false positives and false negatives for each resume
        tp = np.sum(np.where((pred == true) & (contains_string(true, label)), 1, 0))
        fp = np.sum(np.where((pred != true) & (contains_string(pred, label)) & (true == 'O'), 1, 0))
        fn = np.sum(np.where((pred != true) & (pred == 'O') & (contains_string(true, label)), 1, 0))
        
        ## aggregate result for each resume to totals
        true_positives = true_positives   + tp
        false_positives = false_positives + fp
        false_negatives = false_negatives + fn
    
    print("For label '{:<20}' tp: {:<20} fp: {:<20} fn: {:<20}".format(label,true_positives,false_positives,false_negatives))
    
    ## TODO Use the formulas you learned to calculate metrics and print them out        
    precision = true_positives/(true_positives + false_positives)
    recall = true_positives/(true_positives + false_negatives)
    f1 = 2 * (precision * recall/(precision + recall))
    print("Precision: ",precision)
    print("Recall: ",recall)
    print("F1: ",f1)
    row = [precision,recall,f1]
    data.append(row)

## make pandas dataframe with metrics data. Use the chosen entity labels as an index, and the metric names as columns. 
metric_df = pd.DataFrame(data = data, index=chosen_entity_labels, columns=['Precision', 'Recall', 'F1'] )
display(metric_df)

For label 'College Name        ' tp: 42                   fp: 224                  fn: 115                 
Precision:  0.15789473684210525
Recall:  0.267515923566879
F1:  0.19858156028368792
For label 'Companies worked at ' tp: 101                  fp: 305                  fn: 250                 
Precision:  0.24876847290640394
Recall:  0.28774928774928776
F1:  0.2668428005284016
For label 'Degree              ' tp: 7                    fp: 16                   fn: 146                 
Precision:  0.30434782608695654
Recall:  0.0457516339869281
F1:  0.07954545454545454


Unnamed: 0,Precision,Recall,F1
College Name,0.157895,0.267516,0.198582
Companies worked at,0.248768,0.287749,0.266843
Degree,0.304348,0.045752,0.079545


Now we compute an average score for each metric. 

In [160]:
## TODO compute average metrics And print them out. Use pandas dataframe "mean" method to do this
import pandas as pd
data_mean = np.mean(data,axis=0).reshape((1,3))
metric_df_mean = pd.DataFrame(data = data_mean, index = ['mean'], columns=['Precision', 'Recall', 'F1'] )
display(metric_df_mean)

Unnamed: 0,Precision,Recall,F1
mean,0.237004,0.200339,0.181657


##### Question: how do the average metrics here (computed on entity-level) compare to the token-level accuracy score above? Which metric(s) would you prefer to use to evaluate the quality of your model? Why? 
I would suggest to use both of them. The evaluation based on token level tells me how good my model can recognise a non-entity token, on the other side the valuation metric on entity level is an indicator for how good my model can recognise a entity type. Both Informations are important to get an trustful evaluation metric for describing the performance of a model.

We are almost Done with part II! We just need to save our BILUO training data for reuse in Part III. Run the code below to do this. 

In [161]:
# preparing data for flair, a sentence for flair in our case is represented as a combination of characters, wich ends with a point '.'
for res in train_copy:
    
    res[0] = str(np.char.replace (res[0], '\n\n', ' '))
    res[0] = str(np.char.replace (res[0], '\n', ' '))
    res[0] = str(np.char.replace (res[0], '.', '.\n'))
    res[0] = str(np.char.replace (res[0], ': ', ':'))
    
    

In [162]:
for res in test_copy:
    
    res[0] = str(np.char.replace (res[0], '\n\n', ' '))
    res[0] = str(np.char.replace (res[0], '\n', ' '))
    res[0] = str(np.char.replace (res[0], '.', '.\n'))
    res[0] = str(np.char.replace (res[0], ': ', ':'))
    
    

In [167]:
## TODO persist BILUO data as text
print("Make bilou dfs")

training_data_as_bilou = [make_bilou_df(custom_nlp, res) for res in train_copy]
test_data_as_bilou = [make_bilou_df(custom_nlp, res) for res in test_copy]
print("Done!")

training_df = pd.DataFrame(columns = ["text","ner"])
test_df = pd.DataFrame(columns = ["text","ner"])

for idx,df in enumerate(training_data_as_bilou):
    dftrain = pd.DataFrame()
    #train_data = list(filter(None, np.array(df['Tokens'])))
    dftrain["text"] = df['Tokens']
    dftrain["ner"] = df["True"]
   
   
    training_df = training_df.append(dftrain)

for idx,df in enumerate(test_data_as_bilou):
    dftest = pd.DataFrame()
    #test_data = list(filter(None, np.array(df['Tokens'])))
    dftest["text"] = df['Tokens']
    dftest["ner"] = df["True"]
    
    
    test_df = test_df.append(dftest)
 

te = np.array(test_df).astype("str")
tra = np.array(training_df).astype("str")

for i in range(len(te)):
    if te[i,0]=='':
        te[i,1]=''
        
for i in range(len(tra)):        
    if tra[i,0]=='':
        tra[i,1]=''
        
        
#training_df = pd.DataFrame(data = tra, columns= ["text","ner"])     
test_df = pd.DataFrame(data = te, columns= ["text","ner"])

with open("./dataset/flair/train.txt",'w+',encoding="utf-8") as f:
    #tra.to_csv(f,sep=" ",encoding="utf-8", index=False)
    np.savetxt(f, tra,encoding="utf-8", fmt='%s')
   
with open("./dataset/flair/test.txt",'w+',encoding="utf-8") as f:
    #te.to_csv(f,sep=" ",encoding="utf-8", index=False)
    np.savetxt(f, te,encoding="utf-8", fmt='%s')
    



Make bilou dfs
Done!


Now let's load the data we persisted with flair before we go on

In [168]:
from flair.data import Corpus
from flair.datasets import ColumnCorpus
columns = {0:"text",1:"ner"}
data_folder = "./dataset/flair"

corpus: Corpus = ColumnCorpus(data_folder, columns, train_file="train.txt", test_file="test.txt")
print(len(list(corpus.train)))
print(len(list(corpus.test)))

2019-06-18 17:47:06,908 Reading data from dataset\flair
2019-06-18 17:47:06,925 Train: dataset\flair\train.txt
2019-06-18 17:47:06,926 Dev: None
2019-06-18 17:47:06,926 Test: dataset\flair\test.txt
6440
1956


If you could load the corpus without error, you are ready to go on to part 3, where we will work with flair nlp! 