# Resume NER
## Extract Information from Resumes using NER (Named Entity Recognition)

### Part 2 - NER with Spacy 

##### Reload preprocessed data

In [135]:
## import json module
import json
path = "./data/converted_resumes.json"
## TODO open file load as json and store in "resumes" variable
with open(path) as json_file:  
    resumes = json.load(json_file)
## TODO print length of loaded resumes list to be sure everything ok
print(len(resumes))

690


##### Take Spacy for a spin


In [136]:
import spacy
nlp = spacy.load('en')
print(nlp)

<spacy.lang.en.English object at 0x000001800A8DD240>


Now we get the EntityRecognizer in the loaded nlp pipeline and display the labels it supports

In [137]:
ner = nlp.get_pipe('ner')
labels = ner.labels
print(labels)

('PERSON', 'ORG', 'TIME', 'ORDINAL', 'CARDINAL', 'QUANTITY', 'LANGUAGE', 'GPE', 'LOC', 'MONEY', 'EVENT', 'FAC', 'WORK_OF_ART', 'LAW', 'DATE', 'PERCENT', 'PRODUCT', 'NORP')


In [138]:
### TODO  if you choose to use spacy's 'explain' method to get the answer to the question above, provide your code here
## print description of entities using spacy explain
print(spacy.explain("GPE"))
print(spacy.explain("FAC"))
print(spacy.explain("NORP"))

Countries, cities, states
Buildings, airports, highways, bridges, etc.
Nationalities or religious or political groups


Now we will work with one of our resumes, and get spacy to tell us what entities it recognizes.

In [139]:
### TODO get a single resume text and print it out
restxt = resumes[500][0]
## print it out, removing extraneous spaces
print("\n".join(restxt.split('\n\n')))

Sandeep Mahadik
Warehouse Manager/Logistics Manager - Barrick Trade group
Thane, Maharashtra - Email me on Indeed: indeed.com/r/Sandeep-
Mahadik/4c9901ae64c8e1f2
To become successful in the field of Logistics, SCM, warehousing industry
WORK EXPERIENCE
Warehouse Manager/Logistics Manager
Barrick Trade group -  Luanda, AO -
June 2012 to Present
Roles and Responsblities- Responsible for Supervises and coordinates activities of operations of a
warehouse with wide range of operational responsibilities. Work involves day-to-day warehouse
operations including Shipping and Receiving (local & import), effective manpower allocation,
stock taking and inventory maintenance.
¬ Provides customer support in order taking and liaising
¬ Receives daily incoming shipments - local and international imports (Average 30 -50
containers)
¬ Verifies and approves the various stock transactions and documents
¬ Compiles and reports the near expiry goods stock
¬ Verifies the damaged goods and puts up the claims to

In [140]:
doc = nlp(restxt)

The doc object has a list of entities predicted by spacy 'ents'. We would like to loop through all of these entities and print their label and associated text to see what spacy predicted for this resume.

In [141]:
##TODO loop through the doc's entities, and print the label and text for each entity found. 
for e in doc.ents:
    print(e.label_)
    print(e.text)

PERSON
Sandeep Mahadik
PERSON
Warehouse Manager
ORG
Barrick Trade
ORG
Maharashtra - Email
ORG
Logistics
ORG
SCM
ORG
Barrick Trade
GPE
AO
DATE
June 2012
PERSON
Supervises
ORG
Shipping and Receiving
NORP
&
ORG
Provides
DATE
daily
CARDINAL
30
ORG
Insurance Company/Suppliers
ORG
FIFO
PERSON
¬ Coordinates
PERSON
¬ Checks SKU
ORG
QIP
PERSON
¬ Supervises
PERSON
¬ Works
PERSON
Supervises
PERSON
KPI
PERSON
¬ Schedules
ORG
QIP
ORG
Safety
ORG
National Warehouse
PERSON
Manages
ORG
Sales & Operations
PERSON
Pvt Ltd
GPE
Mumbai
ORG
Maharashtra -


DATE
August 2010
DATE
May 2012
ORG
India Pvt Ltd.
GPE
India
PRODUCT
V-Xpress
PRODUCT
V-Logis
DATE
year
DATE
year
PERSON
Logis
CARDINAL
15
ORG
Chemical/FMCG/
Telecom
PRODUCT
60000
ORG
Executive - Sales & Operations
PERSON
Job Profile-


ORG
Sales & Operational
ORG
Bills
CARDINAL
➢
ORG
Implementation &
ORG
SOP
PERSON
➢
NORP
Deccan
CARDINAL
360
PERSON
Pvt Ltd -  
GPE
Mumbai
ORG
Maharashtra
DATE
January 2010 to August 2010
ORG
Business Development
ORG
Payment
G

Now as a comparison, we will list the entities contained in the resume's original annotated training data (remember, the existing annotations were created by a human-annotator, and not predicted by a machine like the entities predicted above) 

In [142]:
##TODO print original entities for one resume
res = resumes[500]
restext = res[0] 
labeled_ents = res[1]["entities"]
#print(labeled_ents)
## TDOD print out formatted list of entity labels and text
for ent in labeled_ents:
    print("{} {}".format(ent[2], restext[ent[0]:ent[1]]))

Skills System, Applications MS word, Power point, Excel, Ms-Office, Internet technology
Skills 
Excel (Less than 1 year), MS word (Less than 1 year), WORD (Less than 1 year)
Graduation Year May 2003
College Name Shivaji University
Degree B.Com
Graduation Year May 2006
Location Thane
College Name INSTITUTE OF INTERNATIONAL BUSINEES AND RESEARCH
Degree MBA in IB
Years of Experience June 2005 to July 2005
Companies worked at G.P.GOSWAMY (CHA)
Years of Experience August 2006 to February 2008
Companies worked at Transolution India Pvt.Ltd
Designation Executive - Sales & Operations
Designation Executive - Sales
Designation Management Trainee
Years of Experience March 2008 to December 2009
Companies worked at Credence Logistics Ltd
Designation Management Trainee
Years of Experience January 2010 to August 2010
Companies worked at Deccan 360 Pvt Ltd 
Designation Executive - Sales
Designation Executive - Sales & Operations
Designation Executive - Sales
Companies worked at V Logis India Pvt Ltd
Y

##### Prepare Training Data for NER model training

In [143]:
##TODO Store the entity labels you want to train for as array in chosen_entity_labels
chosen_entity_labels = ["Designation", "College Name", "Degree"]
##["Skills","Designation", "Years of Experience"]
## this method gathers all resumes which have all of the chosen entites above.
def gather_candidates(dataset,entity_labels):
    candidates = list()
    for resume in dataset:
        res_ent_labels = list(zip(*resume[1]["entities"]))[2]
        if set(entity_labels).issubset(res_ent_labels):
            candidates.append(resume)
    return candidates
## TODO use the gather candidates methods and store result in training_data variable
training_data = gather_candidates(resumes, chosen_entity_labels)
print("Gathered {} training examples".format(len(training_data)))

Gathered 450 training examples


##### Remove other entity annotations from training data

In [144]:
## filter all annotation based on filter list
def filter_ents(ents, filter):
    filtered = [ent for ent in ents if ent[2] in filter]
    return filtered

## TODO use method above to remove all but relevant (chosen) entity annotations and store in X variable 
X = [[dat[0], dict(entities=filter_ents(dat[1]['entities'], chosen_entity_labels))] for dat in training_data]
print(X)






##### Remove resumes that cause errors in spacy

In [145]:
from spacy_train_resume_ner import train_spacy_ner

def remove_bad_data(training_data):
    model, baddocs = train_spacy_ner(training_data, debug=True, n_iter=1)
    ## training data is list of lists with each list containing a text and annotations
    ## baddocs is a set of strings/resume texts.
    ## TODO complete implementation to filter bad docs and store filter result (good docs) in filtered variable
    filtered = [data for data in training_data if data[0] not in baddocs]
    print("Unfiltered training data size: ",len(training_data))
    print("Filtered training data size: ", len(filtered))
    print("Bad data size: ", len(baddocs))
    return filtered

## call remove method. It may take a few minutes for the method to complete.
## you will know it is complete when the print output above. 
X = remove_bad_data(X)

Created blank 'en' model
Exception thrown when processing doc:
("Punit Raghav\nSales Manager - Mukund Overseas - Magnum\n\nThane, Maharashtra - Email me on Indeed: indeed.com/r/Punit-Raghav/f36e9e4d0857ac5b\n\nA competent professional with over 8 years of experience in:\n\n- Handling Dealer & Distributor - Business Development - Handling Projects\n- Architect & Interior Designer - Handling Carpenters & Contractors\n\nCore Functional Skills:\n\n• Effectively meet deadlines, achieve targets and work under pressure.\n• Company success driven - passionate about company's product line.\n• Accounting-related computer literacy.\n• Supervising the performance of dealers / distributors with key emphasis on achieving revenue\ntargets.\n• Excellent communication skills, written and verbal.\n• Effective presentation of complex issues.\n• High level of negotiation skills.\n\nWilling to relocate to: Maharashtra - South india\n\nWORK EXPERIENCE\n\nSales Manager\n\nMukund Overseas - Magnum -  Mumbai, 

("Sheldon Creado\nSr. Manager - Regional Sales\n\nMumbai, Maharashtra - Email me on Indeed: indeed.com/r/Sheldon-Creado/\nb73c053d2691e84a\n\n• Result-oriented professional with experience of 15 years in Sales Planning/ Execution, Process\nImprovement and Business Development.\n• Excellent track record in performing challenging strategic & leadership roles, building\nstrategic service plans and CSAT.\n• Demonstrated effectiveness in high-profile executive roles driving large scale gains in business\nvolumes through on-ground business strategies and consistent acquisition, deepening &\nretention of customer base.\n\nWilling to relocate to: Mumbai, Maharashtra - Pune, Maharashtra - Bangalore, Karnataka\n\nWORK EXPERIENCE\n\nSr. Manager - Regional Sales\n\nTata Teleservices Ltd\n\nJob Profile:\n• Managed and developed an assigned portfolio of accounts, increasing product penetration and\nrevenue market share.\n• Successfully achieved set Business Acquisition, Revenue Maximization & Retent


Exception thrown when processing doc:
("Alok Gond\nArea Business Manager - Zuventus Healthcare Pvt Ltd\n\nMumbai, Maharashtra - Email me on Indeed: indeed.com/r/Alok-Gond/6e691dc668a54602\n\nTo work in an organization where I am able to contribute to the organization's growth and\nprofitability with my skill and in turn get an opportunity to gain exposure and expertise that would\nhelp me build a strong and successful career.\n\nWilling to relocate to: Mumbai, Maharashtra\n\nWORK EXPERIENCE\n\nArea Business Manager\n\nZuventus Healthcare Pvt Ltd -\n\nJuly 2012 to Present\n\nworking in super speciality division of gastro and respiratory and handling team of 5 medical\nrepresentative.\n\nZuventus Healthcare Pvt Ltd -\n\nNovember 2008 to Present\n\n2008.\n\nDescription: Working in speciality Division of Gastro & Respiratory\n\npharmaceuticals sales\n\nNicholas Piramal Pvt. Ltd -\n\nAugust 2007 to November 2008\n\nDescription: Worked in speciality Division of Cardio & Diabetic.\n\nBlack&W

##### Train/Test Split

In [146]:
##TODO complete the implementation  of the train test split function below
def train_test_split(X,train_percent):
    train_size = int(len(X) * (train_percent/100))
    print(train_size)
    train = X[:train_size]
    test = X[train_size:]
    print(len(train))
    print(len(test))
    print(len(X))
    return train,test
## TODO chose train size percent and call train test split, storing results in "train" and "test" variables.
train,test = train_test_split(X, 75)
## TODO use python assert to assert that the size of train and test sets add up to the size of all the data 
assert len(train) + len(test) == len(X), "Training Data + Test Data != X" 

333
333
111
444


##### Train a spacy ner model with our training data 

In [147]:
## run this code to train a ner model using spacy
custom_nlp,_= train_spacy_ner(train,n_iter=20)

Created blank 'en' model
Losses {'ner': 23542.02154053688}
Losses {'ner': 17931.999111959332}
Losses {'ner': 23942.591503306583}
Losses {'ner': 14268.701830841916}
Losses {'ner': 23195.958208328346}
Losses {'ner': 16700.22196586446}
Losses {'ner': 11305.673109984426}
Losses {'ner': 9022.959211523072}
Losses {'ner': 43386.60410847247}
Losses {'ner': 6632.438529685813}
Losses {'ner': 6340.317062260785}
Losses {'ner': 6214.734122767545}
Losses {'ner': 10199.885969128702}
Losses {'ner': 4848.190741455095}
Losses {'ner': 4171.1793907746205}
Losses {'ner': 3768.314296404628}
Losses {'ner': 4045.6210577974684}
Losses {'ner': 3781.0187590244113}
Losses {'ner': 3459.036374136653}
Losses {'ner': 4338.420357389126}


##### Inspect NER predictions on one sample resume

In [148]:
## TODO fetch one resume out of our test dataset and store to the "resume" variable
resume = test[73]
## TODO create a spacy doc out of the resume using our trained model and save to the "doc" variable 
doc = custom_nlp(resume[0])

Now we will output the predicted entities and the existing annotated entities in that doc

In [149]:
## TODO output label and text of predicted entities (in "ents" variable of the spacy doc created above)
print("PREDICTED:")
for ent in doc.ents:
    print("{:20} {}".format(ent.label_, ent))
print()
## TODO output labeled entities (in "entities" dictionary of resume)
print("LABELED:")
for ent in resume[1]["entities"]:
    print("{} {}".format(ent[2], resume[0][ent[0]:ent[1]]))

PREDICTED:
Designation          Store Manager
Designation          Store Manager
Designation          Store Manager
Designation          Store Manager
Designation          Senior Sales Officer
Designation          Sales Associate
Degree               MBA in Marketing Management
College Name         Intellectual Institute of Management and Technology

LABELED:
College Name Intellectual Institute of Management and Technology
Degree MBA in Marketing Management
Designation Channel Partner

Designation Sales Associate
Designation Senior Sales Officer
Designation Store Manager
Designation Store Manager
Designation Store Manager
Designation Store Manager


#### Evaluation Metrics for NER

In [150]:
from spacy.gold import biluo_tags_from_offsets
import pandas as pd
from IPython.display import display, HTML

## returns a pandas dataframe with tokens, prediction, and true (Gold Standard) annotations of tokens
def make_bilou_df(nlp,resume):
    """
    param nlp - a trained spacy model
    param resume - a resume from our train or test set
    """
    doc = nlp(resume[0])
    bilou_ents_predicted = biluo_tags_from_offsets(doc, [(ent.start_char,ent.end_char,ent.label_)for ent in doc.ents])
    bilou_ents_true = biluo_tags_from_offsets(doc, [(ent[0], ent[1], ent[2]) for ent in resume[1]["entities"]])
    doc_tokens = [tok.text for tok in doc]
    bilou_df = pd.DataFrame()
    bilou_df["Tokens"] =doc_tokens
    bilou_df["Tokens"] = bilou_df["Tokens"].str.replace("\\s+","") 
    bilou_df["Predicted"] = bilou_ents_predicted
    bilou_df["True"] = bilou_ents_true
    return bilou_df

## TODO call method above with a resume from test set and store result in bilou_df variable.
bilou_df = make_bilou_df(custom_nlp,test[10])
display(bilou_df)  


Unnamed: 0,Tokens,Predicted,True
0,Ananya,O,O
1,Chavan,O,O
2,,O,O
3,lecturer,O,U-Designation
4,-,O,O
5,oracle,O,O
6,tutorials,O,O
7,,O,O
8,Mumbai,O,O
9,",",O,O


In [151]:
## TODO bilou_df is a pandas dataframe. Use pandas dataframe api to get a subset where predicted and true are the same. 
same_df = bilou_df[bilou_df["Predicted"]==bilou_df["True"]]
## accuracy is the length of this subset divided by the length of bilou_df
accuracy = float(same_df.shape[0])/bilou_df.shape[0]
print("Accuracy on one resume: ",accuracy)


Accuracy on one resume:  0.9798136645962733


In [152]:
## TODO find all rows in bilou_df where "Predicted" not equal to "True" column. 
diff_df = bilou_df[bilou_df["Predicted"]!=bilou_df["True"]]
display(diff_df)

Unnamed: 0,Tokens,Predicted,True
3,lecturer,O,U-Designation
49,lecturer,O,U-Designation
96,lecturer,O,U-Designation
160,lecturer,O,U-Designation
174,lecturer,O,U-Designation
196,,O,-
197,B.Sc,B-Degree,-
198,.,I-Degree,-
199,in,I-Degree,-
200,Com,I-Degree,-


In [153]:
import numpy as np
doc_accuracy = []
for tres in test:
    tres_df = make_bilou_df(custom_nlp, tres)
    same_df = tres_df[tres_df["Predicted"]==tres_df["True"]]
    accuracy = float(same_df.shape[0])/tres_df.shape[0]
    doc_accuracy.append(accuracy)

## TODO calculate mean/average of doc_accuracy (Tip: use numpy!)
total_acc = np.mean(doc_accuracy)
print("Accuracy: ",total_acc)

    

Accuracy:  0.965730796629255


In [154]:
## TODO cycle through chosen_entity_labels and calculate metrics for each entity using test data
data = []
print(len(test))
for label in chosen_entity_labels:
    ## variables to store results for all resumes for one entity type
    true_positives = 0
    false_positives = 0
    false_negatives = 0
    for tres in test:
        ## use make_bilou_df on each resume in our test set, and calculate for each entity true and false positives,
        ## and false negatives. 
        tres_df = make_bilou_df(custom_nlp, tres)
        ## calculate true false positives and false negatives for each resume
        tp = tres_df[(tres_df["Predicted"]==tres_df["True"]) & (tres_df["Predicted"].str.contains(label))]
        fp = tres_df[(tres_df["Predicted"]!=tres_df["True"]) & (tres_df["Predicted"].str.contains(label))]
        fn = tres_df[(tres_df["Predicted"]!=tres_df["True"]) & (tres_df["True"].str.contains(label))]
        ## aggregate result for each resume to totals
        true_positives += tp.shape[0]
        false_positives += fp.shape[0]
        false_negatives += fn.shape[0]
    
    print("For label '{}' tp: {} fp: {} fn: {}".format(label,true_positives,false_positives,false_negatives))
    
    ## TODO Use the formulas you learned to calculate metrics and print them out        
    precision = 0.0 if true_positives == 0 else float(true_positives)/(true_positives+false_positives)
    recall = 0.0 if true_positives == 0 else float(true_positives)/(true_positives+false_negatives)
    f1 = 0.0 if precision+recall == 0 else 2*((precision*recall)/(precision+recall))
    #print("Precision: ",precision)
    #print("Recall: ",recall)
    #print("F1: ",f1)
    row = [precision,recall,f1]
    data.append(row)

## make pandas dataframe with metrics data. Use the chosen entity labels as an index, and the metric names as columns. 
metric_df = pd.DataFrame(data, index=chosen_entity_labels, columns=["Precision", "Recall", "f1"])
display(metric_df)

111
For label 'Designation' tp: 723 fp: 249 fn: 342
For label 'College Name' tp: 350 fp: 481 fn: 78
For label 'Degree' tp: 371 fp: 270 fn: 47


Unnamed: 0,Precision,Recall,f1
Designation,0.743827,0.678873,0.709867
College Name,0.421179,0.817757,0.555997
Degree,0.578783,0.88756,0.700661


In [155]:
## TODO compute average metrics And print them out. Use pandas dataframe "mean" method to do this
print("Average precision:", metric_df["Precision"].mean())
print("Average recall:", metric_df["Recall"].mean())
print("Average f1:", metric_df["f1"].mean())

Average precision: 0.5812632046218694
Average recall: 0.7947300191316181
Average f1: 0.6555084253183594


##### "Sentences" #####
Flair works with "Sentences" which is a list of tokens. If we simply write out our csv with one line for every token in our dataset, we will have 1 giant sentence with many thousands of words.. This is not what we want. 
We would like to partition our data so that we have a list of "Sentences" - corresponding to our intuition for a sentence - a sequence of words that belong together and is not all to long, usually separated by some punctuation. 
When we create our .csv strings/files, we need to do so so that they represent a list of sentences, each sentence consisting of a list of tokens/tags (each token/tag being one line in our csv). 

In [156]:
training_data_as_bilou = [make_bilou_df(nlp,res) for res in train]
test_data_as_bilou = [make_bilou_df(nlp,res) for res in test]
training_df = pd.DataFrame(columns = ["text","ner","ner_spacy","doc"])
test_df = pd.DataFrame(columns = ["text","ner","ner_spacy","doc"])
for idx,df in enumerate(training_data_as_bilou):
    df2 = pd.DataFrame()
    df['Tokens'].replace('', np.nan, inplace=True)
    df.dropna(subset=['Tokens'], inplace=True)
    df = df.replace('\n','', regex=True)
    df2["text"] = df["Tokens"]
    df2["ner"] = df["True"]
    df2["ner_spacy"]=df["Predicted"]
    df2["doc"]=idx
    training_df = training_df.append(df2, sort=True)
for idx,df in enumerate(test_data_as_bilou):
    print(df["Tokens"])
    df2 = pd.DataFrame()
    df['Tokens'].replace('', np.nan, inplace=True)
    df.dropna(subset=['Tokens'], inplace=True)
    df = df.replace('\n','', regex=True)
    df2["text"] = df["Tokens"]
    df2["ner"] = df["True"]
    df2["ner_spacy"]=df["Predicted"]
    df2["doc"]=idx
    test_df = test_df.append(df2, sort=True)

with open("data/flair/train_res_bilou.txt",'w+',encoding="utf-8", newline='') as f:
    training_df.to_csv(f,sep=" ",encoding="utf-8",index=False)
with open("data/flair/test_res_bilou.txt",'w+',encoding="utf-8", newline='') as f:
    test_df.to_csv(f,sep=" ",encoding="utf-8",index=False)
    
    
#with open("data/flair/test_res_bilou.txt", 'w', newline='', encoding="utf-8") as outfile:
#    writer = csv.writer(outfile)

0                            Harini
1                       Komaravelli
2                                  
3                              Test
4                           Analyst
5                                at
6                            Oracle
7                                 ,
8                         Hyderabad
9                                  
10                        Hyderabad
11                                ,
12                        Telangana
13                                -
14                            Email
15                               me
16                               on
17                           Indeed
18                                :
19                       indeed.com
20                                /
21                                r
22                                /
23                          Harini-
24                                 
25     Komaravelli/2659eee82e435d1b
26                                 
27                          

Name: Tokens, Length: 1051, dtype: object
0                        Jitendra
1                               .
2                       Makhijani
3                                
4                       Solutions
5                         Manager
6                               -
7                        Sterling
8                     Information
9                       Resources
10                          India
11                            Pvt
12                            Ltd
13                               
14                          Thane
15                              ,
16                    Maharashtra
17                              -
18                          Email
19                             me
20                             on
21                         Indeed
22                              :
23                     indeed.com
24                              /
25                              r
26                              /
27                      Jitendra-
28    

Name: Tokens, Length: 1056, dtype: object
0                                              Bangalore
1                                             Tavarekere
2                                                       
3                                              Volunteer
4                                             Contestant
5                                                      ,
6                                                 Yappon
7                                                       
8                                              Bengaluru
9                                                      ,
10                                             Karnataka
11                                                     -
12                                                 Email
13                                                    me
14                                                    on
15                                                Indeed
16                                            

Name: Tokens, Length: 80, dtype: object
0                       Ananya
1                       Chavan
2                             
3                     lecturer
4                            -
5                       oracle
6                    tutorials
7                             
8                       Mumbai
9                            ,
10                 Maharashtra
11                           -
12                       Email
13                          me
14                          on
15                      Indeed
16                           :
17                  indeed.com
18                           /
19                           r
20                           /
21                     Ananya-
22                            
23     Chavan/738779ab71971a96
24                            
25                     Seeking
26                           a
27                 responsible
28                         job
29                        with
                ...           

Name: Tokens, Length: 1582, dtype: object
0                                            Ravi
1                                         Shankar
2                                                
3                                         Working
4                                              as
5                                      Escalation
6                                        Engineer
7                                            with
8                                       Microsoft
9                                               .
10                                               
11                                           Pune
12                                              ,
13                                    Maharashtra
14                                              -
15                                          Email
16                                             me
17                                             on
18                                         Indeed
19      

Name: Tokens, Length: 357, dtype: object
0                                       Nitin
1                                          Tr
2                                            
3                                  PeopleSoft
4                                  Consultant
5                                            
6                                   Bangalore
7                                       Urban
8                                           ,
9                                   Karnataka
10                                          -
11                                      Email
12                                         me
13                                         on
14                                     Indeed
15                                          :
16     indeed.com/r/Nitin-Tr/e7e3a2f5b4c1e24e
17                                           
18                                         An
19                                          e
20                                     

Name: Tokens, Length: 303, dtype: object
0                                         Lokmanya
1                                             Pada
2                                                 
3                                            Thane
4                                                ,
5                                      Maharashtra
6                                                -
7                                            Email
8                                               me
9                                               on
10                                          Indeed
11                                               :
12     indeed.com/r/Lokmanya-Pada/1d6100af0815e98a
13                                                
14                                            WORK
15                                      EXPERIENCE
16                                                
17                                         Company
18                                       

Name: Tokens, Length: 283, dtype: object
0                                             Binoy
1                                           Choubey
2                                                  
3                                            Senior
4                                           Manager
5                                              with
6                                                10
7                                             years
8                                                of
9                                        experience
10                                               in
11                                             Real
12                                           Estate
13                                                ,
14                                                 
15                               Telecommunications
16                                              and
17                                               IT
18                     

Name: Tokens, Length: 234, dtype: object
0                                                 Sanjeev
1                                                   Shahi
2                                                        
3                                                      13
4                                                   years
5                                                      of
6                                                   Sales
7                                              experience
8                                                      in
9                                                 Banking
10                                                 sector
11                                                      –
12                                               thorough
13                                              knowledge
14                                                       
15                                                     of
16                             

Name: Tokens, Length: 516, dtype: object
0                                         Yash
1                                         Raja
2                                             
3                                       Senior
4                                      Manager
5                                            -
6                                     Foodlink
7                                        India
8                                      Private
9                                      Limited
10                                            
11                                      Mumbai
12                                           ,
13                                 Maharashtra
14                                           -
15                                       Email
16                                          me
17                                          on
18                                      Indeed
19                                           :
20     indeed.com/r

Name: Tokens, Length: 968, dtype: object
0                                         Tahir
1                                          Pasa
2                                              
3                                        Mumbai
4                                             ,
5                                   Maharashtra
6                                             -
7                                         Email
8                                            me
9                                            on
10                                       Indeed
11                                            :
12     indeed.com/r/Tahir-Pasa/73aba05cc1730e98
13                                             
14                                        Given
15                                           an
16                                  opportunity
17                                            ,
18                                         will
19                                      perform

Name: Tokens, Length: 161, dtype: object
0                                           Naveed
1                                            Chaus
2                                                 
3                                              MBA
4                                               in
5                                           retail
6                                              and
7                                        marketing
8                                       management
9                                             with
10                                               8
11                                           years
12                                              of
13                                    professional
14                                                
15                                      experience
16                                                
17                                           Thane
18                                       

Name: Tokens, Length: 2934, dtype: object
0                           Pradyuman
1                              Nayyar
2                                    
3                                Area
4                               Sales
5                             Manager
6                                   -
7                                 SBI
8                                Card
9                                    
10                             Mumbai
11                                  ,
12                        Maharashtra
13                                  -
14                              Email
15                                 me
16                                 on
17                             Indeed
18                                  :
19     indeed.com/r/Pradyuman-Nayyar/
20                                   
21                   a2995521b867c98f
22                                   
23                                  •
24                                  A
25      

Name: Tokens, Length: 894, dtype: object
0                                                 Puneet
1                                                  Singh
2                                                       
3                                              Associate
4                                               Software
5                                               Engineer
6                                                       
7                                              Bengaluru
8                                                      ,
9                                              Karnataka
10                                                     -
11                                                 Email
12                                                    me
13                                                    on
14                                                Indeed
15                                                     :
16            indeed.com/r/Puneet-Singh/cb1ede9

Name: Tokens, Length: 972, dtype: object
0                                       Sridevi
1                                             H
2                                              
3                                     Bangalore
4                                             ,
5                                     Karnataka
6                                             -
7                                         Email
8                                            me
9                                            on
10                                       Indeed
11                                            :
12      indeed.com/r/Sridevi-H/63703b24aaaa54e4
13                                             
14                                           To
15                                      further
16                                           my
17                                       career
18                                         with
19                                            a

Name: Tokens, Length: 457, dtype: object
0                                      Ramya
1                                          .
2                                          P
3                                           
4                                  Hyderabad
5                                          ,
6                                  Telangana
7                                          -
8                                      Email
9                                         me
10                                        on
11                                    Indeed
12                                         :
13     indeed.com/r/Ramya-P/00f125c7b9b95a35
14                                          
15                                         (
16                                         2
17                                      year
18                                Experience
19                                         )
20                                          
21            

Name: Tokens, Length: 1263, dtype: object
0                                              Vipin
1                                          Jakhaliya
2                                                   
3                                               Full
4                                               time
5                                              PGDBA
6                                               with
7                                                 12
8                                                  +
9                                              years
10                                                of
11                                        experience
12                                                in
13                                             media
14                                                ad
15                                                 .
16                                             sales
17                                                  
18  

Name: Tokens, Length: 856, dtype: object
0                                          Prashant
1                                             Pawar
2                                                  
3                                           General
4                                           Manager
5                                         Marketing
6                                                 -
7                                            VINATI
8                                          ORGANICS
9                                           LIMITED
10                                                 
11                                           Mumbai
12                                                ,
13                                      Maharashtra
14                                                -
15                                            Email
16                                               me
17                                               on
18                     

Name: Tokens, Length: 152, dtype: object
0                     Bhupesh
1                       Singh
2                            
3                     Manager
4                           -
5                       Sales
6                           -
7                        Dion
8                      Global
9                   Solutions
10                        Ltd
11                           
12                       Navi
13                     Mumbai
14                          ,
15                Maharashtra
16                          -
17                      Email
18                         me
19                         on
20                     Indeed
21                          :
22                 indeed.com
23                          /
24                          r
25                          /
26                   Bhupesh-
27                           
28     Singh/89985037448d838f
29                           
                ...          
592               Electronics

Name: Tokens, Length: 480, dtype: object
0                                               Shivasai
1                                                 Mantri
2                                                       
3                                              Microsoft
4                                               dynamics
5                                                     AX
6                                              Technical
7                                             consultant
8                                                       
9                                              Hyderabad
10                                                     ,
11                                             Telangana
12                                                     -
13                                                 Email
14                                                    me
15                                                    on
16                                             

Name: Tokens, Length: 913, dtype: object
0                                               Krishna
1                                                Prasad
2                                                      
3                                                 Patna
4                                                  City
5                                                     ,
6                                                 Bihar
7                                                     -
8                                                 Email
9                                                    me
10                                                   on
11                                               Indeed
12                                                    :
13         indeed.com/r/Krishna-Prasad/56249a1d0efd3fca
14                                                     
15                                                 WORK
16                                           EXPERIENCE
17     

Name: Tokens, Length: 592, dtype: object
0                                                 Gunjan
1                                                 Nayyar
2                                                       
3                                             Hoshiarpur
4                                                      ,
5                                                 Punjab
6                                                      -
7                                                  Email
8                                                     me
9                                                     on
10                                                Indeed
11                                                     :
12           indeed.com/r/Gunjan-Nayyar/a5819ca6733a0f41
13                                                      
14                                                    To
15                                                  keep
16                                             

Name: Tokens, Length: 497, dtype: object
0                                          Mohini
1                                           Gupta
2                                                
3                                          Server
4                                         Support
5                                        Engineer
6                                                
7                                         Gurgaon
8                                               ,
9                                         Haryana
10                                              -
11                                          Email
12                                             me
13                                             on
14                                         Indeed
15                                              :
16     indeed.com/r/Mohini-Gupta/08b5b8e1acd8cf07
17                                               
18                                        Willing
19       

Name: Tokens, Length: 1036, dtype: object
0                                           Sweety
1                                           Kakkar
2                                                 
3                                           Mumbai
4                                                ,
5                                      Maharashtra
6                                                -
7                                            Email
8                                               me
9                                               on
10                                          Indeed
11                                               :
12     indeed.com/r/Sweety-Kakkar/2459d47174eaa56e
13                                                
14                                              To
15                                          pursue
16                                               a
17                                     challenging
18                                      

Name: Tokens, dtype: object
0                                           Sai
1                                         Patha
2                                              
3                                          Mule
4                                           ESB
5                                   Integration
6                                     Developer
7                                             -
8                                         Cisco
9                                       Systems
10                                             
11                                    Hyderabad
12                                            ,
13                                    Telangana
14                                            -
15                                        Email
16                                           me
17                                           on
18                                       Indeed
19                                            :
20      inde

Name: Tokens, Length: 528, dtype: object
0                        Sougata
1                        Goswami
2                               
3                         Senior
4                         Retail
5                              &
6                      Corporate
7                         Banker
8                               
9                         Mumbai
10                             ,
11                   Maharashtra
12                             -
13                         Email
14                            me
15                            on
16                        Indeed
17                             :
18                    indeed.com
19                             /
20                             r
21                             /
22                      Sougata-
23                              
24      Goswami/90354273928f45f1
25                              
26                      Defining
27                              
28                             *
29

Name: Tokens, Length: 454, dtype: object
0                                        Kanhai
1                                           Jee
2                                              
3                                       Manager
4                                             -
5                                      Vodafone
6                                              
7                                        Mumbai
8                                             ,
9                                   Maharashtra
10                                            -
11                                        Email
12                                           me
13                                           on
14                                       Indeed
15                                            :
16     indeed.com/r/Kanhai-Jee/5e33958b1b36b5c8
17                                             
18                                      Looking
19                                          for

Name: Tokens, Length: 816, dtype: object
0                             Ijas
1                       Nizamuddin
2                                 
3                        Associate
4                       Consultant
5                                -
6                            State
7                           Street
8                                 
9                       Irinchayam
10                             B.O
11                               ,
12                          Kerala
13                               -
14                           Email
15                              me
16                              on
17                          Indeed
18                               :
19                      indeed.com
20                               /
21                               r
22                               /
23                           Ijas-
24                                
25     Nizamuddin/6748d77f76f94eed
26                                
27            

Name: Tokens, Length: 111, dtype: object
0                                             Asha
1                                         Subbaiah
2                                                 
3                                                (
4                                        Microsoft
5                                          Partner
6                                        Readiness
7                                       Operations
8                                          Project
9                                          Manager
10                                               (
11                                            APAC
12                                               )
13                                               -
14                                                
15                                       Microsoft
16                                             GPS
17                                                
18                                       

Name: Tokens, Length: 99, dtype: object
0                             Prashant
1                             Pattekar
2                                     
3                                  Key
4                              Account
5                              Manager
6                                     
7                             Dombivli
8                                    ,
9                          Maharashtra
10                                   -
11                               Email
12                                  me
13                                  on
14                              Indeed
15                                   :
16     indeed.com/r/Prashant-Pattekar/
17                                    
18                    ad5404ce0d76f3be
19                                    
20                           Excellent
21                       Organizations
22                                   ,
23                      Communications
24                      

Name: Tokens, Length: 841, dtype: object
0                                          Ravindra
1                                             Verma
2                                                  
3                                          REGIONAL
4                                             SALES
5                                           MANAGER
6                                                 -
7                                           JUPITER
8                                      ILLUMINATION
9                                               PVT
10                                              LTD
11                                                 
12                                           Mumbai
13                                                ,
14                                      Maharashtra
15                                                -
16                                            Email
17                                               me
18                     

Name: Tokens, Length: 108, dtype: object
0                                          aaryan
1                                           vatts
2                                                
3                                          Mumbai
4                                               ,
5                                     Maharashtra
6                                               -
7                                           Email
8                                              me
9                                              on
10                                         Indeed
11                                              :
12     indeed.com/r/aaryan-vatts/536d7f3aac570f70
13                                               
14                                             To
15                                        enhance
16                                             my
17                                      knowledge
18                                            and
19       

Name: Tokens, Length: 188, dtype: object
0                                             Vikram
1                                           Hirugade
2                                                   
3                                            Manager
4                                           Business
5                                        Development
6                                                  -
7                                              Essel
8                                            Propack
9                                                Ltd
10                                            Mumbai
11                                                  
12                                             Thane
13                                                 ,
14                                       Maharashtra
15                                                 -
16                                             Email
17                                                me
18   

Name: Tokens, Length: 3868, dtype: object
0                                                 Shivam
1                                                  Rathi
2                                                       
3                                              Microsoft
4                                             technology
5                                              Associate
6                                                      (
7                                                    MTA
8                                                      )
9                                                       
10                                         Muzaffarnagar
11                                                     ,
12                                                 Uttar
13                                               Pradesh
14                                                     -
15                                                 Email
16                                            

Name: Tokens, Length: 663, dtype: object
0                                              Jalil
1                                          Bhanwadia
2                                                   
3                                                 Sr
4                                                  .
5                                                 No
6                                                   
7                                             Mumbai
8                                                  ,
9                                        Maharashtra
10                                                 -
11                                             Email
12                                                me
13                                                on
14                                            Indeed
15                                                 :
16     indeed.com/r/Jalil-Bhanwadia/e0705a7988b735fd
17                                                  
18   

In [157]:
import re
def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

def get_token_of_training_sample(line):
    token = line[find_nth(line, ' ', 3)+1:]
    return token

def representsInt(s):
    if s != '':
        try: 
            int(s)
            return True
        except ValueError:
            return False

def new_line_behind_entry(current_line, next_line):
    current_doc = current_line.split(" ", 1)
    next_doc = next_line.split(" ", 1)
    #print(current_doc[0], next_doc[0])
    if current_doc[0] != next_doc[0] and representsInt(current_doc[0]) == True and representsInt(next_doc[0]) == True:
        return True
    return False
    
def make_sentences(content, reg_exp):
    cleaned_content = []
    for line in content:
        empty_line_appended = False
        token = get_token_of_training_sample(line)
        for regex in reg_exp:
            if re.search(regex, token) is not None:
                #new_content.append(line)
                cleaned_content.append('\n')
                empty_line_appended = True
                break
        if empty_line_appended == True:
            continue
        cleaned_content.append(line)
    return cleaned_content

def empty_lines_formation(content):
    cleaned_content = []
    for idx, elem in enumerate(content):
        currelem = elem
        prevelem = content[(idx - 1) % len(content)]
        nextelem = content[(idx + 1) % len(content)]
        addemptyline = False
        if new_line_behind_entry(currelem, nextelem) == True:
            addemptyline = True
        #print(len(prevelem))
        doubleempty = False
        if len(prevelem) == 1 and len(currelem) == 1:
            doubleempty = True
            
        if doubleempty == False:
            cleaned_content.append(currelem)
        
        if len(currelem) > 1 and addemptyline == True:
            cleaned_content.append('\n')
    return cleaned_content
    #re.sub(r'\n\s*\n', '\n\n', content)
    
def replace_special_chars(content, reg_exp):
    cleaned_content = []
    for line in content:
        empty_line_appended = False
        token = get_token_of_training_sample(line)
        for regex in reg_exp:
            if re.search(regex, token) is not None:
                empty_line_appended = True
                break
        if empty_line_appended == True:
            continue
        cleaned_content.append(line)
    return cleaned_content

In [158]:
# First regular expression searches for a point
new_sentence_regex = ["^\.$", "^\•$"]
special_char_regex = ["^\?$", "^\!$", "^\;$", "^\-$", "^\_$", "^\+$", "^\&$", "^\($", "^\)$", "^\:$", "^\,$", "^\/$", "^\\$", "^\'s$"]

with open("data/flair/train_res_bilou.txt", encoding="utf8") as f:
    train_content = f.readlines()

with open("data/flair/test_res_bilou.txt", encoding="utf8") as f:
    test_content = f.readlines()

ready_train_set = make_sentences(train_content, new_sentence_regex)
ready_test_set = make_sentences(test_content, new_sentence_regex)
ready_train_set = empty_lines_formation(ready_train_set)
ready_test_set = empty_lines_formation(ready_test_set)
ready_train_set = replace_special_chars(ready_train_set, special_char_regex)
ready_test_set = replace_special_chars(ready_test_set, special_char_regex)
        
with open('data/flair/train_res_bilou_preprocessed.txt', 'w', encoding="utf-8") as f:
    for item in ready_train_set:
        f.write("%s" % item)

with open('data/flair/test_res_bilou_preprocessed.txt', 'w', encoding="utf-8") as f:
    for item in ready_test_set:
        f.write("%s" % item)