# Convert Cleaned Data into Dictionary

This notebook will load in Cleaned_data.csv and adjust the formatting for reading in lists as cell values. It will also convert an input of Cleaned Resume text, the identified skill words, and the start/end indices of these words, and produce the dictionary format which can be used for training an NER model

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
import warnings
warnings.filterwarnings("ignore")

Unfortunately when you save an array in Python it converts it to a string, below is a function for fixing the data when you read it in

In [3]:
data = pd.read_csv('Cleaned_data.csv')#read in cleaned data
data

Unnamed: 0.1,Unnamed: 0,Category,Resume,Cleaned_text,indices,skill_words
0,0,Data Science,Skills * Programming Languages: Python (pandas...,Skills Programming Languages Python pandas num...,"[array([[ 7, 28],\n [ 29, 35],\n ...","[['Programming Languages', 'Python', 'pandas',..."
1,1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...,Education Details May 2013 to May 2017 B.E U...,"[array([[ 464, 466],\n [ 654, 659],\n ...","[['ML', 'steps', 'machine learning', 'encoding..."
2,2,Data Science,"Areas of Interest Deep Learning, Control Syste...",Areas of Interest Deep Learning Control System...,"[array([[ 18, 31],\n [ 47, 65],\n ...","[['Deep Learning', 'Design Programming', 'Mach..."
3,3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...,Skills R Python SAP HANA Tableau SAP HANA...,"[array([[ 8, 9],\n [ 11, 17],\n ...","[['R', 'Python', 'SAP HANA', 'Tableau', 'SAP H..."
4,4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab...",Education Details MCA YMCAUST Faridabad Ha...,"[array([[ 52, 64],\n [325, 333]])]","[['Data Science', 'Database']]"
...,...,...,...,...,...,...
957,957,Testing,Computer Skills: â¢ Proficient in MS office (...,Computer Skills Proficient in MS office Word ...,"[array([[ 0, 15],\n [ 34, 40],\n ...","[['Computer Skills', 'office', 'Word', 'Basic'..."
958,958,Testing,â Willingness to accept the challenges. â ...,Willingness to accept the challenges. Positi...,"[array([[ 347, 358],\n [ 490, 501],\n ...","[['Electronics', 'Electronics', 'Testing', 'El..."
959,959,Testing,"PERSONAL SKILLS â¢ Quick learner, â¢ Eagerne...",PERSONAL SKILLS Quick learner Eagerness to l...,"[array([[ 90, 100],\n [ 408, 416],\n ...","[['leadership', 'Research', 'Testing', 'Testin..."
960,960,Testing,COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...,COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...,"[array([[ 0, 15],\n [ 57, 63],\n ...","[['COMPUTER SKILLS', 'Office', 'C', 'PCB Desig..."


In [4]:
def convert_string_to_array(string):#this function converts a string into a numpy array
    index_list = string.split('],[') #split string
    index_list[0] = index_list[0][1:] #cut off brackets [ and ]
    index_list[-1] = index_list[-1][:-1]
    for i in range(len(index_list)):
        j = (np.array(index_list[i].split(',')))#split each element
        if i == 0:
            index_array = j
        else:
            index_array = np.vstack( [index_array, j] )#stack
    return(index_array)

def fix_df_arrays(data): #cleaning up the list columns (get rid of unwanted characters)
    data['indices'] = data['indices'].str.replace('\n','')
    data['indices'] = data['indices'].str.replace('array','')
    data['indices'] = data['indices'].str.replace('[([','')
    data['indices'] = data['indices'].str.replace('])]','')
    data['indices'] = data['indices'].str.replace(' ','')
    data['skill_words'] = data['skill_words'].str.replace('[[','')
    data['skill_words'] = data['skill_words'].str.replace(']]','')
    data['skill_words'] = data['skill_words'].str.replace("'",'')
    
    data['index_array'] =data['indices']
    
    for i in range(len(data)):
        data['index_array'][i] = convert_string_to_array(data['indices'][i])#convert indices string into numpy array
        
    return(data[['Category','Cleaned_text','skill_words','index_array']])#grab just the columns we want

    

In [5]:
data = fix_df_arrays(data) #run function

In [6]:
data

Unnamed: 0,Category,Cleaned_text,skill_words,index_array
0,Data Science,Skills Programming Languages Python pandas num...,"Programming Languages, Python, pandas, numpy, ...","[[7, 28], [29, 35], [36, 42], [43, 48], [49, 5..."
1,Data Science,Education Details May 2013 to May 2017 B.E U...,"ML, steps, machine learning, encoding, feature...","[[464, 466], [654, 659], [663, 679], [729, 737..."
2,Data Science,Areas of Interest Deep Learning Control System...,"Deep Learning, Design Programming, Machinery, ...","[[18, 31], [47, 65], [85, 94], [95, 110], [111..."
3,Data Science,Skills R Python SAP HANA Tableau SAP HANA...,"R, Python, SAP HANA, Tableau, SAP HANA, SQL, S...","[[8, 9], [11, 17], [19, 27], [29, 36], [38, 46..."
4,Data Science,Education Details MCA YMCAUST Faridabad Ha...,"Data Science, Database","[[52, 64], [325, 333]]"
...,...,...,...,...
957,Testing,Computer Skills Proficient in MS office Word ...,"Computer Skills, office, Word, Basic, Excel, C...","[[0, 15], [34, 40], [41, 45], [46, 51], [52, 5..."
958,Testing,Willingness to accept the challenges. Positi...,"Electronics, Electronics, Testing, Electronics...","[[347, 358], [490, 501], [570, 577], [587, 598..."
959,Testing,PERSONAL SKILLS Quick learner Eagerness to l...,"leadership, Research, Testing, Testing, Transf...","[[90, 100], [408, 416], [628, 635], [665, 672]..."
960,Testing,COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...,"COMPUTER SKILLS, Office, C, PCB Design, Matlab...","[[0, 15], [57, 63], [64, 65], [74, 84], [105, ..."


## Create a dictionary

Now we need to take the data in this csv and convert it to a dictionary form so it can work with spaCy

In [7]:
#this takes the text, a list of the skill words, and the indices list
#and creates a dictionary of the format desired for training
def create_dictionary(sample_text,indices,words):
    data_dict={}
    data_dict['content']=sample_text
    
    d={}
    if indices.shape ==(2,):
        starts = [np.array(indices)[0]]
        ends = [np.array(indices)[1]]
    else:
        starts = indices[:,0]
        ends = indices[:,1]
    d["text"]=words
    d["start"]=starts
    d["end"]=ends
    d = [{'text': words, 'start': starts, 'end': ends} for words,starts,ends in zip(d['text'], d['start'],d['end'])]
    items = []
    label_dict={}
    label_dict['label'] = 'SKILL'
    label_dict['points']  = {}
    for i in range(len(d)):
        label_dict = {'label':'SKILL','points':{'text':d[i]['text'],'start':d[i]['start'],'end':d[i]['end']}}
        items.append(label_dict)
        
    data_dict['annotations']=items
    return(data_dict)


In [8]:

i=46
dict_ = (create_dictionary(data['Cleaned_text'][i],data['index_array'][i], data['skill_words'][i].split(',')))#the words column needs splitting
print(dict_)

{'content': 'Education Details  MBA  ACN College of engineering & mgt HR  Skill Details  Company Details  company - HR Assistant description - ', 'annotations': [{'label': 'SKILL', 'points': {'text': '', 'start': '', 'end': 'dtype=float64)'}}]}


In [9]:
#sample_text, indices, words are taken directly from the columns
i = 0 #for the first row
dict_ = (create_dictionary(data['Cleaned_text'][i],data['index_array'][i], data['skill_words'][i].split(',')))#the words column needs splitting

In [14]:
import json

In [12]:
dicts  = []
for i in range(len(data)):
    dict_ = (create_dictionary(data['Cleaned_text'][i],data['index_array'][i], data['skill_words'][i].split(',')))#the words column needs splitting
    dicts.append(dict_)

In [15]:
dump = json.dumps(dicts)

In [16]:
json.loads(dump)[0]

{'content': 'Skills Programming Languages Python pandas numpy scipy scikit-learn matplotlib Sql Java JavaScript JQuery. Machine learning Regression SVM Nave Bayes KNN Random Forest Decision Trees Boosting techniques Cluster Analysis Word Embedding Sentiment Analysis Natural Language processing Dimensionality reduction Topic Modelling LDA NMF PCA & Neural Nets. Database Visualizations Mysql SqlServer Cassandra Hbase ElasticSearch D3.js DC.js Plotly kibana matplotlib ggplot Tableau. Others Regular Expression HTML CSS Angular 6 Logstash Kafka Python Flask Git Docker computer vision - Open CV and understanding of Deep learning.Education Details Data Science Assurance Associate Data Science Assurance Associate - Ernst & Young LLP Skill Details  JAVASCRIPT- Exprience - 24 months jQuery- Exprience - 24 months Python- Exprience - 24 monthsCompany Details  company - Ernst & Young LLP description - Fraud Investigations and Dispute Services  Assurance TECHNOLOGY ASSISTED REVIEW TAR Technology Ass

# Reformat Data for Training

In [17]:
def convert_to_spacy(dump):
    training_data = []
    lines = json.loads(dump)
    for data in lines:
        text = data['content'].replace("\n", " ")
        entities = []
        data_annotations = data['annotations']
        if data_annotations is not None:
            for annotation in data_annotations:
                #only a single point in text annotation.
                point = annotation['points']
                labels = annotation['label']
                try:
                    entities.append((int(point['start']), int(point['end']), point['text'].lstrip(' ')))
                except:
                    continue
        training_data.append((text, {"entities" : entities}))
    return training_data

In [18]:
train_data = convert_to_spacy(dump)

In [19]:
train_data[500]

('Education Details  January 2012 to January 2013 B.E. Electrical Shivaji University September 2008 HSC Pune Maharashtra Pune University July 2006 SSC Pune Maharashtra Pune University Electrical Engineer Electrical Engineer - R K ELECTRICAL PVT. LTD Skill Details  Company Details  company - R K ELECTRICAL PVT. LTD description - Experience - 1 Year 3 Months  Troubleshooting and Maintenance of following Electrical Equipment -  All Type of Maintenance of Utility.  Electrical and Mechanical Maintenance.  Two 625 KVA Diesel Generator Set Kirloskar  HT LT Switchgear With Protection System Using Relays and Provision For Interlocking C&S Kirloskar  Handling HT Vacuum & SF6 Circuit Breaker Transformer Up to 5000 KVA LT Air circuit Breaker 2000A  Maintenance of STP and WTP Plant.  Maintenance of Air Blower Actuators Soft Starter EOT Crane Mono Rail Centrifugal or Vertical Pumps Hydraulic Machine Rolling Machine Lath Machine Drill Machine AHU HVAC Chiller etc.  Basic knowledge of PLC SCADA Operat

In [20]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(train_data, test_size = 0.1, random_state = 42)


In [21]:
import spacy

from spacy.training import Example


def train_spacy():
    
    nlp = spacy.blank('en')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.add_pipe('ner')
        
    # add labels
    for _, annotations in train_data:
         for ent in annotations.get("entities"):
            ner.add_label(ent[2])
            
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(10):
            print("Starting iteration " + str(itn))
            random.shuffle(train_data)
            losses = {}
            for text, annotations in train_data:
                    example = Example.from_dict(nlp.make_doc(text), annotations)
                    nlp.update([example],
                    drop=0.2,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(losses)
    return nlp

# Train Model

In [22]:
import random
nlp=train_spacy()

Starting iteration 0
{'ner': 81011.56894337424}
Starting iteration 1
{'ner': 53461.45992737572}
Starting iteration 2
{'ner': 40979.45560365889}
Starting iteration 3
{'ner': 29037.122458815633}
Starting iteration 4
{'ner': 21818.10656490605}
Starting iteration 5
{'ner': 17303.556581408764}
Starting iteration 6
{'ner': 14109.431571701924}
Starting iteration 7
{'ner': 12104.553154702548}
Starting iteration 8
{'ner': 10080.716264506746}
Starting iteration 9
{'ner': 8997.392211428945}


In [23]:
import pickle
pickle.dump(nlp,open('nlp.pkl','wb'))

In [24]:
test_data[0][0]

"TECHNICAL SKILLS Programming Languages Java Servlet JSP Spring Boot. Web Technology HTML5 CSS3 Bootstrap JavaScript JQuery Ajax AngularJs. Database MySQL. IDE and Tool Eclipse spring tool Suit Net beans Sublime Text Atom. Operating System Windows XP 7 8 10. ACHIEVEMENT  Java Developer Certificate from Unanth Technical Institute.  Java Certificate from solo Learn.  Command line crash Course certificate from Udemy. JOB DETAILS Education Details  January 2018 M.C.A Pune Maharashtra Pune University January 2015 B.C.A Amravati Maharashtra Amravati University January 2012 H.S.C Amravati Maharashtra Amravati University January 2010 S.S.C Amravati Maharashtra Amravati University Java developer Full Stack Java Developer Skill Details  Css- Exprience - Less than 1 year months Ajax- Exprience - Less than 1 year months Servlet- Exprience - Less than 1 year months Html5- Exprience - Less than 1 year months Spring- Exprience - Less than 1 year months Java- Exprience - Less than 1 year months Jquery

In [25]:
doc = nlp(test_data[0][0])
for ents in doc.ents:
    print(ents.text, "-->>>>", ents.label_)

Programming Languages -->>>> Programming Languages
Java -->>>> Java
Web -->>>> Web
Technology -->>>> Technology
HTML5 -->>>> HTML5
Bootstrap -->>>> Bootstrap
JavaScript -->>>> JavaScript
JQuery -->>>> JQuery
Ajax -->>>> Ajax
Database -->>>> Database
Eclipse -->>>> Eclipse
spring -->>>> spring
beans -->>>> beans
Sublime Text -->>>> Sublime Text
Windows XP -->>>> Windows XP
Java -->>>> Java
Java -->>>> Java
Command -->>>> Command
Java -->>>> Java
Java -->>>> Java
java -->>>> java
java -->>>> java
ajax -->>>> ajax
Technology -->>>> Technology
Core Java -->>>> Core Java
HTML5 -->>>> HTML5
Bootstrap -->>>> Bootstrap
Javascript -->>>> Javascript
Jquery -->>>> Jquery
Ajax -->>>> Ajax
PROJECT -->>>> PROJECT
Status -->>>> Status
Web -->>>> Web
application -->>>> application
Java -->>>> Java
JavaScript -->>>> JavaScript
Jquery -->>>> Jquery
Ajax -->>>> Ajax
MySQL -->>>> MySQL
It -->>>> It
It -->>>> It
print -->>>> print
It -->>>> It
pages -->>>> pages
customer -->>>> customer
item master -->>>> 

In [26]:
test_data[0][1]

{'entities': [(17, 38, 'Programming Languages'),
  (39, 43, 'Java'),
  (69, 72, 'Web'),
  (73, 83, 'Technology'),
  (84, 89, 'HTML5'),
  (95, 104, 'Bootstrap'),
  (105, 115, 'JavaScript'),
  (116, 122, 'JQuery'),
  (123, 127, 'Ajax'),
  (139, 147, 'Database'),
  (168, 175, 'Eclipse'),
  (176, 182, 'spring'),
  (197, 202, 'beans'),
  (203, 215, 'Sublime Text'),
  (239, 249, 'Windows XP'),
  (271, 275, 'Java'),
  (332, 336, 'Java'),
  (367, 374, 'Command'),
  (680, 684, 'Java'),
  (706, 710, 'Java'),
  (1296, 1300, 'java'),
  (1339, 1343, 'java'),
  (1356, 1360, 'ajax'),
  (1436, 1446, 'Technology'),
  (1447, 1456, 'Core Java'),
  (1473, 1478, 'HTML5'),
  (1484, 1493, 'Bootstrap'),
  (1494, 1504, 'Javascript'),
  (1505, 1511, 'Jquery'),
  (1512, 1516, 'Ajax'),
  (1533, 1540, 'PROJECT'),
  (1591, 1597, 'Status'),
  (1650, 1653, 'Web'),
  (1654, 1665, 'application'),
  (1680, 1684, 'Java'),
  (1701, 1711, 'JavaScript'),
  (1712, 1718, 'Jquery'),
  (1719, 1723, 'Ajax'),
  (1728, 1733, 'MySQ