# Convert Cleaned Data into Dictionary

This notebook will load in Cleaned_data.csv and adjust the formatting for reading in lists as cell values. It will also convert an input of Cleaned Resume text, the identified skill words, and the start/end indices of these words, and produce the dictionary format which can be used for training an NER model

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
import warnings
warnings.filterwarnings("ignore")

Unfortunately when you save an array in Python it converts it to a string, below is a function for fixing the data when you read it in

In [3]:
data = pd.read_csv('Cleaned_data.csv')#read in cleaned data
data

Unnamed: 0.1,Unnamed: 0,Category,Resume,Cleaned_text,indices,skill_words
0,0,Data Science,Skills * Programming Languages: Python (pandas...,Skills Programming Languages Python pandas num...,"[array([[ 7, 28],\n [ 29, 35],\n ...","[['Programming Languages', 'Python', 'pandas',..."
1,1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...,Education Details May 2013 to May 2017 B.E U...,"[array([[ 464, 466],\n [ 654, 659],\n ...","[['ML', 'steps', 'machine learning', 'encoding..."
2,2,Data Science,"Areas of Interest Deep Learning, Control Syste...",Areas of Interest Deep Learning Control System...,"[array([[ 18, 31],\n [ 47, 65],\n ...","[['Deep Learning', 'Design Programming', 'Mach..."
3,3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...,Skills R Python SAP HANA Tableau SAP HANA...,"[array([[ 8, 9],\n [ 11, 17],\n ...","[['R', 'Python', 'SAP HANA', 'Tableau', 'SAP H..."
4,4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab...",Education Details MCA YMCAUST Faridabad Ha...,"[array([[ 52, 64],\n [325, 333]])]","[['Data Science', 'Database']]"
...,...,...,...,...,...,...
957,957,Testing,Computer Skills: â¢ Proficient in MS office (...,Computer Skills Proficient in MS office Word ...,"[array([[ 0, 15],\n [ 34, 40],\n ...","[['Computer Skills', 'office', 'Word', 'Basic'..."
958,958,Testing,â Willingness to accept the challenges. â ...,Willingness to accept the challenges. Positi...,"[array([[ 347, 358],\n [ 490, 501],\n ...","[['Electronics', 'Electronics', 'Testing', 'El..."
959,959,Testing,"PERSONAL SKILLS â¢ Quick learner, â¢ Eagerne...",PERSONAL SKILLS Quick learner Eagerness to l...,"[array([[ 90, 100],\n [ 408, 416],\n ...","[['leadership', 'Research', 'Testing', 'Testin..."
960,960,Testing,COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...,COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...,"[array([[ 0, 15],\n [ 57, 63],\n ...","[['COMPUTER SKILLS', 'Office', 'C', 'PCB Desig..."


In [4]:
def convert_string_to_array(string):#this function converts a string into a numpy array
    index_list = string.split('],[') #split string
    index_list[0] = index_list[0][1:] #cut off brackets [ and ]
    index_list[-1] = index_list[-1][:-1]
    for i in range(len(index_list)):
        j = (np.array(index_list[i].split(',')))#split each element
        if i == 0:
            index_array = j
        else:
            index_array = np.vstack( [index_array, j] )#stack
    return(index_array)

def fix_df_arrays(data): #cleaning up the list columns (get rid of unwanted characters)
    data['indices'] = data['indices'].str.replace('\n','')
    data['indices'] = data['indices'].str.replace('array','')
    data['indices'] = data['indices'].str.replace('[([','')
    data['indices'] = data['indices'].str.replace('])]','')
    data['indices'] = data['indices'].str.replace(' ','')
    data['skill_words'] = data['skill_words'].str.replace('[[','')
    data['skill_words'] = data['skill_words'].str.replace(']]','')
    data['skill_words'] = data['skill_words'].str.replace("'",'')
    
    data['index_array'] =data['indices']
    
    for i in range(len(data)):
        data['index_array'][i] = convert_string_to_array(data['indices'][i])#convert indices string into numpy array
        
    return(data[['Category','Cleaned_text','skill_words','index_array']])#grab just the columns we want

    

In [5]:
data = fix_df_arrays(data) #run function

In [6]:
data

Unnamed: 0,Category,Cleaned_text,skill_words,index_array
0,Data Science,Skills Programming Languages Python pandas num...,"Programming Languages, Python, pandas, numpy, ...","[[7, 28], [29, 35], [36, 42], [43, 48], [49, 5..."
1,Data Science,Education Details May 2013 to May 2017 B.E U...,"ML, steps, machine learning, encoding, feature...","[[464, 466], [654, 659], [663, 679], [729, 737..."
2,Data Science,Areas of Interest Deep Learning Control System...,"Deep Learning, Design Programming, Machinery, ...","[[18, 31], [47, 65], [85, 94], [95, 110], [111..."
3,Data Science,Skills R Python SAP HANA Tableau SAP HANA...,"R, Python, SAP HANA, Tableau, SAP HANA, SQL, S...","[[8, 9], [11, 17], [19, 27], [29, 36], [38, 46..."
4,Data Science,Education Details MCA YMCAUST Faridabad Ha...,"Data Science, Database","[[52, 64], [325, 333]]"
...,...,...,...,...
957,Testing,Computer Skills Proficient in MS office Word ...,"Computer Skills, office, Word, Basic, Excel, C...","[[0, 15], [34, 40], [41, 45], [46, 51], [52, 5..."
958,Testing,Willingness to accept the challenges. Positi...,"Electronics, Electronics, Testing, Electronics...","[[347, 358], [490, 501], [570, 577], [587, 598..."
959,Testing,PERSONAL SKILLS Quick learner Eagerness to l...,"leadership, Research, Testing, Testing, Transf...","[[90, 100], [408, 416], [628, 635], [665, 672]..."
960,Testing,COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...,"COMPUTER SKILLS, Office, C, PCB Design, Matlab...","[[0, 15], [57, 63], [64, 65], [74, 84], [105, ..."


## Create a dictionary

Now we need to take the data in this csv and convert it to a dictionary form so it can work with spaCy

In [7]:
#this takes the text, a list of the skill words, and the indices list
#and creates a dictionary of the format desired for training
def create_dictionary(sample_text,indices,words):
    data_dict={}
    data_dict['content']=sample_text
    
    d={}
    starts = indices[:,0]
    ends = indices[:,1]
    d["text"]=words
    d["start"]=starts
    d["end"]=ends
    d = [{'text': words, 'start': starts, 'end': ends} for words,starts,ends in zip(d['text'], d['start'],d['end'])]
    
    items = []
    label_dict={}
    label_dict['label'] = 'SKILL'
    label_dict['points']  = {}
    for i in range(len(d)):
        label_dict = {'label':'SKILL','points':{'text':d[i]['text'],'start':d[i]['start'],'end':d[i]['end']}}
        items.append(label_dict)
        
    data_dict['annotations']=items
    return(data_dict)


In [8]:
#sample_text, indices, words are taken directly from the columns
i = 0 #for the first row
create_dictionary(data['Cleaned_text'][i],data['index_array'][i], data['skill_words'][i].split(','))#the words column needs splitting

{'content': 'Skills Programming Languages Python pandas numpy scipy scikit-learn matplotlib Sql Java JavaScript JQuery. Machine learning Regression SVM Nave Bayes KNN Random Forest Decision Trees Boosting techniques Cluster Analysis Word Embedding Sentiment Analysis Natural Language processing Dimensionality reduction Topic Modelling LDA NMF PCA & Neural Nets. Database Visualizations Mysql SqlServer Cassandra Hbase ElasticSearch D3.js DC.js Plotly kibana matplotlib ggplot Tableau. Others Regular Expression HTML CSS Angular 6 Logstash Kafka Python Flask Git Docker computer vision - Open CV and understanding of Deep learning.Education Details Data Science Assurance Associate Data Science Assurance Associate - Ernst & Young LLP Skill Details  JAVASCRIPT- Exprience - 24 months jQuery- Exprience - 24 months Python- Exprience - 24 monthsCompany Details  company - Ernst & Young LLP description - Fraud Investigations and Dispute Services  Assurance TECHNOLOGY ASSISTED REVIEW TAR Technology Ass