# Resume Parsing

So basically, we want to create a entity ruler to parse real resume and get the "skills".

But for your assignment, i will ask you to extract education.

## 1. Load data

In [1]:
import pandas as pd
import numpy as np

df_resume = pd.read_csv("../data/resume.csv")

In [5]:
df_resume = df_resume.reindex(np.random.permutation(df_resume.index))
df_resume = df_resume.copy().iloc[0:200, ]  #optional if your computer is fast, no need
df_resume.shape

(200, 4)

## 2. Load skill data

If we define patterns for all the skill, we gonna be too tired.

So spacy knows that, so it allows you to give you a list of words, then it will automatically create pattern.

In [6]:
import spacy

nlp = spacy.load('en_core_web_md')
skill_path = "../data/skills.jsonl"

2023-02-02 13:32:04.094197: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [7]:
ruler = nlp.add_pipe("entity_ruler")
ruler.from_disk(skill_path)
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'entity_ruler']

In [8]:
doc = nlp("Chaky loves deep learning.")
doc.ents

(deep learning,)

## 3. Let's try to extract skills from this resume.csv

In [9]:
df_resume.head()

Unnamed: 0,ID,Resume_str,Resume_html,Category
1825,23734441,ACCOUNTANT Professional Summary...,"<div class=""fontsize fontface vmargins hmargin...",ACCOUNTANT
2426,25718772,TSO/FLOATER Career Overview ...,"<div class=""fontsize fontface vmargins hmargin...",AVIATION
469,53169257,DISABILITY ADVOCATE Profess...,"<div class=""fontsize fontface vmargins hmargin...",ADVOCATE
538,38059130,HUMAN RESOURCES MANAGER Summary...,"<div class=""fontsize fontface vmargins hmargin...",ADVOCATE
99,19336728,HR ASSISTANT INTERN Summary ...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [10]:
from spacy.lang.en.stop_words import STOP_WORDS

#before that, let's clean our resume.csv dataframe
def preprocessing(sentence):
    
    stopwords = list(STOP_WORDS)
    doc = nlp(sentence)
    cleaned_tokens = []
    
    for token in doc:
        if token.text not in stopwords and token.pos_ != 'PUNCT' and token.pos_ != 'SPACE' and \
            token.pos_ != 'SYM':
                cleaned_tokens.append(token.lemma_.lower().strip())
                
    return " ".join(cleaned_tokens)

In [16]:
#let's try Chaky, if you feel too dangerous
#random sampling
random_resume = df_resume.Resume_str.iloc[5]
random_resume[:300]

'         INFORMATION TECHNOLOGY COORDINATOR       Career Overview     AVP / Director of Information Technology I Network Engineer with extensive experience.                                                  Strengths - excellent communication skills, strong problem solving skills. Sound work ethic, c'

In [17]:
preprocessing(random_resume[:300])

'information technology coordinator career overview avp director information technology i network engineer extensive experience strengths excellent communication skill strong problem solving skill sound work ethic c'

In [19]:
#let's apply to the whole dataframe
for i, row in df_resume.iterrows():
    clean_text = preprocessing(row.Resume_str)
    df_resume.at[i, 'Clean_resume'] = clean_text

In [20]:
df_resume.head()

Unnamed: 0,ID,Resume_str,Resume_html,Category,Clean_resume
1825,23734441,ACCOUNTANT Professional Summary...,"<div class=""fontsize fontface vmargins hmargin...",ACCOUNTANT,accountant professional summary skills work hi...
2426,25718772,TSO/FLOATER Career Overview ...,"<div class=""fontsize fontface vmargins hmargin...",AVIATION,tso floater career overview to obtain position...
469,53169257,DISABILITY ADVOCATE Profess...,"<div class=""fontsize fontface vmargins hmargin...",ADVOCATE,disability advocate professional summary dedic...
538,38059130,HUMAN RESOURCES MANAGER Summary...,"<div class=""fontsize fontface vmargins hmargin...",ADVOCATE,human resources manager summary to continue pr...
99,19336728,HR ASSISTANT INTERN Summary ...,"<div class=""fontsize fontface vmargins hmargin...",HR,hr assistant intern summary new graduate seek ...


## 4. Let's really extract skills!!

In [23]:
def get_skills(text):
    #pass the text to the nlp
    doc = nlp(text)  #note that this nlp already know skills
    
    skills = []
    
    #look at the ents
    for ent in doc.ents:
        #if the ent.label_ is SKILL, then we append to some list
        if ent.label_ == "SKILL":
            skills.append(ent.text)
    
    return skills

def unique_skills(x):
    return list(set(x))

In [24]:
df_resume.head(1)

Unnamed: 0,ID,Resume_str,Resume_html,Category,Clean_resume
1825,23734441,ACCOUNTANT Professional Summary...,"<div class=""fontsize fontface vmargins hmargin...",ACCOUNTANT,accountant professional summary skills work hi...


In [25]:
df_resume['Skills'] = df_resume.Clean_resume.apply(get_skills)
df_resume['Skills'] = df_resume.Skills.apply(unique_skills)

In [29]:
df_resume.Skills.iloc[0]

['schedule', 'business administration', 'accounting', 'support', 'software']

## 5. Visualization

Which skills is most important in information management?