In this project, I am going to use spacy for entity recognition on 200 Resume 
and experiment around various NLP tools for text analysis. 

I have also added skills match feature so that hiring managers can follow a metric that will
help them to decide whether they should move to the interview stage or not. 

I will be using two datasets; the first contains resume texts 
and the second contains skills that we will use to create an entity ruler.

## Inside the CSV
* ID: Unique identifier and file name for the respective pdf.
* Resume_str : Contains the resume text only in string format.
* Resume_html : Contains the resume data in html format as present while web scrapping.
* Category : Category of the job the resume was used to apply.

## Present categories
HR, Designer, Information-Technology, Teacher, Advocate, Business-Development, Healthcare, Fitness, Agriculture, BPO, Sales, Consultant, Digital-Media, Automobile, Chef, Finance, Apparel, Engineering, Accountant, Construction, Public-Relations, Banking, Arts, Aviation

### Jobzilla skill patterns

The jobzilla skill dataset is jsonl file containing different skills that can be used to create spaCy entity_ruler. 

The data set contains label and pattern-> diferent words used to descibe skills in various resume.

In [None]:
%pip install spacy

In [None]:
%pip install gensim

In [None]:
%pip install pyLDAvis

In [None]:
%pip install wordcloud

In [None]:
%pip install jsonlines

In [None]:
%pip install nltk

In [None]:
#spacy
import spacy
from spacy.pipeline import EntityRuler
from spacy.lang.en import English
from spacy.tokens import Doc

#gensim
import gensim
from gensim import corpora

#Visualization
from spacy import displacy
import pyLDAvis.gensim_models
from wordcloud import WordCloud
import plotly.express as px
import matplotlib.pyplot as plt

#Data loading/ Data manipulation
import pandas as pd
import numpy as np
import jsonlines

#nltk
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download(['stopwords','wordnet'])

#warning
import warnings 
warnings.filterwarnings('ignore')

## Loading
In this section, I am going to load the spaCy model, Resume Dataset, and Jobzilla skills dataset directly into the entity ruler.

### Resume Dataset
Using Pandas read_csv to read dataset containing text data about Resume.

* I am going to randomized Job categories so that 200 samples contain various job categories instead of one.
* I am going to limit our number of samples to 200 as processing 2400+ takes time.

In [None]:
df = pd.read_csv("../Getting The Data/Resume.csv")
df = df.reindex(np.random.permutation(df.index))
data = df.copy().iloc[0:200,]
data.head()

### Loading spaCy model
I can download spaCy model then load spacy model into nlp.

In [None]:
nlp = spacy.load("en_core_web_sm")
skill_pattern_path = "jz_skill_patterns.jsonl"

### Entity Ruler
To create an entity ruler we need to add a pipeline and then load the .jsonl file containing skills into ruler.

As you can see we have successfully added a new pipeline entity_ruler. 

Entity ruler helps us add additional rules to highlight various categories within the text, such as skills and job description in our case.

In [None]:
ruler = nlp.add_pipe("entity_ruler")
ruler.from_disk(skill_pattern_path)
nlp.pipe_names

### Skills

I will create two python functions to extract all the skills within a resume and create an array containing all the skills. 

Later I am going to apply this function to the dataset and create a new feature called skill. 

This will help us visualize trends and patterns within the dataset.

get_skills is going to extract skills from a single text.
unique_skills will remove duplicates.

In [None]:
def get_skills(text):
    doc = nlp(text)
    myset = []
    subset = []
    for ent in doc.ents:
        if ent.label_ == "SKILL":
            subset.append(ent.text)
    myset.append(subset)
    return subset


def unique_skills(x):
    return list(set(x))

## Cleaning Resume Text

I am going to use nltk library to clean our dataset in a few steps:

* I am going to use regex to remove hyperlinks, special characters, or punctuations.
* Lowering text
* Splitting text into array based on space
* Lemmatizing text to its base form for normalizations
* Removing English stopwords
* Appending the results into an array.

In [None]:
clean = []
for i in range(data.shape[0]):
    review = re.sub('(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?"'," ", data["Resume_str"].iloc[i],)
    review = review.lower()
    review = review.split()
    lm = WordNetLemmatizer()
    review = [
        lm.lemmatize(word)
        for word in review
        if not word in set(stopwords.words("english"))
    ]
    review = " ".join(review)
    clean.append(review)

## Applying functions

In this section, we are going to apply all the functions we have created previously

* creating Clean_Resume columns and adding cleaning Resume data.
* creating skills columns, lowering text, and applying the get_skills function.
* removing duplicates from skills columns.

As you can see below that I have cleaned the resume and skills columns.

In [None]:
data["Clean_Resume"] = clean
data["skills"] = data["Clean_Resume"].str.lower().apply(get_skills)
data["skills"] = data["skills"].apply(unique_skills)
data.head()

## Visualization
Now that we have everything we want, we are going to visualize Job distributions and skill distributions.

## Jobs Distribution
As we can see our random 200 samples contain a variety of job categories. Accountants, Business development, and Advocates are the top categories.

In [None]:
fig = px.histogram(
    data, x="Category", title="Distribution of Jobs Categories"
    ).update_xaxes(categoryorder="total descending")
    
fig.show()