# Introduction

Introduction to resume analyser (Prabhjit)

Explain at a high level what this package aims to do and explain the example

Example: Comparing resume I downloaded from www.heinz.cmu.edu to the Software Engineer Job Description Sampson sent on Slack (**citation needed**)
If we need a better resume to cite we can do it too if someone can find a better one! Or does one of us want to upload an anonymised version of our own resumes? Also, should we add a reference section?

## Getting Started with Resume Analyser

We will be assuming that you have already installed our resumeanalyser package, as per the `Installation` section of our README.md file. To start using resume analyser in a project, we will run the following code cell:

In [2]:
import resumeanalyser
print(resumeanalyser.__version__)

0.1.0


# Extracting Text 
To start comparing the sample resume to the sample job description, we will first start by extracting the text from them. The sample resume, which was downloaded from Carnegie Mellon University's Heinz College, has been stored as a PDF in the `data` subdirectory under the `tests` directory of this repository, while the sample job description has been stored as a docx document under the same directory. Thankfully, our `resumeanalyser` is compatible with both .docx and .pdf documents. We will thus start the analysis by reading in the text from both sources, which will serve as the basis of our analysis for the rest of this vignette.

In [3]:
# To import the text extraction functions
from resumeanalyser.text_reading import *

Now that our imports are done, we will be using `pdf_to_text` to read the text from the sample resume, and will store this as a string. This function takes in a pathname ending in `.pdf` as an input.

In [18]:
resume_pdf_path = "../tests/data/msppm-sample-resume.pdf"
sample_resume_text = pdf_to_text(resume_pdf_path)
print(sample_resume_text[0:50])

Polly Seapsea@andrew.cmu.edu | 412.889.4687 | link


The resume has now been read and stored under the variable `sample_resume_text`. Next, we will be using the function `docx_to_text` to extract text from the job description, and will store the text as a string. Similarly, it takes in a path name ending in `.docx` as an input,so if you run into problems with either of the text reading functions, please be sure to check your file path name and use the appropriate function.

In [19]:
job_desc_path = "../tests/data/software_engineer_job_description.docx"
job_desc_text = docx_to_text(job_desc_path)
print(job_desc_text[0:50])

Software Engineer Job Description We are looking f


# Example Usage of Text Cleaning Functions

`resumeanalyser` offers how to use a series of text cleaning functions. These functions include:
1. Removing punctuation
2. Tokenization
3. Converting to lower case
4. Removing stop words
5. Lemmatization

You can apply these functions either step-by-step to understand each part of the text cleaning process, 
or you can use the `clean_text` function to apply all these steps in one go for convenience.

In [8]:
from resumeanalyser.text_cleaning import *

NLTK WordNet and stopwords downloaded successfully.


[nltk_data] Downloading package wordnet to /Users/user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
# Example text
sample_text = "The cats are chasing the mice, and one mouse is running faster than the others."

## Demonstrating step-by-step process

In [10]:
# Step 1: Remove punctuation
no_punctuation = remove_punctuation(sample_text)
print("Text without Punctuation:", no_punctuation)

Text without Punctuation: The cats are chasing the mice and one mouse is running faster than the others


In [11]:
# Step 2: Tokenize the text
tokens = tokenize(no_punctuation)
print("Tokenized Text:", tokens)

Tokenized Text: ['The', 'cats', 'are', 'chasing', 'the', 'mice', 'and', 'one', 'mouse', 'is', 'running', 'faster', 'than', 'the', 'others']


In [12]:
# Step 3: Convert to lower case
lower_tokens = to_lower(tokens)
print("Lowercase Tokens:", lower_tokens)

Lowercase Tokens: ['the', 'cats', 'are', 'chasing', 'the', 'mice', 'and', 'one', 'mouse', 'is', 'running', 'faster', 'than', 'the', 'others']


In [13]:
# Step 4: Remove stop words
no_stop_words = remove_stop_words(lower_tokens)
print("Tokens without Stop Words:", no_stop_words)

Tokens without Stop Words: ['cats', 'chasing', 'mice', 'one', 'mouse', 'running', 'faster', 'others']


In [14]:
# Step 5: Lemmatize
lemmatized_tokens = lemmatize(no_stop_words)
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: ['cat', 'chasing', 'mouse', 'one', 'mouse', 'running', 'faster', 'others']


## Using the clean_text function for an all-in-one solution

In [15]:
from resumeanalyser.text_cleaning import clean_text
cleaned_text = clean_text(sample_text)
print("Cleaned Text:", cleaned_text)

Cleaned Text: cat chasing mouse one mouse running faster others


# Example Usage of Metric Functions for Comparing two texts

`resumeanalyser` offers two functions to compare the two texts provided by the user. These functions include:
1. Syntactic Text Matching
2. Semantic Text Matching

## Literal Text Matching

Literal Text Matching" typically refers to a measure of how closely two pieces of text align in a character-by-character or word-by-word manner without considering variations or synonyms.

In [16]:
from resumeanalyser.metrics import SimilarityCV

literal_match_score = SimilarityCV("I am studying Data Science at UBC", "There are many good sources to study Data Science online")
print("Literal Match Score:", literal_match_score)

ModuleNotFoundError: No module named 'sklearn'

## Semantic Text Matching

Semantic Text Matching measures the similarity in meaning between two pieces of text. Unlike literal or exact match scores, semantic matching takes into account the context, synonyms, and related concepts to determine how closely the content aligns in terms of intent or significance. 

In [None]:
from resumeanalyser.metrics import SimilaritySpacy

semantic_match_score = SimilaritySpacy("I am studying Data Science at UBC", "There are many good sources to study Data Science online")
print("Syntactic Match Score:", semantic_match_score)

# Examples of Using Plotting Functions of the Package

In [None]:
from resumeanalyser.plotting import *

In [None]:
test_text = 'I am going to fill in a test text here the the the a a a a'

Users can plot the word cloud of the input resume/job description text:

In [None]:
fig1 = plot_wordcloud(test_text)

Or plot the top-frenquency words that are most relvant in the text:

In [None]:
fig2 = plot_topwords(test_text)

It is also possible to plot both in one suite plot for illustration:

In [None]:
fig3 = plot_suite(test_text)