<h1><center>Text Similarity</center></h1>

<h2>Part 1. PDF Extraction</h2>

<h3>1.1 Libraries</h3>

<ul>
    <li>io</li>
    <li>os</li>
</ul>
<b>PDF</b>
<ul>
    <li>pdfminer</li>
</ul>

In [14]:
import io
import os

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFSyntaxError

<h3>1.2 Text Extract Functions</h3>

In [15]:
# https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/
def extract_text_from_pdf(pdf_path):
    '''
    Helper function to extract the plain text from .pdf files
    :param pdf_path: path to PDF file to be extracted (remote or local)
    :return: iterator of string of extracted text
    '''

    if not isinstance(pdf_path, io.BytesIO):
        # extract text from local pdf file
        with open(pdf_path, 'rb') as file:
            try:
                for page in PDFPage.get_pages(
                        file,
                        caching=True,
                        check_extractable=True
                ):
                    resource_manager = PDFResourceManager()
                    fake_file_handle = io.StringIO()
                    converter = TextConverter(
                        resource_manager,
                        fake_file_handle,
#                         codec='utf-8',
                        laparams=LAParams()
                    )
                    page_interpreter = PDFPageInterpreter(
                        resource_manager,
                        converter
                    )
                    page_interpreter.process_page(page)

                    text = fake_file_handle.getvalue()
                    yield text

                    # close open handles
                    converter.close()
                    fake_file_handle.close()
            except PDFSyntaxError:
                return
    else:
        # extract text from remote pdf file
        try:
            for page in PDFPage.get_pages(
                    pdf_path,
                    caching=True,
                    check_extractable=True
            ):
                resource_manager = PDFResourceManager()
                fake_file_handle = io.StringIO()
                converter = TextConverter(
                    resource_manager,
                    fake_file_handle,
                    codec='utf-8',
                    laparams=LAParams()
                )
                page_interpreter = PDFPageInterpreter(
                    resource_manager,
                    converter
                )
                page_interpreter.process_page(page)

                text = fake_file_handle.getvalue()
                yield text

                # close open handles
                converter.close()
                fake_file_handle.close()
        except PDFSyntaxError:
            return

def extract_text(file_path): 
    text = ''
    for page in extract_text_from_pdf(file_path):
            text += ' ' + page

    return text

In [5]:
pdf_path = 'C:\\Users\\Charles Lebensdauer\\Downloads\\sampleres.pdf'

<h2>Part 2. Text Similarity</h2>

In [None]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import spacy
from spacy.matcher import Matcher
#import tkinter as tk
#from tkinter.filedialog import askopenfilename

In [None]:
def cosine_distance_wordembedding_method(s1, s2):
    import scipy
    vector_1 = np.mean([model[word] for word in preprocess(s1)],axis=0)
    vector_2 = np.mean([model[word] for word in preprocess(s2)],axis=0)
    cosine = scipy.spatial.distance.cosine(vector_1, vector_2)
    print('Word Embedding method with a cosine distance asses that our two sentences are similar to',round((1-cosine)*100,2),'%')

In [104]:
# from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np

In [36]:
resume = extract_text(pdf_path)

In [31]:
def process(file):
    # Store the resume in a variable
    #filename = askopenfilename()
    filename = file
    resume = extract_text(filename)

    # Print the resume
    #print(resume)
    stat = dict()

    for filename in os.listdir("./test_job"):
        # Store the job description into a variable
        with open("./test_job/"+filename,'r+',encoding='utf-8') as file:
            job_description = file.read()
#         job_description = docx2txt.process("./test_job/"+filename)

        # Print the job description
        print(job_description)

        # A list of text
        text = [resume, job_description]

        cv = CountVectorizer()
        count_matrix = cv.fit_transform(text)

        #Print the similarity scores
        #print("\nSimilarity Scores:")
        #print(cosine_similarity(count_matrix))

        #get the match percentage
        matchPercentage = cosine_similarity(count_matrix)[0][1] * 100
        matchPercentage = round(matchPercentage, 2) # round to two decimal
        stat[(resume,filename)] = matchPercentage
        print("Your resume matches about "+ str(matchPercentage)+ "% of the job description:"+ filename)

    match = Counter(stat)
    top5 = match.most_common(5)
    output = 'Your top job recommendations are:'
    for (temp_resume,temp_match) in top5:
        print(temp_resume[1],temp_match,"% matching")
        output += "\n"+str(temp_resume[1][:-4])+" "+str(temp_match)+" % macthing"+"|"
    print(output)
    return output

In [34]:
with open("./test_job/"+'Business Analyst.txt','r+',encoding='utf-8') as file:
            job_description = file.read()

In [54]:
with open("./test_job/"+'Data Scientist.txt','r+',encoding='utf-8') as file:
            job_description = file.read()

In [42]:
print(job_description)

﻿Business Analyst
We’re looking for a Business Analyst to evaluate business processes, identify startup needs, and develop strategies to maximize opportunities for the various Forkaia incubator companies. The Business Analyst may work with business, IT and test systems. Some create documentation and manuals. Business Analysts interact with developers, stakeholders, system architects and various subject experts.
Responsibilities
Collecting and analyzing data for potential business expansion
Identifying specific business opportunities
Influencing stakeholders to support business projects
Leading projects and coordinating with other teams to produce better business outcomes
Testing business processes and recommending improvements
Skills and Qualifications
Excellent written and verbal communication skills
Great analytical, critical thinking and problem-solving abilities
Superior presentation and negotiation skills
Proven management and organizational skills
Strong adaptability and capacity

In [43]:
print(resume)

 Data Scientist
ROBERT SMITH

Phone: (123) 456 78 99 
Email: info@qwikresume.com
Website: www.qwikresume.com
LinkedIn:
linkedin.com/qwikresume
Address: 1737 Marshville Road,
Alabama.

Objective
Data Scientist with PhD in Physics and 1+ industrial experience. Two years of working experience 
in Data Analysis team of LIGO Scientific Collaboration [$3M Special Breakthrough Prize winner of 
2016]. Over ten years of successful research experience in both theoretical and computational 
physics. Strong problem-solving and analytical skills. Advanced programming proficiency. Certified
in Data Analysis and Machine Learning.
Skills

Data Mining, Data Analysis, Machine Learning, Python, R, MATLAB, Sphinx, LaTeX, Mathematica, 
Maple, GIT, CVS, HTCondor.
Work Experience
Data Scientist
ABC Corporation ­ May 1994 – May 2005 
 Assisted in determining client needs, deliverable design, estimates and feasibility for 

analytical projects concerning a custom study for a manufacturer who is using the resu

In [74]:
experience = '''Data Scientist with PhD in Physics and 1+ industrial experience. Two years of working experience 
in Data Analysis team of LIGO Scientific Collaboration [$3M Special Breakthrough Prize winner of 
2016]. Over ten years of successful research experience in both theoretical and computational 
physics. Strong problem-solving and analytical skills. Advanced programming proficiency. Certified
in Data Analysis and Machine Learning.
Skills

Data Mining, Data Analysis, Machine Learning, Python, R, MATLAB, Sphinx, LaTeX, Mathematica, 
Maple, GIT, CVS, HTCondor.
Work Experience
Data Scientist
ABC Corporation ­ May 1994 – May 2005 
 Assisted in determining client needs, deliverable design, estimates and feasibility for 

analytical projects concerning a custom study for a manufacturer who is using the results to 
support a litigation claim.

 Served as an internal resource for Jacknife programming and documentation.
 Designed and developed small scale deliverables related to the custom study.
 Participated in the Post Project Review QIP team.
 Responsible for results reporting in the appropriate media and creation of supporting 
 Monitored products from statistical programs for accuracy, consistency and statistical validity.
 Designed and applied statistical and mathematical methods for corporate analytics that were 

documentation for the client.

implemented into client-facing products.

Data Scientist
ABC Corporation ­ 1993 – 1994 
 Maintained automated ETL for reporting.


Implemented Data mining and machine learning algorithms to describe and predict user 
behavior on various retailer websites.
I revamped their &quot;Predictive Marketing&quot; process to be more data driven and 
profitable.
increase in conversion.


 The new process was able to hone in on more useful user segments that had a significant 
 Skills Used Data Cleansing and Data Analysis using Python, Scala, R and Spark.
 Cloud computing on AWS.
 Automation of reporting..
'''
experience = ''.join([l if l.isalpha() else ' ' for l in experience])

In [99]:
#load the spacy module
#Use the large module
nlp = spacy.load('en_core_web_lg')

#list of stopwords in english
#Added -PRON- since spacy will convert every pronouns to -PRON-
stopwords_list = stopwords.words('english') + ['-PRON-']

def remove_stopwords(text):
    no_stopwords = [w for w in text if w not in stopwords_list]
    return no_stopwords

#use spacy to lemmatize each word
def lemma(text):
    text = text.lower()
    doc = nlp(text)
    doc_lemma = " ".join(token.lemma_ for token in doc)
    word_list = word_tokenize(doc_lemma)
    word_list = remove_stopwords(word_list)
    return word_list

In [105]:
experience_lemma = lemma(experience)
find_skill_score(experience_lemma)

array([15.,  2.,  1.,  1.,  1.,  0.])

In [108]:
resume_lemma = lemma(resume_text)
score = find_skill_score(resume_lemma)
score/score.sum()

array([0.75, 0.1 , 0.05, 0.05, 0.05, 0.  ])

In [101]:
#Define lists of keywords that are related to each job
#The first list contains single words,
#while the second list contains phrases

n_job = 6

data_skill = [
              [#single words
                  'data','datum',
                  'sql','mysql','postgresql','python','r',
                  'tabeleau','sas','spark','scala',
                  'database','model','mine','mining',
                  'clean','cleaning',
                  'preprocess','preprocessing'                                       #skills
                  'analysis','analyze','analytic',                                   #analyze
                  'statistics','statistical','visualization',                        #statistics, visualization
                  'ml','machine','scikit','sklearn','keras',
                  'regression','forest','classify',
                  'predictive','prescriptive'                                        #machine learning
              ],
              [#phrases
                  'data scientist','datum scientist',
                  'data analytic','datum analytic',
                  'data analysis','datum analysis',
                  'datum cleaning','datum mining','datum structure',                 #skills
                  'business analytic','business intelligence',
                  'business analysis','business analyst',                            #business-related
                  'machine learning','deep learning','data modeling',
                  'natural language processing'
              ]
             ]

software_skill = [
                  [#single words
                      'python','java','r','c','linux','oracle',                      #languages
                      'software','backend'
                  ],          
                  [#phrases
                      'computer science','computer engineering',
                      'software engineer','software development',
                      'software engineering','software developer'
                  ]
                 ]
business_skill = [
                  [#single words
                      'business','market',
                      'marketing','promotion','campaign',
                      'social','network','networking','ad',
                      'consumer'
                  ],
                  [#phrases
                      'business development',
                      'market research','social medium','social network'
                  ]
                 ]

dev_skill = [
             [#single words
                'js''html','css','javascript','node',                                #languages
                 'web','website','frontend',
                 'ui','ux','uiux'                                                    #ui, ux design
             ],
             [#phrases
                 'web development','web design',
                 'web application','web app','web services',
                 'user interface', 'user experience'
             ]
            ]


create_skill = [
                [#single words
                    'design','photography','illustration',                          #graphics
                    'adobe','photoshop','illustrator','lightroom','premiere',
                    'ps','ai','ae','pr',                                            #adobe suite softwares
                    'write', 'writing','verbal', 'communication',
                    'writer','editor'                                               #content production
                ],
                [#phrases
                    'graphic design','adobe suite'
                ]
               ]

admin_skill = [
               [#single words
                   'finance','financial','accounting',
                   'management'
               ],
               [#phrases
                   'project management'
               ]
              ]


# 1 - Data Scientist
# 2 - Software Engineer
# 3 - Web Developer
# 4 - Business Development
# 5 - Creative Personnel
# 6 - Administration
job_dict = {1 : 'Data Scientist',
            2 : 'Software Engineer',
            3 : 'Web Developer',
            4 : 'Business Development',
            5 : 'Creative Personnel',
            6 : 'Administration'}

skill_dict = {1 : data_skill,
              2 : software_skill,
              3 : dev_skill,
              4 : business_skill,
              5 : create_skill,
              6 : admin_skill}

def find_skill_score(text):
    skill_score = np.zeros(n_job)
    for (key, val) in skill_dict.items():
        for v in val[0]:
            if v in text:
                skill_score[key-1]+=1
        for v in val[1]:
            if v in " ".join(text):
                skill_score[key-1]+=1
    return skill_score

def find_max(arr):
    if (arr == np.zeros(n_job)).all():
        return 0
    else:
        return arr.argmax() + 1

In [76]:
text = [experience, job_description]

cv = CountVectorizer()
count_matrix = cv.fit_transform(text)
# cv.vocabulary_

# Print the similarity scores
print("\nSimilarity Scores:")
print(cosine_similarity(count_matrix)[0][1])


Similarity Scores:
0.6456226260648081


In [66]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [69]:
resume_text = ''.join([l if l.isalpha() else ' ' for l in resume])

In [70]:
print(resume_text)

 Data Scientist ROBERT SMITH  Phone                   Email  info qwikresume com Website  www qwikresume com LinkedIn  linkedin com qwikresume Address       Marshville Road  Alabama   Objective Data Scientist with PhD in Physics and    industrial experience  Two years of working experience  in Data Analysis team of LIGO Scientific Collaboration    M Special Breakthrough Prize winner of         Over ten years of successful research experience in both theoretical and computational  physics  Strong problem solving and analytical skills  Advanced programming proficiency  Certified in Data Analysis and Machine Learning  Skills  Data Mining  Data Analysis  Machine Learning  Python  R  MATLAB  Sphinx  LaTeX  Mathematica   Maple  GIT  CVS  HTCondor  Work Experience Data Scientist ABC Corporation   May        May         Assisted in determining client needs  deliverable design  estimates and feasibility for   analytical projects concerning a custom study for a manufacturer who is using the resu

In [55]:
text = [resume, job_description]

cv = CountVectorizer()
count_matrix = cv.fit_transform(text)
# cv.vocabulary_

# Print the similarity scores
print("\nSimilarity Scores:")
print(cosine_similarity(count_matrix)[0][1])

#get the match percentage
# matchPercentage = cosine_similarity(count_matrix)[0][1] * 100
# matchPercentage = round(matchPercentage, 2) # round to two decimal
# stat[(resume,filename)] = matchPercentage


Similarity Scores:
0.6299192852801511


In [71]:
text = [resume_text, job_description]

cv = CountVectorizer()
count_matrix = cv.fit_transform(text)
# cv.vocabulary_

# Print the similarity scores
print("\nSimilarity Scores:")
print(cosine_similarity(count_matrix)[0][1])


Similarity Scores:
0.6334913752934328


In [33]:
print(process(pdf_path))

﻿Business Analyst
We’re looking for a Business Analyst to evaluate business processes, identify startup needs, and develop strategies to maximize opportunities for the various Forkaia incubator companies. The Business Analyst may work with business, IT and test systems. Some create documentation and manuals. Business Analysts interact with developers, stakeholders, system architects and various subject experts.
Responsibilities
Collecting and analyzing data for potential business expansion
Identifying specific business opportunities
Influencing stakeholders to support business projects
Leading projects and coordinating with other teams to produce better business outcomes
Testing business processes and recommending improvements
Skills and Qualifications
Excellent written and verbal communication skills
Great analytical, critical thinking and problem-solving abilities
Superior presentation and negotiation skills
Proven management and organizational skills
Strong adaptability and capacity