<h1><center>Text Similarity</center></h1>

<h2>Part 1. Resume Preparation</h2>

<h3>1.1 Libraries</h3>

In [1]:
import os
import time

# custom modules
import resume
import indeed_job_scraper as indeed

import pandas as pd
import numpy as np

import re

<h3>1.2 Extract the Resume and the Objective and Experience Part of the Resume</h3>

In [1]:
# Sample Resume
# pdf_path = 'C:\\Users\\Charles Lebensdauer\\Downloads\\sampleres.pdf'
pdf_path = 'C:\\Users\\Charles Lebensdauer\\Downloads\\sampleres1.pdf'

# res = resume.extract_text(pdf_path)
# res_sec = resume.extract_entity_sections_grad(res)
# obj = ' '.join(res_sec['objective'])
# exp = ' '.join(res_sec['experience'])

# obj_exp = obj + ' ' + exp
# obj_exp = resume.get_obj_exp(pdf_path)
# print(obj_exp)

<h2>Part 2. Job Description Preparation</h2>

<h3>2.1 Using Job Descriptions Posted on Indeed</h3>

<h4>Example 1: Data Scientist at Irvine</h4>

In [66]:
job_title = 'Data Scientist'
location = 'Irvine'
job_dict = indeed.get_indeed_job(job_title,location)
pd.DataFrame(job_dict).to_csv('indeed_job_'
                              +job_title.lower().replace(' ','_')
                              +'_'
                              +location.lower().replace(' ','_')
                              +'.csv',index=False)

<h4>Example 2: Data Analyst at Irvine</h4>

In [3]:
job_title = 'Data Analyst'
location = 'Irvine'

In [3]:
job_dict = indeed.get_indeed_job(job_title,location)
pd.DataFrame(job_dict).to_csv('indeed_job_'
                              +job_title.lower().replace(' ','_')
                              +'_'
                              +location.lower().replace(' ','_')
                              +'.csv',index=False)

<h4>Alternative: Read from saved csv files</h4>
Read from saved job listings scraped from Indeed.<br>
Using saved job listings allows for testing the resume matching algorithm without worrying about the changes of job postings on the actual Indeed site.

In [4]:
job_df = pd.read_csv('indeed_job_'
                     +job_title.lower().replace(' ','_')
                     +'_'
                     +location.lower().replace(' ','_')
                     +'.csv')
job_df = job_df.dropna().drop_duplicates()
job_dict = job_df.to_dict(orient='list')

In [2]:
import resume_matching as rm

In [3]:
rm.resume_match(pdf_path)

Unnamed: 0,job,company,description
5,Data Scientist,Karma Automotive LLC,\n\nOverview:\n\n\nSouthern California-based K...
31,Senior Business/Data Analyst,Accurate Background,\n\nOverview:\n\n\nWe are looking for an exper...
50,Manager of Data Science,Niagara Bottling,"\n\n\nAt Niagara, we’re looking for Team Membe..."
20,"VP, Global Data & Analytics - Antech",Mars Petcare,\n\n\n\nDATA &amp; ANALYTICS AT MARS PETCARE\n...
2,Senior Financial Data Analyst w/PE backed Heal...,Alliance Resource Group,\n\n\nSenior Financial Data Analyst w/healthca...
44,Lead Data Scientist,Pacific Life,"\nPacific Life is looking to invest in bright,..."
43,Senior Financial Data Analyst,Alignment Healthcare,\n\n\n\nJob Number:\n 1966\n\n\n\nPosition Tit...
1,Senior Data Analyst (Mortgage Banking),Matrix Resources,\n\n\nThis nationwide mortgage industry leader...
54,Sr. Principal Data Analyst/Engineer,Mr. Cooper,\n\n\n\n\nReady to be a Cooper too? This might...
9,Data Scientist,Driveway,\n\n\nWe are looking for a Data Scientist who ...


In [4]:
!pip freeze > req.txt

<h2>Part 3. Text Similarity</h2>

<h3>3.1. Preprocessing Functions</h3>
Remove non-alphabets and stopwords from text to increase the accuracy of text similarity comparisons.

<h4>Libraries</h4>

In [7]:
import scipy

import re

#nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

<h4>Preprocess Function</h4>

In [33]:
stopwords_list = stopwords.words('english')

def preprocess(text):
    '''
    Remove line changes (\n), non-alphabets, and stopwords from input text
    and return a list of words in the text.
    '''
    text = text.lower().replace('\n',' ')
    text = re.sub('[^A-Za-z]',' ',text)

    words = word_tokenize(text)
    no_stopwords = [w for w in words if w not in stopwords_list]
    return no_stopwords

<h3>3.2. CountVectorizer + Cosine Similarity</h3>

<h4>Libraries</h4>

In [5]:
#sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

<h4>Using the full scraped description</h4>

In [124]:
t = time.time()

job_match_df = {'job':[],'count_vectorizer':[],'tfidf_vectorizer':[],'company':[],'description':[]}
for i in range(0,len(job_dict['title'])):
    job_match_df['job'].append(job_dict['title'][i])
    
    #text similarity
    comp_text = [obj_exp, job_dict['description'][i].replace('\n',' ')]
    #count vectorizer
    cv = CountVectorizer(stop_words='english')
    cv_fit = cv.fit_transform(comp_text)
    job_match_df['count_vectorizer'].append(cosine_similarity(cv_fit)[0][1])
    #tfidf vectorizer
    tv = TfidfVectorizer(use_idf=True, stop_words='english')
    tv_fit = tv.fit_transform(comp_text)
    job_match_df['tfidf_vectorizer'].append(cosine_similarity(tv_fit)[0][1])
    
    job_match_df['company'].append(job_dict['company'][i])
    job_match_df['description'].append(job_dict['description'][i])
df_cv = pd.DataFrame(job_match_df).sort_values(by='count_vectorizer',ascending=False)

print(time.time()-t,'seconds')

0.7180328369140625 seconds


In [126]:
df_cv.head()

Unnamed: 0,job,count_vectorizer,tfidf_vectorizer,company,description
31,Quality Data Analyst,0.562781,0.405291,66X,"\n\n\n\n\nQUALITY DATA ANALYST - SANTA ANA, CA..."
18,QA Data Analyst,0.557901,0.410783,PriceSpider,\n\nPriceSpider is a retail technology company...
119,Director of Data Analytics,0.555305,0.411729,MobilityWare,\n\n\n\nCompany Summary\n\n\n\n\nLooking for a...
106,Senior Data Analytics Analyst (Corporate Quality),0.554732,0.401082,Edwards Lifesciences,\n\n\nThis Senior Data Analytics Analyst for C...
22,Data Analyst,0.544337,0.383506,Arize Corporation,\n\n\nTitle: \n Data Analyst\n\n\nCompany: \n ...


In [157]:
df_cv.sort_values(by='tfidf_vectorizer',ascending=False).head()

Unnamed: 0,job,count_vectorizer,tfidf_vectorizer,company,description
119,Director of Data Analytics,0.555305,0.411729,MobilityWare,\n\n\n\nCompany Summary\n\n\n\n\nLooking for a...
18,QA Data Analyst,0.557901,0.410783,PriceSpider,\n\nPriceSpider is a retail technology company...
31,Quality Data Analyst,0.562781,0.405291,66X,"\n\n\n\n\nQUALITY DATA ANALYST - SANTA ANA, CA..."
106,Senior Data Analytics Analyst (Corporate Quality),0.554732,0.401082,Edwards Lifesciences,\n\n\nThis Senior Data Analytics Analyst for C...
22,Data Analyst,0.544337,0.383506,Arize Corporation,\n\n\nTitle: \n Data Analyst\n\n\nCompany: \n ...


In [149]:
print(df_cv.iloc[0,4])






QUALITY DATA ANALYST - SANTA ANA, CA




Position Summary



The Data Analyst oversees and conducts the conversion of data into insights that will lead to informed business and clinical decisions. The position works directly with chief quality officer and the top management and executives within the different departments. The Data Analyst will handle multiple simultaneous tasks, prioritize work, and remain functional under pressure, and aggressive timelines.



Responsibilities




Develop and implement databases, data collection systems, data analytics and other strategies that optimize statistical efficiency and quality of master data.


Interpret data, analyze results using statistical techniques and provide ongoing reports.


Develop and implement databases, data collection systems, data analytics and other strategies that optimize statistical efficiency and quality.


Filter and “clean” data by reviewing computer reports, printouts, and performance indicators to locate and co

In [148]:
print(df_cv.sort_values(by='tfidf_vectorizer',ascending=False).iloc[0,4])





Company Summary




Looking for an innovative, creative, passionate, and fun company in OC? Then, let us introduce ourselves. We are MobilityWare and we make fun for a living! Our mobile games have been rocking the app store since its inception and we regularly show up in top lists for most popular games. We recently reached over 350 million downloads across our portfolio of games!



We have been voted 2015, 2016, 2017 &amp; 2018 Best Places to Work in Orange County by the Orange County Business Journal and OC Register! Headquartered in Irvine, CA., our flagship game, Solitaire, was released on the day the App Store opened in 2008. Other titles include Blackjack, FreeCell, Jigsaw Puzzle and Spider Solitaire, to name a few. Our mission is to bring JOY to others one game at a time.








Position Summary




MobilityWare is looking for a Director of Data Analytics that will be heading multiple teams of passionate and skilled product analysts monetization/marketing analyst and busi

Check the top match given by count vectorizer:

In [120]:
print(df_cv.iloc[0,4])

quality data analyst   santa ana  ca position summary the data analyst oversees and conducts the conversion of data into insights that will lead to informed business and clinical decisions the position works directly with chief quality officer and the top management and executives within the different departments the data analyst will handle multiple simultaneous tasks  prioritize work  and remain functional under pressure  and aggressive timelines responsibilities develop and implement databases  data collection systems  data analytics and other strategies that optimize statistical efficiency and quality of master data interpret data  analyze results using statistical techniques and provide ongoing reports develop and implement databases  data collection systems  data analytics and other strategies that optimize statistical efficiency and quality filter and  clean  data by reviewing computer reports  printouts  and performance indicators to locate and correct code problems creates and

As expected, the count vectorizer and tfidf vectorizer are not a very accurate model for resume matching. None of the top jobs matched by count vectorizor or tfidf vectorizer is a business-related role, while the experience described in the sample resume is very much related to business.<br>
However, the vectorizers can be combined with other models to give a more accurate result.

<h4>Using the extracted description</h4>

In [121]:
t = time.time()

job_match_df = {'job':[],'count_vectorizer':[],'tfidf_vectorizer':[],'company':[],'extracted_description':[]}
for i in range(0,len(job_dict['title'])):
    job_match_df['job'].append(job_dict['title'][i])
    
    #text similarity
    comp_text = [obj_exp, job_dict['extracted_description'][i]]
    #count vectorizer
    cv = CountVectorizer(stop_words='english')
    cv_fit = cv.fit_transform(comp_text)
    
    job_match_df['count_vectorizer'].append(cosine_similarity(cv_fit)[0][1])
    #tfidf vectorizer
    tv = TfidfVectorizer(use_idf=True, stop_words='english')
    tv_fit = tv.fit_transform(comp_text)
    job_match_df['tfidf_vectorizer'].append(cosine_similarity(tv_fit)[0][1])
    
    job_match_df['company'].append(job_dict['company'][i])
    job_match_df['extracted_description'].append(job_dict['extracted_description'][i])
df_cv_desc = pd.DataFrame(job_match_df).sort_values(by='count_vectorizer',ascending=False)

print(time.time()-t,'seconds')

0.59342360496521 seconds


In [122]:
df_cv_desc.head()

Unnamed: 0,job,count_vectorizer,tfidf_vectorizer,company,extracted_description
31,Quality Data Analyst,0.597865,0.437712,66X,quality data analyst santa ana ca position ...
106,Senior Data Analytics Analyst (Corporate Quality),0.561488,0.405887,Edwards Lifesciences,this senior data analytics analyst for corpora...
22,Data Analyst,0.549341,0.388417,Arize Corporation,title data analyst company arize corporation l...
19,Quality Assurance Data Analyst,0.527388,0.372951,Pro-Dex,job summary responsible for data research col...
18,QA Data Analyst,0.515588,0.359776,PriceSpider,pricespider is a retail technology company fil...


Check the top match given by count vectorizer:

In [123]:
print(df_cv_desc.iloc[0,4])

quality data analyst   santa ana  ca position summary the data analyst oversees and conducts the conversion of data into insights that will lead to informed business and clinical decisions the position works directly with chief quality officer and the top management and executives within the different departments the data analyst will handle multiple simultaneous tasks  prioritize work  and remain functional under pressure  and aggressive timelines responsibilities develop and implement databases  data collection systems  data analytics and other strategies that optimize statistical efficiency and quality of master data interpret data  analyze results using statistical techniques and provide ongoing reports develop and implement databases  data collection systems  data analytics and other strategies that optimize statistical efficiency and quality filter and  clean  data by reviewing computer reports  printouts  and performance indicators to locate and correct code problems creates and

<h3>3.3 Gensim Word2Vec Model + Similarity Matrix</h3>

<h4>Libraries</h4>

In [77]:
t = time.time()
# import gensim
import time

from gensim.models import TfidfModel
from gensim.models import Word2Vec

from gensim import corpora
from gensim.matutils import softcossim 
from gensim.utils import simple_preprocess

from gensim.models.keyedvectors import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix,SoftCosineSimilarity
print(time.time() - t)

0.0009975433349609375


<h4>Using the full scraped description</h4>

In [151]:
t = time.time()

comp_text = [obj_exp] + job_dict['description']
corpus = [preprocess(txt) for txt in comp_text]
dictionary = corpora.Dictionary(corpus)
text_process = [dictionary.doc2bow(txt) for txt in corpus]

tfidf = TfidfModel(dictionary=dictionary)
w2v = Word2Vec(corpus, min_count=5, size=300, seed=12345)
sim_index = WordEmbeddingSimilarityIndex(w2v.wv,threshold=0.0,exponent=2.0)
sim_mat = SparseTermSimilarityMatrix(sim_index, dictionary,
                                     tfidf,
                                     nonzero_limit=100)

job_match_df = {'job':[],'word2vec':[],'company':[],'description':[]}
for i in range(0,len(job_dict['title'])):
    similarity = sim_mat.inner_product(text_process[0], text_process[i+1], normalized=True)
    job_match_df['job'].append(job_dict['title'][i])
    job_match_df['word2vec'].append(similarity)
    job_match_df['company'].append(job_dict['company'][i])
    job_match_df['description'].append(job_dict['description'][i])
df_wvec = pd.DataFrame(job_match_df).sort_values(by='word2vec',ascending=False)

print(time.time()-t,'seconds')

5.538192987442017 seconds


In [152]:
df_wvec.head()

Unnamed: 0,job,word2vec,company,description
139,Sales Business Intelligence Specialist,0.834327,Agility Logistics,\n\n\n\n\n\nThe Sales Business Intelligence Sp...
39,Senior Financial Analyst – (Tableau/Data),0.823427,Experian,\n\n\nExperian is the power behind the data. A...
119,Director of Data Analytics,0.82253,MobilityWare,\n\n\n\nCompany Summary\n\n\n\n\nLooking for a...
24,Data Intelligence Analyst,0.797245,"EMINENT, INC.",\n\n\n\nREVOLVE is the next-generation fashion...
19,Quality Assurance Data Analyst,0.78744,Pro-Dex,\n\n\nJOB SUMMARY: \n\n\nResponsible for data ...


<h4>Using the extracted description</h4>

In [153]:
t = time.time()

comp_text = [obj_exp] + job_dict['extracted_description']
corpus = [preprocess(txt) for txt in comp_text]
dictionary = corpora.Dictionary(corpus)
text_process = [dictionary.doc2bow(txt) for txt in corpus]

tfidf = TfidfModel(dictionary=dictionary)
w2v = Word2Vec(corpus, min_count=5, size=300, seed=12345)
sim_index = WordEmbeddingSimilarityIndex(w2v.wv,threshold=0.0,exponent=2.0)
sim_mat = SparseTermSimilarityMatrix(sim_index, dictionary,
                                     tfidf,
                                     nonzero_limit=100)

job_match_df = {'job':[],'word2vec':[],'company':[],'extracted_description':[]}
for i in range(0,len(job_dict['title'])):
    similarity = sim_mat.inner_product(text_process[0], text_process[i+1], normalized=True)
    job_match_df['job'].append(job_dict['title'][i])
    job_match_df['word2vec'].append(similarity)
    job_match_df['company'].append(job_dict['company'][i])
    job_match_df['extracted_description'].append(job_dict['extracted_description'][i])
df_wvec_desc = pd.DataFrame(job_match_df).sort_values(by='word2vec',ascending=False)

print(time.time()-t,'seconds')

2.3267405033111572 seconds


In [154]:
df_wvec_desc.head()

Unnamed: 0,job,word2vec,company,extracted_description
48,"Analyst, Finance Data Management & Governance",1.0,Edwards Lifesciences,responsible for conducting thorough and insigh...
113,Sr. Medical Economics Analyst,1.0,ConcertoHealth,job summary this position provides analytical ...
99,Sr. Medical Economics Analyst,1.0,"Concerto Healthcare,Inc",job summary this position provides analytical ...
139,Sales Business Intelligence Specialist,1.0,Agility Logistics,the sales business intelligence specialist rep...
106,Senior Data Analytics Analyst (Corporate Quality),1.0,Edwards Lifesciences,this senior data analytics analyst for corpora...


Using the extracted description part from job would theoretically result in more accurate text similarity measures.<br>
Currently, the extracting function searches for the header for the "requirement" section. Usually, all the job descriptions occur before the "requirement" section.<br>
However, this function fails when:
<ol>
    <li>The companies use different headers for the requirement section. In this case, the whole job posting's description is extracted.
    </li>
    <li>Alternatively, the description is not extracted because the keywords ("requirement", "experience") in the requirement section header occur before the actual requirements section.
    </li>
</ol>
Overall, the second case is more unwanted since a good job could be ranked as the least matching job because the description is not extracted. For now, the whole description in job posting is used for matching.

<h2>Part 4. Skill Match</h2>

In [155]:
res_skills = resume.extract_skills(res)

In [158]:
def skill_score(skills):
    '''
    Measures the percent of skills in the resume that is required by the job
    and also the percent of skills required that appears in the resume.
    The latter measure helps to exclude jobs that do not list many skills from having a high skill score.
    '''
    skills = skills.split('|')
    common_skills = list(set(res_skills) & set(skills))
    percent_skills = len(common_skills) / len(skills) + 0.5*(len(common_skills) / len(res_skills))
    return percent_skills

<b>Key Points:</b>
<ol>
    <li>Report Automation</li>
    <li>Data Cleaning</li>
    <li>Data Analysis</li>
    <li>Data Mining</li>
    <li>Machine Learning</li>
    <li>Predict Behavior</li>
</ol>

In [163]:
t = time.time()

comp_text = [obj_exp] + job_dict['description']
corpus = [preprocess(txt) for txt in comp_text]
dictionary = corpora.Dictionary(corpus)
text_process = [dictionary.doc2bow(txt) for txt in corpus]

tfidf = TfidfModel(dictionary=dictionary)
w2v = Word2Vec(corpus, min_count=5, size=300, seed=12345)
sim_index = WordEmbeddingSimilarityIndex(w2v.wv,threshold=0.0,exponent=2.0)
sim_mat = SparseTermSimilarityMatrix(sim_index, dictionary,
                                     tfidf,
                                     nonzero_limit=100)

job_score = {'job':[],'word2vec':[],'tfidf_vectorizer':[],'company':[],'description':[],'skills':[],'skill_score':[]}
for i in range(0,len(job_dict['title'])):
    job_score['job'].append(job_dict['title'][i])
    
    #word2vec
    similarity = sim_mat.inner_product(text_process[0], text_process[i+1], normalized=True)
    job_score['word2vec'].append(similarity)
    
    #tfidf vectorizer
    tv = TfidfVectorizer(use_idf=True, stop_words='english')
    tv_fit = tv.fit_transform([obj_exp, job_dict['description'][i]])
    job_score['tfidf_vectorizer'].append(cosine_similarity(tv_fit)[0][1])
    
    job_score['company'].append(job_dict['company'][i])
    job_score['description'].append(job_dict['description'][i])
    job_score['skills'].append(job_dict['skill'][i])
    job_score['skill_score'].append(skill_score(job_dict['skill'][i]))
    
job_score_df= pd.DataFrame(job_score)



print(time.time()-t,'seconds')

5.596011400222778 seconds


In [164]:
job_score_df['score'] = (
                         (job_score_df.word2vec / job_score_df.word2vec.max())
                       + (job_score_df.tfidf_vectorizer / job_score_df.tfidf_vectorizer.max())
                       + (job_score_df.skill_score / job_score_df.skill_score.max())
                         )
job_score_df = job_score_df.sort_values(by='score',ascending=False)

In [165]:
job_score_df

Unnamed: 0,job,word2vec,tfidf_vectorizer,company,description,skills,skill_score,score
9,Data Analyst,0.731261,0.316303,Vaco,\n\n\n\nData Analyst\n\n\n\n\n\nLocation: Anah...,operations|analysis|reporting|technical|report...,0.425000,2.644899
4,SQL Data Analyst,0.714625,0.294543,Vaco,\n\n\n\n*****This is a full time role so at th...,operations|analysis|reporting|technical|report...,0.425000,2.572106
106,Senior Data Analytics Analyst (Corporate Quality),0.775368,0.401082,Edwards Lifesciences,\n\n\nThis Senior Data Analytics Analyst for C...,Analytics|reports|distribution|metrics|modelin...,0.263561,2.523831
68,Senior Business/Data Analyst,0.719693,0.328388,Accurate Background,\n\nOverview:\n\n\nWe are looking for an exper...,reports|analyze|security|legal|certification|a...,0.362500,2.513325
18,QA Data Analyst,0.772718,0.410783,PriceSpider,\n\nPriceSpider is a retail technology company...,retail|Brand|purchasing|marketing|retail|sales...,0.243995,2.498178
...,...,...,...,...,...,...,...,...
36,DATA CONTROL ANALYST,0.535585,0.099011,"Simplex Construction Management, Inc. (Simplex)",\n\n\n\nMinimum 2 years experience in data ent...,controls|controls|construction|controls|data e...,0.000000,0.882561
5,Economy Hotel Data Analyst/Operations Support,0.453626,0.080327,Income Property Investments,\n\n\nPOSITION SUMMARY: \nThe Economy Hotel Da...,Hotel|Operations|Coaching|Staffing|Supervising...,0.000000,0.738925
59,QA Quality Analyst II,0.411289,0.048994,B. Braun Medical Inc.,\n\nOverview:\n\n\n\nAbout B. Braun\n\n\n\n\n\...,healthcare|pharmacy|safety|process|reports|pro...,0.047454,0.723725
95,Board Certified Behavior Analyst,0.238364,0.032820,Acuity Behavior Solutions,\n\n\nManage cases\n\n\nEnsure quality of beha...,acquisition|writing|Analyze|training|certifica...,0.087340,0.570980


In [167]:
for desc in job_score_df['description'].head(5):
    print(desc)
    print('-'*40)





Data Analyst





Location: Anaheim, CA





(***Unfortunately, NO Corp2Corp or 3rd Parties, Local Candidate ONLY Please***)





Responsibilities




The Data Analyst should gather information from various sources and interprets patterns and trends, which can offer ways to improve operations and guide critical business decisions. By analysis and producing reporting, Data Analyst helps to ensure clients are provided with the most complete and accurate data available. Data Analyst will provide first level technical support, run and compile reports, perform manual data import/export, and administers the reporting database as required.



Add identify opportunities for automation


Audit activities




Qualifications





Required Qualifications:




BS degree in Mathematics, Engineering, Statistics, or Computer Science.


Hands-on experience working with relational databases. MS sql Server, MySQL AWS RDS are a plus.


Have experience in managing data mapping/transfers/upload processe

In [168]:
for desc in job_score_df['description'].tail(5):
    print(desc)
    print('-'*40)





Minimum 2 years experience in data entry involving various project controls applications


Support other office personnel in the entry of data into various project controls applications as directed


Demonstrates proficiency in the use of Microsoft Office applications and construction project controls applications (examples: Primavera Project Planner and Primavera Expedition)






----------------------------------------



POSITION SUMMARY: 
The Economy Hotel Data Analyst/Operations Support is responsible for supporting operation of the various Motel 6 properties. Help delivering consistent quality and value to our guests while achieving profit goals, and maintaining a safe, secure and hospitable environment for our Motel 6 guests and Motel 6 Team Members.


PRIMARY DUTIES &amp; RESPONSIBILITIES
: This document in no way states or implies that these are the only duties to be performed by the individual occupying this position. This is a representative list of the general duties a

<h4>Original (Simple Cosine Similarity)</h4>

In [31]:
def process(file):
    # Store the resume in a variable
    #filename = askopenfilename()
    filename = file
    resume = extract_text(filename)

    # Print the resume
    #print(resume)
    stat = dict()

    for filename in os.listdir("./test_job"):
        # Store the job description into a variable
        with open("./test_job/"+filename,'r+',encoding='utf-8') as file:
            job_description = file.read()
#         job_description = docx2txt.process("./test_job/"+filename)

        # Print the job description
        print(job_description)

        # A list of text
        text = [resume, job_description]

        cv = CountVectorizer()
        count_matrix = cv.fit_transform(text)

        #Print the similarity scores
        #print("\nSimilarity Scores:")
        #print(cosine_similarity(count_matrix))

        #get the match percentage
        matchPercentage = cosine_similarity(count_matrix)[0][1] * 100
        matchPercentage = round(matchPercentage, 2) # round to two decimal
        stat[(resume,filename)] = matchPercentage
        print("Your resume matches about "+ str(matchPercentage)+ "% of the job description:"+ filename)

    match = Counter(stat)
    top5 = match.most_common(5)
    output = 'Your top job recommendations are:'
    for (temp_resume,temp_match) in top5:
        print(temp_resume[1],temp_match,"% matching")
        output += "\n"+str(temp_resume[1][:-4])+" "+str(temp_match)+" % macthing"+"|"
    print(output)
    return output

In [33]:
print(process(pdf_path))

﻿Business Analyst
We’re looking for a Business Analyst to evaluate business processes, identify startup needs, and develop strategies to maximize opportunities for the various Forkaia incubator companies. The Business Analyst may work with business, IT and test systems. Some create documentation and manuals. Business Analysts interact with developers, stakeholders, system architects and various subject experts.
Responsibilities
Collecting and analyzing data for potential business expansion
Identifying specific business opportunities
Influencing stakeholders to support business projects
Leading projects and coordinating with other teams to produce better business outcomes
Testing business processes and recommending improvements
Skills and Qualifications
Excellent written and verbal communication skills
Great analytical, critical thinking and problem-solving abilities
Superior presentation and negotiation skills
Proven management and organizational skills
Strong adaptability and capacity