In this notebook, I explore how the JobFunnel duplicate filter works and look for ways I can improve upon it. I will be using two pickles I got from monster.com in my analysis. The first pickle will act as the 'master_list.csv',which I got using the search query "Java-Python". The second pickle will be our scrape job that's compared to the master list, where "Python" is the search query. The filter itself, is using tfidf and cosine similarity to identify duplicates, more info about can be [found here.](https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/)

__Note:__ This is extremely rough, and more of something I made for personal use. I only posted it for those who want to see how the internals of the duplicate filter work. I did a little bit of clean up and added a few links, but don't won't be doing too much else in this notebook. I'm planning on doing a proper data exploration and analysis later, and post on github when I get the time. 

In [1]:
## Uncomment commented out code if any packages are missing.
#import sys, subprocess
#for pkg in ['scikit-learn', 'nltk']: subprocess.check_call([sys.executable, "-m", "pip", "install", "--user", pkg])
import pickle, pandas as pd, numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from typing import Dict, Optional
from pprint import pprint

In [2]:
master_py_jv: Dict[str, dict] = pickle.load(open("Data/py_jv_2019-12-11.pkl", 'rb'))
scrape_py: Dict[str, dict] = pickle.load(open("Data/py_2019-12-11.pkl", 'rb'))
pd.DataFrame.from_dict(master_py_jv, orient='index').head(n = 10)

Unnamed: 0,status,title,company,location,date,blurb,link,id,provider
214184580,new,Manufacturing Test Developer,WorldHire Inc,"Waterloo, ON",2019-12-02,Our Waterloo client is need of a Manufacturing...,https://job-openings.monster.ca/manufacturing-...,214184580,monster
214184939,new,Design Verification Specialist,World Hire Inc,"Waterloo, ON",2019-12-02,Our Waterloo client is need of a multi-discipl...,https://job-openings.monster.ca/design-verific...,214184939,monster
214402155,new,Application Security Analyst,Hays,"Waterloo, ON",2019-12-11,Financial Services Client hiring an Applicatio...,https://job-openings.monster.ca/application-se...,214402155,monster
206413964,new,Software Developer,BROCK Solutions,"Kitchener, ON",2019-12-11,Software Developer Brock Solutions is an engin...,https://job-openings.monster.ca/software-devel...,206413964,monster
064e0158-ad2f-473d-a363-b7fdef291d7a,new,Cloud Tools Developer,FLIR Systems,"Waterloo, ON",2019-11-11,"Be visionaryAt FLIR, we have a simple but ambi...",https://job-openings.monster.ca/cloud-tools-de...,064e0158-ad2f-473d-a363-b7fdef291d7a,monster


In [3]:
pd.DataFrame.from_dict(scrape_py, orient='index').head(n = 10)

Unnamed: 0,status,title,company,location,date,blurb,link,id,provider
214184580,new,Manufacturing Test Developer,WorldHire Inc,"Waterloo, ON",2019-12-02,Our Waterloo client is need of a Manufacturing...,https://job-openings.monster.ca/manufacturing-...,214184580,monster
214184939,new,Design Verification Specialist,World Hire Inc,"Waterloo, ON",2019-12-02,Our Waterloo client is need of a multi-discipl...,https://job-openings.monster.ca/design-verific...,214184939,monster
214402155,new,Application Security Analyst,Hays,"Waterloo, ON",2019-12-11,Financial Services Client hiring an Applicatio...,https://job-openings.monster.ca/application-se...,214402155,monster
206413964,new,Software Developer,BROCK Solutions,"Kitchener, ON",2019-12-11,Software Developer Brock Solutions is an engin...,https://job-openings.monster.ca/software-devel...,206413964,monster
52dd0c65-d021-45d8-941d-66299e773eb8,new,Machine Learning Engineer,EMAGIN,"Kitchener, ON",2019-11-26,About UsWe are changing the way water utilitie...,https://job-openings.monster.ca/machine-learni...,52dd0c65-d021-45d8-941d-66299e773eb8,monster
92b0416c-4f65-470b-9e6c-6723e18314f3,new,Autonomy Engineer - Controls,Clearpath Robotics,"Kitchener, ON",2019-11-23,Position: Autonomy Engineer- Controls ...,https://job-openings.monster.ca/autonomy-engin...,92b0416c-4f65-470b-9e6c-6723e18314f3,monster
064e0158-ad2f-473d-a363-b7fdef291d7a,new,Cloud Tools Developer,FLIR Systems,"Waterloo, ON",2019-11-11,"Be visionaryAt FLIR, we have a simple but ambi...",https://job-openings.monster.ca/cloud-tools-de...,064e0158-ad2f-473d-a363-b7fdef291d7a,monster
9c00becc-96d4-4bdb-a2dc-c48c011c537f,new,Autonomy Engineer - Controls,Clearpath Robotics,"Waterloo, ON",2019-11-11,Position: Autonomy Engineer- Controls Loc...,https://job-openings.monster.ca/autonomy-engin...,9c00becc-96d4-4bdb-a2dc-c48c011c537f,monster
28850739-d127-45af-baad-d27a4c1ff316,new,Autonomy Engineer - Perception,Clearpath Robotics,"Kitchener, ON",2019-11-11,Position: Autonomy Engineer - Perception Locat...,https://job-openings.monster.ca/autonomy-engin...,28850739-d127-45af-baad-d27a4c1ff316,monster


Defined below is a modified tfidf_filter function from the [filters.py](https://github.com/PaulMcInnis/JobFunnel/blob/master/jobfunnel/tools/filters.py) module of JobFunnel and some accompanying helper functions.

In [4]:
# Returns a tuple consisting of the Job Title, Company, and ID for a specific ID in input dictionary.
def key_comp(job_d: Dict[str, dict] , id: str):
    return (job_d.get(id).get('title'), job_d.get(id).get('company'), id)

# Returns two lists of extracted blurbs and ids in a dictionary respectively.
def extract(job_list):
    ids = [job['id'] for job in job_list.values()]
    words = [job['blurb'] for job in job_list.values()]
    return ids, words

# Main tf_idf class that ouputs formatted cosine similarity matrix, input vectorizer, and raw similarity matrix.
def tfidf_filter(scrape: Dict[str, dict], master: Dict[str, dict], vectorizer = None):
#def tfidf_filter(cur_dict: Dict[str, dict], prev_dict: Dict[str, dict],max_similarity: float = 0.75):
    # init vectorizer
    if vectorizer is None:
        vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True, analyzer='word')
    #vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True, analyzer='word')
    
    # get reference words as list
    reference_ids, reference_words = extract(master)
     #reference_words = [job['blurb'] for job in prev_dict.values()]
    
    # get query words and ids as lists
    query_ids, query_words = extract(scrape)
     #query_words, query_ids = TfidfVectorizer(strip_accents='unicode', lowercase=True, analyzer='word', 
     #for job in cur_dict.values():
     #    query_words.append(job['blurb'])
     #    query_ids.append(job['id'])
     #query_ids = [job['id'] for job in cur_dict.values()]
     #query_words = [job['blurb'] for job in cur_dict.values()]
    
    # fit vectorizer to entire corpus
    vectorizer.fit(query_words + reference_words)
    
    # set reference tfidf for cosine similarity later
    references = vectorizer.transform(reference_words)
    
    # calculate cosine similarity between reference and current blurbs
    similarities = cosine_similarity(vectorizer.transform(query_words), references)
    
    ## Removed Sections
     # Code for removing duplicate jobs and can be viewed farther down
     # Code for logging and ouput
    
    # Creates a formatted DF of our similarity matrix with job_ids attaced and 
    # a reference dictionary matching job and company with job_ids.
    df_res = pd.concat(
        [pd.DataFrame(data = [key_comp(scrape, i) for i in query_ids], columns = ['job','company','id']), 
         pd.DataFrame(data = similarities, columns = reference_ids)], 
        axis = 1).style.applymap(lambda df: 'font-weight: bold' , 
                                 subset = pd.IndexSlice[:, ['id']]).hide_index()

    return df_res, vectorizer, similarities
    #return duplicate_ids

Now lets run the filter on our data scrapes.

In [5]:
#vectorizer = tfidf_filter(strip_accents='unicode', lowercase=True, analyzer='word')
x_1, _, s_1 = tfidf_filter(scrape_py, master_py_jv)
x_1

job,company,id,214184580,214184939,214402155,206413964,064e0158-ad2f-473d-a363-b7fdef291d7a
Manufacturing Test Developer,WorldHire Inc,214184580,1.0,0.811423,0.31653,0.438657,0.47688
Design Verification Specialist,World Hire Inc,214184939,0.811423,1.0,0.320275,0.407434,0.47648
Application Security Analyst,Hays,214402155,0.31653,0.320275,1.0,0.252297,0.322094
Software Developer,BROCK Solutions,206413964,0.438657,0.407434,0.252297,1.0,0.412801
Machine Learning Engineer,EMAGIN,52dd0c65-d021-45d8-941d-66299e773eb8,0.351434,0.342604,0.230905,0.32853,0.375632
Autonomy Engineer - Controls,Clearpath Robotics,92b0416c-4f65-470b-9e6c-6723e18314f3,0.476319,0.473035,0.282693,0.390465,0.464714
Cloud Tools Developer,FLIR Systems,064e0158-ad2f-473d-a363-b7fdef291d7a,0.47688,0.47648,0.322094,0.412801,1.0
Autonomy Engineer - Controls,Clearpath Robotics,9c00becc-96d4-4bdb-a2dc-c48c011c537f,0.449296,0.444078,0.267457,0.382575,0.440197
Autonomy Engineer - Perception,Clearpath Robotics,28850739-d127-45af-baad-d27a4c1ff316,0.51456,0.507477,0.307011,0.430268,0.499257


From the results it seems that one match with a max similarity value of 75% (The match would be popped) has been found. This match should be investigated to see if the two jobs are really the same. To do this the Natural Language Toolkit ([nltk](https://www.nltk.org/)) is imported to tokenize the job descriptions and only match words which are not found with each job description. 

In [6]:
import nltk
try:
    stopwords = nltk.corpus.stopwords.words('english')
except LookupError:
    try: 
        nltk.download('stopwords', quiet=True)
    except e:
        print(e)
## Ignore for now
#sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
#nltk.download('punkt')

In [7]:
#print('\n-----\n'.join(sent_detector.tokenize(x_1.strip())))
## Above can seperate sentetnce, which seem cool.
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
# Look at diffrent datasets to make sure if some blurb formatting issues is job provider specific or the scraper.
tokens_1 = tokenizer.tokenize(master_py_jv.get('214184939').get('blurb')) 
tokens_2 = tokenizer.tokenize(master_py_jv.get('214184580').get('blurb'))
tokens_1 = [word.lower() for word in tokens_1 if word not in stopwords]
tokens_2 = [word.lower() for word in tokens_2 if word not in stopwords]
res_1 = [word for word in tokens_1 if word not in tokens_2]
res_2 = [word for word in tokens_2 if word not in tokens_1]
print(len(res_1))
print(sorted(set(res_1)))
print('\n')
print(len(res_2))
print(sorted(set(res_2)))

82
['activitiesparticipate', 'activitiesresearch', 'application', 'applicationsmanagement', 'appropriate', 'bring', 'broader', 'capabilities', 'clients', 'communicating', 'complete', 'coverageparticipate', 'cycleexecution', 'cycleparticipate', 'cyclesable', 'cyclesense', 'defects', 'delivery', 'demonstrated', 'develop', 'developmentexposure', 'developmentprogramming', 'developmentstrong', 'disciplinesour', 'e', 'early', 'efficient', 'embedded', 'environmentrequired', 'execution', 'exploratory', 'external', 'field5', 'firmware', 'following', 'functionsdevelop', 'highly', 'identified', 'implemented', 'internal', 'introductionprogramming', 'issues', 'issuesdevelopment', 'manager', 'medical', 'needsabove', 'objective', 'organization', 'performance', 'plan', 'planning', 'processeseducation', 'processesmanage', 'processreport', 'protocol', 'protocols', 'regulated', 'reporting', 'responsible', 'results', 'running', 'skillsteam', 'specialist', 'specifications', 'specificationstesting', 'stakeh

The results show there are many diffiering words between the descriptions, even when you ignore any words that got combined, for words that were have been combined. This most likely is causing false positives to popped from our job scrapes.

Lets try adding a stop words, to see how that affects our similarity matrix output.

In [8]:
vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True, analyzer='word', 
                             stop_words = stopwords)
x_2, _, s_2 = tfidf_filter(scrape_py, master_py_jv, vectorizer)
x_2

job,company,id,214184580,214184939,214402155,206413964,064e0158-ad2f-473d-a363-b7fdef291d7a
Manufacturing Test Developer,WorldHire Inc,214184580,1.0,0.648447,0.0871111,0.15824,0.117587
Design Verification Specialist,World Hire Inc,214184939,0.648447,1.0,0.0954839,0.113875,0.118413
Application Security Analyst,Hays,214402155,0.0871111,0.0954839,1.0,0.0642077,0.122166
Software Developer,BROCK Solutions,206413964,0.15824,0.113875,0.0642077,1.0,0.127448
Machine Learning Engineer,EMAGIN,52dd0c65-d021-45d8-941d-66299e773eb8,0.0687842,0.0642776,0.0606568,0.112746,0.13651
Autonomy Engineer - Controls,Clearpath Robotics,92b0416c-4f65-470b-9e6c-6723e18314f3,0.0985567,0.100084,0.0474624,0.0862141,0.115413
Cloud Tools Developer,FLIR Systems,064e0158-ad2f-473d-a363-b7fdef291d7a,0.117587,0.118413,0.122166,0.127448,1.0
Autonomy Engineer - Controls,Clearpath Robotics,9c00becc-96d4-4bdb-a2dc-c48c011c537f,0.0929403,0.0891469,0.0455025,0.0897801,0.103522
Autonomy Engineer - Perception,Clearpath Robotics,28850739-d127-45af-baad-d27a4c1ff316,0.107528,0.102242,0.0489271,0.108576,0.139806


By removing stop words, our cosine similarity scores are looking a lot better compared to before.

In [9]:
reference_ids, _ = extract(master_py_jv)
pd.concat([pd.DataFrame(data = [['New Results']]+s_2.tolist(),columns = reference_ids), 
           pd.DataFrame(data = [['Old Results']]+s_1.tolist(),columns = reference_ids)], 
          axis = 0).reset_index(drop=True).style.apply(
    lambda df: ['font-weight: bold' if v in set([df.loc[i] for i in [0,10]]) else '' for v in df]).hide_index()

214184580,214184939,214402155,206413964,064e0158-ad2f-473d-a363-b7fdef291d7a
New Results,,,,
1,0.648447,0.0871111,0.15824,0.117587
0.648447,1.0,0.0954839,0.113875,0.118413
0.0871111,0.0954839,1.0,0.0642077,0.122166
0.15824,0.113875,0.0642077,1.0,0.127448
0.0687842,0.0642776,0.0606568,0.112746,0.13651
0.0985567,0.100084,0.0474624,0.0862141,0.115413
0.117587,0.118413,0.122166,0.127448,1.0
0.0929403,0.0891469,0.0455025,0.0897801,0.103522
0.107528,0.102242,0.0489271,0.108576,0.139806


Now the final task would be to rewrite the tfidf_filter function so it will filter is on one scrape dictionary passed. The current implementation does not remove duplicates from a the first scrape that becomes our masterlist. There is also no way of checking the input scrape itself for reposts within itself, which could lead to them being saved if no duplicates are matched to the masterlist. This is a problem as reposts on job boards or between multiple jobs boards is very common, those missed duplicates would end up being saved in our master list.

The first step would be to rewrite tfidf_filter_2 function to output results whether there is one or two dicts as input.

In [10]:
def tfidf_filter_2(scrape: Dict[str, dict], master: Optional[Dict[str, dict]] = None, vectorizer = None):
    if vectorizer is None:
        vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True, analyzer='word')
    query_ids, query_words = extract(scrape)
    # Creates one way cosine similarity matrix instead if no master dictionary was inputted.
    if master is None:
        similarities = cosine_similarity(vectorizer.fit_transform(query_words))
        df_c = pd.concat(
            [pd.DataFrame(data = [key_comp(scrape, i) for i in query_ids], columns = ['job','company','id']), 
             pd.DataFrame(data = similarities, columns = query_ids)], axis = 1)
    else:
        reference_ids, reference_words = extract(master)
        vectorizer.fit(query_words + reference_words)
        references = vectorizer.transform(reference_words)
        similarities = cosine_similarity(vectorizer.transform(query_words), references)
        df_c = pd.concat(
            [pd.DataFrame(data = [key_comp(scrape, i) for i in query_ids], columns = ['job','company','id']), 
             pd.DataFrame(data = similarities, columns = reference_ids)], axis = 1)
    df_res = df_c.style.applymap(lambda df: 'font-weight: bold' , subset = pd.IndexSlice[:, ['id']]).hide_index()
    return df_res, vectorizer, similarities

In [11]:
vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True, analyzer='word', stop_words=stopwords)
df1, _, s1 = tfidf_filter_2(master_py_jv, vectorizer=vectorizer)
df2, _, s2 = tfidf_filter_2(scrape_py, master_py_jv, vectorizer=vectorizer)
pprint(s1)
pprint(s2)

array([[1.        , 0.66345844, 0.09731095, 0.17593045, 0.13047919],
       [0.66345844, 1.        , 0.10827282, 0.13186546, 0.12717015],
       [0.09731095, 0.10827282, 1.        , 0.0707284 , 0.12617179],
       [0.17593045, 0.13186546, 0.0707284 , 1.        , 0.14235869],
       [0.13047919, 0.12717015, 0.12617179, 0.14235869, 1.        ]])
array([[1.        , 0.64844742, 0.0871111 , 0.15823959, 0.11758716],
       [0.64844742, 1.        , 0.09548391, 0.11387481, 0.11841301],
       [0.0871111 , 0.09548391, 1.        , 0.06420767, 0.1221658 ],
       [0.15823959, 0.11387481, 0.06420767, 1.        , 0.12744822],
       [0.06878422, 0.06427765, 0.06065676, 0.11274559, 0.13651008],
       [0.09855672, 0.10008376, 0.04746236, 0.08621409, 0.11541298],
       [0.11758716, 0.11841301, 0.1221658 , 0.12744822, 1.        ],
       [0.09294027, 0.08914694, 0.0455025 , 0.08978008, 0.10352207],
       [0.10752756, 0.10224172, 0.04892713, 0.1085764 , 0.13980574]])


Success, our results above show that we are able to create similarity matrices with either one or two dictionaries. Before this function is ready, we need to investigate if the duplicate removal method works on one scrape job.
So lets define the function for use here and input the results of the outputted similarity matrices, to see what would get removed.

In [12]:
def duplicate_rm(cur_dict: Dict[str, dict], similarities, max_similarity: float = 0.75):
    query_ids, _ = extract(cur_dict)
    
    # get duplicate job ids and pop them
    duplicate_ids = []
    for sim, query_id in zip(similarities, query_ids):
        if np.max(sim) >= max_similarity:
            duplicate_ids.append(query_id)
            #duplicate_ids.append(cur_dict.pop(query_id)['id'])
            
            print(query_id, sim, round(np.max(sim)))
    print("\n")
    print("Duplicates to be popped: ")
    return duplicate_ids
print("Results: ")
pprint(duplicate_rm(master_py_jv, s1))

Results: 
214184580 [1.         0.66345844 0.09731095 0.17593045 0.13047919] 1.0
214184939 [0.66345844 1.         0.10827282 0.13186546 0.12717015] 1.0
214402155 [0.09731095 0.10827282 1.         0.0707284  0.12617179] 1.0
206413964 [0.17593045 0.13186546 0.0707284  1.         0.14235869] 1.0
064e0158-ad2f-473d-a363-b7fdef291d7a [0.13047919 0.12717015 0.12617179 0.14235869 1.        ] 1.0


Duplicates to be popped: 
['214184580',
 '214184939',
 '214402155',
 '206413964',
 '064e0158-ad2f-473d-a363-b7fdef291d7a']


Looks like the duplicate filter is working, but would remove all jobs because of the diagonal. Lets write a diffrent one to be compatible with a single scrape job, and lower our max_similarity to see how it works.

In [13]:
np.fill_diagonal(s1, 0) #Replaces diagonals with 0 to avoid accidental removal
similarities = s1
max_similarity = 0.14
query_ids, _ = extract(master_py_jv)
index = 0
duplicates = []
while True:
    if index == len(similarities):
        print(query_ids)
        print(similarities)
        break
    if np.max(similarities[index]) >= max_similarity:
        print(query_ids)
        print(similarities)
        duplicates.append(query_ids.pop(index))
        similarities = np.delete(similarities, index, axis = 0)
        similarities = np.delete(similarities, index, axis = 1)
    else:
        index += 1

print("\n","Duplicates to be popped: ", duplicates)
query_ids, _ = extract(master_py_jv)
pd.DataFrame(data = s1, columns = query_ids, index = query_ids)

['214184580', '214184939', '214402155', '206413964', '064e0158-ad2f-473d-a363-b7fdef291d7a']
[[0.         0.66345844 0.09731095 0.17593045 0.13047919]
 [0.66345844 0.         0.10827282 0.13186546 0.12717015]
 [0.09731095 0.10827282 0.         0.0707284  0.12617179]
 [0.17593045 0.13186546 0.0707284  0.         0.14235869]
 [0.13047919 0.12717015 0.12617179 0.14235869 0.        ]]
['214184939', '214402155', '206413964', '064e0158-ad2f-473d-a363-b7fdef291d7a']
[[0.         0.10827282 0.13186546 0.12717015]
 [0.10827282 0.         0.0707284  0.12617179]
 [0.13186546 0.0707284  0.         0.14235869]
 [0.12717015 0.12617179 0.14235869 0.        ]]
['214184939', '214402155', '064e0158-ad2f-473d-a363-b7fdef291d7a']
[[0.         0.10827282 0.12717015]
 [0.10827282 0.         0.12617179]
 [0.12717015 0.12617179 0.        ]]

 Duplicates to be popped:  ['214184580', '206413964']


Unnamed: 0,214184580,214184939,214402155,206413964,064e0158-ad2f-473d-a363-b7fdef291d7a
214184580,0.0,0.663458,0.097311,0.17593,0.130479
214184939,0.663458,0.0,0.108273,0.131865,0.12717
214402155,0.097311,0.108273,0.0,0.070728,0.126172
206413964,0.17593,0.131865,0.070728,0.0,0.142359
064e0158-ad2f-473d-a363-b7fdef291d7a,0.130479,0.12717,0.126172,0.142359,0.0


In [14]:
#print(s1)
query_ids, _ = extract(master_py_jv)
def test_1(similarities, query_ids, max_similarity: float = 0.6):
    index = 0
    duplicates = []
    np.fill_diagonal(similarities, 0)
    while True:
        if index == len(similarities): 
            break
        if np.max(similarities[index]) >= max_similarity:
            duplicates.append(query_ids.pop(index))
            similarities = np.delete(similarities, index, axis = 0)
            similarities = np.delete(similarities, index, axis = 1)
        else:
            index += 1
    return duplicates
print("Our duplicate job ids are: ", test_1(s1, query_ids.copy())) 

Our duplicate job ids are:  ['214184580']


To be continued...

Questions:
1. Could tuning hyper-parameters improve filtering?
2. Are there any other models that might prove better than tf-idf and cosine similarity?

In [15]:
#test = CountVectorizer(strip_accents='unicode', lowercase=True, analyzer='word', stop_words = stopwords.words('english'))
#test = CountVectorizer(strip_accents='unicode', lowercase=True, analyzer='word', stop_words = 'english')
#test.fit_transform(reference_words)
#pprint(test.vocabulary_.items())
#{k:v for k, v in sorted(test.vocabulary_.items(), key=lambda item: item[1], reverse=True)}
#print(test.stop_words_)

In [16]:
#import csv
#def read_csv(path, key_by_id=True):
    ## reads csv passed in as path
#    with open(path, 'r') as csvfile:
#        reader = csv.DictReader(csvfile)
#        if key_by_id:
#            return dict([(j['id'], j) for j in reader])
#        else:
#            return [row for row in reader]