# AIT 590 Project TF-IDF Job Data Work 
- Team 3:  Melissa, Fernando, Archer
- Final combined notebook


##### References
- https://medium.com/@adriensieg/text-similarities-da019229c894
- https://www.datacamp.com/community/tutorials/recommender-systems-python
- https://stackoverflow.com/questions/26826002/adding-words-to-stop-words-list-in-tfidfvectorizer-in-sklearn
- https://medium.com/@adriensieg/text-similarities-da019229c894
- https://stackoverflow.com/questions/47557563/lemmatization-of-all-pandas-cells
- https://www.geeksforgeeks.org/python-measure-similarity-between-two-sentences-using-cosine-similarity

In [1]:
!pwd

import pandas as pd
import numpy as np
import nltk
import json
import re
import pickle
import os

from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel

import spacy
import en_core_web_sm  # english model

# load English tokenizer, tagger, parser, NER and word vectors
nlp = en_core_web_sm.load()



/Users/melissacirtain/work/ait-590-nlp


In [2]:
version = 9
df = None

### Load and view data

- 10,501 job listing records

In [3]:
dataset = '/Users/melissacirtain/Documents/gmu/AIT590-NLP/project/data/career_builder_jobs_10501.json'
data = json.load(open(dataset))

len(data)  # 10,501 records

# What are the features?
print(data[0].keys())

# Look at a single record
data[10]

dict_keys(['salary', 'domain', 'education', 'crawled_at', 'description', 'title', 'skills', 'country', 'raw_description', 'locality', 'posted_at', 'longitude', 'postalCode', 'url', 'experience', 'address', 'latitude', '_id', 'company', 'region', 'employment_type'])


{'salary': '$65,000.00 - $120,000.00 / year',
 'domain': 'https://www.careerbuilder.com/',
 'education': 'High School',
 'crawled_at': '05/05/2021, 03:01:51',
 'description': 'Job Description\n New York Life and its affiliates are dedicated to prudent financial management, high-quality products, and impeccable service. Our financial professionals help clients develop a long-term financial strategy to achieve their financial goals using a comprehensive array of financial products and services, including life insurance, investments, annuities, and mutual funds. \n As a financial professional with New York Life, you will be able to build your practice and help those in your community plan for their financial futures. We’re looking for people who possess the following characteristics: \n\n Highly self-motivated and self-disciplined with the ability to work effectively and independently \n Outgoing personality with the ability to develop relationships (i.e., “People Person") and a sincere d

In [4]:
df = pd.DataFrame.from_records(data)

# keep only needed fields
df = df[['_id', 'address', 'company', 'country', 'description',
       'education', 'employment_type', 'experience', 
       'locality', 'longitude', 'postalCode', 'posted_at', 'raw_description',
       'region', 'salary', 'skills', 'title', 'url']]
df.head()

Unnamed: 0,_id,address,company,country,description,education,employment_type,experience,locality,longitude,postalCode,posted_at,raw_description,region,salary,skills,title,url
0,2b387592-8148-5720-a661-a2730061d14c,550 East Main Street,"Action for a Better Community, Inc.",US,Job Description\nTo implement the Head Start P...,Bachelor's Degree,FULL_TIME,1 to 2 years experience.,Rochester,-77.59785,14604.0,2021-04-21T20:20Z,\n\n<strong>Job Description</strong>\n<span>To...,NY,$17.23 - $22.00 / hour,Emergency Handling,Head Start Teacher,https://www.careerbuilder.com/job/J8S02F6YRMG5...
1,cfc728ee-e7f8-5538-b1ee-0f6a2d12e1d1,,Magic Ears,US,Job Description\nYou have the magic. We have t...,Graduate Degree,PART_TIME,No experience required.,Atlanta,-84.38799,,2021-04-16T04:49:23Z,\n\n<strong>Job Description</strong>\n<p style...,GA,$19.00 - $26.00 / hour,"Vocabularies, Grammars, Teaching, Lesson Plann...",Teacher of English for Online Groups!,https://www.careerbuilder.com/job/JCL1H264YM6F...
2,8dcd846b-db99-547f-836c-bcda497cff0d,,ExecuSource,US,Job Description\nWe are looking for a CRM deve...,Bachelor's Degree,FULL_TIME,At least 5 years experience.,Duluth,-84.17516,30097.0,2021-04-29T13:59:47Z,\n\n<strong>Job Description</strong>\n<p class...,GA,"$106,250.00 - $125,000.00 / year","PHP (Scripting Language), Debugging, Web Servi...",CRM / PHP Developer,https://www.careerbuilder.com/job/JCM4J76JP92B...
3,e7fc9e40-ac86-5cf2-ba8f-3c1ef9e21982,3001 South Kansas Avenue,Briggs Dodge Ram Fiat,US,Job Description\n\nBriggs Dodge Ram Fiat is lo...,High School,FULL_TIME,No experience required.,Topeka,-95.68294,66611.0,2021-04-19T01:05:50Z,\n\n<strong>Job Description</strong>\n<p style...,KS,,"Driving, Service Delivery, Customer Service, A...",Automotive Service Advisor / Driver,https://www.careerbuilder.com/job/JCH4GH71PST9...
4,0d4d608a-17c0-5e6b-827e-88ad505cad09,1814 Atrium Place Drive,Regency Integrated Health Services,US,"Job Description\n HARLINGEN, TX- RIO GRANDE VA...",4 Year Degree,FULL_TIME,At least 3 years experience.,Harlingen,-97.65767,78550.0,2021-04-13T14:38:05Z,\n\n<strong>Job Description</strong>\n<p style...,TX,"$85,000.00 - $120,000.00 / year","Emergency Handling, Training, Accounting, Heal...",Licensed Nursing Home Administrator,https://www.careerbuilder.com/job/JCM2C66SCC02...


### Add all-text feature for tf-idf matrix

In [6]:
df['all_text'] = df['description'] + ' ' + df['skills'] + ' ' + df['title']
df.head()

Unnamed: 0,_id,address,company,country,description,education,employment_type,experience,locality,longitude,postalCode,posted_at,raw_description,region,salary,skills,title,url,all_text
0,2b387592-8148-5720-a661-a2730061d14c,550 East Main Street,"Action for a Better Community, Inc.",US,Job Description\nTo implement the Head Start P...,Bachelor's Degree,FULL_TIME,1 to 2 years experience.,Rochester,-77.59785,14604.0,2021-04-21T20:20Z,\n\n<strong>Job Description</strong>\n<span>To...,NY,$17.23 - $22.00 / hour,Emergency Handling,Head Start Teacher,https://www.careerbuilder.com/job/J8S02F6YRMG5...,Job Description\nTo implement the Head Start P...
1,cfc728ee-e7f8-5538-b1ee-0f6a2d12e1d1,,Magic Ears,US,Job Description\nYou have the magic. We have t...,Graduate Degree,PART_TIME,No experience required.,Atlanta,-84.38799,,2021-04-16T04:49:23Z,\n\n<strong>Job Description</strong>\n<p style...,GA,$19.00 - $26.00 / hour,"Vocabularies, Grammars, Teaching, Lesson Plann...",Teacher of English for Online Groups!,https://www.careerbuilder.com/job/JCL1H264YM6F...,Job Description\nYou have the magic. We have t...
2,8dcd846b-db99-547f-836c-bcda497cff0d,,ExecuSource,US,Job Description\nWe are looking for a CRM deve...,Bachelor's Degree,FULL_TIME,At least 5 years experience.,Duluth,-84.17516,30097.0,2021-04-29T13:59:47Z,\n\n<strong>Job Description</strong>\n<p class...,GA,"$106,250.00 - $125,000.00 / year","PHP (Scripting Language), Debugging, Web Servi...",CRM / PHP Developer,https://www.careerbuilder.com/job/JCM4J76JP92B...,Job Description\nWe are looking for a CRM deve...
3,e7fc9e40-ac86-5cf2-ba8f-3c1ef9e21982,3001 South Kansas Avenue,Briggs Dodge Ram Fiat,US,Job Description\n\nBriggs Dodge Ram Fiat is lo...,High School,FULL_TIME,No experience required.,Topeka,-95.68294,66611.0,2021-04-19T01:05:50Z,\n\n<strong>Job Description</strong>\n<p style...,KS,,"Driving, Service Delivery, Customer Service, A...",Automotive Service Advisor / Driver,https://www.careerbuilder.com/job/JCH4GH71PST9...,Job Description\n\nBriggs Dodge Ram Fiat is lo...
4,0d4d608a-17c0-5e6b-827e-88ad505cad09,1814 Atrium Place Drive,Regency Integrated Health Services,US,"Job Description\n HARLINGEN, TX- RIO GRANDE VA...",4 Year Degree,FULL_TIME,At least 3 years experience.,Harlingen,-97.65767,78550.0,2021-04-13T14:38:05Z,\n\n<strong>Job Description</strong>\n<p style...,TX,"$85,000.00 - $120,000.00 / year","Emergency Handling, Training, Accounting, Heal...",Licensed Nursing Home Administrator,https://www.careerbuilder.com/job/JCM2C66SCC02...,"Job Description\n HARLINGEN, TX- RIO GRANDE VA..."


### Clean up all_text

In [7]:
%%time
def cleanup_text(text):
    # numbers, chars, etc.
    text = text.replace('_', ' ')
    text = text.replace('\r', ' ')
    text = text.replace('\n', ' ')
    text = text.replace('*', ' ')
    # digits, punctuation
    #text = text.replace('\d+', ' ')
    text = re.sub('\d+', ' ', text)
    #text = text.replace(r'[^\w\s]+', ' ')
    # reduce extra spaces
    text = re.sub(' +', ' ', text)
    # get tokens, pos, etc.
    text = nlp(text)
    #lemmatize and remove punctuation, stopwords, etc.
    text = ' '.join([token.lemma_.lower() for token in text if not (token.is_stop) and not (token.is_punct)])
    
    return text

df['lemma_lower_text'] = df['all_text'].apply(cleanup_text)

CPU times: user 10min 4s, sys: 1min 3s, total: 11min 8s
Wall time: 11min 12s


### Save our work

In [11]:
outfile = '/Users/melissacirtain/work/ait-590-nlp/lemmatized_df.csv'
#df.to_csv(outfile, encoding='utf-8')
print(f'wrote {outfile}')

wrote /Users/melissacirtain/work/ait-590-nlp/lemmatized_df.csv


### Inspect Lemmatized dataset

- 10,501 job listing records
- /Users/melissacirtain/work/ait-590-nlp/lemmatized_df_7.csv

In [12]:
%%time
#lemmatized_csv = '/Users/melissacirtain/work/ait-590-nlp/lemmatized_df_7.csv'
#df = pd.read_csv(lemmatized_csv)
print(df.shape)
df.head()

(10501, 20)
CPU times: user 753 µs, sys: 425 µs, total: 1.18 ms
Wall time: 751 µs


In [13]:
df['lemma_lower_text'].iloc[0]

'job description implement head start performance standards overall management classroom include promote social physical intellectual growth provide safe healthy environment developmentally linguistically culturally appropriate responsible oversight assistant teacher classroom volunteer job requirements nys teacher certification prefer bachelor&rsquo;s degree early childhood education year experience teach early childhood setting require work level knowledge early childhood developmentally appropriate practice require administrative analytical evaluative oral write communication skill aptitude training motivate people require proficiency use personal computer require health physical capability work office classroom include sit floor child sized chair bend run climb stair lift child weigh lbs require access reliable transportation employee abc early childhood services division receive maintain clearance justice center new york office children family services allow unsupervised contact c

## TF-IDF the lemma_lower_text field

In [14]:
%%time
#make a new tfidf mtx
tfidf = TfidfVectorizer(min_df=1, stop_words="english")

tfidf_mtx = tfidf.fit_transform(df['lemma_lower_text'])

CPU times: user 2.76 s, sys: 57.8 ms, total: 2.82 s
Wall time: 2.84 s


In [15]:
print(tfidf_mtx.shape)  
print(type(tfidf_mtx))

(10501, 45165)
<class 'scipy.sparse.csr.csr_matrix'>


In [16]:
tfidf_mtx.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

## Pickle the sparse matrix

In [17]:
import pickle
pkl_filename = f'our_tfidf_mtx.pkl'
with open(pkl_filename, 'wb') as f:
    pickle.dump(tfidf_mtx, f)
    print(f'saved pickled tfidf matrix to {pkl_filename}')

saved pickled tfidf matrix to our_tfidf_mtx.pkl


### Features
- Work on improving and reducing and cleaning up features
- create a function that steps through all the steps in one call

In [18]:
# Future work: clean up features (without losing acronyms!)
feature_names = tfidf.get_feature_names()
print(feature_names[0:30])

['aa', 'aaa', 'aab', 'aace', 'aacn', 'aade', 'aag', 'aahaprofessional', 'aak', 'aakspecialize', 'aam', 'aama', 'aami', 'aamira', 'aanalyst', 'aanp', 'aap', 'aapc', 'aapl', 'aarp', 'aas', 'aashto', 'aaspn', 'aat', 'aauv', 'ab', 'aba', 'abachelor', 'abandon', 'abap']


In [19]:
# stopwords are removed
'of' in feature_names

False

## Save Vocabulary Features

In [20]:
# write vocab to file for future cleaning:

with open('vocab.txt', 'w') as f:
    for word in feature_names:
        f.write(f"'{word}', ")

print(os.path.join(os.getcwd(), 'vocab.txt'))


/Users/melissacirtain/work/ait-590-nlp/vocab.txt


In [22]:
print(feature_names)



## Process a new incoming record

- clean and lower and lemmatize and repeat all the same data prep steps
- use tfidf to make a new document vector
- compare to each row in the matrix to get a cosine score
- take top 5 scores as recommendations to applicants

In [23]:
new_applicant = """
I am looking for a job as an english teacher in a high school.  
I am interested in working with students to develop their confidence 
and problem solving skills.  I have 5 years English Language teaching 
experience in the classroom.  I have no problem learning new skills and am a 
continuous learner.  I love working with adolescents and youth
to show them the value of the english language and literature.
"""

applicant_lemmas = cleanup_text(new_applicant)

In [24]:
corpus_tfidf = tfidf
corpus_tfidf_mtx = tfidf_mtx
corpus_vocab = feature_names

### Fit incoming text to _corpus vocabulary_

In [25]:
applicant_tfidf = TfidfVectorizer().fit(corpus_vocab)  # <-- this is the difference, fit new on corpus vocab

### Create tfidf-vector transforming applicant lemmas

In [27]:
# Transform each new applicant lemmas
applicant_tfidf_vector = applicant_tfidf.transform([applicant_lemmas])  # <-- and this (fit-corpus, transform applicant)

In [28]:
applicant_tfidf_vector.shape

(1, 45165)

## Calculate cosine similarities 
- Determine cosine similarity between applicant and each job listing

In [29]:
cosine_similarities = cosine_similarity(applicant_tfidf_vector, corpus_tfidf_mtx).flatten()

# Pick the top matches
best_job_match_indices = cosine_similarities.argsort()[:-11:-1]

## Peek at best job matches

In [30]:
df['title'].iloc[best_job_match_indices]

2831     Teacher(English) - Online& High-paid Job - Eas...
6717     ✨✨English Teachers for young learners, Teach R...
1991     English Teachers for young learners, Teach Rem...
1                    Teacher of English for Online Groups!
6742     Teach English Online for Chinese Kids! Work Re...
7159     🚗English Teachers for young learners,🚗 Teach R...
8753                       ENGLISH & LITERATURE INSTRUCTOR
8042     Bilingual - Customer Service - Field Support R...
8624                      Associate-Environmental Services
10482                                     Data Entry Clerk
Name: title, dtype: object