Skip to content

aschneem/Job-Matcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Job Matcher

This project came out of some python scripts I had started writing to assist with my job search. During a previous job search I had done a little experimentation with using Tf-Idf consine similarilty with matching my resume to job postings to help determine which positions I might be the best match for. Drawing off that experience and knowing the amount of effort it is to manually get the text from job posts I started writing some python scripts to help automate this process. During this I also stumbled upon the Resume-Matcher project which provided additional inspiration. In some of the initial scripts I found that I ran into issues with anti-webscraping techniques being used on some job sites this led me down the path of using Playwright to get around some of those barriers. Eventually, the scripts and their output was growing to the point that it was difficult to navigate and I decided to turn it into more of an actual application. I selected Flask for the backend since it seemed like an easy option to stand up in front of some of my existing code and Angular on the frontend since it had been a framework that I had some experience with, but mostly tweaking things around the edges and I wanted to develop a deeper understanding. I want to note that currently the code is at best prototype / proof of concept level and that I haven't implemented unit tests at this point as I've been mostly working to get something usable, and publishable since some of the initial code was full of personal information and used paths specific to my computer.

Backend

The backend uses Flask, Playwright, Spacy (and some other nlp code), and Persists data with MongoDB.

Setup

Consider using a virtual environment. If needed install the virtualenv module pip install virtualenv

create the virtual environment python -m venv my_venv

Activate the virtual environment using the appropriate script type for your environment in the venv/Scripts folder created by venv command

pip install -r requirements.txt
python -m spacy download en_core_web_sm
python -m playwright install

The last piece before you should be able to run the application is setting an environment variable MONGODB_CONNECTIONSTRING this should be set with a connection string to a MongoDB database that the application will read and write from. MongoDB community edition should be available here: Download MongoDB Community

Running the backend

The backend can be run like a normal flask app through python -m flask run or in production mode with python -m gunicorn --bind 0.0.0.0:5000 app:app It can also be run in a container. On this front I've included a dockerfile and docker-compose.yml most of the setup can be skipped when using the dockerfile / docker-compose.yml with a couple caveats.

  1. You will need to set the MONGODB_CONNECTIONSTRING environment variable still
  2. Running as a container seems to prevent using a non-headless mode for Playwright, which I've found necessary in rare instances and also makes it much easier to debug since the backend does not currently have great observability

Frontend

The frontend is an Angular 17 app using standalone components. Please refer to the autogenerated README.md in the frontend folder if you need help setting up or running an angular app on node.js Longer term I will also containerize this piece and include it with the docker-compose.yml

Using Job Matcher

Currently, Job Matcher tracks 3 things resumes, search definitions, and job posts. I'd recommend that your first step be to upload a resume.

Resumes

When you upload a resume Job Matcher will extract keywords and do other NLP analysis of the text, it will also begin to search for synonyms for the extracted words to potentially make suggestions for words to change out to make the resume text more aligned to a specific job post. You can set a resume as the default resume, which will make that resume the one that the application will use by default when display a score for a job post and the default option in the compare view. Note that due to rate limiting searching for synonyms and caching them the initial time can take a while.

Searches

The next recommended step I'd take is to upload a search definition JSON file. I have plans to overhaul this piece and hopefully will eventually make it more user friendly, but for now this was quickes way for me to hit the ground running.

Here is a description of the JSON schema

{
 "url": "String REQ - The url that the search opens up to. Note that it can be helpful to include filters in the form of query parameters for some searches",
 "skip": "Boolean OPT - This likely won't be particularly useful, but if true the search service will skip over this definition when running searches",
 "extraDelay": "Number OPT - Amount of extra seconds to wait between posts.",
 "acceptCookiesGetBy": "String OPT - The Playwright get_by for an element to accept cookies for the search. This isn't necessarily a required step for all sites",
 "acceptCookiesRole": "String OPT - The role of the element when using get_by_role",
 "acceptCookiesKey": "String OPT - The identifier for the element",
 "searchBoxGetBy": "String OPT - The get_by to use to retrieve a search box element to fill with searches for different types of posts. Note that there is a special value that can be used here url see searchUrl for more info",
 "searchUrl": "String OPT - If searchBoxGetBy is url then instead of performing a search with a search box and button click the search service will instead navigate to the URL specified here. The URL should include this placeholder {search} it will be replaced by the search terms",
 "searchSpaceChar": "String OPT - If searchBoxGetBy is url and you need to override how spaces are being encoded from the search term you can specify the override character here",
 "searchBoxKey": "String OPT - The search box element identifier",
 "searchBoxKeyExact": "Boolean OPT - If the search box Key requires an exact match",
 "searchButtonGetBy": "String OPT - The search button get_by to click once the search has been entered",
 "searchButtonRole": "String OPT - The search button role when using get_by_role",
 "searchButtonKey": "String OPT - The search button element identifier",
 "searchButtonKeyExact": "Boolean OPT - If the search button Key requires an exact match",
 "filters": [
  {
   "getBy": "String OPT - get_by for the filter to apply",
   "role": "String OPT - role if using get_by role",
   "type": "String OPT - Used to specify if the key is a regular expression if it is the value of this field should be re",
   "key": "String OPT - The element identifier for the filter to click if the type is re than you can use a regular expression for example Minnesota \\(\\d+\\)"
  }
 ],
 "jobPostsCSSSelector": "String REQ - CSS selector for the link to click to navigate / open the job post when crawling the page",
 "jobPostsContentCSSSelector": "String REQ - CSS selector for the content of the job description. Note that you may want to get more specific and exclude more content than you might otherwise. The main reason is the at a hash of the content is used to determine if a post is a duplicate and if too much is included you can get things like chatbot bubbles that display slightly different welcome messages",
 "backToSearchGetBy": "String REQ - The method to navigate back to the job post listing being investigated. In the current implementation it supports role, label, and back. Back is a special case where it uses the browsers back navigation instead of click on an element on the website",
 "backToSearchRole": "String OPT - The role when using get_by_role",
 "backToSearchKey": "String OPT - The element identifier for the back link / button / close post",
 "headless": "Boolean OPT - If true the browser will lauch in headless mode with playwright. The current Default is False",
 "name": "String REQ - The name / ID of the search configuration",
 "pageMax": "Number OPT - The maximum number of job post links to investigate on a page",
 "runData": "Object OPT - This is generate by the search service when the configuration is being run"
}

Outside of the parameter descriptions I want to make a few helpful notes from my experience

  1. In general if the site supports a search parameters / filters as query params use those and don't mess with trying to have the search manually apply the filters and conduct searches with the search box. If you want to perform searches with multiple keywords on those site create multiple searches.
  2. The content of the post is hashed into a content id borrowing the idea from git that content id is used to essentially check if the post was already found. Along those lines you want to be careful about the content that is included in the post selector to make sure that the content of the post is as stable as possible.
  3. Use Playwright codegen or the Playwright extension to find the right get_by selectors to use, but for the css selectors using inspect / dev tools will be the most useful. Along these lines for the record it is using the Chromium browser.

Job Posts

Job posts are added by running searches. Eventually, I think it would be good to add an endpoint and create a browser extension to call that endpoint to allow for a manual add. There are a number of statuses that you can place a job post into to filter a list out to note that you have reviewed the post. Posts are sorted with the best match on top. The match is determined with the default resume and default score. The default score is determined by the matchers, which are hard-coded currently in app.py. The two scores that I personally have found most useful are Text Rank and Tf Idf. On the Job Post list there is also a compare link, which will compare the post to the default resume and show how the text intersects and differs with potential suggestions for changes using the previously mentioned cached synonyms for the resume text.

Future Roadmap Items

These are somethings that I have on my mind about wanting to do in the future with this project. Note that this doesn't mean that these features will be coming particularly soon or that I will be working on them in any particular order.

  1. Separate out browser interaction from the search service and create a browser service to run scripts and build output that can be invoked and then processed by the search service instead.
  2. Give the UI some love. Make proper use of variable in the CSS and look at properly styling somethings. Additionally, just improve the overall readability and usability
  3. Make use of the separated out browser service to allow for the creation of scripts to automate submitting applications
  4. Create a browser extension to allow for saving more bespoke posts if creating a search seems overkill
  5. Add in some resume building functionality to allow for more automated resume tailoring for a particular post. (In general I'm thinking there would be a set or possibly a couple different LaTeX templates that the builder would using to output the specific resume. The builder would select bullets points for the position based on how they match the role and possibly make minor edits based on synonym suggestions)
  6. Add support for multiple users
  7. Add Tests
  8. Make the application more configurable and easier to configure from the UI
  9. Add actual logging
  10. Create and save better metrics about searches to make it easier to determine if they are working as intended
  11. Create a Google Apps Script web app to inspect emails about applications to automate tracking
  12. Improve deduplication of job posts beyond the exact content match