Skip to content

causal-jobs is a project dedicated to revealing the status of causal inference in the European job market.

License

Notifications You must be signed in to change notification settings

ggiannarakis/causal-jobs

Repository files navigation

Welcome to causal-jobs


Find the latest job report here (updates every Tuesday)


causal-jobs is a project dedicated to revealing the status of causal inference in the European job market


It is based on automatically parsing job alert emails generated by "causal inference" as a search term.

Since May 12, 2022, every EU vacancy that contains this term is extracted from the corresponding email, transformed into an analysis-ready dataframe, and loaded into a postgresql database living in a Docker container. This is done via an ETL pipeline that runs daily. Once a week, a html report is then automatically generated by analyzing the data and deployed on GitHub pages. The report contains the latest causal-jobs vacancies, as well as company names and other relevant info. The project is developed and maintained by @ggiannarakis.


Architecture


How to setup

Below is a brief step-by-step guide in case you would like to set up your own causal-jobs project, e.g. to change the analysis, focus on another region / specific country or even consider another search term.

0. Create the relevant job alert

Set up a daily job alert email message from LinkedIn specifying search terms ("causal inference") and location ("European Union"). In the Gmail account that you will be receiving the alerts, create a new folder and label it causal-jobs. Finally, create a Gmail filter in order to automatically label the daily job alert emails under the causal-jobs folder. This is where the extract.py script will be looking for new emails.

1. Set up the GmailAPI and virtual environment

Begin by setting up the Gmail API to create a simple Python command-line application that makes requests. This is done through the Google Cloud console. Make sure to complete the OAuth consent screen and push the Gmail API app you have created to production. Remember to put your credentials.json and token.json file on .gitignore. The following links will help:

Also, create a virtual environment that you will be activating before running any script (see requirements.txt). Using conda, I have named mine gmailapi.

2. Create the Extract & Transform scripts

Develop an extract.py and a transform.py Python script (see mine or links above). The extract.py script requests the last job alert email(s) and extracts it from the causal-jobs folder of your Gmail via Python to local host. Then, the transform.py script transforms it into a Pandas dataframe where each row is a job that the email contains, and each column contains info about it.

3. Create the database

A database is needed for saving the transformed job alert emails and for the weekly analysis script to retrieve all relevant data. For brevity, I got a Docker PostgreSQL image from Docker Hub and used pgAdmin 4 to create and manage the database. The database only contains one table featuring the causal-jobs.

4. Create the Load script

Develop a load.py Python script (see mine). The load.py initially executes the extract.py and transform.py scripts, i.e. retrieves the latest email as a Pandas dataframe that's ready for insertion into the database. Then, it connects with the postgres database in the running Docker container and performs the following actions:

  • Retrieve the causal-jobs table containing all past emails
  • If the email_id of the already retrieved latest email does not exist in the table append the data to it

5. Schedule the daily ETL routine

Create a PowerShell / Bash script that fires up the ETL routine daily, performing all relevant actions (see causal-jobs-powershell.ps1, Widows in my case). Then using Task Scheduler / Cron set up a daily task at a convenient time to execute the aforementioned script. If you miss a day or two (e.g. due to localhost being offline) you can always manually execute the task provided the missing email is the last among the emails labeled as causal-jobs.

6. Setup alerts

Besides proper logging (see ETL scripts) it is essential to have in place an alert system for when the pipeline execution fails. A simple idea that combines monitoring and alerting is to create an extra send_email.py script that is executed at the end of the daily ETL. This script will create an email with the data (causal-job) that was just indexed in the database (if any), and it will send the email to your own Gmail address. Note that due to privacy reasons I am not including my own send_email.py script on GitHub but this tutorial should be straightforward.

7. Create the Analysis script

Develop a Python script that connects to the postgres database of the running Docker container and performs all sorts of causal-jobs analysis you would like to see! See the Analysis.ipynb notebook for inspiration.

8. Schedule the weekly Analysis report & upload it

Just like #5, schedule a weekly PowerShell / Bash script that executes the Jupyter notebook Analysis.ipynb, generating the weekly report. The report is then extracted in html form via the command line. Create a Github page for your repo. Finally, using the causal-analysis-weekly-powershell.ps1 script above, push the latest weekly report as index.html (required for Github page deployment) from localhost to the Github repo. Your report is live!

About

causal-jobs is a project dedicated to revealing the status of causal inference in the European job market.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published