Find the latest job report here (updates every Tuesday)
causal-jobs is a project dedicated to revealing the status of causal inference in the European job market
It is based on automatically parsing job alert emails generated by "causal inference" as a search term.
Since May 12, 2022, every EU vacancy that contains this term is extracted from the corresponding email, transformed into an analysis-ready dataframe, and loaded into a postgresql database living in a Docker container. This is done via an ETL pipeline that runs daily. Once a week, a html report is then automatically generated by analyzing the data and deployed on GitHub pages. The report contains the latest causal-jobs vacancies, as well as company names and other relevant info. The project is developed and maintained by @ggiannarakis.
Below is a brief step-by-step guide in case you would like to set up your own causal-jobs project, e.g. to change the analysis, focus on another region / specific country or even consider another search term.
Set up a daily job alert email message from LinkedIn
specifying search terms ("causal inference")
and location ("European Union"). In the Gmail account that you will be receiving
the alerts, create a new folder
and label it causal-jobs
. Finally, create a
Gmail filter
in order to automatically label the daily job alert
emails under the causal-jobs
folder. This is where
the extract.py
script will be looking for new
emails.
Begin by setting up the
Gmail API
to create a simple Python command-line application
that makes requests. This is done through the
Google Cloud console. Make sure to complete the
OAuth consent screen and push the Gmail API app you
have created to production. Remember to put your
credentials.json
and token.json
file on .gitignore
. The following links
will help:
Also, create a virtual environment that you will be activating
before running any script (see requirements.txt
). Using conda, I have
named mine gmailapi
.
Develop an extract.py
and a transform.py
Python script (see mine or links above). The extract.py
script requests the last job alert email(s)
and extracts it from the causal-jobs
folder of your Gmail via Python to local host. Then,
the transform.py
script transforms it
into a Pandas dataframe where each row is a job
that the email contains, and each column
contains info about it.
A database is needed for saving the transformed job alert emails and for the weekly analysis script to retrieve all relevant data. For brevity, I got a Docker PostgreSQL image from Docker Hub and used pgAdmin 4 to create and manage the database. The database only contains one table featuring the causal-jobs.
Develop a load.py
Python script (see mine).
The load.py
initially executes the
extract.py
and transform.py
scripts, i.e.
retrieves the latest email as a Pandas dataframe
that's ready for insertion into the database.
Then, it connects with the postgres database in
the running Docker container
and performs the following actions:
- Retrieve the causal-jobs table containing all past emails
- If the
email_id
of the already retrieved latest email does not exist in the table append the data to it
Create a PowerShell / Bash script that fires up
the ETL routine daily, performing all relevant actions
(see causal-jobs-powershell.ps1
, Widows in my case). Then using Task Scheduler / Cron set up a daily task
at a convenient time to execute the aforementioned script.
If you miss a day or two (e.g. due to localhost being offline)
you can always manually execute the task provided the
missing email is the last among the emails labeled
as causal-jobs.
Besides proper logging (see ETL scripts) it is essential to have in place an alert system
for when the pipeline execution fails. A simple idea that
combines monitoring and alerting is to create an extra send_email.py
script that is executed at the end of the daily ETL.
This script will create an email with the data (causal-job)
that was just indexed in the database (if any), and it will send the email
to your own Gmail address.
Note that due to privacy reasons I am not including my own
send_email.py
script on GitHub but this
tutorial should be straightforward.
Develop a Python script that connects to the
postgres database of the running Docker container
and performs all sorts of
causal-jobs analysis you would like
to see! See the Analysis.ipynb
notebook
for inspiration.
Just like #5, schedule a weekly PowerShell / Bash
script that executes the Jupyter notebook
Analysis.ipynb
, generating the weekly report.
The report is then extracted in
html form via the command line.
Create a Github page
for your repo.
Finally, using the
causal-analysis-weekly-powershell.ps1
script above,
push the latest weekly report as index.html
(required for Github page deployment)
from localhost to the Github repo. Your report is
live!