This repo contains some Jupyter notebooks and accessory functions exploring the use of the Structural Topic Model on a corpus of job offers scraped from indeed.co.uk. This exercies is meant to be an example of usage of stm and the relative R package, and the results presented don't have any real statistical validity.
The documents contained are:
- CleanIndeedG.R: incorporate "CoordIndeedG.R" and "CleanSalaryIndeedG.R"
- CoordIndeedG.R: function to extrapolate coordinates from vacancies on Indeed and store them in two new columns;
- CleanSalaryIndeedG.R: function to extrapolate min and max salary and factor by which the rate is computed;
- Indeed RJupyterNB.ipynb: notebook with data scraping and cleaning workflow;
- Indeed RJupyterNB2b.ipynb: notebook with model selection and overview of main functions of stm;
- Indeed RJupyterNB3_Salaries_Location.ipynb: notebook with analysis of topic content metadata (salary and dummy variable with location in Newcastle);
- ScrapingIndeedCodeG.R: the function presented in the first section of "Indeed RJupyter.ipynb" to scrape data from Indeed;
- totaljobs.txt: dataset
- totaljobsCoord.txt: dataset with coordinates
- totaljobsCoordRates.txt: dataset with coordinates and salary rates.
The documents have also been published in my personal blog.