Note: as of 2/5/2021, due to changes in Indeed's page structure, the scraper might no longer work correctly.
This repo contains some Jupyter notebooks and accessory functions exploring the use of the Structural Topic Model on a corpus of job offers scraped from indeed.co.uk. This exercies is meant to be an example of usage of stm and the relative R package, and the results presented don't have any real statistical validity.
The documents contained are:
- CleanIndeedG.R: incorporate "CoordIndeedG.R" and "CleanSalaryIndeedG.R"
- CoordIndeedG.R: function to extrapolate coordinates from vacancies on Indeed and store them in two new columns;
- CleanSalaryIndeedG.R: function to extrapolate min and max salary and factor by which the rate is computed;
- Indeed RJupyterNB.ipynb: notebook with data scraping and cleaning workflow;
- Indeed RJupyterNB2b.ipynb: notebook with model selection and overview of main functions of stm;
- Indeed RJupyterNB3_Salaries_Location.ipynb: notebook with analysis of topic content metadata (salary and dummy variable with location in Newcastle);
- ScrapingIndeedCodeG.R: the function presented in the first section of "Indeed RJupyter.ipynb" to scrape data from Indeed;
- totaljobs.txt: dataset
- totaljobsCoord.txt: dataset with coordinates
- totaljobsCoordRates.txt: dataset with coordinates and salary rates.
The documents have also been published in my personal blog.