Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Documentation Status

A tool to assign standard occupational classification codes to job vacancy descriptions

Given a job title, job description, and job sector the algorithm assigns a UK 3-digit standard occupational classification (SOC) code to the job. The algorithm uses the SOC 2010 standard, more details of which can be found on the ONS' website.

This code originally written by Jyldyz Djumalieva, Arthur Turrell, David Copple, James Thurgood, and Bradley Speigner. Martin Wood has provided more recent code updates and improvements.

If you use this code please cite:

Turrell, A., Speigner, B., Djumalieva, J., Copple, D., & Thurgood, J. (2019). Transforming Naturally Occurring Text Data Into Economic Statistics: The Case of Online Job Vacancy Postings (No. w25837). National Bureau of Economic Research.

  title={Transforming naturally occurring text data into economic statistics: The case of online job vacancy postings},
  author={Turrell, Arthur and Speigner, Bradley and Djumalieva, Jyldyz and Copple, David and Thurgood, James},
  institution={National Bureau of Economic Research}


See for a full list of Python packages.

occupationcoder is built on top of NLTK and uses 'Wordnet' (a corpora, number 82 on their list) and the Punkt Tokenizer Models (number 106 on their list). When the coder is run, it will expect to find these in their usual directories. If you have nltk installed, you can get them corpora using which will install them in the right directories or you can go to to download them manually (and follow the install instructions).

A couple of the other packages, such as rapidfuzz do not come with the Anaconda distribution of Python. You can install these via pip (if you have access to the internet) or download the relevant binaries and install them manually.

File and folder description

  • occupationcoder/ applies SOC codes to job descriptions
  • occupationcoder/ contains helper function which mostly manipulate strings
  • occupationcoder/createdictionaries turns the ONS' index of SOC code into dictionaries used by occupationcoder/
  • occupationcoder/dictionaries contains the dictionaries used by occupationcoder/
  • occupationcoder/outputs is the default output directory
  • occupationcoder/tests/test_vacancies.csv contains 'test' vacancies to run the code on, used by unittests, accessible by you!

Installation via terminal using pip

Download the package and navigate to the download directory. Then use

python sdist
cd dist
pip install occupationcoder-<version>.tar.gz

The first line creates the .tar.gz file, the second navigates to the directory with the packaged code in, and the third line installs the package. The version number to use will be evident from the name of the .tar.gz file.

Running the code as a python script

Importing, and creating an instance, of the coder

import pandas as pd
from occupationcoder.coder import SOCCoder
myCoder = SOCCoder()

To run the code with a single query, use the following syntax with the code_record(job_title,job_description,job_sector) method:

if __name__ == '__main__':
    myCoder.code_record('Physicist', 'Calculations of the universe', 'Professional scientific')

Note that you can leave some of the fields blank and the algorithm will still return a SOC code.

To run the code on a file (eg csv name 'job_file.csv') with structure

job_title job_description job_sector
Physicist Make calculations about the universe, do research, perform experiments and understand the physical environment. Professional, scientific & technical activities


df = pd.read_csv('path/to/foo.csv')
df = myCoder.code_data_frame(df, title_column='job_title', sector_column='job_sector', description_column='job_description')

The column name arguments are optional, shown above are default values. This will return a new dataframe with SOC code entries appended in a new column:

job_title job_description job_sector SOC_code
Physicist Make calculations about the universe, do research, perform experiments and understand the physical environment. Professional, scientific & technical activities 211

Running the code from the command line

If you have all the relevant packages in requirements.txt, download the code and navigate to the occupationcoder folder (which contains the README). Then run

python -m occupationcoder.coder path/to/foo.csv

This will create a 'processed_jobs.csv' file in the outputs/ folder which has the original text and an extra 'SOC_code' column with the assigned SOC codes.


To run the tests in your virtual environment, use

python -m unittest

in the top level occupationcoder directory. Look in for what is run and for examples of use. The output appears in the 'processed_jobs.csv' file in the outputs/ folder.


We are very grateful to Emmet Cassidy for testing this algorithm.


This code is provided 'as is'. We would love it if you made it better or extended it to work for other countries. All views expressed are our personal views, not those of any employer.


The development of this package was supported by the Bank of England.

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.


Given a job title and job description, the algorithm assigns a standard occupational classification (SOC) code to the job.








No releases published


No packages published