Example based on Qdrant vector search [tutorial](https://qdrant.tech/documentation/tutorials/search-beginners/). This notebook provides assistance in classifying job postings according to the International Classification of Occupations (ISCO) [(link)](https://ilostat.ilo.org/resources/concepts-and-definitions/classification-occupation/).

In [1]:
!pip install -U sentence-transformers qdrant-client > /dev/null

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorboard 2.15.1 requires protobuf<4.24,>=3.19.6, but you have protobuf 4.25.2 which is incompatible.
tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 4.25.2 which is incompatible.[0m[31m
[0m

In [27]:
import pandas as pd
import numpy as np
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

Read the data from [ILO website](https://ilostat.ilo.org/resources/concepts-and-definitions/classification-occupation/).

In [11]:
data_df = pd.read_excel("isic index.xlsx")

In [12]:
data_df.head()

Unnamed: 0,Sortorder,Code,Description,ExplanatoryNoteInclusion,ExplanatoryNoteExclusion
0,20,A,"Agriculture, forestry and fishing",This section includes the exploitation of vege...,
1,30,01,"Crop and animal production, hunting and relate...","This division includes two basic activities, n...",
2,40,011,Growing of non-perennial crops,This group includes the growing of non-perenni...,
3,50,0111,"Growing of cereals (except rice), leguminous c...",This class includes all forms of growing of ce...,This class excludes:\n- growing of maize for f...
4,60,0112,Growing of rice,This class includes:\n- growing of rice (inclu...,


Filter only level 4 categories.

In [9]:
data_df = data_df[data_df["Code"].apply(lambda x: len(str(x)) == 4)]

In [10]:
data_df.head()

Combine definition, tasks, and included occupation in a single colum and check the maximum text length in words.

In [19]:
data_df["text"] = data_df.apply(lambda row: " ".join([str(row["Description"]), str(row["ExplanatoryNoteInclusion"]), str(row["ExplanatoryNoteExclusion"])]), axis=1)

In [20]:
data_df["words"] = data_df.apply(lambda row: len(row["text"].split(" ")), axis=1)

In [21]:
print("Maximum description length (words): {}".format(data_df.words.max()))

Maximum description length (words): 91


Select necessary columns, rename them to avoid spaces in variable names, and transform into dictionary for insertion into Qdrant vector DB.

In [23]:
data_dict = data_df[["Code", "Description", "text"]].rename(columns={"Description": "title", "Code": "isic_code"}).to_dict(orient="records")

Encode data and add into Qdrant vector DB. This example uses a in-memory database, you can find examples for persistent data storage on [Qdrant website](https://qdrant.tech/documentation/).

In [24]:
encoder = SentenceTransformer("all-mpnet-base-v2")
qdrant = QdrantClient(":memory:")
qdrant.recreate_collection(
    collection_name="jobs",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),  # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
)

  qdrant.recreate_collection(


True

In [34]:
print("Embedding dimensions: {}".format(encoder.get_sentence_embedding_dimension()))

Embedding dimensions: 768


In [37]:
qdrant.upload_points(
    collection_name="jobs",
    points=[
        models.Record(
            id=idx, vector=encoder.encode(job["text"]).tolist(), payload=job
        )
        for idx, job in enumerate(data_dict)
    ],
)


RuntimeError: Numpy is not available

Job description ant title for an example [job posting](https://www.jobs.co.ug/job/finance-&-accountancy-manager/350). If the job description is too long (larger than the embedding dimesions) it may cause issues for the search.

In [52]:
job_title = "RELIGIOUS LEADER"

job_description = """

"""

In [25]:
hits_description = qdrant.search(
    collection_name="jobs",
    query_vector=encoder.encode(job_description).tolist(),
    limit=3,
)

In [26]:
for hit in hits_description:
  print("ISCO Code: {} - Title: {} - Score: {:.2f}".format(hit.payload.get("isco_code"), hit.payload.get("title"), hit.score))

ISCO Code: 5242 - Title: Sales Demonstrators - Score: 0.19
ISCO Code: 2431 - Title: Advertising and Marketing Professionals - Score: 0.13
ISCO Code: 5241 - Title: Fashion and Other Models - Score: 0.12


In [53]:
hits_title = qdrant.search(
    collection_name="jobs",
    query_vector=encoder.encode(job_title).tolist(),
    limit=3,
)

In [54]:
for hit in hits_title:
  print("ISCO Code: {} - Title: {} - Score: {:.2f}".format(hit.payload.get("isco_code"), hit.payload.get("title"), hit.score))

ISCO Code: 2636 - Title: Religious Professionals - Score: 0.52
ISCO Code: 3413 - Title: Religious Associate Professionals - Score: 0.47
ISCO Code: 1113 - Title: Traditional Chiefs and Heads of Villages - Score: 0.21


As you can see, the long description with several medical references steer the model towards Health service manager jobs. However, the title points more towards finance jobs. With the search you can also return a confidence score, that may be useful to understand how much the model is confident in each result similarity with the original query. REMEMBER THAT MODELS CAN BE VERY CONFIDENTLY WRONG, however this score may guide the human operator towards a choice.