<a href="https://colab.research.google.com/github/ayundina/job_posts_analysis/blob/main/extract_key_phrases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Extract phrases**

In [2]:
%%capture
!pip install openai

Prompt OpenAI model to extract phrases from job descriptions

In [12]:
SYSTEM_PROMPT = ("You are a smart and intelligent Named Entity Recognition "
                 "(NER) system. I will provide you the definition of the "
                 "entities you need to extract, the text samples from where "
                 "your extract the entities and the output format with "
                 "examples.")
USER_PROMPT_1 = "Are you clear about your role?"
ASSISTANT_PROMPT_1 = ("Sure, I'm ready to help you with your NER task. "
                      "Please provide me with the necessary information "
                      "to get started.")
GUIDELINES_PROMPT = (
    "Entity Definition:\n"
    "1. EDU: Name of a degree.\n"
    "2. EXP: Number of years of work experience. Years can also be in natural "
    "language.\n"
    "3. TOOL: Name of any programming language or engineering tool.\n"
    "4. TECH: Any type of technology or skill"
    "\n"
    "Entity Format:\n"
    "All entities must be formated as follows:\n"
    "- lowercased"
    "- 'Ph.D' or 'PhD' or other modifications of a Doctorate degree should "
    "become 'phd'\n"
    "- 'Master’s' or 'Master' or 'Masters' or 'MSc' or other modifications of "
    "a Master’s degree should become 'msc'\n"
    "- 'BSc' or any other modifications of a bachelor degree should be shown "
    "as 'bsc'\n"
    "- 'Machine Learning' or 'machine learning' or its abbreviation - 'ML' "
    "should become 'ml'\n"
    "- 'Artificial Intelligence' or 'artificial intelligence' or its "
    "abbreviation - 'AI'  should become 'ai'\n"
    "- 'AI/ML' or 'ML/AI' or 'Machine Learning/Artificial Intelligence' or "
    "'Artificial Intelligence/Machine Learning' should become two separate "
    "entities: 'ai' and 'ml'\n"
    "- 'Natural Language Processing' or its abbreviation - 'NLP' should "
    "become 'nlp'\n"
    "- 'Large Language Model' or its abbreviation 'LLM' should become 'llm'\n"
    "- from the working experience remove spaces and words, leave digits and "
    "pluses or dashes only. For example '3 – 5 years of experience' should "
    "become '3-5' and '3+ years' should become '3+' and '3 years of "
    "experience' should become '3'\n"
    "\n"
    "Output Format:\n"
    "{'EDU': {set of entities present}, 'EXP': {set of entities present}, "
    "'TOOL': {set of entities present}, 'TECH': {set of entities present}}\n"
    "If no entities are presented in any categories keep it as an empty set: "
    "{}\n"
    "\n"
    "Examples:\n"
    "\n"
    "1. Text sample: Have a strong quantitative background (statistics, "
    "machine learning, sciences, etc.). Master’s or Ph.D. in NLP, "
    "Computational Linguistics, Data Science, Computer Science, or another "
    "relevant field. 4+ years experience in applied R&D in the field of NLP / "
    "LLM, reasoning or related fields.\n"
    "{'EDU': {'msc', 'phd'}, 'EXP': {'4+'}, 'TOOL': {}, 'TECH': "
    "{'statistics', 'ml', 'sciences', 'computational linguistics', "
    "'data science', 'computer science', 'nlp', 'llm', 'reasoning'}}\n"
    "\n"
    "2. Text sample: 3+ years of relevant work experience (or equivalent), "
    "involved with the application of Machine Learning and Artificial "
    "Intelligence to business problems in a commercial environment. "
    "Demonstrable experience of multiple machine learning facets, such as "
    "working with large data sets and large language models, experimentation, "
    "scalability and optimization. Fluency in at least one programming "
    "language, with a strong preference for Python. Strong working knowledge "
    "of Spark and SQL. Experience working in a cloud environment is a plus. "
    "Experience with data-driven product development: analytics, A/B testing, "
    "etc. You have a ‘can do’ attitude and you act proactively and not "
    "reactively. BSc or higher in Computer Science, Artificial Intelligence, "
    "Software Engineering, or related fields. Excellent English communication "
    "skills, both written and verbal..\n"
    "{'EDU': {'bsc'}, 'EXP': {'3+'}, 'TOOL': {'python', 'spark', 'sql'}, "
    "'TECH': {'ml', 'ai', 'large data sets', 'llm', 'experimentation', "
    "'scalability', 'optimization', 'cloud environment', 'analytics', "
    "'a/b testing', 'computer science', 'ai', 'software engineering', "
    "'english'}}\n"
    "\n"
    "3. Text sample: University graduate (Master or PhD level) in Computer "
    "Science, Artificial Intelligence, Computational Linguistics or an "
    "associated area. 3 – 5 years of experience in Machine Learning or a "
    "similar role. Solid software engineering skills and experience including "
    "coding, testing, troubleshooting and deployment. Experience using key "
    "languages like JVM-based languages (Java, Clojure), C++ and Python. "
    "Large scale data processing experience using Spark or Hadoop/MapReduce. "
    "Solid Experience in machine learning including supervised or "
    "unsupervised learning techniques and algorithms (e.g. k-NN, SVM, RVM, "
    "Naïve Bayes, Decision trees, etc.). Familiarity with cloud computing "
    "(AWS). Experience and/or interest in Scala is a plus. Relevant "
    "certificates (Spark, Hadoop/Cloudera or CBIP) is a plus. Experience with "
    "Git or a similarly distributed revision control system. You think a "
    "working proof-of-concept is the best way to make a point. Experience "
    "working with a variety of stakeholders at the mid and senior management "
    "level and ability to coach junior members.\n"
    "{'EDU': {'msc', 'phd'}, 'EXP': {'3–5'}, 'TOOL': {'java', 'clojure', "
    "'c++', 'python', 'spark', 'hadoop', 'mapreduce', 'aws', 'scala', "
    "'cloudera', 'cbip', 'git'}, 'TECH': {'computer science', 'ai', "
    "'computational linguistics', 'ml', 'software engineering', 'coding', "
    "'testing', 'troubleshooting', 'deployment', 'large scale data "
    "processing', 'supervised', 'unsupervised learning', 'k-nn', 'svm', "
    "'rvm', 'naïve bayes', 'decision trees', 'cloud computing'}}"
)
ASSISTANT_PROMPT_2 = ("The instructions are clear. Please provide the job "
                      "description.")


Function to extract key phrases from a single job description. OpenAI requires an API key that can be accuired after registration and generating the key on [their website](https://openai.com)

In [4]:
import openai

openai.api_key = "sk-..." # insert API key

def extract_edu_exp_tech_tool(job_description: str) -> str:
  try:
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": USER_PROMPT_1},
            {"role": "assistant", "content": ASSISTANT_PROMPT_1},
            {"role": "user", "content": GUIDELINES_PROMPT},
            {"role": "assistant", "content": ASSISTANT_PROMPT_2},
            {"role": "user", "content": job_description}
        ]
    )
    skills = response['choices'][0]['message']['content'].split('\n', 1)[0]
  except Exception as e:
    print(f"Exception in extract_edu_exp_tech_tool() - {e}")
    print(f"Job description: {job_description[0:150]}")
    skills = "{'EDU': {}, 'EXP': {}, 'TOOL': {}, 'TECH': {}}"
  return skills

Function to apply key phrases extraction to a dataframe

In [5]:
import pandas as pd
from typing import Callable

def df_apply(df: pd.DataFrame, func: Callable, new_col: str, col: str) -> pd.DataFrame:
  try:
    df[new_col] = df.apply(lambda x: func(x[col]), axis=1)
  except Exception as e:
    print(f"Exception in df_apply() - {e}")
  return df

Function to save/append results to a file

In [6]:
import os

def df_append_to_file(df: pd.DataFrame, path: str) -> None:
  header = True
  if os.path.exists(f"{path}"):
    header = False
  df.to_csv(f"{path}", mode="a", header=header, index=False)

def save_to_file(df: pd.DataFrame, path: str) -> None:
  df.to_csv(f"{path}", index=False)

def read_from_file(path: str) -> pd.DataFrame:
  df = pd.read_csv(f"{path}")
  return df

Process in batches in case if something fails

In [16]:
from tqdm import tqdm

def process_in_batches(df: pd.DataFrame, start: int, batch_size: int, path: str, batch_func: Callable, cell_func, new_col, col) -> None:
  n_rows = df.shape[0]
  try:
    for batch_start in tqdm(range(0, n_rows, batch_size)):
      batch_end = batch_start + batch_size
      if batch_end > n_rows:
        batch_end = n_rows
      batch_df = df[batch_start:batch_end].copy()
      batch_func(batch_df, cell_func, new_col, col)
      df_append_to_file(batch_df, path)
  except Exception as e:
    print(f"Exception in process_in_batches() - {e}")
    print(f"Batch failed = {batch_start}")

Open the file with jobs from the Google Drive

In [15]:
folder = "/content/drive/MyDrive/jobs"
file_name = "/data_science_jobs.csv"

jobs_df = read_from_file(f"{folder}{file_name}")
n_rows = jobs_df.shape[0]
print(f"rows = {n_rows}")
jobs_df.head(n_rows)

rows = 159


Unnamed: 0,title,company_name,location,description
0,Graduate Data Scientist,Optiver,"Amsterdam, Netherlands",Can you solve this puzzle?\n\nAn ant leaves it...
1,Process Data Scientist,FrieslandCampina,"Amersfoort, Netherlands",• Work with stakeholders (supply chain experts...
2,Data Scientist,felyx,"Amsterdam, Netherlands",Company Description\n\nWith the intensifying t...
3,Data Scientist,Adyen,"Amsterdam, Netherlands","This is Adyen\n\nAdyen provides payments, data..."
4,Data Scientist,HEINEKEN,"Amsterdam, Netherlands",Data Scientist\n\nThe mission of Global Analyt...
...,...,...,...,...
154,PhD position: Predicting radiotherapy outcome ...,University Medical Centre Groningen (UMCG),"Groningen, Netherlands",Radiotherapy is an important treatment modalit...
155,"Internship: Computer Science, Robotics, Comput...",Lely,"Maassluis, Netherlands",Job DescriptionAre you following a degree in C...
156,Data Scientist Artificial Intelligence ervarin...,CareerValue,"Amersfoort, Netherlands",Voor een toffe consultancy bedrijf die het geb...
157,PhD position on Hybrid argumentation using lar...,Vrije Universiteit Amsterdam VU,"Amsterdam, Netherlands",A 4-year full-time Ph.D. student position is a...



Process the jobs in batches

In [17]:
file_name_processed = "/processed_jobs.csv"

process_in_batches(jobs_df, 0, 10, f"{folder}{file_name_processed}", df_apply, extract_edu_exp_tech_tool, 'key_requirements', 'description')

  6%|▋         | 1/16 [01:43<25:54, 103.65s/it]

Exception in extract_edu_exp_tech_tool() - The server is overloaded or not ready yet.
Job description: We are the fastest growing omnichannel supermarket in the Netherlands. We are continuously improving, so if you feel you can help Jumbo become even mo


 19%|█▉        | 3/16 [03:16<12:43, 58.70s/it]

Exception in extract_edu_exp_tech_tool() - The server is overloaded or not ready yet.
Job description: Overview

Are you a passionate Lead Data Scientist with a keen interest in NLP? Are you looking for a challenging and rewarding role where you can app


 25%|██▌       | 4/16 [04:08<11:14, 56.19s/it]

Exception in extract_edu_exp_tech_tool() - The server is overloaded or not ready yet.
Job description: What does a Data Scientist at Coolblue do?
You work with state-of-the-art methods, the latest technologies, and a mature data infrastructure from many


 31%|███▏      | 5/16 [04:53<09:34, 52.21s/it]

Exception in extract_edu_exp_tech_tool() - That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 73f0cdc023a97aef9463927a7b1578e7 in your message.)
Job description: Become a Part of the NIKE, Inc. Team

NIKE, Inc. does more than outfit the world’s best athletes. It is a place to explore potential, obliterate bound


 62%|██████▎   | 10/16 [10:03<05:40, 56.80s/it]

Exception in extract_edu_exp_tech_tool() - The server is overloaded or not ready yet.
Job description: Passionate about developing and implementing cutting-edge solutions that can impact the lives of millions of people? Interested in establishing yourse


 75%|███████▌  | 12/16 [12:00<03:51, 57.93s/it]

Exception in extract_edu_exp_tech_tool() - The server is overloaded or not ready yet.
Job description: Machine Learning Engineer

Up to €60,000

Amsterdam

The Company

I am working with an exciting Media and Entertainment company in the Netherlands. Th


100%|██████████| 16/16 [16:12<00:00, 60.79s/it]


In [19]:
jobs_processed_df = read_from_file(f"{folder}{file_name_processed}")
n_rows = jobs_processed_df.shape[0]

jobs_processed_df.head(n_rows)

Unnamed: 0,title,company_name,location,description,key_requirements
0,Graduate Data Scientist,Optiver,"Amsterdam, Netherlands",Can you solve this puzzle?\n\nAn ant leaves it...,"Sure, here is the solution to the puzzle quest..."
1,Process Data Scientist,FrieslandCampina,"Amersfoort, Netherlands",• Work with stakeholders (supply chain experts...,"Based on the job description provided, here is..."
2,Data Scientist,felyx,"Amsterdam, Netherlands",Company Description\n\nWith the intensifying t...,"{'EDU': {'masters', 'bachelors'}, 'EXP': {'2+'..."
3,Data Scientist,Adyen,"Amsterdam, Netherlands","This is Adyen\n\nAdyen provides payments, data...","{'EDU': {}, 'EXP': {'3+'}, 'TOOL': {'spark', '..."
4,Data Scientist,HEINEKEN,"Amsterdam, Netherlands",Data Scientist\n\nThe mission of Global Analyt...,"{'EDU': {'msc', 'phd'}, 'EXP': {'3-5'}, 'TOOL'..."
...,...,...,...,...,...
154,PhD position: Predicting radiotherapy outcome ...,University Medical Centre Groningen (UMCG),"Groningen, Netherlands",Radiotherapy is an important treatment modalit...,"{'EDU': {'phd'}, 'EXP': {}, 'TOOL': {'proton t..."
155,"Internship: Computer Science, Robotics, Comput...",Lely,"Maassluis, Netherlands",Job DescriptionAre you following a degree in C...,"{'EDU': {'university'}, 'EXP': {}, 'TOOL': {'o..."
156,Data Scientist Artificial Intelligence ervarin...,CareerValue,"Amersfoort, Netherlands",Voor een toffe consultancy bedrijf die het geb...,"{'EDU': {'bsc', 'master', 'msc'}, 'EXP': {'2-3..."
157,PhD position on Hybrid argumentation using lar...,Vrije Universiteit Amsterdam VU,"Amsterdam, Netherlands",A 4-year full-time Ph.D. student position is a...,"{'EDU': {'phd', 'msc', 'computer science', 'ar..."
