<a href="https://colab.research.google.com/github/ayundina/job_posts_analysis/blob/main/extract_key_phrases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Extract phrases**
Prompt Open AI model to extract phrases from job descriptions

In [1]:
%%capture
!pip install openai

In [None]:
SYSTEM_PROMPT = "You are a smart and intelligent Named Entity Recognition (NER) system. I will provide you the definition of the entities you need to extract, the text samples from where your extract the entities and the output format with examples."
USER_PROMPT_1 = "Are you clear about your role?"
ASSISTANT_PROMPT_1 = "Sure, I'm ready to help you with your NER task. Please provide me with the necessary information to get started."
GUIDELINES_PROMPT = (
    "Entity Definition:\n"
    "1. EDU: Name of a degree.\n"
    "2. EXP: Number of years of work experience. Years can also be in natural language.\n"
    "3. TOOL: Name of any programming language or engineering tool.\n"
    "4. TECH: Any type of technology or skill"
    "\n"
    "Entity Format:\n"
    "All entities must be formated as follows:\n"
    "- lowercased"
    "- 'Ph.D' or 'PhD' or other modifications of a Doctorate degree should become 'phd'\n"
    "- 'Master’s' or 'Master' or 'Masters' or 'MSc' or other modifications of a Master’s degree should become 'msc'\n"
    "- 'BSc' or any other modifications of a bachelor degree should be shown as 'bsc'\n"
    "- 'Machine Learning' or 'machine learning' or its abbreviation - 'ML' should become 'ml'\n"
    "- 'Artificial Intelligence' or 'artificial intelligence' or its abbreviation - 'AI'  should become 'ai'\n"
    "- 'AI/ML' or 'ML/AI' or 'Machine Learning/Artificial Intelligence' or 'Artificial Intelligence/Machine Learning' should become two separate entities: 'ai' and 'ml'\n"
    "- 'Natural Language Processing' or its abbreviation - 'NLP' should become 'nlp'\n"
    "- 'Large Language Model' or its abbreviation 'LLM' should become 'llm'\n"
    "- from the working experience remove spaces and words, leave digits and pluses or dashes only. For example '3 – 5 years of experience' should become '3-5' and '3+ years' should become '3+' and '3 years of experience' should become '3'\n"
    "\n"
    "Output Format:\n"
    "{'EDU': {set of entities present}, 'EXP': {set of entities present}, 'TOOL': {set of entities present}, 'TECH': {set of entities present}}\n"
    "If no entities are presented in any categories keep it as an empty set: {}\n"
    "\n"
    "Examples:\n"
    "\n"
    "1. Text sample: Have a strong quantitative background (statistics, machine learning, sciences, etc.). Master’s or Ph.D. in NLP, Computational Linguistics, Data Science, Computer Science, or another relevant field. 4+ years experience in applied R&D in the field of NLP / LLM, reasoning or related fields.\n"
    "{'EDU': {'msc', 'phd'}, 'EXP': {'4+'}, 'TOOL': {}, 'TECH': {'statistics', 'ml', 'sciences', 'computational linguistics', 'data science', 'computer science', 'nlp', 'llm', 'reasoning'}}\n"
    "\n"
    "2. Text sample: 3+ years of relevant work experience (or equivalent), involved with the application of Machine Learning and Artificial Intelligence to business problems in a commercial environment. Demonstrable experience of multiple machine learning facets, such as working with large data sets and large language models, experimentation, scalability and optimization. Fluency in at least one programming language, with a strong preference for Python. Strong working knowledge of Spark and SQL. Experience working in a cloud environment is a plus. Experience with data-driven product development: analytics, A/B testing, etc. You have a ‘can do’ attitude and you act proactively and not reactively. BSc or higher in Computer Science, Artificial Intelligence, Software Engineering, or related fields. Excellent English communication skills, both written and verbal..\n"
    "{'EDU': {'bsc'}, 'EXP': {'3+'}, 'TOOL': {'python', 'spark', 'sql'}, 'TECH': {'ml', 'ai', 'large data sets', 'llm', 'experimentation', 'scalability', 'optimization', 'cloud environment', 'analytics', 'a/b testing', 'computer science', 'ai', 'software engineering', 'english'}}\n"
    "\n"
    "3. Text sample: University graduate (Master or PhD level) in Computer Science, Artificial Intelligence, Computational Linguistics or an associated area. 3 – 5 years of experience in Machine Learning or a similar role. Solid software engineering skills and experience including coding, testing, troubleshooting and deployment. Experience using key languages like JVM-based languages (Java, Clojure), C++ and Python. Large scale data processing experience using Spark or Hadoop/MapReduce. Solid Experience in machine learning including supervised or unsupervised learning techniques and algorithms (e.g. k-NN, SVM, RVM, Naïve Bayes, Decision trees, etc.). Familiarity with cloud computing (AWS). Experience and/or interest in Scala is a plus. Relevant certificates (Spark, Hadoop/Cloudera or CBIP) is a plus. Experience with Git or a similarly distributed revision control system. You think a working proof-of-concept is the best way to make a point. Experience working with a variety of stakeholders at the mid and senior management level and ability to coach junior members.\n"
    "{'EDU': {'msc', 'phd'}, 'EXP': {'3–5'}, 'TOOL': {'java', 'clojure', 'c++', 'python', 'spark', 'hadoop', 'mapreduce', 'aws', 'scala', 'cloudera', 'cbip', 'git'}, 'TECH': {'computer science', 'ai', 'computational linguistics', 'ml', 'software engineering', 'coding', 'testing', 'troubleshooting', 'deployment', 'large scale data processing', 'supervised', 'unsupervised learning', 'k-nn', 'svm', 'rvm', 'naïve bayes', 'decision trees', 'cloud computing'}}"
)
ASSISTANT_PROMPT_2 = "The instructions are clear. Please provide the job description."

Function to extract key phrases from a single job description

In [None]:
import openai

openai.api_key = ""
def extract_edu_exp_tech_tool(job_description: str) -> str:
  try:
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": USER_PROMPT_1},
            {"role": "assistant", "content": ASSISTANT_PROMPT_1},
            {"role": "user", "content": GUIDELINES_PROMPT},
            {"role": "assistant", "content": ASSISTANT_PROMPT_2},
            {"role": "user", "content": job_description}
        ]
    )
    skills = response['choices'][0]['message']['content'].split('\n', 1)[0]
  except Exception as e:
    print(f"Exception in extract_edu_exp_tech_tool() - {e}")
    skills = {'EDU': {}, 'EXP': {}, 'TOOL': {}, 'TECH': {}}
  return skills

Function to apply key phrases extraction to a dataframe

In [None]:
import pandas as pd

def df_apply(df: pd.DataFrame, func: function, new_col: str, col: str) -> pd.DataFrame:
  try:
    df[new_col] = df.apply(lambda x: func(x[col]), axis=1)
  except Exception as e:
    print(f"Exception in df_apply() - {e}")
  return df

Function to save/append results to a file 

In [4]:
import os

def df_append_to_file(df: pd.DataFrame, path: str) -> None:
  header = True
  if os.path.exists(f"{path}"):
    header = False
  df.to_csv(f"{path}", mode="a", header=header, index=False)

Process in batches in case if something fails

In [None]:
def process_in_batches(df: pd.DataFrame, start: int, batch_size: int, path: str, batch_func: function, cell_func, new_col, col) -> None:
  df_len = len(df)
  try:
    for batch in range(0, df_len, batch_size):
      batch_df = df[batch:batch_size].copy()
      batch_func(batch_df, cell_func, new_col, col)
      df_append_to_file(batch_df, path)
  except Exception as e:
    print(f"Exception in process_in_batches() - {e}")
    print(f"Batch failed = {batch}")

Open the file with jobs from the Google Drive

In [None]:
folder = "/content/drive/MyDrive/data_science_jobs"
file_to_read = "/data_science_jobs.csv"
job_df = pd.read_csv(f"{folder}{file_to_read}")


Process the jobs in batches

In [None]:
file_to_write = "/processed_jobs.csv"
process_in_batches(job_df, 0, 10, f"{folder}{file_to_write}", df_apply, extract_edu_exp_tech_tool, 'key_requirements', 'description')