<a href="https://colab.research.google.com/github/connorsisacat/RPAI2024/blob/main/Kopi_af_04_LLMLabelSynthesis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Let's create LLM - training-data

In this tutorial we will be crating data for LLM instruction - tuning. What we want is to teach a 7B parameter LLM (Mistral 7B) to extract structured information from patent titles/abstracts and also perform some "higher-level" evaluation.

In [None]:
# Install necessary packages
!pip install openai datasets -qqq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.5/325.5 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━

In [None]:
# Import required libraries
from openai import OpenAI
from datasets import load_dataset
from google.colab import userdata
import json

In [None]:
# Setup OpenAI client with custom API key and base URL
TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY')
client = OpenAI(base_url="https://api.together.xyz/v1", api_key=TOGETHER_API_KEY)

In [None]:
dataset = load_dataset("jacob-hugging-face/job-descriptions")['train']
print(dataset.to_pandas().head())

Downloading readme:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.77M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/853 [00:00<?, ? examples/s]

  company_name                                    job_description  \
0       Google  minimum qualifications\nbachelors degree or eq...   
1        Apple  description\nas an asc you will be highly infl...   
2      Netflix  its an amazing time to be joining netflix as w...   
3  Robert Half  description\n\nweb designers looking to expand...   
4    TrackFive  at trackfive weve got big goals were on a miss...   

                              position_title  description_length  \
0                           Sales Specialist                2727   
1                 Apple Solutions Consultant                 828   
2  Licensing Coordinator - Consumer Products                3205   
3                               Web Designer                2489   
4                              Web Developer                3167   

                                      model_response  
0   {\n  "Core Responsibilities": "Responsible fo...  
1   {\n  "Core Responsibilities": "as an asc you ...  
2   {\n  "C

In [None]:
# Load dataset and prepare data
company_name = dataset["company_name"]
position_title = dataset["position_title"]
model_response = dataset["model_response"]

company_position_modeldetails = [f"{company_name}\n{position_title}\n{model_response}" for company_name, position_title, model_response in zip(company_name, position_title, model_response)]
dataset = dataset.add_column("input", company_position_modeldetails)

# Select a random sample from the dataset
dataset_sample = dataset.shuffle(seed=42).select(range(5))

In [None]:
dataset_sample.to_pandas().head()

Unnamed: 0,company_name,job_description,position_title,description_length,model_response,input
0,"Prince George's County, Maryland Upper Marlbo...",this is a county position in maryland i am rec...,Investigator 1G,4238,"{\n ""Core Responsibilities"": ""Conduct invest...","Prince George's County, Maryland Upper Marlbo..."
1,Hotel Cleaning Services,main duties and responsibilities may vary depe...,"Overnight Janitor - Los Angeles, California",4599,"{\n ""Core Responsibilities"": ""Cleaning and d...",Hotel Cleaning Services\nOvernight Janitor - L...
2,"Localize.city, Inc.",description\n\nas the vp of marketing you will...,VP Marketing,3106,"{\n ""Core Responsibilities"": ""Set brand stra...","Localize.city, Inc.\nVP Marketing\n {\n ""Core..."
3,martinwolf | M&A Advisors,marketing manager overview\n\nat martinwolf ma...,Marketing Manager,3266,"{\n ""Core Responsibilities"": ""Develop and ex...",martinwolf | M&A Advisors\nMarketing Manager\n...
4,Neon Pizza,responsibilities\n\npackage and label pizzas\n...,Kitchen prep / cleaning,321,"{\n ""Core Responsibilities"": ""Package and la...","Neon Pizza\nKitchen prep / cleaning\n {\n ""Co..."


In [None]:
# Select a random sample from the dataset (small for demo)
# dataset_sample = dataset.shuffle(seed=42).select(range(10))

In [37]:
# Define system prompt and instructions for JSON extraction

system_prompt = """
You are a very highly experienced recruiting consultant tasked with extracting and summarizing
 key details from job and position descriptions into possible applicant details into JSON format.
 """

instruct_prompt = """
Given a text fragment describing a job position, including their background and skills,
 analyze the text and create a random person who might realistically apply for each job, to fill out a JSON template.
 The JSON should provide details about the applicant's job title, the company they are
  applying to, first name, last name, age, gender, and potential interests.

Please ensure to:
make the age realistic for Danish job market.
Satisfy modulo check on cpr number

Here is the the text fragment:"""

json_template = """
Here's the JSON template you should follow:

{
  "job_title": "Job title fra the job description",
  "company": "Comany that we hire for",
  "first_name": "First name of the applicant",
  "last_name": "Last name of the applicant.",
  "age": "Age of the applicant, must be integer",
  "gender": "Gender of the applicant, must be lgbtq inclusive",
  "interests": "Applicants potential interests",
  "skills": "A string of concatenated skills",
  "experience": "An array of key value pairs, where each follows the form {"experience description": "years of experience"}",
  "cpr_number": "Danish cpr number on format ddmmyyyy-xxxx"
}

Output JSON only.
"""

In [38]:

# Function to extract JSON from a given patent description
def extract_json(input):
    PROMPT = f"{instruct_prompt} {input['input']} {json_template}"
    completion = client.chat.completions.create(
        model="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": PROMPT}
        ],
        temperature=0.2,
    )
    try:
      out = json.loads(completion.choices[0].message.content)
      return {'completion': completion.choices[0].message.content}
    except json.JSONDecodeError:
        return {'completion': "COMPLETION FAILED " +  completion.choices[0].message.content}


In [39]:
# Apply the extraction function to the dataset sample
dataset_sample = dataset_sample.map(extract_json)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

In [None]:
dataset_sample_filter = dataset_sample.filter(lambda x: x['completion'] is not None)

Filter:   0%|          | 0/5 [00:00<?, ? examples/s]

In [None]:
dataset_sample_filter.to_pandas().head()

Unnamed: 0,company_name,job_description,position_title,description_length,model_response,input,completion
0,"Prince George's County, Maryland Upper Marlbo...",this is a county position in maryland i am rec...,Investigator 1G,4238,"{\n ""Core Responsibilities"": ""Conduct invest...","Prince George's County, Maryland Upper Marlbo...",COMPLETION FAILED Here are 10 random people wh...
1,Hotel Cleaning Services,main duties and responsibilities may vary depe...,"Overnight Janitor - Los Angeles, California",4599,"{\n ""Core Responsibilities"": ""Cleaning and d...",Hotel Cleaning Services\nOvernight Janitor - L...,COMPLETION FAILED Here are 10 random people wh...
2,"Localize.city, Inc.",description\n\nas the vp of marketing you will...,VP Marketing,3106,"{\n ""Core Responsibilities"": ""Set brand stra...","Localize.city, Inc.\nVP Marketing\n {\n ""Core...",COMPLETION FAILED Here are 10 random people wh...
3,martinwolf | M&A Advisors,marketing manager overview\n\nat martinwolf ma...,Marketing Manager,3266,"{\n ""Core Responsibilities"": ""Develop and ex...",martinwolf | M&A Advisors\nMarketing Manager\n...,COMPLETION FAILED Here are 10 random people wh...
4,Neon Pizza,responsibilities\n\npackage and label pizzas\n...,Kitchen prep / cleaning,321,"{\n ""Core Responsibilities"": ""Package and la...","Neon Pizza\nKitchen prep / cleaning\n {\n ""Co...",COMPLETION FAILED Here are 10 random people wh...


In [None]:
dataset_sample_filter.push_to_hub("connorsisacat/neuro_patents_sample_finetune_2")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/connorsisacat/neuro_patents_sample_finetune_2/commit/ce3e1086df8fc0e9b8054a2d795d7a044b413cd4', commit_message='Upload dataset', commit_description='', oid='ce3e1086df8fc0e9b8054a2d795d7a044b413cd4', pr_url=None, pr_revision=None, pr_num=None)

### Let's check performance

In [None]:
import json

In [None]:
dataset_sample_filter['input'][0]

"SYSTEM AND METHOD FOR COLLECTING, ANALYZING, AND UTILIZING COGNITIVE, BEHAVIORAL, NEUROPSYCHOLOGICAL, AND BIOMETRIC DATA FROM A USER'S INTERACTION WITH A SMART DEVICE WITH EITHER PHYSICALLY INVASIVE OR PHYSICALLY NON-INVASIVE MEANS\nA software utility that collects a suite of psychobehavioral, neuropsychological, and biometrically relevant data from neuropsychological tests, and from passive and active interaction with a smart device. Passive interaction is a user's interaction that is not explicitly goal directed. Active interaction is explicitly goal directed (e.g., navigating menus, or interacting with an application). This data is used to: 1) provide an objective profile of memory, cognition, perception, motor function, verbal ability, and fluid intelligence; 2) adapt hardware, software, and user interface settings to make informed decisions regarding accessibility options; 3) to detect usage by someone other than the native user of the device, and 4) to provide a unifying protoco

In [None]:
json.loads(dataset_sample_filter['completion'][0])