# Let's create LLM - training-data

In this tutorial we will be crating data for LLM instruction - tuning. What we want is to teach a 7B parameter LLM (Mistral 7B) to extract structured information from patent titles/abstracts and also perform some "higher-level" evaluation.

In [1]:
# Install necessary packages
#!pip install openai datasets -qqq

In [1]:
# Import required libraries
from openai import OpenAI
from datasets import load_dataset
import json

In [2]:
# Setup OpenAI client with custom API key and base URL
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()
TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
client = OpenAI(base_url="https://api.together.xyz/v1", api_key=TOGETHER_API_KEY)

In [3]:
# Load dataset and prepare data
dataset = load_dataset("RJuro/neuro_patents")['train']
titles = dataset["appln_title"]
abstracts = dataset["appln_abstract"]
titles_abstracts = [f"{title}\n{abstract}" for title, abstract in zip(titles, abstracts)]
dataset = dataset.add_column("input", titles_abstracts)

# Select a random sample from the dataset
dataset_sample = dataset.shuffle(seed=42).select(range(10))

In [4]:
# Define system prompt and instructions for JSON extraction

system_prompt = "You are a highly skilled data analyst tasked with extracting and summarizing key details from patent descriptions into a JSON format."

patent_instruct_prompt = """Given a text fragment describing a patent, including its title,
and abstract, analyze the text and extract relevant information
to fill out a JSON template. The JSON should provide a concise summary of the patent,
focusing on its main label, application (purpose and use cases), direct use on people (how it's applied in relation to humans),
input description (key components or methodologies), risk description (potential risks or side effects unless None obvious),
and risk level (overall assessment of potential harm). Use the following scale for risk level: None, Low, Moderate, High.

Please ensure to:

Clearly distinguish between direct and indirect uses of the patent on humans.
Provide specific examples or descriptions of inputs when mentioned in the text.
Outline any mentioned risks, including how they might impact users or society.
Assess the risk level based on the information provided, using the predefined scale.

Here is the the text fragment:"""

json_template = """
Here's the JSON template you should follow:

{
  "label": "Short, descriptive title of the invention",
  "application": "Brief description of what the invention is used for",
  "direct_use": "Direct/Indirect/Tool/Machine",
  "input_description": "Description of inputs or components, if applicable. Use 'None' or 'Not Applicable' for patents where this doesn't apply.",
  "risk_description": "Outline of potential risks or side effects. If no risks are present, indicate 'None' and provide a brief explanation.",
  "risk_level": "Low/Moderate/High/None. Use 'None' for patents with no identifiable risks."
}

Output JSON only.
"""

In [5]:

# Function to extract JSON from a given patent description
def extract_json(input):
    PROMPT = f"{patent_instruct_prompt} {input['input']} {json_template}"
    completion = client.chat.completions.create(
        model="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": PROMPT}
        ],
        temperature=0.2,
    )
    try:
      out = json.loads(completion.choices[0].message.content)
      return {'completion': completion.choices[0].message.content}
    except json.JSONDecodeError:
        return {'completion': None}


In [6]:
# Apply the extraction function to the dataset sample
dataset_sample = dataset_sample.map(extract_json)



Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [7]:
dataset_sample_filter = dataset_sample.filter(lambda x: x['completion'] is not None)

Filter:   0%|          | 0/10 [00:00<?, ? examples/s]

In [8]:
dataset_sample_filter

Dataset({
    features: ['appln_id', 'appln_filing_date', 'docdb_family_id', 'granted', 'appln_abstract', 'appln_abstract_lg', 'appln_title', 'applt_coun', 'invt_coun', 'cpc', 'ipc', '__index_level_0__', 'input', 'completion'],
    num_rows: 7
})

In [9]:
dataset_sample_filter.push_to_hub("RJuro/neuro_patents_bds")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

In [12]:
print(dataset_sample_filter['completion'][0])

{
  "label": "Cognitive Data Collection System",
  "application": "Collecting, analyzing, and utilizing cognitive, behavioral, neuropsychological, and biometric data from a user's interaction with a smart device",
  "direct_use": "Direct",
  "input_description": "Neuropsychological tests, passive and active interaction with a smart device",
  "risk_description": "Potential risks include privacy concerns due to the collection of sensitive data, and potential misuse of the data for unauthorized purposes.",
  "risk_level": "Moderate"
}
