# Generating Synthetic Entities with `Outlines`


Plan: Given a domain/industry, need to generate synthetic entities that are comprised of:

1. First, given a domain/industry name and a description of that domain, generate a list of N possible job titles (along with job description)

IndustryJobs - Industry Name - Industry Description - Job Titles

2. Then for each job title/description generate, generate a job entity:

Job Entity - Job Title (str) - Job Description (str) - Associated Job Postings/Position (List[str]) - Job Skills (List[str])


In [4]:
import os
from typing import List
import outlines
import huggingface_hub
from pydantic import BaseModel, conlist

In [5]:
# huggingface_hub.interpreter_login()

In [6]:
# class Job(BaseModel):
#     title: str
#     description: str
#     skills: conlist(str, min_length=3, max_length=3)  # type: ignore
#     relevant_postings: conlist(str, min_length=3, max_length=3)  # type: ignore


# class IndustryJobs(BaseModel):
#     industry_name: str
#     industry_description: str
#     industry_jobs: conlist(Job, min_length=3, max_length=3)  # type: ignore

In [7]:
class Job(BaseModel):
    title: str
    description: str
    skills: List[str]
    relevant_postings: List[str]


class IndustryJobs(BaseModel):
    industry_name: str
    industry_description: str
    industry_jobs: List[Job]

In [8]:
@outlines.prompt
def industry_jobs_prompt(name: str, description: str) -> IndustryJobs:
    """
    You are a expert human resources professional with broad and deep knowledge of talent profiles across every industry.
    Your job is to generate a list of 3 diverse and popular Job Profiles that cover a range of functions, from foundational
    roles to innovative and emerging positions based on a provided industry name and description.

    For each Job Profile, you need to provide the following details:
    - Job Title: The title of the job
    - Job Description: A brief description of the job role and responsibilities
    - Skills: A list of 3 skills required for the job
    - Relevant Job Postings: A list of 3 relevant job postings as they might appear on popular job portals

    Here is the new industry you need to generate jobs for:
    Industry Name: {{ name }}
    Industry Description: {{ description }}
    Jobs Profiles:
    """

In [9]:
import torch
from outlines import models

model_id = "google/gemma-7b-it"

model = models.transformers(
    model_id, model_kwargs={"device_map": "auto", "torch_dtype": torch.bfloat16}
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [10]:
model.device

device(type='cuda', index=0)

In [11]:
prompt = industry_jobs_prompt(
    name="Hospitality",
    description="The hospitality industry is a broad category of fields within the service industry that includes lodging, event planning, theme parks, transportation, cruise line, and additional fields within the tourism industry.",
)

In [12]:
print(prompt)

You are a expert human resources professional with broad and deep knowledge of talent profiles across every industry.
Your job is to generate a list of 3 diverse and popular Job Profiles that cover a range of functions, from foundational
roles to innovative and emerging positions based on a provided industry name and description.

For each Job Profile, you need to provide the following details:
- Job Title: The title of the job
- Job Description: A brief description of the job role and responsibilities
- Skills: A list of 3 skills required for the job
- Relevant Job Postings: A list of 3 relevant job postings as they might appear on popular job portals

Here is the new industry you need to generate jobs for:
Industry Name: Hospitality
Industry Description: The hospitality industry is a broad category of fields within the service industry that includes lodging, event planning, theme parks, transportation, cruise line, and additional fields within the tourism industry.
Jobs Profiles:


In [13]:
jobs_generator = outlines.generate.json(model, IndustryJobs)

In [17]:
%%time

out = jobs_generator(prompt)

RuntimeError: The size of tensor a (8192) must match the size of tensor b (8193) at non-singleton dimension 3

In [None]:
out.model_dump()

In [16]:
out.model_dump()

{'industry_name': 'Hospitality',
 'industry_description': 'The hospitality industry is a broad category of fields within the service industry that includes lodging, event planning, theme parks, transportation, cruise line, and additional fields within the tourism industry.',
 'industry_jobs': [{'title': 'Front Desk Agent',
   'description': 'Provides a warm and professional welcome to guests, assists with check-in/out procedures, and handles various customer service inquiries.',
   'skills': ['Excellent communication and interpersonal skills',
    'Strong customer service orientation',
    'Proficient in a variety of software systems'],
   'relevant_postings': ['Front Desk Agent - The Peninsula, New York, NY',
    'Front Desk Agent - The Ritz-Carlton, Chicago, IL',
    'Front Desk Agent - The Four Seasons Hotel, Los Angeles, CA']},
  {'title': 'Chef',
   'description': 'Plans, prepares, and cooks a variety of culinary creations for guests, maintains kitchen cleanliness and safety stand

In [9]:
%%time

out = outlines.generate.json(model, IndustryJobs)(prompt)

RuntimeError: The size of tensor a (8192) must match the size of tensor b (8193) at non-singleton dimension 3

In [11]:
out

IndustryJobs(industry_name='Hospitality', industry_description='The hospitality industry is a broad category of fields within the service industry that includes lodging, event planning, theme parks, transportation, cruise line, and additional fields within the tourism industry.', industry_jobs=[Job(title='Front Desk Agent', description='The Front Desk Agent is responsible for providing a welcoming and efficient environment for guests by completing various tasks such as registration, room assignments, and resolving complaints.', skills=['Excellent communication and interpersonal skills', 'Strong customer service orientation', 'Proficient in various software systems', 'Ability to work independently and collaboratively', 'Attention to detail'], relevant_postings=['Hotel Front Desk Agent Jobs', 'Front Desk Agent Jobs in New York', 'Front Desk Agent Job Posting']), Job(title='Hotel Room Attendant', description='The Hotel Room Attendant is responsible for maintaining cleanliness and providin

In [12]:
out.model_dump()

{'industry_name': 'Hospitality',
 'industry_description': 'The hospitality industry is a broad category of fields within the service industry that includes lodging, event planning, theme parks, transportation, cruise line, and additional fields within the tourism industry.',
 'industry_jobs': [{'title': 'Front Desk Agent',
   'description': 'The Front Desk Agent is responsible for providing a welcoming and efficient environment for guests by completing various tasks such as registration, room assignments, and resolving complaints.',
   'skills': ['Excellent communication and interpersonal skills',
    'Strong customer service orientation',
    'Proficient in various software systems',
    'Ability to work independently and collaboratively',
    'Attention to detail'],
   'relevant_postings': ['Hotel Front Desk Agent Jobs',
    'Front Desk Agent Jobs in New York',
    'Front Desk Agent Job Posting']},
  {'title': 'Hotel Room Attendant',
   'description': 'The Hotel Room Attendant is res

In [11]:
out

IndustryJobs(industry_name='Software Development', industry_description='Software development is the process of conceiving, specifying, designing, programming, documenting, testing, and bug fixing involved in creating and maintaining applications, frameworks, or other software components.', industry_jobs=[Job(job_title='Software Engineer', job_description='Software engineers design, code, test, and maintain software applications. They are responsible for all aspects of the software development process, from initial planning to implementation.'), Job(job_title='Full-Stack Developer', job_description='Full-stack developers are responsible for building and maintaining software applications across all platforms. They have strong proficiency in web development technologies like HTML, CSS, and JavaScript, as well as in back-end languages like Python or Java.'), Job(job_title='Junior Software Engineer', job_description='Junior software engineers assist senior engineers in designing, coding, a

In [12]:
out.industry_name

'Software Development'

In [13]:
out.industry_description

'Software development is the process of conceiving, specifying, designing, programming, documenting, testing, and bug fixing involved in creating and maintaining applications, frameworks, or other software components.'

In [15]:
for job in out.industry_jobs:
    print(job.job_title)
    print(job.job_description)
    print()

Software Engineer
Software engineers design, code, test, and maintain software applications. They are responsible for all aspects of the software development process, from initial planning to implementation.

Full-Stack Developer
Full-stack developers are responsible for building and maintaining software applications across all platforms. They have strong proficiency in web development technologies like HTML, CSS, and JavaScript, as well as in back-end languages like Python or Java.

Junior Software Engineer
Junior software engineers assist senior engineers in designing, coding, and testing software applications. They are typically responsible for smaller tasks and are often paired with a senior engineer for guidance.



## Testing


In [1]:
from transformers import AutoTokenizer, pipeline
import torch

model = "google/gemma-7b-it"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 4/4 [00:49<00:00, 12.30s/it]


In [2]:
messages = [
    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]
prompt = pipeline.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
outputs = pipeline(
    prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95
)

In [6]:
outputs[0]["generated_text"][]

"<bos><start_of_turn>user\nWho are you? Please, answer in pirate-speak.<end_of_turn>\n<start_of_turn>model\nAvast, me heartie, I am a swabblin' digital pirate, ready to pillage the high seas of the digital realm."

## Testing demo example


In [6]:
from dataclasses import dataclass
from enum import Enum

import torch
import transformers
from pydantic import BaseModel, conlist, constr, StringConstraints
from typing_extensions import Annotated

import outlines

In [None]:
import torch
from outlines import models

model_id = "google/gemma-7b-it"

model = models.transformers(
    model_id, model_kwargs={"device_map": "auto", "torch_dtype": torch.bfloat16}
)

In [7]:
class QuestionChoice(str, Enum):
    A = "The key to my heart is"
    B = "The first item on my bucket list is"
    C = "Perks of dating me"
    D = "Message me if you also love"
    E = "People would describe me as"
    F = "I can beat you in a game of"


@dataclass
class QuestionAnswer:
    question: QuestionChoice
    answer: str


class DatingProfile(BaseModel):
    bio: Annotated[str, StringConstraints(min_length=10, max_length=300)]
    job: Annotated[str, StringConstraints(max_length=50)]
    interests: conlist(str, min_length=1, max_length=5)  # type: ignore
    qna1: QuestionAnswer
    qna2: QuestionAnswer

In [8]:
@dataclass
class Example:
    description: str
    profile: DatingProfile

In [9]:
@outlines.prompt
def dating_profile_prompt(description: str, examples: list[Example]):
    """
    You are a world-renowned matchmaker who understands the modern dating
    market. Your job is to generate dating app profiles for male clients
    interested in women based on a provided description. The profiles should be
    authentic, show off their strengths, and maximize their likelihood of
    getting matches on dating apps.  Here are some examples of past clients that
    you have successfully created profiles for:

    {% for example in examples %}
    Description:
    {{ example.description }}
    Profile:
    {{ example.profile }}
    {% endfor %}

    Here is the new client who you need to create a profile for:
    Description: {{ description }}
    Profile:
    """

In [10]:
samples: list[Example] = [
    Example(
        description="I'm an author and former professional soccer player living in Seattle who publishes popular fiction books. A typical day for me starts by hanging out with my cat, drinking a coffee, and reading as much as I can in a few hours. Then, I'll prepare a quick smoothie before starting to write for a few hours, take a break with soccer or running a few miles, and finally meet friends for dinner at a new, hip restaurant in the evening. Sometimes we go axe-throwing afterwards, or play poker, or watch a comedy show, or visit a dive bar. On my vacations, I travel extensively to countries South America, Europe, and Asia, with the goal of visiting them all!",
        profile=DatingProfile(
            bio="Adventurer, dreamer, author, and soccer enthusiast. Life’s too short to waste time so I make the most of each day by exploring new places and playing with my friends on the pitch. What’s your favorite way to get out and have fun?",
            job="Famous Soccer Player -> Famous Author",
            interests=["Soccer", "Travel", "Friends", "Books", "Fluffy Animals"],
            qna1=QuestionAnswer(
                question=QuestionChoice.B, answer="swim in all seven oceans!"
            ),
            qna2=QuestionAnswer(
                question=QuestionChoice.E,
                answer="fun-loving, adventurous, and a little bit crazy",
            ),
        ),
    ),
    Example(
        description="I run my company and build houses for a living. I'm a big fan of the outdoors and love to go hiking, camping, and fishing. I don't like video games, but do like to watch movies. My love language is home-cooked food, and I'm looking for someone who isn't afraid to get their hands dirty.",
        profile=DatingProfile(
            bio="If you're looking for a Montana man who loves to get outdoors and hunt, and who's in-tune with his masculinity then I'm your guy!",
            job="House Construction Manager / Entrepreneur",
            interests=["Hunting", "Hiking", "The outdoors", "Home-cooked food"],
            qna1=QuestionAnswer(question=QuestionChoice.A, answer="food made at home"),
            qna2=QuestionAnswer(
                question=QuestionChoice.C,
                answer="having a man in your life who can fix anything",
            ),
        ),
    ),
    Example(
        description="I run my own Youtube channel with 10M subscribers. I love working with kids, and my audience skews pretty young too. In my free time, I play Fortnite and Roblox. I'm looking for someone who is also a gamer and likes to have fun. I'm learning Japanese in my free time as well as how to cook.",
        profile=DatingProfile(
            bio="Easy on the eyes (find me on Youtube!) and great with kids. What more do you need?",
            job="Youtuber 10M+ subscribers",
            interests=["Kids", "Gaming", "Japanese"],
            qna1=QuestionAnswer(question=QuestionChoice.D, answer="anime and gaming!"),
            qna2=QuestionAnswer(question=QuestionChoice.F, answer="Fortnite, gg ez"),
        ),
    ),
]

In [11]:
new_description = """I'm a laid-back lawyer who spends a lot of his free-time
gaming. I work in a corporate office, but ended up here after the start-up  I
cofounded got acquired, so still play ping pong with my cool coworkers every
day.  I have a bar at home where I make cocktails, which is great for
entertaining  friends. I secretly like to wear suits and get a new one tailored
every few  months. I also like weddings because I get to wear those suits, and
it's  a good excuse for a date. I watch the latest series because I'm paying,
with my hard-earned money, for every streaming service."""

In [12]:
prompt = dating_profile_prompt(new_description, samples)
profile = outlines.generate.json(model, DatingProfile)
out = profile(prompt)

Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x7f917c73d130>>
Traceback (most recent call last):
  File "/home/user/hf-notebooks/synthetic-entity-generation/.venv/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 770, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(
KeyboardInterrupt: 
