<a href="https://colab.research.google.com/github/10udCryp7/TV-command-synthesis/blob/main/src_prototype/Phase1_TextSynthesis_PromptGenerator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
non_active_prompt = """
You are to generate {generated_nums} sentences of type "non_active".

**Definition of "non_active":**
- Sentences represent natural, everyday indoor human conversation.
- Absolutely no TV/device commands.
- Mentioning or talking about a command is still considered non_active — only direct device instructions count as active.

**Input content:**
- Use the following content list (after removing prefixes): {content_list}

**Rules:**
1. Use only human-to-human conversation, not device commands.
2. Even if the sentence contains TV-related words, if it is not a direct command to the device, it is still non_active.
3. Sentences must sound natural and conversational.
4. Use the specific content ideas from the list.

**Output format (JSON only):**
```json
{{
  "non_active": [
    {{ "text": "..." }}
  ]
}}
"""
single_active_prompt = """
You are to generate {generated_nums} sentences of type "single_active".

**Definition of "single_active":**
- Each sentence is a direct, clear TV/device command.
- No human conversation or extra wording beyond the command.
- The command must be something that could be spoken to a device to perform an action immediately.

**Input:**
- TV commands list: {command_list}
- Content list (prefix removed): {content_list}

**Rules:**
1. Each sentence = exactly one direct TV/device command from the TV commands list.
2. Do not use questions, hypotheticals, or descriptions — only imperative commands.
3. No unrelated conversation or comments.
4. Must sound natural as a spoken device instruction.

**Output format (JSON only):**
```json
{{
  "single_active": [
    {{ "text": "..." }}
  ]
}}

"""
single_mix_prompt = """
You are to generate {generated_nums} sentences of type "single_mix".

**Definition of "single_mix":**
- Each sentence must contain exactly ONE direct, clear TV/device command + one unrelated human conversation.
- The command must be an imperative statement addressed to the device, not a question or discussion.
- The two parts must be completely unrelated.

**Position requirement:**
- In this task, the TV/device command should appear {command_position} in the sentence.

**Input:**
- TV commands list: {command_list}
- Content list (prefix removed): {content_list}

**Rules:**
1. Exactly one direct TV/device command + one unrelated human conversation.
2. The command must be in imperative form (telling the device to do something immediately).
3. No hypotheticals, descriptions, or indirect mentions of commands.
4. Commands and conversation must follow the specified position rule above.
5. Sentences must sound natural.
6. When splitting into segments, include every single word from the original sentence — no removals.

**Output format (JSON only):**
```json
{{
  "single_mix": [
    {{
      "text": "...",
      "segments": [
        {{ "0": "...", "type": "active" }},
        {{ "1": "...", "type": "non_active" }}
        // More segments if needed, but all words from the original sentence must be included
      ]
    }}
  ]
}}
"""
chain_active_prompt = """
You are to generate {generated_nums} sentences of type "chain_active".

**Definition of "chain_active":**
- Sentences contain multiple direct TV/device commands only.
- All commands must be imperative instructions to the device.

**Input:**
- TV commands list: {command_list}
- Content list (prefix removed): {content_list}

**Rules:**
1. Each sentence must have two or more imperative device commands from the provided list.
2. No human conversation, questions, or descriptive phrases.
3. Commands can be related or unrelated, but all must be valid direct instructions.
4. Sentences should sound natural as a spoken command sequence.

**Output format (JSON only):**
```json
{{
  "chain_active": [
    {{ "text": "..." }}
  ]
}}
"""
chain_mix_prompt = """
You are to generate {generated_nums} sentences of type "chain_mix".

**Definition of "chain_mix":**
- Sentences must contain multiple direct TV/device commands + at least one unrelated human conversation.
- Commands must be imperative instructions to the device, not questions or descriptions.

**Position requirement:**
- In this task, the sequence of TV/device commands should appear {command_position} in the sentence.

**Input:**
- TV commands list: {command_list}
- Content list (prefix removed): {content_list}

**Rules:**
1. Each sentence must contain 2 or more direct imperative commands from the TV commands list.
2. Must also contain at least one unrelated human conversation segment.
3. Commands and conversation must follow the specified position rule above.
4. No hypotheticals, indirect mentions, or descriptions — only direct instructions are active.
5. Sentences must sound natural.
6. Segments must cover the entire sentence with no missing words or characters.
7. In segments, `"type"` must be `"active"` for commands and `"non_active"` for conversation.

**Output format (JSON only):**
```json
{{
  "chain_mix": [
    {{
      "text": "...",
      "segments": [
        {{ "0": "...", "type": "active" }},
        {{ "1": "...", "type": "non_active" }}
        // More segments as needed, but must cover 100% of the sentence
      ]
    }}
  ]
}}
"""

In [2]:
PROMPT_CORPUS = {
    "non_active": non_active_prompt,
    "single_active": single_active_prompt,
    "single_mix": single_mix_prompt,
    "chain_active": chain_active_prompt,
    "chain_mix": chain_mix_prompt
}

In [4]:
import random

class PromptGenerator():
  def __init__(self, num_samples_command: int, num_samples_content: int,
               chain_length: int, prompt_dir: str, prompt_types: list,
               list_corpus: list[list[str]], list_command: list[str],
               list_position: list[str], generated_num: int):
    self.num_samples_command = num_samples_command
    self.num_samples_content = num_samples_content
    self.chain_length = chain_length # for chain case
    self.prompt_dir = prompt_dir
    self.prompt_types = prompt_types
    self.list_corpus = list_corpus
    self.list_command = list_command
    self.list_position = list_position
    self.generated_num = generated_num

  def content_sample(self):
    # corpus = list of list
    sample_list = []
    # sampling content from each data corpus
    for lis_sample in self.list_corpus:
      sample_list.append(random.sample(lis_sample, self.num_samples_content))

    return sample_list

  def command_sample(self):
    sample = random.sample(self.list_command, self.num_samples_command)
    return sample

  def get_prompt(self, prompt_type, sampled_command, sampled_contents):
    prompt_dict = {} # get from config

    prompt_template = prompt_dict[prompt_type]
    if prompt_type in ("single_mix", "chain_mix"):
      command_position = self.position_sample()
      prompt = prompt_template.format(command_list = sampled_command,
                                      content_list = sampled_contents,
                                      generated_num = self.generated_num,
                                      command_position = command_position)
    else:
      prompt = prompt_template.format(command_list = sampled_command,
                                      content_list = sampled_contents,
                                      generated_num = self.generated_num)

    return prompt

  def position_sample(self):
    sample = random.sample(self.list_position, 3)
    return sample

  def generate_prompt(self):
    prompts = {}
    for prompt_type in self.prompt_types:
      sampled_command = self.command_sample(self.list_command)
      sampled_contents = self.content_sample(self.list_corpus)

      prompt = self.get_prompt(prompt_type,
                              command_list = sampled_command,
                              content_list = sampled_contents)

      prompts[prompt_type] = prompt
    return prompts
