### Prepare In-context Demonstrations

In [11]:
from prosody_incontext_examples import in_context_examples_selected, in_context_examples_random

print(f'Number of selected in-context examples: {len(in_context_examples_selected)}')
print(f'Number of random in-context examples: {len(in_context_examples_random)}')

Number of selected in-context examples: 16
Number of random in-context examples: 128


### Prepare Testing samples

In [12]:
# read data
test_input = open('data/databaker_prosody/chatgpt_test_1k_input.txt', 'r').readlines()
test_ground_truth = open('data/databaker_prosody/chatgpt_test_1k_ground_truth.txt', 'r').readlines()

# remove '\n'
test_input = [line.strip() for line in test_input]
test_ground_truth = [line.strip() for line in test_ground_truth]

print(f'Number of test examples: {len(test_input)}')

Number of test examples: 1000


### Construct Prompt

In [13]:
import random, re

def get_knowledge_prompt(levels=['#1', '#2', '#3', '#4'], mode='default', separator='\n'):

  '''
  This function generates a string containing prosodic hierarchy knowledge based on the input parameters. The parameters are levels, mode, and separator.
  - levels: A list containing the prosodic levels to be included in the hint.
  - mode: A string indicating how to display the prosodic hierarchy knowledge. Currently, only the 'default' mode is available.
  - separator: A string used to separate the knowledge of each prosodic level.
  
  此函数根据传入的参数，生成包含韵律层级知识的字符串。参数有 levels, mode, 和 separator。
  - levels: 一个包含要包含在提示中的韵律层级的列表。
  - mode: 一个字符串，表示如何展示韵律层级知识。目前只有 'default' 模式。
  - separator: 一个字符串，用于分隔各个韵律层级的知识。
  '''

  # 定义一个字典 knowledges，存储韵律层级知识
  knowledges = {
      '#1': "韵律词 （#1）：韵律词常常是一个词典词，有时也会包含多个词典词。一般包含2~3个音节，包含一个词及它的附着成分。"
            "单音节的韵律词一般由单音节词延长，例如连词、介词等。"
            "#1主要出现在韵律词之后以及1） 不带“的”或“地”的短语或词组的两词之间；2） 带“的”名词短语或词组的“的”字之后",

      '#2': "韵律短语（#2）：是介于韵律词（#1）和语调短语（#3）之间的中等节奏模块。它可能小于句法上的短语。是人耳能听到的可分辨的界限。"
            "由1个以上的韵律词构成，包含4~9个音节，是最常见的自然韵律单元。"
            "#2主要出现在：1）修饰语与中心语之间；2）动宾之间、介宾之间或系表之间。",

      '#3': "语调短语（#3）：语调短语由多个韵律短语（#2）构成，一般对应句中的逗号，会有明显的停顿。"
            "#3主要出现在：1）小句之间；2）复杂的主谓之间。",

      '#4': "语段边界（#4）：语段边界一般是句子的末尾。"
            "汉语中的代表句子末尾的句末点号有三个：句末，有句号、问号、叹号，#4主要集中在句号、问号、叹号前。"
  }

'''
'#1': Prosodic Word is often a single dictionary word, occasionally encompassing multiple dictionary words. Typically, it comprises 2 to 3 syllables and includes a word along with its affixes. Monosyllabic prosodic words are usually extended from monosyllabic words, such as conjunctions and prepositions. #1 primarily occurs after prosodic words and in the following contexts: 1) between two-word phrases or expressions that do not include "DE" or "DI" and 2) after the "DE" particle in noun phrases or phrases that contain "DE".
'#2': Prosodic Phrase serves as an intermediate rhythmic unit between prosodic words (#1) and intonational phrases (#3), potentially shorter than syntactic phrases. It represents a distinguishable boundary perceptible to the human ear, consisting of one or more prosodic words and encompassing 4 to 9 syllables. #2 predominantly occurs in the following situations: 1) between modifiers and head 2) between verb and object, preposition and object, or subject and predicate.
'#3': Intonational Phrase comprises multiple prosodic phrases (#2) and typically corresponds to commas in a sentence, marked by evident pauses. It is defined as a functional unit equivalent to a minor clause with a communicative function. #3 primarily appears in the following contexts: 1) after subclauses within a single or complex sentence 2) following the main clause in subordinate clauses 3) after predicate elements, adjuncts, or embedded elements in complex sentences.
'''


  knowledge_prompt = ''

  if mode=='default':
    # 遍历指定的韵律层级
    for level in levels:
      # 将当前韵律层级的知识添加到 knowledge_prompt 字符串
      knowledge_prompt += knowledges[level]
      # 在每个知识条目之间添加分隔符
      knowledge_prompt += separator

  return knowledge_prompt


def get_demonstration_prompt(num_in_context_examples=4, mode='input+output', index=True, selected=True, seed=42):

    '''
    This function generates a string containing examples based on the input parameters. The parameters are num_in_context_examples, mode, and index.
    - num_in_context_examples: An integer indicating the number of examples to be included in the hint.
    - mode: A string indicating how to display the examples. Options are 'input+output' or 'output'.
    - index: A boolean indicating whether to add an index to the examples.

    此函数根据传入的参数，生成包含示例的字符串。参数有 num_in_context_examples, mode, 和 index。
    - num_in_context_examples: 一个整数，表示要包含在提示中的示例数量。
    - mode: 一个字符串，表示如何展示示例。可选值为 'input+output' 或 'output'。
    - index: 一个布尔值，表示是否为示例添加索引。
    '''

    examples_random = in_context_examples_random.copy()
    examples_selected = in_context_examples_selected.copy()
    if seed is not None:
      random.Random(seed).shuffle(examples_random)
      random.Random(seed).shuffle(examples_selected)

    # Define an internal function process_sentence to process example sentences,
    # e.g., converting"吃个#1甜筒#2降#1降温吧#4。"to "吃个甜筒降降温吧。"
    def process_sentence(sentence):
        #remove <#number> from the sentence
        sentence = re.sub(r'<#\d+>', '', sentence)
        # Generate input string by removing prosodic markers
        input_str = re.sub(r'(\w+)#(\d+)', r'\1', sentence)
        return input_str

    # Initialize an empty string demonstration_prompt to store the generated examples
    demonstration_prompt = ''
    # Ensure the selected number of examples is less than or equal to the available number of examples
    assert num_in_context_examples <= len(examples_selected)
    # Iterate through the specified number of examples
    for i in range(num_in_context_examples):
        # Process the example sentence, generating input and output strings
        if selected:
          output_str = examples_selected[i].strip()
        else:
          output_str = examples_random[i].strip()
        input_str = process_sentence(output_str)

        # If adding example index is required
        if index:
            demonstration_prompt += f'\n示例{i+1}\n'
        else:
            demonstration_prompt += '\n'

        # If the mode is 'input+output', provide both input and output
        if mode=='input+output':
            demonstration_prompt += '输入：' + input_str + '\n' + '输出：' + output_str + '\n'

        # If the mode is 'output', only provide the output with prosodic markers, not the original input
        elif mode=='output':
            demonstration_prompt += output_str + '\n'

    return demonstration_prompt


def get_task_instruction_prompt(sentence_to_process, mode='default'):

    '''
    This function generates a task instruction string based on the input parameters. The parameters are sentence_to_process and mode.
    - sentence_to_process: A string representing the sentence to be processed.
    - mode: A string indicating how to display the task instruction. Options are 'output_prompt' or 'default'.

    此函数根据传入的参数，生成任务说明字符串。参数有 sentence_to_process 和 mode。
    - sentence_to_process: 一个字符串，表示需要处理的句子。
    - mode: 一个字符串，表示如何展示任务说明。可选值为 'output_prompt' 或 'default'。
    '''


    task_instruction = ("\n你已经学习了韵律层级的理论知识，以及从示例中学习了韵律层级标注的规律。"
                        "接下来，请仔细理解下面的句子，并进行韵律层级结构的标注，直接输出结果句子，"
                        "不要加入任何额外的内容（例如'输出：'或换行符）。")
'''
    "\nYou have learned the theoretical knowledge of prosodic hierarchy and the rules of prosodic hierarchy annotation from examples."
    "Next, please carefully understand the following sentence and perform prosodic hierarchy structure annotation, output the result sentence directly,"
    "without adding any extra content (such as 'output:' or line breaks)."
'''                  

    # Enclose the sentence to be processed in double quotes
    if mode == 'default':
        return f'{task_instruction} "{sentence_to_process}":'
    elif mode == 'output_prompt':
        return f'{task_instruction}\n输入：{sentence_to_process}\n输出：'
    elif mode == 'output_prompt_only':
        return f'\n\n输入：{sentence_to_process}\n输出：'


def get_prompt(prompt_components, prompt='', sentence_to_process=''):

    '''
    This function generates a complete prompt string based on the input parameters. It integrates the outputs of the get_knowledge_prompt, get_demonstration_prompt, and get_task_instruction_prompt functions. The parameters are prompt_components and sentence_to_process.

    - prompt_components: A dictionary containing the following key-value pairs, used to configure the prompts for each part:
    'knowledge': Used to configure the get_knowledge_prompt function.
    'demonstration': Used to configure the get_demonstration_prompt function.
    'task_instruction': Used to configure the get_task_instruction_prompt function.

    - sentence_to_process: A string representing the sentence to be processed.

    The get_prompt function first calls the get_knowledge_prompt function, adding the returned string (containing prosodic hierarchy knowledge) to the prompt string.
    Next, it calls the get_demonstration_prompt function, adding the returned string (containing examples) to the prompt string.
    Finally, it calls the get_task_instruction_prompt function, adding the returned string (containing task instructions) to the prompt string.

    If you want to add new configurations, you can add new key-value pairs to the prompt_components dictionary and pass them to the corresponding functions.


    此函数根据传入的参数，生成完整的提示字符串。
    它整合了 get_knowledge_prompt, get_demonstration_prompt, 和 get_task_instruction_prompt 函数的输出。
    参数有 prompt_components 和 sentence_to_process。

    - prompt_components: 一个字典，包含以下键值对，分别用于配置各个部分的提示：
      'knowledge': 用于配置 get_knowledge_prompt 函数。
      'demonstration': 用于配置 get_demonstration_prompt 函数。
      'task_instruction': 用于配置 get_task_instruction_prompt 函数。

    - sentence_to_process: 一个字符串，表示需要处理的句子。

    get_prompt 函数首先调用 get_knowledge_prompt 函数，将返回的字符串（包含韵律层级知识）添加到 prompt 字符串中。
    接下来，调用 get_demonstration_prompt 函数，将返回的字符串（包含示例）添加到 prompt 字符串中。
    最后，调用 get_task_instruction_prompt 函数，将返回的字符串（包含任务说明）添加到 prompt 字符串中。

    如果想要添加新的配置，可以在 prompt_components 字典中添加新的键值对，然后将其传递给相应的函数。
    '''

    # Add the knowledge of the current prosodic level to the prompt string
    prompt += get_knowledge_prompt(
        levels=prompt_components['knowledge']['levels'],
        mode=prompt_components['knowledge']['mode'],
    )

    # Add task instructions to the prompt string
    prompt += get_demonstration_prompt(
        num_in_context_examples=prompt_components['demonstration']['num_demonstration'],
        mode=prompt_components['demonstration']['mode'],
        index=prompt_components['demonstration']['index'],
        selected=prompt_components['demonstration']['selected'],
        seed=prompt_components['demonstration']['seed']
    )

    # Return the complete prompt string
    prompt += get_task_instruction_prompt(
        sentence_to_process=sentence_to_process,
        mode=prompt_components['task_instruction']
    )

    return prompt

In [14]:
# Define the structure design of the prompt
prompt_components = {
    'knowledge': {
        'levels': ['#1', '#2', '#3', '#4'],
        'mode': 'default'
    },
    'demonstration': {
        'num_demonstration': 4,
        'mode': 'input+output',  # Options: 'input+output' or 'output'
        'index': False,  # Options: True or False
        'selected': True, # Options: True or False
        'seed': None, # Options: int or None
    },
    'task_instruction': 'output_prompt',  # Options: 'output_prompt' 'output_prompt_only' or 'default'
}

# Construct prompt
prompt = get_prompt(
    prompt_components,
    sentence_to_process='当时离学校不过几十米远。',
    )
print(prompt)

韵律词 （#1）：韵律词常常是一个词典词，有时也会包含多个词典词。一般包含2~3个音节，包含一个词及它的附着成分。单音节的韵律词一般由单音节词延长，例如连词、介词等。#1主要出现在韵律词之后以及1） 不带“的”或“地”的短语或词组的两词之间；2） 带“的”名词短语或词组的“的”字之后
韵律短语（#2）：是介于韵律词（#1）和语调短语（#3）之间的中等节奏模块。它可能小于句法上的短语。是人耳能听到的可分辨的界限。由1个以上的韵律词构成，包含4~9个音节，是最常见的自然韵律单元。#2主要出现在：1）修饰语与中心语之间；2）动宾之间、介宾之间或系表之间。
语调短语（#3）：语调短语由多个韵律短语（#2）构成，一般对应句中的逗号，会有明显的停顿。#3主要出现在：1）小句之间；2）复杂的主谓之间。
语段边界（#4）：语段边界一般是句子的末尾。汉语中的代表句子末尾的句末点号有三个：句末，有句号、问号、叹号，#4主要集中在句号、问号、叹号前。

输入：巨大的雷鸣声一浪浪地滚来，大地轻颤。
输出：巨大的#1雷鸣声#2一浪浪地#1滚来#3，大地#1轻颤#4。

输入：常务副省长陈敏尔也已经在昨天晚上赶到舟山。
输出：常务#1副省长#2陈敏尔#3也已经#1在#1昨天#1晚上#2赶到#1舟山#4。

输入：乌鸦说我真不幸，但实际上，他是因为运气吗？
输出：乌鸦说#2我真#1不幸#3，但#1实际上#3，他是#1因为#1运气吗#4？

输入：在栾城县红日永和豆浆店，记者看到了一块诚信监管牌。
输出：在#1栾城县#2红日#2永和#1豆浆店#3，记者#1看到了#1一块#2诚信#1监管牌#4。

你已经学习了韵律层级的理论知识，以及从示例中学习了韵律层级标注的规律。接下来，请仔细理解下面的句子，并进行韵律层级结构的标注，直接输出结果句子，不要加入任何额外的内容（例如'输出：'或换行符）。
输入：当时离学校不过几十米远。
输出：


### Inference with OpenAI API (Multi-processing)

In [15]:
import openai
import concurrent.futures

openai.api_key = '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' # please enter your own api key
model_engine = "text-davinci-003"

num_test_samples = 3 # or 1000
test_input = test_input[:num_test_samples]
test_ground_truth = test_ground_truth[:num_test_samples]

num_thread = 1 # #Number of threads
num_worker = min(num_test_samples, num_thread)

def fetch_completion(i, prompt):
    completions = openai.Completion.create(
        engine=model_engine,
        prompt=prompt,
        max_tokens=100,
        n=1,
        stop=None,
        temperature=0.2,
    )
    return i, completions.choices[0].text.strip()


predictions = {}
with concurrent.futures.ThreadPoolExecutor(max_workers=num_worker) as executor:
    futures = {}
    for i in range(len(test_input)):
        prompt = get_prompt(prompt_components, sentence_to_process=test_input[i])
        futures[executor.submit(fetch_completion, i, prompt)] = i

    for future in concurrent.futures.as_completed(futures):
        i = futures[future]
        result = future.result()[1]
        predictions[i] = result


def print_result(i, test_input, ground_truth, result):
    print(f'[Sample {i+1}/{len(test_input)}]:\t{test_input[i]}')
    #print(f'Prompt: \t', prompt)
    print(f'Ground Truth: \t{ground_truth[i]}')
    print(f'Prediction:  \t{result}')
    print('- ' * 64)

for i in range(len(test_input)):
    print_result(i, test_input, test_ground_truth, predictions[i])

[Sample 1/3]:	曾志伟坦言各人没有计较出场次序，但安排上仍有困难。
Ground Truth: 	曾志伟#2坦言#2各人#2没有#1计较#1出场#1次序#3，但#1安排上#2仍有#1困难#4。
Prediction:  	曾#1志伟#2坦言#3各人#1没有#1计较#1出场#1次序#3，但#1安排上#3仍有#1困难#4。
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
[Sample 2/3]:	聊天止于呵呵啊，兄弟。
Ground Truth: 	聊天#2止于#1呵呵啊#2，兄弟#4。
Prediction:  	聊天#1止于#1呵呵#2啊#3，兄弟#4。
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
[Sample 3/3]:	第二垄断血液骨髓。
Ground Truth: 	第二#1垄断#2血液#1骨髓#4。
Prediction:  	第#1二垄#2断血#1液骨#1髓#4。
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 


### Evaluate Results

In [16]:
import prosody_evaluation 
recall, precision, fscore, completematch = prosody_evaluation.evaluate(test_ground_truth, predictions)

Testing sample 0
	Ground Truth: 曾志伟#2坦言#2各人#2没有#1计较#1出场#1次序#3，但#1安排上#2仍有#1困难#4。
	Prediction  : 曾#1志伟#2坦言#3各人#1没有#1计较#1出场#1次序#3，但#1安排上#3仍有#1困难#4。
	Ground Truth: WWW12WW12WW12WW1WW1WW1WW123W1WWW2WW1WW4
	Prediction  : W1WW12WW123WW1WW1WW1WW1WW123W1WWW123WW1WW4
	Ground Truth: 002020201010103100201004
	Prediction  : 102030101010103100301004
Testing sample 1
	Ground Truth: 聊天#2止于#1呵呵啊#2，兄弟#4。
	Prediction  : 聊天#1止于#1呵呵#2啊#3，兄弟#4。
	Ground Truth: WW2WW1WWW2WW4
	Prediction  : WW1WW1WW12W123WW4
	Ground Truth: 0201002004
	Prediction  : 0101023004
Testing sample 2
	Ground Truth: 第二#1垄断#2血液#1骨髓#4。
	Prediction  : 第#1二垄#2断血#1液骨#1髓#4。
	Ground Truth: WW1WW2WW1WW4
	Prediction  : W1WW2WW1WW1W4
	Ground Truth: 010201004
	Prediction  : 102010104
| Level | Precision | Recall | F-Score |
| ----- | --------- | ------ | ------- |
|PW  #1 | 68.42 | 81.25 | 74.29 |
|PPH #2 | 71.43 | 62.5 | 66.67 |
|IPH #3 | 25.0 | 100.0 | 40.0 |
|Average | 54.95 | 81.25 | 60.32 |
Exact Match: 0 / 3 = 0.0%
