<div class="alert alert-block alert-info">
    
 <b>Generate training and dev dataset for SLM fine-tuning Using PPO/GRPO</b>

 1. Only ASTE: Total 3634 train and 887 dev examples across all the four datasets.
 2. ASTE + AOPE + AESC, 10,902 train and 2661 dev examples across all the four datasets.
 3. AE + OE + ASTE + AOPE + AESC, 18,170 train and 4435 dev examples across all the four datasets.
</div>

In [1]:
import os
import json
from jinja2 import Environment
import pandas as pd
import random
import re
import sys
from tqdm import tqdm
from datasets import Dataset

data_dir = "/home/azureuser/localfiles/data/aste"
datasets = ["14res", "15res", "16res", "lap14"]

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Current Error: NotImplementedError: sequence_length=274 is larger than max_length=256
base_reasoning_template_prefix ="""A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. User: """
aspect_template = base_reasoning_template_prefix + """You are an AI agent skilled at identifying aspect terms from a given sentence. For the sentence provided below extract all the aspect terms and return as a python list of strings, e.g., ['aspect_1', 'aspect_2', ...]. \nIf no aspect term can be identified then return ['NULL']. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> ['battery', 'screen'] </answer>. \nSentence: {sentence}
Assistant: Let me solve this step by step.
<think>"""
opinion_template = base_reasoning_template_prefix + """You are an AI agent skilled at identifying opinion terms from a given sentence. For the sentence provided below extract all the opinion terms and return as a python list of strings, e.g., ['aspect_1', 'aspect_2', ...]. \nIf no aspect term can be identified then return ['NULL']. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> ['battery', 'screen'] </answer>. \nSentence: {sentence}
Assistant: Let me solve this step by step.
<think>"""
aope_template = base_reasoning_template_prefix + """You are an AI agent skilled at identifying aspect and opinion terms from a given sentence. For the sentence provided below extract all the aspect terms and the corresponding opinion terms and return as a python list of strings, e.g., ['aspect_1 ; opinion_1', 'aspect_2 ; opinion_2', ...]. \nIf either an aspect or opinion term is not present in the sentence then return 'NULL' in its place. Make sure every element in the list has two sub-elements in it. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> ['battery', 'screen'] </answer>. \nSentence: {sentence}
Assistant: Let me solve this step by step.
<think>"""
aesc_template = base_reasoning_template_prefix + """You are an AI agent skilled at identifying aspect and sentiment terms from a given sentence. For the sentence provided below extract all the aspect terms and the corresponding sentiment terms and return as a python list of strings, e.g., ['aspect_1 ; sentiment_1', 'aspect_2 ; sentiment_2', ...]. \nIf either an aspect or sentiment term is not present in the sentence then return 'NULL' in its place. Make sure every element in the list has two sub-elements in it. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> ['battery', 'screen'] </answer>. \nSentence: {sentence}
Assistant: Let me solve this step by step.
<think>"""
aste_template = base_reasoning_template_prefix + """For the sentence provided below extract all the aspect, opinion and sentiments and return as a python list of strings, e.g., ['aspect_1 ; opinion_1 ; sentiment_1', ...]. \nThe sentiment is one of the following three, 'POS', 'NEG' and 'NEU', respectively. Return 'NULL' for missing values. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags. \nSentence: {sentence}
Assistant: Let me solve this step by step.
<think>"""

def templatize(input_variables: list, template: str):
    """
    Dynamically create prompts based on a template and
    user defined values. The template can have placeholders
    indicated within curly braces, e.g., {placeholder}
    """
    res = re.findall(r"[{][a-z_:1-9]+[}]", template)
    if len(input_variables) == 0 or len(res) == 0:
        # nothing to decorate
        return template

    var = [x.strip('{}') for x in res]  # identify the variable names
    nvr = ["{{ "+v+" }}" for v in var]  # rewrite in the Jinja format
    s = template
    for v1, v2 in zip(res, nvr):
        s = s.replace(v1, v2)
    environment = Environment()
    template = environment.from_string(s)
    # t = {k: v for k, v in zip(var, input_variables)}
    return template.render({k: v for k, v in zip(var, input_variables)})


def make_prefix(dp, template):
    sentence = dp['sentence']
    prefix = templatize([sentence], template)
    return prefix

In [3]:
split = 'train'
train_dataset = {'dataset': [], 'sentence': [], 'target': []}
for dataset in datasets:
    with open(os.path.join(data_dir, dataset, f'{split}.sent'), 'r') as fr:
        sentences = fr.readlines()
        sentences = [e.strip() for e in sentences]

    with open(os.path.join(data_dir, dataset, f'{split}.tup'), 'r') as fr:
        labels = fr.readlines()
        labels = [t.strip() for t in labels]
    assert len(sentences) == len(labels), f"Mismatch in X and Y length"

    for x, y in zip(sentences, labels):
        train_dataset['dataset'].append(dataset)
        train_dataset['sentence'].append(x)
        train_dataset['target'].append(f"{y.split('|')}")

split = 'dev'
dev_dataset = {'dataset': [], 'sentence': [], 'target': []}
for dataset in datasets:
    with open(os.path.join(data_dir, dataset, f'{split}.sent'), 'r') as fr:
        sentences = fr.readlines()
        sentences = [e.strip() for e in sentences]

    with open(os.path.join(data_dir, dataset, f'{split}.tup'), 'r') as fr:
        labels = fr.readlines()
        labels = [t.strip() for t in labels]
    assert len(sentences) == len(labels), f"Mismatch in X and Y length"

    for x, y in zip(sentences, labels):
        dev_dataset['dataset'].append(dataset)
        dev_dataset['sentence'].append(x)
        dev_dataset['target'].append(f"{y.split('|')}")

train_dataset = Dataset.from_dict(train_dataset)
dev_dataset = Dataset.from_dict(dev_dataset)

In [4]:
train_dataset[0]

{'dataset': '14res',
 'sentence': 'But the staff was so horrible to us .',
 'target': "['staff ; horrible ; NEG']"}

In [5]:
data_source = 'aste'
def make_map_fn(split):
    def process_fn(example, idx):
        question = make_prefix(example, template=aste_template)
        solution = {
            "target": example['target'],
        }
        qtype = 'aste'
        data = {
            "data_source": data_source,
            "prompt": [{
                "role": "user",
                "content": question,
            }],
            "ability": "math",
            "reward_model": {
                "style": "rule",
                "ground_truth": solution
            },
            "extra_info": {
                'split': split,
                'index': idx,
                'type': qtype,
            }
        }
        return data
    return process_fn

train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True)
dev_dataset = dev_dataset.map(function=make_map_fn('test'), with_indices=True)

Map: 100%|██████████| 3634/3634 [00:01<00:00, 2042.18 examples/s]
Map: 100%|██████████| 887/887 [00:00<00:00, 1698.62 examples/s]


In [6]:
train_dataset.to_parquet(os.path.join(data_dir, 'train.parquet'))
dev_dataset.to_parquet(os.path.join(data_dir, 'test.parquet'))

Creating parquet from Arrow format: 100%|██████████| 4/4 [00:00<00:00, 360.51ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 411.61ba/s]


1105319

In [7]:
train_dataset[0]

{'dataset': '14res',
 'sentence': 'But the staff was so horrible to us .',
 'target': "['staff ; horrible ; NEG']",
 'data_source': 'aste',
 'prompt': [{'content': "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. User: For the sentence provided below extract all the aspect terms, corresponding opinion terms and sentiments and return as a python list of strings, e.g., ['aspect_1 ; opinion_1 ; sentiment_1', 'aspect_2 ; opinion_2 ; sentiment_2', ...]. \nThe sentiment is one of the following three, 'POS', 'NEG' and 'NEU' for positive, negative and neutral sentiment, respectively. If either an aspect or opinion term is not present in the sentence then return 'NULL' in its place. Make sure every element in the list has three sub-elements in it. Show your work in <think> </think> tags. And return the final answer in <answer> </answer>