# Full ABSA task with Phi3.5 mini (testing) compared to 3 mini

- do EDA on dataset (get list of all possible aspect categories and sentiments)
- run some of the dataset through a baseline phi3 mini to see
- study parsing of results, and how to eval also

In [1]:
from datasets import load_dataset

In [2]:
absa_quad = load_dataset("NEUDM/absa-quad")

In [3]:
absa_quad

DatasetDict({
    train: Dataset({
        features: ['task_type', 'dataset', 'input', 'output', 'situation', 'label', 'extra', 'instruction'],
        num_rows: 2098
    })
    validation: Dataset({
        features: ['task_type', 'dataset', 'input', 'output', 'situation', 'label', 'extra', 'instruction'],
        num_rows: 525
    })
    test: Dataset({
        features: ['task_type', 'dataset', 'input', 'output', 'situation', 'label', 'extra', 'instruction'],
        num_rows: 1081
    })
})

In [4]:
import ast

In [5]:
ALL_SENTIMENTS = []
ALL_ASPECT_CATEGORIES = []

for example in absa_quad["train"].iter(batch_size=1):
    output_as_str = example["output"][0]
    
    # list of quads
    xs = ast.literal_eval(output_as_str)
    for quad in xs:
        ALL_SENTIMENTS.append(quad[2])
        #if quad[2] == "neutral":
        #    print(example) # <-- used to get examples with neutral for the prompt
        ALL_ASPECT_CATEGORIES.append(quad[1])

In [6]:
len(ALL_SENTIMENTS), len(ALL_ASPECT_CATEGORIES)

(3343, 3343)

In [7]:
from collections import Counter

sentiment_cnt = Counter(ALL_SENTIMENTS)
aspect_category_cnt = Counter(ALL_ASPECT_CATEGORIES)

sentiment_cnt, aspect_category_cnt

(Counter({'positive': 2374, 'negative': 873, 'neutral': 96}),
 Counter({'food quality': 1195,
          'service general': 597,
          'restaurant general': 502,
          'ambience general': 379,
          'food style_options': 163,
          'restaurant miscellaneous': 121,
          'food prices': 103,
          'restaurant prices': 98,
          'drinks quality': 67,
          'drinks style_options': 51,
          'location general': 39,
          'drinks prices': 27,
          'food general': 1}))

# Load model and prepare for baseline evaluation

In [8]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed

import torch

In [9]:
!pip install tiktoken

  pid, fd = os.forkpty()




In [10]:
!pip install einops



In [11]:
!pip install triton



In [12]:
device = "cuda" if torch.cuda.is_available() else "cpu"

#MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
#MODEL_NAME = "microsoft/Phi-3-small-8k-instruct" # try later, too slow to download
MODEL_NAME = "microsoft/Phi-3.5-mini-instruct"

model = AutoModelForCausalLM.from_pretrained( 
    MODEL_NAME,  
    device_map=device,  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

config.json:   0%|          | 0.00/3.45k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

## Prompt template

**UPDATE:**

I noticed that the simple fact of adding/removing the space between `Finally there is a query which consists of an actual restaurant review.` and the next part `Perform the same aspect-based ....` leads to difference in number of predicted quads O_o very sensitive to prompt

In [13]:
template = """## INSTRUCTIONS ##

Aspect-based sentiment analysis requires you to identify terms in a sentence that correspond to a person expressing an opinion. Each sentence may have several opinion terms. An opinion term is the exact string of words about which a sentiment is being expressed. The justification is the exact string of words, if any, that justifies or explains the corresponding opinion term.

Below are some examples of aspect-based sentiment analysis performed on different restaurant reviews.

Then there is a list of Aspect Categories: each category is a feature of interest to classifying a specific part of the review, such as "ambience general" or "food style_options".

Then there is a list of allowed Sentiments that you can label each opinion term with.

Finally there is a query which consists of an actual restaurant review.

Perform the same aspect-based sentiment analysis as shown in the example. Extract the terms and the justifications directly from the review, without any modification. Only use labels given in the Aspect Category list and in the Sentiment list. Use the exact word NULL if there is no occurrence in the sentence for any of the individual requested items in the output. Follow the exact same output formatting without deviation.

## EXAMPLES OF ASPECT-BASED SENTIMENT ANALYSIS ##

{examples_absa}

## LIST OF ASPECT CATEGORIES ##

- food quality
- service general
- restaurant general
- ambience general
- food style_options
- restaurant miscellaneous
- food prices
- restaurant prices
- drinks quality
- drinks style_options
- location general
- drinks prices
- food general

## LIST OF SENTIMENTS ##

- positive
- negative
- neutral

## QUERY ##

restaurant review: {query}
absa output:
"""

In [14]:
# TODO - try adding a summarize-first step, then generate ABSA
examples_absa = """
restaurant review: I can not imagine better Indian food in all of the city.
absa output:
[{ \"opinion term\": \"Indian food\", \"aspect category\": \"food quality\", \"sentiment\": \"positive\", \"justification\": \"better\" }]

restaurant review: I've been many times and have never been disappointed.
absa output:
[{ \"opinion term\": \"NULL\", \"aspect category\": \"restaurant general\", \"sentiment\": \"positive\", \"justification\": \"never been disappointed\" }]

restaurant review: "Skip this restaurant, it 's a big disappointment."
absa output:
[{ \"opinion term\": \"restaurant\", \"aspect category\": \"restaurant general\", \"sentiment\": \"negative\", \"justification\": \"Skip\" }, { \"opinion term\": \"restaurant\", \"aspect category\": \"restaurant general\", \"sentiment\": \"negative\", \"justification\": \"disappointment\" }]

restaurant review: "With the exception of our lemon salad that had so much pepper on it that our eyes started watering, the food here was decent, not great."
absa output:
[{ \"opinion term\": \"food\", \"aspect category\": \"food quality\", \"sentiment\": \"neutral\", \"justification\": \"decent\" }, { \"opinion term\": \"food\", \"aspect category\": \"food quality\", \"sentiment\": \"negative\", \"justification\": \"not great\" }, { \"opinion term\": \"lemon salad\", \"aspect category\": \"food quality\", \"sentiment\": \"negative\", \"justification\": \"exception\" }]
"""

In [15]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

## Test single example

In [16]:
for example in absa_quad["train"].iter(batch_size=1):
    example_text = example["input"][0]
    #print(example_text, type(example_text))
    input_text = ast.literal_eval(example_text)[0]
    
    prompt = template.format(examples_absa=examples_absa, query=input_text)

    messages = [ 
        {"role": "system", "content": "You are helpful asssistant that performs aspect-based sentiment analysis on restaurant reviews. You strictly follow all formatting instructions without deviation."}, 
        {"role": "user", "content": prompt}, 
    ] 

    generation_args = { 
        "max_new_tokens": 500, 
        "return_full_text": False, 
        "temperature": 0.0, 
        "do_sample": False, 
    }
    
    output = pipe(messages, **generation_args)
    
    print("==== INPUT TEXT ====")
    print(input_text)
    print("---- MODEL OUTPUT ----")
    print(output[0]['generated_text'])
    
    break

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


==== INPUT TEXT ====
The wait here is long for dim sum , but if you do n't like sharing tables or if the typical raucous dim sum atmosphere is not your gig , this is a sleek ( for Chinatown ) alternative .
---- MODEL OUTPUT ----
 [{ "opinion term": "wait time", "aspect category": "service general", "sentiment": "negative", "justification": "long" }, { "opinion term": "dim sum atmosphere", "aspect category": "ambience general", "sentiment": "neutral", "justification": "not your gig" }, { "opinion term": "restaurant", "aspect category": "restaurant general", "sentiment": "positive", "justification": "sleek ( for Chinatown ) alternative" }]


In [17]:
import json

# note has leading whitespace, but json loads seems to work ok
test_json = json.loads( output[0]['generated_text'] )

test_json

[{'opinion term': 'wait time',
  'aspect category': 'service general',
  'sentiment': 'negative',
  'justification': 'long'},
 {'opinion term': 'dim sum atmosphere',
  'aspect category': 'ambience general',
  'sentiment': 'neutral',
  'justification': 'not your gig'},
 {'opinion term': 'restaurant',
  'aspect category': 'restaurant general',
  'sentiment': 'positive',
  'justification': 'sleek ( for Chinatown ) alternative'}]

# Run train dataset through model

- store raw completions, and the input text + its gold annotations

In [18]:
import time

In [19]:
all_results = [] # store a list of jsons, each json contains input text, gold annotations, raw model completion

EXAMPLES_TO_TEST = 50
example_counter = 0

start_time = time.time()

for example in absa_quad["train"].iter(batch_size=1):
    
    curr_example_data = {"example_id":example_counter, "input_text":None, "gold_quads":[]}
    
    # === process gold data and save it ===
    example_text = example["input"][0]
    input_text = ast.literal_eval(example_text)[0]
    curr_example_data["input_text"] = input_text
    
    gold_labels = example["output"][0]
    
    # go through list of quads
    golds = ast.literal_eval(gold_labels)
    for quad in golds:
        tmp_d = {}
        tmp_d["opinion term"] = quad[0]
        tmp_d["aspect category"] = quad[1]
        tmp_d["sentiment"] = quad[2]
        tmp_d["justification"] = quad[3]
        curr_example_data["gold_quads"].append(tmp_d)
    
    # === GET MODEL COMPLETION ===
    if example_counter % 20 == 0:
        print(f"=== Sending example number {example_counter} to model ===")
        #print(f"--- current example is : {input_text} ---")
    
    prompt = template.format(examples_absa=examples_absa, query=input_text)

    messages = [ 
        {"role": "system", "content": "You are helpful asssistant that performs aspect-based sentiment analysis on restaurant reviews. You strictly follow all formatting instructions without deviation. You extract opinion terms and justifications from the given review directly, you do not make any changes."}, 
        {"role": "user", "content": prompt}, 
    ] 

    generation_args = { 
        "max_new_tokens": 1000, 
        "return_full_text": False, 
        "temperature": 0.0, 
        #"do_sample": False, 
    }
    
    try:
        output = pipe(messages, **generation_args)
        raw_completion = output[0]['generated_text']
        
        curr_example_data["raw_completion"] = raw_completion
    except Exception as e:
        print(example_counter, "===>", e)
        curr_example_data["raw_completion"] = "__DID_NOT_COMPLETE__"
    
    # === append full dict to all results list ===
    all_results.append(curr_example_data)
    #print("--- heading to next example ---")
    
    # === loop logic ===
    # --- uncomment if debugging
    example_counter += 1
    if example_counter >= EXAMPLES_TO_TEST:
        break   
    

# timing
print("Finished this many samples :", EXAMPLES_TO_TEST)
end_time = time.time()


=== Sending example number 0 to model ===


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


=== Sending example number 20 to model ===
=== Sending example number 40 to model ===
Finished this many samples : 50


In [20]:
total_time = end_time - start_time

time_per_sample = round( (total_time) / EXAMPLES_TO_TEST , 2)
print("Total time taken = ", end_time - start_time)
print("Time per sample with single batch pipeline : ", time_per_sample)

Total time taken =  262.8929646015167
Time per sample with single batch pipeline :  5.26


## Save results to JSON for later/offline

- note that not using pipeline with `KeyDataset` thing and batch size in pipeline, ie not using GPU so inference is slow
- **TODO** figure out how this works 
- this is why at the moment only doing 50 samples to test if dataprocessing works after also

In [40]:
with open('absa-quad-phi35-baseline-inference-50-train-samples.json', 'w') as file:
    json.dump(all_results, file)

In [22]:
!ls

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


absa-quad-phi3-baseline-inference-50-train-samples.json


## Quick data analysis

- check how many of the 50 completions load directly to JSON
- check how many times you get a Aspect Category or Sentiment label that is not in allowed list

In [23]:
# do this in pandas later, just quickly for now
does_not_load_to_json = 0

REFERENCE_KEYS = set( ["opinion term", "aspect category", "sentiment", "justification"])
quad_dict_has_bad_keys = 0
total_quad_dicts = 0

aspect_category_not_in_allowed = 0

sentiment_not_in_allowed = 0

for single_result in all_results:
    try:
        json_completion = json.loads(single_result["raw_completion"])
        # go over aspect categories and sentiments
        
        for quad_dict in json_completion:
            # check keys are those expected
            if set(quad_dict.keys()) != REFERENCE_KEYS:
                quad_dict_has_bad_keys += 1
            total_quad_dicts += 1
            
            if quad_dict.get("aspect category", None) not in aspect_category_cnt:
                aspect_category_not_in_allowed += 1
                
            if quad_dict.get("sentiment", None) not in sentiment_cnt:
                sentiment_not_in_allowed += 1
    except Exception as e:
        does_not_load_to_json += 1
        print(e)

In [24]:
print(total_quad_dicts)
print(quad_dict_has_bad_keys)
print(aspect_category_not_in_allowed)
print(sentiment_not_in_allowed)

79
0
1
1


## Show some results and inspect

- check for non-extractive answers etc

In [25]:
# display some results for illustration
for result in all_results[:5]:
    print(result)
    print("===========")
    print()
    print()

{'example_id': 0, 'input_text': "The wait here is long for dim sum , but if you do n't like sharing tables or if the typical raucous dim sum atmosphere is not your gig , this is a sleek ( for Chinatown ) alternative .", 'gold_quads': [{'opinion term': 'wait', 'aspect category': 'service general', 'sentiment': 'negative', 'justification': 'long'}, {'opinion term': 'atmosphere', 'aspect category': 'ambience general', 'sentiment': 'negative', 'justification': 'raucous'}, {'opinion term': 'NULL', 'aspect category': 'restaurant miscellaneous', 'sentiment': 'negative', 'justification': 'sleek'}], 'raw_completion': ' [{ "opinion term": "wait time", "aspect category": "service general", "sentiment": "negative", "justification": "long" }, { "opinion term": "dim sum atmosphere", "aspect category": "ambience general", "sentiment": "neutral", "justification": "not your gig" }, { "opinion term": "restaurant", "aspect category": "restaurant general", "sentiment": "positive", "justification": "sleek 

## Observation - TODO FOR PHI 3.5 - BELOW IS COPIED FROM PHI 3 - haven't read through samples yet for 3.5

Noticed in above that the 2nd example (example_id = 1) **does actually have model reformatting:**

`input_text : "Just because it 's cheap does NOT mean the portions are small or the food is nasty , IT IS GREAT !"`

and the last dict in the `raw_completion` is:

`{ "opinion term": "food", "aspect category": "food quality", "sentiment": "positive", "justification": "great" }`

which seems to have **lowercased GREAT to great**

---

**Also, more importantly, in the same example:**

`{ "opinion term": "portions", "aspect category": "food general", "sentiment": "positive", "justification": "not small" }`

notice that the justification `not small` **DOES NOT ACTUALLY APPEAR IN THE `input_text` BUT IS A "paraphrase" OF THE FACT THAT THE PORTIONS ARE NOT SMALL**


# Temporary conclusions

- seems like the JSON behavior is 100% so far, without any structuring, just with good prompting
- didn't time but it was fast to do 50: also I was doing it slow way, no batching (get warning suggesting to use pipeline with dataset instead of sequentially)

# After restart

**NOTE: learnt that you can't access the Output files!?!?! I had to load into Input from the local copy I had saved to disk!?!**

In [43]:
import json

with open("/kaggle/input/absa-35-phi/absa-quad-phi35-baseline-inference-50-train-samples.json", 'r') as fo:
    results = json.load(fo)

In [44]:
results[0]

{'example_id': 0,
 'input_text': "The wait here is long for dim sum , but if you do n't like sharing tables or if the typical raucous dim sum atmosphere is not your gig , this is a sleek ( for Chinatown ) alternative .",
 'gold_quads': [{'opinion term': 'wait',
   'aspect category': 'service general',
   'sentiment': 'negative',
   'justification': 'long'},
  {'opinion term': 'atmosphere',
   'aspect category': 'ambience general',
   'sentiment': 'negative',
   'justification': 'raucous'},
  {'opinion term': 'NULL',
   'aspect category': 'restaurant miscellaneous',
   'sentiment': 'negative',
   'justification': 'sleek'}],
 'raw_completion': ' [{ "opinion term": "wait time", "aspect category": "service general", "sentiment": "negative", "justification": "long" }, { "opinion term": "dim sum atmosphere", "aspect category": "ambience general", "sentiment": "neutral", "justification": "not your gig" }, { "opinion term": "restaurant", "aspect category": "restaurant general", "sentiment": "p

In [45]:
type(results[0]["gold_quads"])

list

In [46]:
type(results[0]["raw_completion"])

str

# Parse raw completions to json

In [47]:
# parse raw_completion to list
for elem in results:
    elem["model_predicted_quads"] = json.loads(elem["raw_completion"])

In [48]:
results[0]

{'example_id': 0,
 'input_text': "The wait here is long for dim sum , but if you do n't like sharing tables or if the typical raucous dim sum atmosphere is not your gig , this is a sleek ( for Chinatown ) alternative .",
 'gold_quads': [{'opinion term': 'wait',
   'aspect category': 'service general',
   'sentiment': 'negative',
   'justification': 'long'},
  {'opinion term': 'atmosphere',
   'aspect category': 'ambience general',
   'sentiment': 'negative',
   'justification': 'raucous'},
  {'opinion term': 'NULL',
   'aspect category': 'restaurant miscellaneous',
   'sentiment': 'negative',
   'justification': 'sleek'}],
 'raw_completion': ' [{ "opinion term": "wait time", "aspect category": "service general", "sentiment": "negative", "justification": "long" }, { "opinion term": "dim sum atmosphere", "aspect category": "ambience general", "sentiment": "neutral", "justification": "not your gig" }, { "opinion term": "restaurant", "aspect category": "restaurant general", "sentiment": "p

# Results analysis - focus on extractive performance

- for now, looking at how well the model does extractive answering (without structured generation etc)
- how often does opinion term/justification NOT appear in original input text for example

In [49]:
import pandas as pd

In [50]:
pd.set_option('display.max_colwidth', None)

In [51]:
# for each of the 50 samples:
# 1/ get each of the individual predicted quads
# 2/ check if the OPINION TERM and the JUSTIFICATION appears VERBATIM in the text (ignore NULL)
# 3/ append as row to dataframe

exact_substring_data = []

for result in results:
    #print(result)
    example_id = result["example_id"]
    input_text = result["input_text"]
    
    input_text_lower = input_text.lower()
    
    for pred_quad in result["model_predicted_quads"]:
        ot = pred_quad["opinion term"]
        justi = pred_quad["justification"]
        
        if ot == "NULL":
            ot_raw_ok = None
            ot_lower_ok = None
        else:
            if ot in input_text:
                ot_raw_ok = True
                ot_lower_ok = True
            elif ot.lower() in input_text_lower:
                ot_raw_ok = False
                ot_lower_ok = True
            else:
                ot_raw_ok = False
                ot_lower_ok = False
        
        if justi == "NULL":
            justi_raw_ok = None
            justi_lower_ok = None
        else:
            if justi in input_text:
                justi_raw_ok = True
                justi_lower_ok = True
            elif justi.lower() in input_text_lower:
                justi_raw_ok = False
                justi_lower_ok = True
            else:
                justi_raw_ok = False
                justi_lower_ok = False
        
        tmp = {"example_id":example_id,
               "input_text":input_text,
               "opinion_term":ot,
               "justification":justi,
              "ot_raw_ok":ot_raw_ok,
              "ot_lower_ok":ot_lower_ok,
              "justi_raw_ok":justi_raw_ok,
              "justi_lower_ok":justi_lower_ok}
        
        exact_substring_data.append(tmp)
        
        
        

In [52]:
df = pd.DataFrame(exact_substring_data)

with pd.option_context("display.max_rows", 1000):
    display(df.head(100)) # show all

Unnamed: 0,example_id,input_text,opinion_term,justification,ot_raw_ok,ot_lower_ok,justi_raw_ok,justi_lower_ok
0,0,"The wait here is long for dim sum , but if you do n't like sharing tables or if the typical raucous dim sum atmosphere is not your gig , this is a sleek ( for Chinatown ) alternative .",wait time,long,False,False,True,True
1,0,"The wait here is long for dim sum , but if you do n't like sharing tables or if the typical raucous dim sum atmosphere is not your gig , this is a sleek ( for Chinatown ) alternative .",dim sum atmosphere,not your gig,True,True,True,True
2,0,"The wait here is long for dim sum , but if you do n't like sharing tables or if the typical raucous dim sum atmosphere is not your gig , this is a sleek ( for Chinatown ) alternative .",restaurant,sleek ( for Chinatown ) alternative,False,False,True,True
3,1,"Just because it 's cheap does NOT mean the portions are small or the food is nasty , IT IS GREAT !",food,GREAT,True,True,True,True
4,1,"Just because it 's cheap does NOT mean the portions are small or the food is nasty , IT IS GREAT !",prices,cheap does NOT mean the portions are small,False,False,True,True
5,2,Food is excellent .,Food,excellent,True,True,True,True
6,3,As always we had a great glass of wine while we waited .,glass of wine,great,True,True,True,True
7,4,I can not imagine a friendlier staff working in a restaurant .,staff,friendlier,True,True,True,True
8,5,"Also , specify if you like your food spicy- its rather bland if you do n't .",food spicy,rather bland,True,True,True,True
9,6,Big Wong gets big Ups for a fine establishment .,Big Wong,big Ups,True,True,True,True


In [53]:
df[["ot_raw_ok", "ot_lower_ok", "justi_raw_ok", "justi_lower_ok"]].apply(pd.Series.value_counts)

Unnamed: 0,ot_raw_ok,ot_lower_ok,justi_raw_ok,justi_lower_ok
True,62,63,70,70
False,16,15,9,9


In [54]:
# show where opinion term even after lowercasing does NOT appear in input text

df.loc[
    df["ot_lower_ok"] == False
][["example_id", "input_text", "opinion_term"]]

Unnamed: 0,example_id,input_text,opinion_term
0,0,"The wait here is long for dim sum , but if you do n't like sharing tables or if the typical raucous dim sum atmosphere is not your gig , this is a sleek ( for Chinatown ) alternative .",wait time
2,0,"The wait here is long for dim sum , but if you do n't like sharing tables or if the typical raucous dim sum atmosphere is not your gig , this is a sleek ( for Chinatown ) alternative .",restaurant
4,1,"Just because it 's cheap does NOT mean the portions are small or the food is nasty , IT IS GREAT !",prices
10,7,I was pleasantly suprised .,pleasantly surprised
12,9,I ca n't wait to go back .,I can't wait to go back
18,13,DO not try unless you 're just going there to hang out like the rest of the hipsters who apparently have no sense of taste .,restaurant
26,18,It 's somewhere you can eat and be happy .,restaurant
31,23,It is very overpriced and not very tasty .,restaurant
32,23,It is very overpriced and not very tasty .,food
33,24,Worth the trip from Manhattan .,restaurant


### Notes

It seems that the model is putting some kind of Aspect Category as the actual opinion term when it makes a mistake here - maybe can fix with prompting or more examples?

In [55]:
# check for example the id-49 : is model really outputting Aspect Categories in the Opinion Term sections??
results[49]
# yes seems so O_o rest of answer in 49 is good though

{'example_id': 49,
 'input_text': 'Great friendly service , Fast seating , Fast Delivery , Excellent sushi .',
 'gold_quads': [{'opinion term': 'service',
   'aspect category': 'service general',
   'sentiment': 'positive',
   'justification': 'Great friendly'},
  {'opinion term': 'seating',
   'aspect category': 'service general',
   'sentiment': 'positive',
   'justification': 'Fast'},
  {'opinion term': 'Delivery',
   'aspect category': 'service general',
   'sentiment': 'positive',
   'justification': 'Fast'},
  {'opinion term': 'sushi',
   'aspect category': 'food quality',
   'sentiment': 'positive',
   'justification': 'Excellent'}],
 'raw_completion': ' [{ "opinion term": "service", "aspect category": "service general", "sentiment": "positive", "justification": "Great friendly" },\n{ "opinion term": "service", "aspect category": "service general", "sentiment": "positive", "justification": "Fast seating" },\n{ "opinion term": "service", "aspect category": "service general", "sen

In [56]:
# show where justification even after lowercasing does NOT appear in input text

df.loc[
    df["justi_lower_ok"] == False
][["example_id", "input_text", "justification"]]

Unnamed: 0,example_id,input_text,justification
12,9,I ca n't wait to go back .,can't wait
18,13,DO not try unless you 're just going there to hang out like the rest of the hipsters who apparently have no sense of taste .,DO not try unless you're just going there to hang out like the rest of the hipsters who apparently have no sense of taste
23,16,"Kind , attentive wait staff .","Kind, attentive"
30,22,"The sandwiches are dry , tasteless and way overpriced .","dry, tasteless, overpriced"
37,25,"I like the somosas , chai , and the chole , but the dhosas and dhal were kinda dissapointing .",kinda disappointing
38,25,"I like the somosas , chai , and the chole , but the dhosas and dhal were kinda dissapointing .",kinda disappointing
39,26,We wo n't go to this place again for a good meal .,won't go to this place again for a good meal
51,33,"Their bagels are fine , but they are a little overcooked , and not really a 'special ' bagel experience .",not really a'special' bagel experience
66,42,"Most of the servers are very attentive , friendly and quite attractive .","very attentive, friendly and quite attractive"


### Notes

- Here it seems errors are more due to weird tokenization of text with e.g. `ca n't` which model "fixes" to `can't` etc
- Still some reformulation though : `The sandwiches are dry , tasteless and way overpriced` -> `dry, tasteless, way overpriced`