# Full ABSA task with Phi3

- do EDA on dataset (get list of all possible aspect categories and sentiments)
- run some of the dataset through a baseline phi3 mini to see
- study parsing of results, and how to eval also

In [1]:
from datasets import load_dataset

In [2]:
absa_quad = load_dataset("NEUDM/absa-quad")

Downloading readme:   0%|          | 0.00/2.83k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/503k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2098 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/525 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1081 [00:00<?, ? examples/s]

In [3]:
absa_quad

DatasetDict({
    train: Dataset({
        features: ['task_type', 'dataset', 'input', 'output', 'situation', 'label', 'extra', 'instruction'],
        num_rows: 2098
    })
    validation: Dataset({
        features: ['task_type', 'dataset', 'input', 'output', 'situation', 'label', 'extra', 'instruction'],
        num_rows: 525
    })
    test: Dataset({
        features: ['task_type', 'dataset', 'input', 'output', 'situation', 'label', 'extra', 'instruction'],
        num_rows: 1081
    })
})

In [11]:
import ast

In [22]:
ALL_SENTIMENTS = []
ALL_ASPECT_CATEGORIES = []

for example in absa_quad["train"].iter(batch_size=1):
    output_as_str = example["output"][0]
    
    # list of quads
    xs = ast.literal_eval(output_as_str)
    for quad in xs:
        ALL_SENTIMENTS.append(quad[2])
        #if quad[2] == "neutral":
        #    print(example) # <-- used to get examples with neutral for the prompt
        ALL_ASPECT_CATEGORIES.append(quad[1])

In [16]:
len(ALL_SENTIMENTS), len(ALL_ASPECT_CATEGORIES)

(3343, 3343)

In [17]:
from collections import Counter

sentiment_cnt = Counter(ALL_SENTIMENTS)
aspect_category_cnt = Counter(ALL_ASPECT_CATEGORIES)

sentiment_cnt, aspect_category_cnt

(Counter({'positive': 2374, 'negative': 873, 'neutral': 96}),
 Counter({'food quality': 1195,
          'service general': 597,
          'restaurant general': 502,
          'ambience general': 379,
          'food style_options': 163,
          'restaurant miscellaneous': 121,
          'food prices': 103,
          'restaurant prices': 98,
          'drinks quality': 67,
          'drinks style_options': 51,
          'location general': 39,
          'drinks prices': 27,
          'food general': 1}))

# Load model and prepare for baseline evaluation

In [18]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed

import torch

In [19]:
device = "cuda" if torch.cuda.is_available() else "cpu"

MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"

model = AutoModelForCausalLM.from_pretrained( 
    MODEL_NAME,  
    device_map=device,  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

## Prompt template

In [30]:
template = """## INSTRUCTIONS ##

Aspect-based sentiment analysis requires you to identify terms in a sentence that correspond to a person expressing an opinion. Each sentence may have several opinion terms. An opinion term is the exact string of words about which a sentiment is being expressed. The justification is the exact string of words, if any, that justifies or explains the corresponding opinion term.

Below are some examples of aspect-based sentiment analysis performed on different restaurant reviews.

Then there is a list of Aspect Categories: each category is a feature of interest to classifying a specific part of the review, such as "ambience general" or "food style_options".

Then there is a list of allowed Sentiments that you can label each opinion term with.

Finally there is a query which consists of an actual telephone product review. Perform the same aspect-based sentiment analysis as shown in the example. Extract the terms and the justifications directly from the review, without any modification. Only use labels given in the Aspect Category list and in the Sentiment list. Use the exact word NULL if there is no occurrence in the sentence for any of the individual requested items in the output. Follow the exact same output formatting without deviation.

## EXAMPLES OF ASPECT-BASED SENTIMENT ANALYSIS ##

{examples_absa}

## LIST OF ASPECT CATEGORIES ##

- food quality
- service general
- restaurant general
- ambience general
- food style_options
- restaurant miscellaneous
- food prices
- restaurant prices
- drinks quality
- drinks style_options
- location general
- drinks prices
- food general

## LIST OF SENTIMENTS ##

- positive
- negative
- neutral

## QUERY ##

restaurant review: {query}
absa output:
"""

In [32]:
# TODO - try adding a summarize-first step, then generate ABSA
examples_absa = """
restaurant review: I can not imagine better Indian food in all of the city.
absa output:
[{ \"opinion term\": \"Indian food\", \"aspect category\": \"food quality\", \"sentiment\": \"positive\", \"justification\": \"better\" }]

restaurant review: I've been many times and have never been disappointed.
absa output:
[{ \"opinion term\": \"NULL\", \"aspect category\": \"restaurant general\", \"sentiment\": \"positive\", \"justification\": \"never been disappointed\" }]

restaurant review: "Skip this restaurant, it 's a big disappointment."
absa output:
[{ \"opinion term\": \"restaurant\", \"aspect category\": \"restaurant general\", \"sentiment\": \"negative\", \"justification\": \"Skip\" }, { \"opinion term\": \"restaurant\", \"aspect category\": \"restaurant general\", \"sentiment\": \"negative\", \"justification\": \"disappointment\" }]

restaurant review: "With the exception of our lemon salad that had so much pepper on it that our eyes started watering, the food here was decent, not great."
absa output:
[{ \"opinion term\": \"food\", \"aspect category\": \"food quality\", \"sentiment\": \"neutral\", \"justification\": \"decent\" }, { \"opinion term\": \"food\", \"aspect category\": \"food quality\", \"sentiment\": \"negative\", \"justification\": \"not great\" }, { \"opinion term\": \"lemon salad\", \"aspect category\": \"food quality\", \"sentiment\": \"negative\", \"justification\": \"exception\" }]
"""

In [27]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

## Test single example

In [33]:
for example in absa_quad["train"].iter(batch_size=1):
    example_text = example["input"][0]
    #print(example_text, type(example_text))
    input_text = ast.literal_eval(example_text)[0]
    
    prompt = template.format(examples_absa=examples_absa, query=input_text)

    messages = [ 
        {"role": "system", "content": "You are helpful asssistant that performs aspect-based sentiment analysis on restaurant reviews. You strictly follow all formatting instructions without deviation."}, 
        {"role": "user", "content": prompt}, 
    ] 

    generation_args = { 
        "max_new_tokens": 500, 
        "return_full_text": False, 
        "temperature": 0.0, 
        #"do_sample": False, 
    }
    
    output = pipe(messages, **generation_args)
    
    print("==== INPUT TEXT ====")
    print(input_text)
    print("---- MODEL OUTPUT ----")
    print(output[0]['generated_text'])
    
    break

==== INPUT TEXT ====
The wait here is long for dim sum , but if you do n't like sharing tables or if the typical raucous dim sum atmosphere is not your gig , this is a sleek ( for Chinatown ) alternative .
---- MODEL OUTPUT ----
 [{ "opinion term": "wait", "aspect category": "service general", "sentiment": "negative", "justification": "long" }, { "opinion term": "dim sum", "aspect category": "food quality", "sentiment": "neutral", "justification": "for" }, { "opinion term": "tables", "aspect category": "restaurant general", "sentiment": "negative", "justification": "don't like sharing" }, { "opinion term": "dim sum atmosphere", "aspect category": "ambience general", "sentiment": "negative", "justification": "not your gig" }, { "opinion term": "sleek", "aspect category": "restaurant general", "sentiment": "positive", "justification": "alternative" }, { "opinion term": "Chinatown", "aspect category": "location general", "sentiment": "positive", "justification": "for" }]


In [39]:
import json

# note has leading whitespace, but json loads seems to work ok
test_json = json.loads( output[0]['generated_text'] )

test_json

[{'opinion term': 'wait',
  'aspect category': 'service general',
  'sentiment': 'negative',
  'justification': 'long'},
 {'opinion term': 'dim sum',
  'aspect category': 'food quality',
  'sentiment': 'neutral',
  'justification': 'for'},
 {'opinion term': 'tables',
  'aspect category': 'restaurant general',
  'sentiment': 'negative',
  'justification': "don't like sharing"},
 {'opinion term': 'dim sum atmosphere',
  'aspect category': 'ambience general',
  'sentiment': 'negative',
  'justification': 'not your gig'},
 {'opinion term': 'sleek',
  'aspect category': 'restaurant general',
  'sentiment': 'positive',
  'justification': 'alternative'},
 {'opinion term': 'Chinatown',
  'aspect category': 'location general',
  'sentiment': 'positive',
  'justification': 'for'}]

# Run train dataset through model

- store raw completions, and the input text + its gold annotations

In [48]:
all_results = [] # store a list of jsons, each json contains input text, gold annotations, raw model completion

EXAMPLES_TO_TEST = 50
example_counter = 0

for example in absa_quad["train"].iter(batch_size=1):
    
    curr_example_data = {"example_id":example_counter, "input_text":None, "gold_quads":[]}
    
    # === process gold data and save it ===
    example_text = example["input"][0]
    input_text = ast.literal_eval(example_text)[0]
    curr_example_data["input_text"] = input_text
    
    gold_labels = example["output"][0]
    
    # go through list of quads
    golds = ast.literal_eval(gold_labels)
    for quad in golds:
        tmp_d = {}
        tmp_d["opinion term"] = quad[0]
        tmp_d["aspect category"] = quad[1]
        tmp_d["sentiment"] = quad[2]
        tmp_d["justification"] = quad[3]
        curr_example_data["gold_quads"].append(tmp_d)
    
    # === GET MODEL COMPLETION ===
    print(f"=== Sending example number {example_counter} to model ===")
    #print(f"--- current example is : {input_text} ---")
    
    prompt = template.format(examples_absa=examples_absa, query=input_text)

    messages = [ 
        {"role": "system", "content": "You are helpful asssistant that performs aspect-based sentiment analysis on restaurant reviews. You strictly follow all formatting instructions without deviation."}, 
        {"role": "user", "content": prompt}, 
    ] 

    generation_args = { 
        "max_new_tokens": 500, 
        "return_full_text": False, 
        "temperature": 0.0, 
        #"do_sample": False, 
    }
    
    try:
        output = pipe(messages, **generation_args)
        raw_completion = output[0]['generated_text']
        
        curr_example_data["raw_completion"] = raw_completion
    except Exception as e:
        print(example_counter, "===>", e)
        curr_example_data["raw_completion"] = "__DID_NOT_COMPLETE__"
    
    # === append full dict to all results list ===
    all_results.append(curr_example_data)
    #print("--- heading to next example ---")
    
    # === loop logic ===
    example_counter += 1
    if example_counter >= EXAMPLES_TO_TEST:
        break   

=== Sending example number 0 to model ===
=== Sending example number 1 to model ===
=== Sending example number 2 to model ===
=== Sending example number 3 to model ===
=== Sending example number 4 to model ===
=== Sending example number 5 to model ===
=== Sending example number 6 to model ===
=== Sending example number 7 to model ===
=== Sending example number 8 to model ===
=== Sending example number 9 to model ===
=== Sending example number 10 to model ===
=== Sending example number 11 to model ===
=== Sending example number 12 to model ===
=== Sending example number 13 to model ===
=== Sending example number 14 to model ===
=== Sending example number 15 to model ===
=== Sending example number 16 to model ===
=== Sending example number 17 to model ===
=== Sending example number 18 to model ===
=== Sending example number 19 to model ===
=== Sending example number 20 to model ===
=== Sending example number 21 to model ===
=== Sending example number 22 to model ===
=== Sending example n

## Save 50 results to JSON for later/offline

In [50]:
with open('absa-quad-phi3-baseline-inference-50-samples.json', 'w') as file:
    json.dump(all_results, file)

In [51]:
!ls

  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


absa-quad-phi3-baseline-inference-50-samples.json


## Quick data analysis before close session

- check how many of the 50 completions load directly to JSON
- check how many times you get a Aspect Category or Sentiment label that is not in allowed list

In [55]:
# do this in pandas later, just quickly for now
does_not_load_to_json = 0

REFERENCE_KEYS = set( ["opinion term", "aspect category", "sentiment", "justification"])
quad_dict_has_bad_keys = 0
total_quad_dicts = 0

aspect_category_not_in_allowed = 0

sentiment_not_in_allowed = 0

for single_result in all_results:
    try:
        json_completion = json.loads(single_result["raw_completion"])
        # go over aspect categories and sentiments
        
        for quad_dict in json_completion:
            # check keys are those expected
            if set(quad_dict.keys()) != REFERENCE_KEYS:
                quad_dict_has_bad_keys += 1
            total_quad_dicts += 1
            
            if quad_dict.get("aspect category", None) not in aspect_category_cnt:
                aspect_category_not_in_allowed += 1
                
            if quad_dict.get("sentiment", None) not in sentiment_cnt:
                sentiment_not_in_allowed += 1
    except Exception as e:
        does_not_load_to_json += 1
        print(e)

In [56]:
print(total_quad_dicts)
print(quad_dict_has_bad_keys)
print(aspect_category_not_in_allowed)
print(sentiment_not_in_allowed)

85
0
0
0


# Temporary conclusions

- seems like the behavior is 100% so far, without any structuring, just with good prompting