# Assignment 2 - In-Context Learning

In this assignment, students experiment with in-context learning by selecting and ordering demonstrations to train a large language model at inference time to classify text. In this task, an online store is interested in classifying whether a review describes one or more general topics of interest. The topics are specific to a class of product, in this case vacuum cleaners. Other topics would be relevant to other products.

The dataset has been divided into a development, training and test sets. Students should practice setting up their experiments and writing their prompts using only the development set. Demonstrations for in-context leanring can be drawn from the training set. Final evaluation prior to submission should use the test set.

In [2]:
import configparser
from openai import OpenAI

# Read API key from config file
config = configparser.ConfigParser()
config.read('config.ini')
api_key = config['openai']['api_key']

# Create a new OpenAI client
client = OpenAI(api_key=api_key)

## Load Reviews with Hashtags

The dataset is partitioned into development, training and testing sets. While writing the code to setup your experiments and write your prompts, only use the development set. The training set should be used to sample demonstrations. Only when your code is completed and you are ready to turn in your assignment should you run your experiment on the test set.

In [3]:
import json

data_dev = json.load(open('dataset-dev.json', 'r'))
data_train = json.load(open('dataset-train.json', 'r'))
data_test = json.load(open('dataset-test.json', 'r'))

print('\nDataset Sizes: Dev %i, Train %i, Test %i\n' % (len(data_dev), len(data_train), len(data_test)))

data_dev[0]


Dataset Sizes: Dev 100, Train 100, Test 300



{'text': 'Used the product and was very happy with it until about a month ago. Motor sounded like it was working harder; thought maybe I was imagining things. Look all through hoses and brush roller assembly for any blockages. Today it was not getting good suction; then motor suddenly cut back on output. Barely runs; does not run in upright position. No suction. Bought this as an "inexpensive" replacement to Dyson that died after 5 years. You get what you pay for evidently. Wondering if manufacturer warranty in effect, though I failed to send in the warranty card.',
 'expected': ['#PerformanceAndFunctionality',
  '#ValueForMoneyAndInvestment',
  '#CustomerExperienceAndExpectations'],
 'sentiment': ['N', 'N', 'N']}

## Define the Hashtag List for Prediction

In [4]:
tags = [
    '#DesignAndUsabilityIssues',
    '#PerformanceAndFunctionality',
    '#BatteryAndPowerIssues',
    '#DurabilityAndMaterialConcerns',
    '#MaintenanceAndCleaning',
    '#CustomerExperienceAndExpectations',
    '#ValueForMoneyAndInvestment',
    '#AssemblyAndSetup'
]

## Review the Hashtag Distribution

In general, it is good practice when classifying items to know the distribution of target categories. Categories that are underrepresented, especially in the training data, would lead to underperformance.

In [5]:
from collections import Counter

def review_hashtag_distribution(data):
    hashtag_counter = Counter()
    for item in data:
        hashtag_counter.update(item['expected'])
    
    return hashtag_counter

# Review hashtag distribution for the training set
distribution = review_hashtag_distribution(data_train)

# Sort by count in descending order
sorted_distribution = distribution.most_common()

print("Hashtag distribution in training set (sorted by count):")
for tag, count in sorted_distribution:
    print(f"{tag}: {count}")

Hashtag distribution in training set (sorted by count):
#PerformanceAndFunctionality: 58
#DurabilityAndMaterialConcerns: 37
#CustomerExperienceAndExpectations: 33
#ValueForMoneyAndInvestment: 31
#DesignAndUsabilityIssues: 18
#MaintenanceAndCleaning: 12
#AssemblyAndSetup: 7
#BatteryAndPowerIssues: 4


## Define the Prompt and Experiment

The experiment generally has the following steps: (1) sample the training data to identify k demonstrations for 0 =< k < training set size; (2) construct linearize the demonstrations into text; (3) iterate over the test data and insert the test review and text linearization of the demonstrations into the prompt template; (4) send the prompt to the model and receive the response; (5) validate the response, if the response passes then store the response for later, else if the response fails validation, then save the response to a list of errors. It is generally good to save responses and errors with an index that can be linked back to the test data.

After running the experiment, the evaluation metrics should be computed from the answers and the errors should be inspected. Adjustments to the prompt and/or experiment can be made to reduce the errors, e.g., by post-processing the responses prior to validation.

In [31]:
import random
import re

def sample_demonstrations(dataset, k):
    """
    Samples k random demonstrations from the dataset.
    """
    return random.sample(dataset, k)

## Where the prompt engineering begins
def linearize_demonstrations(demonstrations):
    """
    Converts a list of demonstration examples into a formatted text string.
    """
    prompt_text = ""
    for demo in demonstrations:
        prompt_text += f"Review: {demo['text']}\n"
        prompt_text += f"Hashtags: {', '.join(demo['expected'])}\n\n"
    return prompt_text

# ordering strategies start
def cluster_by_similarity(demonstrations):
    """
    Orders examples by clustering based on overlapping hashtags.
    """
    # Create a map of hashtags to examples
    hashtag_map = {}
    for demo in demonstrations:
        for hashtag in demo["expected"]:
            if hashtag not in hashtag_map:
                hashtag_map[hashtag] = []
            hashtag_map[hashtag].append(demo)

    # Cluster examples by hashtag
    clustered_examples = []
    seen = set()
    for hashtag, examples in hashtag_map.items():
        for example in examples:
            if id(example) not in seen:  # Prevent duplicates
                clustered_examples.append(example)
                seen.add(id(example))

    return linearize_demonstrations(clustered_examples)

def progressive_complexity(demonstrations):
    """
    Orders examples by increasing complexity (fewer hashtags to more hashtags).
    """
    sorted_examples = sorted(demonstrations, key=lambda demo: len(demo["expected"]))
    return linearize_demonstrations(sorted_examples)

def rare_hashtags_last(demonstrations):
    """
    Orders examples so that those with rare hashtags are placed last in the context.
    """
    # Count the frequency of each hashtag in the dataset
    hashtag_counts = {}
    for demo in demonstrations:
        for hashtag in demo["expected"]:
            hashtag_counts[hashtag] = hashtag_counts.get(hashtag, 0) + 1

    # Sort examples based on the rarity of their hashtags
    sorted_examples = sorted(
        demonstrations,
        key=lambda demo: min(hashtag_counts[hashtag] for hashtag in demo["expected"]),
        reverse=False
    )
    return linearize_demonstrations(sorted_examples)

def relevant_hashtags_last(demonstrations, query_hashtags):
    """
    Orders examples so that those most relevant to the query hashtags are placed last.
    """
    sorted_examples = sorted(
        demonstrations,
        key=lambda demo: len(set(demo["expected"]) & set(query_hashtags)),
    )
    return linearize_demonstrations(sorted_examples)
# ordering strategies end

def construct_prompt(demonstrations_text, review, tag_list):
    """
    Constructs the full prompt for the model.
    
    Parameters:
        demonstrations_text (str): The formatted text of the demonstrations.
        review (str): The text of the review to classify.
        tag_list (list): The full list of valid hashtags for prediction.
    
    Returns:
        str: The constructed prompt as a string.
    """
    bulletpoint_tag_list = "\n".join([f"- {tag}" for tag in tag_list])
    prompt = f"Below are customer reviews and the hashtags that describe them:\n{demonstrations_text}"
    prompt += "Select the hashtags from the options below most applicable to the review. Return only the selected hashtags separated by commas.\n"
    prompt += f"Review: {review}\n"
    prompt += f"Options:\n{bulletpoint_tag_list}"
    return prompt
## Where the prompt engineering ends

def prompt_model(prompt):
    """
    Sends the prompt to the model and returns the response.
    """
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        store=True,
        messages=[
            {"role": "user", 'content': prompt}
        ]
    )
    return completion.choices[0].message.content

def extract_valid_hashtags(response, tag_list):
    """
    Extracts valid hashtags from the model's response.
    
    Parameters:
        response (str): The raw response string from the model.
        tag_list (list): The list of valid hashtags.
    
    Returns:
        list: A list of valid hashtags extracted from the response.
    """
    # Normalize the response by removing extra spaces and splitting into lines
    response = response.strip()
    
    # Use a regular expression to extract all words starting with '#'
    extracted_tags = re.findall(r"#\w+", response)
    
    # Filter the extracted tags to include only those in the valid tag list
    valid_tags = [tag for tag in extracted_tags if tag in tag_list]
    
    return valid_tags

def save_to_json(filename, data):
    with open(filename, "w") as f:
        json.dump(data, f, indent=4)

# Parameters
k = 8  # Number of demonstrations to include
tag_list = tags  # Use the global tag list

# Run the experiment
results = []
errors = []
test_cases = data_test # the reviews to classify

# Sample and linearize demonstrations
demonstrations = sample_demonstrations(data_train, k)

# Order and linearalize demonstrations
# demonstrations_text = your_strategy_here(demonstrations)
    
for idx, test_data in enumerate(test_cases):
    # This strategy needs to be computed here since it relies on the review to be predicted
    demonstrations_text = relevant_hashtags_last(demonstrations, test_data['expected'])

    # Construct the full prompt and get response from the model
    prompt = construct_prompt(demonstrations_text, test_data['text'], tag_list)
    try:
        response = prompt_model(prompt)
        print(idx)
        
        # Remove invalid tags from the response
        predicted_categories = extract_valid_hashtags(response, tag_list)
        results.append({
            "review": test_data["text"],
            "true_labels": test_data["expected"],
            "predicted": predicted_categories,
        })
    except Exception as e:
        # Catch and log any other errors
        errors.append({
            "review": test_data["text"],
            "true_labels": test_data["expected"],
            "error": str(e),
        })

# Save the results and errors to files
save_to_json("results.json", results)
save_to_json("errors.json", errors)

# Print summary
print(f"Experiment completed. {len(results)} results saved to 'results.json'.")

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

## Evaluate the Experimental Results

The evaluation metrics include precision, recall and F1 score. For the total number of true positives (tp), false positives (fp) and false negatives (fn), these calculations should be used to report results:
* Precision = tp / (tp + fp)
* Recall = tp / (tp + fn)
* F1 = 2tp / (2tp + fp + fn)

In [32]:
import json

def read_from_json(filename):
    with open(filename, "r") as f:
        return json.load(f)

def evaluate_results(results):
    """
    Evaluates the results of the experiment by calculating overall precision, recall, and F1 score.
    Metrics are calculated per instance and then averaged across all results.
    
    Parameters:
        results (list): The list of results. Each result object should contain a list of true and a list of predicted labels.
        
    Returns:
        dict: A dictionary containing overall precision, recall, and F1 score.
    """
    total_precision = 0
    total_recall = 0
    total_f1 = 0
    
    for result in results:
        true_labels = result["true_labels"]
        predicted_labels = result["predicted"]
        
        # Count true positives, false positives, and false negatives
        tp = 0  # True positives
        fp = 0  # False positives
        fn = 0  # False negatives

        # Count occurrences in true and predicted lists
        true_counts = {}
        predicted_counts = {}

        for label in true_labels:
            true_counts[label] = true_counts.get(label, 0) + 1

        for label in predicted_labels:
            predicted_counts[label] = predicted_counts.get(label, 0) + 1

        # Calculate true positives
        for label in predicted_counts:
            if label in true_counts:
                tp += min(predicted_counts[label], true_counts[label])

        # Calculate false positives
        for label in predicted_counts:
            if label not in true_counts:
                fp += predicted_counts[label]
            else:
                fp += max(0, predicted_counts[label] - true_counts[label])

        # Calculate false negatives
        for label in true_counts:
            if label not in predicted_counts:
                fn += true_counts[label]
            else:
                fn += max(0, true_counts[label] - predicted_counts[label])

        # Precision, recall, and F1 for this result
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = (2 * tp) / (2 * tp + fp + fn) if (2 * tp + fp + fn) > 0 else 0

        # Accumulate metrics
        total_precision += precision
        total_recall += recall
        total_f1 += f1
    
    # Calculate averages
    n = len(results)
    return {
        "precision": total_precision / n,
        "recall": total_recall / n,
        "f1": total_f1 / n,
    }

# Evaluate the results
results = read_from_json("results.json")
metrics = evaluate_results(results)

# Print the overall metrics
print("Evaluation Metrics:")
print(f"Precision: {metrics['precision']:.3f}")
print(f"Recall: {metrics['recall']:.3f}")
print(f"F1 Score: {metrics['f1']:.3f}")

Evaluation Metrics:
Precision: 0.638
Recall: 0.918
F1 Score: 0.727


In [None]:
## 5 demos, 20 test cases, best of 2 runs
# no ordering: .66, .89, .74
# cluster by similarity: .66, .94, .76
# progressive complexity: .68, .94, .78
# rare hashtags last: .67, .90, .75
# relevant hashtags last: .74, .88, .78

## 10 demos, 20 test cases, best of 2 runs
# no ordering: .69, .93, .78
# cluster by similarity: .72, .89, .78
# progressive complexity: .70, .90, .77
# rare hashtags last: .71, .92, .78
# relevant hashtags last: .71, .93, .80

## relevant hashtags last, 20 test cases, 1 run f1-score
# 8 demos: .787
# 9 demos: .709
# 10 demos: .721