# Assignment 1 - In-Context Learning

### Ananya Agrawal (ananyaa2)

In this assignment, students experiment with in-context learning by selecting and ordering demonstrations to train a large language model at inference time to classify text. In this task, an online store is interested in classifying whether a review describes one or more general topics of interest. The topics are specific to a class of product, in this case vacuum cleaners. Other topics would be relevant to other products.

The dataset has been divided into a development, training and test sets. Students should practice setting up their experiments and writing their prompts using only the development set. Demonstrations for in-context leanring can be drawn from the training set. Final evaluation prior to submission should use the test set.

# In-Context Learning for Amazon Review Categorization

## **Objective**
This notebook explores In-Context Learning (ICL) for classifying Amazon reviews into predefined categories.
We will experiment with:
- Different numbers of demonstrations.
- Varying demonstration selections.
- Changing demonstration order.

## **1. Import Libraries & Load Data**


In [1]:
# Import necessary libraries
import json
from openai import OpenAI
import os
import random

## Load Reviews with Hashtags

The dataset is partitioned into development, training and testing sets. While writing the code to setup your experiments and write your prompts, only use the development set. The training set should be used to sample demonstrations. Only when your code is completed and you are ready to turn in your assignment should you run your experiment on the test set.

In [2]:
data_dev = json.load(open('dataset-dev.json', 'r')) # For prompt testing
data_train = json.load(open('dataset-train.json', 'r')) # For in-context demonstrations
data_test = json.load(open('dataset-test.json', 'r')) # Final evaluation

print('\nDataset Sizes: Dev %i, Train %i, Test %i\n' % (len(data_dev), len(data_train), len(data_test)))

data_dev[0]


Dataset Sizes: Dev 100, Train 100, Test 300



{'text': 'Used the product and was very happy with it until about a month ago. Motor sounded like it was working harder; thought maybe I was imagining things. Look all through hoses and brush roller assembly for any blockages. Today it was not getting good suction; then motor suddenly cut back on output. Barely runs; does not run in upright position. No suction. Bought this as an "inexpensive" replacement to Dyson that died after 5 years. You get what you pay for evidently. Wondering if manufacturer warranty in effect, though I failed to send in the warranty card.',
 'expected': ['#PerformanceAndFunctionality',
  '#ValueForMoneyAndInvestment',
  '#CustomerExperienceAndExpectations'],
 'sentiment': ['N', 'N', 'N']}

In [3]:
print(f"Loaded {len(data_train)} training samples, {len(data_dev)} dev samples, and {len(data_test)} test samples.")

Loaded 100 training samples, 100 dev samples, and 300 test samples.


In [5]:
# Load predefined category list from training data
category_set = {tag for example in data_train for tag in example["expected"]}

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

## **2. Define Prompting Strategy**
- Construct few-shot prompts using training examples.
- Experiment with different numbers of demonstrations.
- Shuffle demonstration order and compare performance.


## Define the Hashtag List for Prediction

In [6]:
tags = [
    '#DesignAndUsabilityIssues',
    '#PerformanceAndFunctionality',
    '#BatteryAndPowerIssues',
    '#DurabilityAndMaterialConcerns',
    '#MaintenanceAndCleaning',
    '#CustomerExperienceAndExpectations',
    '#ValueForMoneyAndInvestment',
    '#AssemblyAndSetup'
]

tag_list = ' '.join(tags)

In [7]:
# Ensure we validate predictions against this list
category_set = set(tags)

## Review the Hashtag Distribution

In general, it is good practice when classifying items to know the distribution of target categories. Categories that are underrepresented, especially in the training data, would lead to underperformance.

In [8]:
def sample_demonstrations(k=32, random_order=True):
    """Sample k examples from training data and format them for in-context learning."""
    examples = random.sample(data_train, k) if random_order else data_train[:k]

    formatted_examples = []
    for ex in examples:
        formatted_examples.append(f"Review: {ex['text']}\nCategories: {', '.join(ex['expected'])}\n")
    
    return "\n".join(formatted_examples)

def construct_prompt(k=32, random_order=True):
    """Construct a prompt with sampled demonstrations."""
    demonstrations = sample_demonstrations(k, random_order)
    
    prompt_template = (
    "Below are examples of customer reviews categorized into relevant product features.\n"
    "Your task is to categorize a new review using only the following hashtags within the same set of categories.:\n"
    f"{tag_list}\n\n"
    "Respond with a comma-separated list of valid hashtags from the predefined list.\n\n"
    f"{demonstrations}\n"
    "Review: {test_review}\nCategories:"
)

    return prompt_template


## **3. Query Model with Prompt**

## Define the Prompt and Experiment

The experiment generally has the following steps: (1) sample the training data to identify k demonstrations for 0 =< k < training set size; (2) construct linearize the demonstrations into text; (3) iterate over the test data and insert the test review and text linearization of the demonstrations into the prompt template; (4) send the prompt to the model and receive the response; (5) validate the response, if the response passes then store the response for later, else if the response fails validation, then save the response to a list of errors. It is generally good to save responses and errors with an index that can be linked back to the test data.

After running the experiment, the evaluation metrics should be computed from the answers and the errors should be inspected. Adjustments to the prompt and/or experiment can be made to reduce the errors, e.g., by post-processing the responses prior to validation.

In [9]:
def prompt_model(prompt):
    """Send a structured prompt to OpenAI API and return the response."""
    try:
        completion = client.chat.completions.create(
            model="gpt-3.5-turbo",
            store=True,
            messages=[{"role": "user", "content": prompt}]
        )
        return completion.choices[0].message.content.strip()
    except Exception as e:
        return f"ERROR: {e}"


## **4. Run Experiments on Dev Dataset & Validate Responses**
- Generate model responses using different prompts.
- Store predictions for analysis.


In [10]:
def validate_response(response):
    """Filter out invalid responses and hallucinations."""
    predicted_categories = {tag.strip() for tag in response.split(",") if tag.strip().startswith("#")}
    return list(predicted_categories.intersection(category_set))

def run_experiment(dataset, k=32, random_order=True):
    """Run prompt-based experiment on a given dataset."""
    results, errors = [], []

    for idx, review in enumerate(dataset):
        prompt = construct_prompt(k, random_order).format(test_review=review['text'])
        response = prompt_model(prompt)

        validated_response = validate_response(response)

        if validated_response:
            results.append({
                "index": idx,
                "review": review["text"],
                "expected": review["expected"],
                "predicted": validated_response
            })
        else:
            errors.append({"index": idx, "review": review["text"], "response": response})

    return results, errors

# Run experiment on dev data for tuning prompt design
dev_results, dev_errors = run_experiment(data_dev, k=32, random_order=True)


## **5. Evaluate Model Performance on Dev Dataset**
- Measure accuracy of category classification and sentiment agreement.


## Evaluate the Experimental Results

The evaluation metrics include precision, recall and F1 score. For the total number of true positives (tp), false positives (fp) and false negatives (fn), these calculations should be used to report results:
* Precision = tp / (tp + fp)
* Recall = tp / (tp + fn)
* F1 = 2tp / (2tp + fp + fn)

In [11]:
def evaluate_results(results):
    """Calculate accuracy based on exact category matches."""
    correct = sum(1 for res in results if set(res["predicted"]) == set(res["expected"]))
    accuracy = (correct / len(results)) * 100 if results else 0
    return {"accuracy": accuracy, "total_reviews": len(results)}

# Evaluate dev results
dev_metrics = evaluate_results(dev_results)
print("Dev Set Evaluation:", dev_metrics)


Dev Set Evaluation: {'accuracy': 11.0, 'total_reviews': 100}


In [12]:
def calculate_metrics(results):
    """
    Calculate Precision, Recall, and F1-score for the experiment results.
    """
    tp, fp, fn = 0, 0, 0  # Initialize counts
    
    for res in results:
        expected_set = set(res["expected"])
        predicted_set = set(res["predicted"])
        
        tp += len(expected_set & predicted_set)  # True Positives (Correctly predicted)
        fp += len(predicted_set - expected_set)  # False Positives (Incorrect predictions)
        fn += len(expected_set - predicted_set)  # False Negatives (Missed correct labels)
    
    # Calculate Precision, Recall, and F1-score
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * tp / (2 * tp + fp + fn) if (2 * tp + fp + fn) > 0 else 0
    
    return {
        "precision": round(precision, 4),
        "recall": round(recall, 4),
        "f1_score": round(f1, 4),
        "total_reviews": len(results)
    }


In [13]:
dev_metrics = calculate_metrics(dev_results)
print("Dev Set Evaluation Metrics:", dev_metrics)


Dev Set Evaluation Metrics: {'precision': 0.6171, 'recall': 0.758, 'f1_score': 0.6803, 'total_reviews': 100}


## **6. Run Final Experiment on Test Data & Save Results**
- Store experiment outcomes in `results.json`.


In [15]:
# Once we finalize the best prompt strategy from dev, run on test data
test_results, test_errors = run_experiment(data_test, k=32, random_order=True)

# Evaluate final test results
test_metrics = calculate_metrics(test_results)
print("Final Test Set Evaluation Metrics:", test_metrics)

Final Test Set Evaluation Metrics: {'precision': 0.6169, 'recall': 0.7582, 'f1_score': 0.6803, 'total_reviews': 300}


In [16]:
# Save final test results to match dataset-test.json format
final_results = []
for idx, review in enumerate(data_test):
    predicted_tags = [res["predicted"] for res in test_results if res["index"] == idx]
    
    # If no prediction found, default to an empty list
    predicted_tags = predicted_tags[0] if predicted_tags else []

    # Create a structured dictionary
    final_results.append({
        "index": idx,
        "text": review["text"],
        "expected": review["expected"],  # Keep original expected categories
        "predicted": predicted_tags      # Add model's predicted categories
    })

# Save results to JSON file
with open("results.json", "a") as f:
    json.dump(final_results, f, indent=4)

print("results.json saved successfully!")


results.json saved successfully!


## Open Source Models (Optional)

If students wish to evaluate their solution on open source models, they may use Ollama, if their hardware supports it.

In [17]:
# from ollama import chat
# from ollama import ChatResponse

# def prompt_ollama(prompt):
#     response: ChatResponse = chat(model='llama3.3', messages=[{
#         'role': 'user',
#         'content': prompt,
#       },
#     ])
#     return response['message']['content']