Assignment: Natural Language Inference

In this assignment, you will build a system that does "natural language inference" (NLI). It will determine, for a "premise" sentence and a "hypothesis" sentence, whether the hypothesis follows from the premise.

We will use the Stanford NLI Corpus (SNLI), which is also available on Hugging Face. As explained on the SNLI website (which you should read), the premise can either entail the hypothesis; the hypothesis can contradict the premise; or they can be neutral with respect to each other. The dataset is balanced: One third of instances are of each type, so a system that guesses at random achieves an accuracy of 33%.

For this final assignment, you are free to use any Pytorch or Hugging Face classes that you like. However, this makes it even more important that you document and explain your code in detail. Document in your submission which models you used and describe them briefly, and be sure to submit your prompt templates.

Finetuning (40 points)

Train an NLI model and evaluate it on the SNLI dataset. You can start from a general-purpose pretrained model, such as RoBERTa or T5, if you wish.

Report the accuracy and the runtime on the validation set, and report the training time.

For a reduced number of points, you can instead use an existing model that has already been finetuned on SNLI, such as nli-roberta-base. In this case, you won't be able to report the training time or a learning curve.

Prompting (40 points)

Now use in-context learning (aka prompting) with a pre-trained large language model (LLM) to solve SNLI. You can try to provide detailed natural-language instructions on what you want the LLM to do; you can provide one or more examples; or you can try a mixture of both.

Run the LLM on at least the first 100 examples from the validation set and report accuracy and runtime. If you don't evaluate on the entire validation set, make sure you evaluate the finetuned model on the same set as the prompting model, to ensure comparability of your results. Also submit three interesting instantiated prompts with the LLM-generated outputs.

You have a choice regarding the LLM you want to use. The most accurate option will be to use the OpenAI API with an up-to-date model. You will need to create an OpenAI account and pay a few dollars in usage fees.

Alternatively, you can choose to use an open-source LLM and run it on your own computer or a server. One simple way to do this is to use the Ollama Python API, e.g. with Llama 3.2. Note that this will require a substantial amount of memory.

Discussion (20 points)

Discuss the findings from your finetuning and prompting experiments, with regard to speed, accuracy, and the amount of required training data, and to anything else you found interesting. If you used a pre-trained NLI model in the finetuning part of the assignment, you may have to speculate about training time.

As a sanity check, evaluate your models on inputs where only the hypothesis and not the premise is given. If the dataset was built carefully enough and SNLI didn't leak into the LLM's training data, this should reduce the accuracy of the models to chance level (33% accuracy = even choice among three alternative labels). Discuss your findings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment: Natural Language Inference

Finetuning (40 points)

Prompting (40 points)

Discussion (20 points)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally