Skip to content

Assignment: Natural Language Inference

Alexander Koller edited this page Jan 15, 2024 · 11 revisions

In this assignment, you will build a system that does "natural language inference" (NLI). It will determine, for a "premise" sentence and a "hypothesis" sentence, whether the hypothesis follows from the premise.

We will use the Stanford NLI Corpus (SNLI), which is also available on Hugging Face. As explained on the SNLI website (which you should read), the premise can either entail the hypothesis; the hypothesis can contradict the premise; or they can be neutral with respect to each other. The dataset is balanced: One third of instances are of each type, so a system that guesses at random achieves an accuracy of 33%.

For this final assignment, you are free to use any Pytorch or Hugging Face classes that you like. However, this makes it even more important that you document and explain your code in detail. Document in your submission which models you used and describe them briefly, and be sure to submit your prompt templates.

Supervised Training (40 points)

Implement, train, or obtain an NLI model and evaluate it on the SNLI dataset. You can develop your own model, e.g. by finetuning RoBERTa or adding a classification model on top, or you can use an existing model, such as nli-roberta-base.

Report the accuracy, training time, and the runtime on the validation set.

Prompting (40 points)

Now use in-context learning (aka prompting) with a pre-trained large language model (LLM) to solve SNLI. You can try to provide detailed natural-language instructions on what you want the LLM to do; you can provide one or more examples; or you can try a mixture of both.

Run the LLM on at least the first 100 examples from the validation set and report accuracy and runtime. If you don't evaluate on the entire validation set, make sure you evaluate the finetuned model on the same set as the prompting model, to ensure comparability of your results. Also submit three interesting instantiated prompts with the LLM-generated outputs.

You have a choice regarding the LLM you want to use. The easiest and most accurate option will be to use the OpenAI API with the gpt-4-1106-preview model (aka GPT-4 Turbo). This will require you to make an OpenAI account and pay a small amount of money.

Alternatively, you can choose to use any open-source LLM and run it on your own computer or a server. One way to do it is to use the GPT4All Python API, e.g. with an instruction-tuned Mistral model; you can get one in the gguf format that GPT4All can read from this website. Note that these models require a substantial amount of main memory. Feel free to try out the quantized models (Q4, Q5), especially if you run on CPU.

Discussion (20 points)

Discuss the findings from your finetuning and prompting experiments, with regard to speed, accuracy, and the amount of required training data, and to anything else you found interesting. If you used a pre-trained NLI model in the finetuning part of the assignment, you may have to speculate about training time.

Double-check whether SNLI is actually a well-designed task by evaluating your models on inputs where only the hypothesis and not the premise is given. If the dataset was built carefully enough, this should reduce the accuracy of the models to chance level (33% accuracy = even choice among three alternative labels). Discuss your findings.

Clone this wiki locally