In [1]:
import pandas as pd
from string import Template
from pathlib import Path

import warnings
warnings.simplefilter("ignore")

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

data_path = Path('/kaggle/input/kaggle-llm-science-exam')

## We'll use `FLAN-T5-base` from Kaggle's Model Hub

You'll probably want to turn on the GPU option for the notebook! (Remember though, since this is a Code competition, you'll need to set Internet to Off for Notebook submissions to the competition.)

In [2]:
llm = '/kaggle/input/flan-t5/pytorch/base/2'


device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = T5ForConditionalGeneration.from_pretrained(llm).to(device)
tokenizer = T5Tokenizer.from_pretrained(llm)

The data is formatted as follows. For each `prompt` (e.g., the question) there are five possible answers labeled `[A-E]`. Only one of the answers is correct.

In [3]:
test = pd.read_csv(data_path / 'test.csv', index_col='id')
test.head()

Unnamed: 0_level_0,prompt,A,B,C,D,E
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...
1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...
2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...
3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...


## Creating a preamble template

How you format your prompt to input to the LLM can make a big difference in the output you get. Here, we try to instruct the LLM to rank all of the options from most likely to least likely.

In [4]:
preamble = \
    'Answer the following question by outputting the letters A, B, C, D, and E '\
    'in order of the most likely to be correct to the to least likely to be correct.'

template = Template('$preamble\n\n$prompt\n\nA) $a\nB) $b\nC) $c\nD) $d\nE) $e')

In [5]:
def format_input(df, idx):
    
    prompt = df.loc[idx, 'prompt']
    a = df.loc[idx, 'A']
    b = df.loc[idx, 'B']
    c = df.loc[idx, 'C']
    d = df.loc[idx, 'D']
    e = df.loc[idx, 'E']

    input_text = template.substitute(
        preamble=preamble, prompt=prompt, a=a, b=b, c=c, d=d, e=e)
    
    return input_text

This is an example of a formatted question that would be used as input to the LLM.

In [6]:
print(format_input(test, 0))

Answer the following question by outputting the letters A, B, C, D, and E in order of the most likely to be correct to the to least likely to be correct.

Which of the following statements accurately describes the impact of Modified Newtonian Dynamics (MOND) on the observed "missing baryonic mass" discrepancy in galaxy clusters?

A) MOND is a theory that reduces the observed missing baryonic mass in galaxy clusters by postulating the existence of a new form of matter called "fuzzy dark matter."
B) MOND is a theory that increases the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 20.
C) MOND is a theory that explains the missing baryonic mass in galaxy clusters that was previously considered dark matter by demonstrating that the mass is in the form of neutrinos and axions.
D) MOND is a theory that reduces the discrepancy between the observed missing baryonic mass in galaxy cl

In [7]:
inputs = tokenizer(format_input(test, 0), return_tensors="pt").to(device)
outputs = model.generate(**inputs)
answer = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print(answer)

['A']


## Post-processing

You can see from the above that the LLM did not properly follow instructions. You'll need to figure out how to ensure your model provides at least the top three predictions, and have checks and post-processing in place for when they don't (such as in our example!)

This notebook provides a naive and **very** fragile example of how to do this. You'll want to make something more rubust!

In [8]:
def post_process(predictions):
    valid = set(['A', 'B', 'C', 'D', 'E'])
    # If there are no valid choices, return something and hope for partial credit
    if set(predictions).isdisjoint(valid):
        final_pred = 'A B C D E'
    else:
        final_pred = []
        for prediction in predictions:
            if prediction in valid:
                final_pred += prediction
        # add remaining letters
        to_add = valid - set(final_pred)
        final_pred.extend(list(to_add))
        # put in space-delimited format
        final_pred = ' '.join(final_pred)
        
    return final_pred

## Making a submission

We can now make a simple script to make a submission to the competition.

In [9]:
submission = pd.read_csv(
    data_path / 'sample_submission.csv', index_col='id')

for idx in test.index:
    inputs = tokenizer(format_input(test, idx), return_tensors="pt").to(device)
    outputs = model.generate(**inputs)
    answer = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    submission.loc[idx, 'prediction'] = post_process(answer)

Token indices sequence length is longer than the specified maximum sequence length for this model (596 > 512). Running this sequence through the model will result in indexing errors


You can include all five possible answers, but only the first three will be counted!

In [10]:
submission.head()

Unnamed: 0_level_0,prediction
id,Unnamed: 1_level_1
0,A E C D B
1,A E C D B
2,C E A D B
3,A E C D B
4,A E C D B


In [11]:
submission.to_csv('submission.csv')