<a href="https://colab.research.google.com/github/fhasan8/iglu22/blob/main/IGLU_2022_NLP_Baseline_BERT_Classifier_BM25_Ranker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://images.aicrowd.com/uploads/ckeditor/pictures/914/content_b1f1e024bb2e2e095d0d.png)

## Overview
This notebook contains the official Baseline for NeurIPS 2022 IGLU Challenge - NLP Task

This task has two parts:

1. **"When to ask clarifying question"** which is a **binary classification problem**: whether to ask a clarification question or not.
2.  **"What to ask as a clarifying question"** which is a **ranking problem**: how closely the list of human-issued clarifying questions are rank

In this notebook we will:

1. Train a [BERT](https://huggingface.co/docs/transformers/model_doc/bert) based binary classifier for the first task.
2. Setup a [BM25](https://pypi.org/project/rank-bm25/) based ranker for the second task.
3. Submit both the models to the AIcrowd submission format for this competition.

**Authors**

[Negar Arabzadeh](https://twitter.com/NegarEmpr) (IGLU Team)

[Dipam Chakraborty](https://twitter.com/__dipam__) (AIcrowd)

# Prerequisites 

Sign up for the competition 🚀

https://www.aicrowd.com/challenges/neurips-2022-iglu-challenge/problems/neurips-2022-iglu-challenge-nlp-task

Login to AIcrowd and download the data (run cells below)

In [2]:
!pip install -U aicrowd-cli
%load_ext aicrowd.magic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
%aicrowd login
#login to AI crowd to download the data 

Please login here: [34m[1m[4mhttps://api.aicrowd.com/auth/qof2aTlev38QUVLKIKhWbcI1LqZrKZXAHq6zZHhT2po[0m
[32mAPI Key valid[0m
[32mGitlab access token valid[0m
[32mSaved details successfully![0m


In [4]:
!mkdir public_data
%aicrowd dataset download -c neurips-2022-iglu-challenge-nlp-task -o public_data

ERROR:root:Error while reading the git config, 'NoneType' object has no attribute 'config_reader'


clarifying_questions_train.csv:   0%|          | 0.00/2.55M [00:00<?, ?B/s]

iglu-2022-nlp-task-states-v1.0.zip:   0%|          | 0.00/13.9M [00:00<?, ?B/s]

question_bank.csv:   0%|          | 0.00/60.3k [00:00<?, ?B/s]

# Train a BERT classifier - "When to ask clarifying question" ❓

We fine-tune a large pre-trained language model (BERT) with a classification layer on top. 

Models’ performance is reported using the official metric - Macro average F1 Score. 




## Install Huggingface transformers and other libraries 🤗


In [5]:
!pip3 install transformers
!pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 36.0 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 72.1 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 57.9 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.1 transformers-4.23.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [6]:
from dataclasses import dataclass
import os
import datetime
import pickle
import numpy as np
import random
import torch
import pandas as pd
from torch.utils.data import TensorDataset
from transformers import BertTokenizer, RobertaTokenizer, BartTokenizer
from sklearn.metrics import precision_score, recall_score, f1_score
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler)
from torch.nn import CrossEntropyLoss, MSELoss
from transformers import BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from transformers import RobertaForSequenceClassification, BartForSequenceClassification
from tqdm.auto import tqdm

## Hyperparameters

In [7]:
model_name = 'bert'
max_seq_length=128
batch_size=16
epoch=2
lr=5e-6
seed_val = 42

## Proprocessing - Tokenize the text data for BERT

In [8]:
def get_tensor_dataset(df,tokenizer):
    
    # Tokenize all of the sentences and map the tokens to thier word IDs.
    input_ids = []
    token_type_ids = []
    attention_masks = []
    labels = []
    topic_ids = []
    
    for count, item in tqdm(enumerate(zip(df["GameId"], 
                                          df['InputInstruction'], df["IsInstructionClear"])),
                            total=len(df), desc='Tokenizing data'):
        z, x, y = item
        encoded_dict = tokenizer.encode_plus(
                            x.lower(),
                            add_special_tokens = True, 
                            max_length = max_seq_length,           
                            pad_to_max_length = True,
                            truncation=True, 
                            return_attention_mask = True,   
                            return_tensors = 'pt',     
                       )

        input_ids.append(encoded_dict['input_ids'])

        if "token_type_ids" in encoded_dict:        
            token_type_ids.append(encoded_dict['token_type_ids'])

        attention_masks.append(encoded_dict['attention_mask'])
        labels.append(y)

        topic_ids.append(z)
    

    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels = torch.tensor(labels)
    return  TensorDataset(input_ids, attention_masks, labels)

In [9]:
from sklearn.model_selection import train_test_split

data_path='public_data/clarifying_questions_train.csv'
df = pd.read_csv(data_path, sep=",")
dftrain, dfdev = train_test_split(df, test_size=0.15)
dfdev.to_csv('public_data/clarifying_questions_val.csv', index=False)

for df in [dfdev,dftrain]:

  if df.shape ==dfdev.shape:
    data_type='dev'
  else:
    data_type='train'  
  

  df['IsInstructionClear'] = df.IsInstructionClear.replace({'Yes': 0, 'No': 1})

  if model_name == 'bart':
      tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
  elif model_name == 'roberta':
      tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
  elif model_name == 'bert':
      tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

  dataset = get_tensor_dataset(df,tokenizer) 

  tensor_data=f"{model_name}_{data_type}.pkl"
  with open(tensor_data, 'wb') as f:
      pickle.dump(dataset, f)
      
tokenizer.save_pretrained("saved_tokenizer")

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Tokenizing data:   0%|          | 0/1025 [00:00<?, ?it/s]



Tokenizing data:   0%|          | 0/5803 [00:00<?, ?it/s]

('saved_tokenizer/tokenizer_config.json',
 'saved_tokenizer/special_tokens_map.json',
 'saved_tokenizer/vocab.txt',
 'saved_tokenizer/added_tokens.json')

## Train Classifier 🏋️‍♀️

In [10]:
def train_model( model_name, model, train_dataloader, scheduler, optimizer, criterion, epochs, lr):
    random.seed(seed_val)
    np.random.seed(seed_val)
    torch.manual_seed(seed_val)
    torch.cuda.manual_seed_all(seed_val)

    if torch.cuda.is_available():    
        device = torch.device("cuda")

    else:
        device = torch.device("cpu")
    print(device)

    # For each epoch...
    for epoch in range(epochs):
        
        total_train_loss = 0
        train_n_correct = 0
        nb_tr_examples = 0

        model.train()

        for _, batch in tqdm(enumerate(train_dataloader), 
                             total=len(train_dataloader),
                             desc=f'Train epoch {epoch+1}/{epochs}'):

            b_input_ids = batch[0].to(device)
            b_input_mask = batch[1].to(device)
            b_labels = batch[2].to(device)

            model.zero_grad()        
            result = model(b_input_ids, attention_mask=b_input_mask)

            loss = criterion(result.logits, b_labels)
            loss.backward()
            total_train_loss += loss.item()
            
            optimizer.step()
            scheduler.step()

            logits = result.logits.detach().cpu().numpy()
            label_ids = b_labels.to('cpu').numpy()
            _, _, _, accuracy = eval_result(logits, label_ids) 
            train_n_correct += accuracy
            nb_tr_examples+=b_input_ids.size(0)
        
        avg_train_loss = total_train_loss / len(train_dataloader)            
        train_acc = train_n_correct / len(train_dataloader)


        print('Epoch [{}/{}], Train Loss: {:.4f}, Train Accuracy: {:.4f} '.format(epoch+1, epochs, avg_train_loss, train_acc  ))

    print("Training complete!")

    model.save_pretrained(f"saved_model/{model_name}_{epochs}e_{lr}lr")
    model.save_pretrained(f"drive/MyDrive/IGLU-cq-data/{model_name}_{epochs}e_{lr}lr")

In [11]:
def eval_result(preds, labels):
    """ Calculate the accuracy, f1, precision, recall of our predictions vs labels
    """

    y_pred = np.argmax(preds, axis=1).flatten()
    y_true = labels.flatten()

    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred, average='macro')
    accuracy = np.sum(y_pred == y_true) / len(y_true) 

    return (precision, recall, f1, accuracy)

from sklearn.metrics import classification_report

def eval_model( model, epoch,lr,test_dataloader):
    
    if torch.cuda.is_available():    
        device = torch.device("cuda")
        model.cuda()
    else:
        device = torch.device("cpu")


    model.eval()

    test_results = []
    test_labels = []
    test_results_predicted_lavels = [] 
    for batch in tqdm(test_dataloader, desc='Eval model'):
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        with torch.no_grad():        
            result = model(b_input_ids, 
                       attention_mask=b_input_mask, 
                       labels=b_labels,
                       return_dict=True)

        logits = result.logits
        logits = logits.detach().cpu().numpy()
        test_results.extend(logits.tolist())
        tmp = np.asarray(logits.tolist())
        test_results_predicted_lavels.extend(np.argmax(tmp, axis=1).flatten())
        label_ids = b_labels.to('cpu').numpy()
        test_labels.extend(label_ids)

    print(classification_report( np.asarray(test_labels), np.asarray(test_results_predicted_lavels)))
    (precision, recall, f1, accuracy) = eval_result(np.asarray(test_results), np.asarray(test_labels))

    print('Test Precision: {:.4f}, Test Recall: {:.4f}, Test Macro F1: {:.4f}, Test Accuracy: {:.4f} ' .format(precision, recall, f1, accuracy))

    df = pd.DataFrame(test_results, columns=["0", "1"])

    df['y_True'] = test_labels
    #print(df.head())
    df.to_csv( f"Final_Test_Probanility_Distribution_{model_name}_{epoch}e_{lr}lr_eval.csv", index=False)

In [12]:
print("============================================================")
print(f" {epoch} -- {lr} -- {datetime.datetime.now()}")
print("============================================================")

with open(f'{model_name}_train.pkl', 'rb') as f:
    train_dataset = pickle.load(f)

train_dataloader = DataLoader(
            train_dataset,  
            sampler = RandomSampler(train_dataset), 
            batch_size = batch_size, 
            num_workers= 4
        )

if model_name == 'bart':
    model = BartForSequenceClassification.from_pretrained(
        "facebook/bart-base",
        num_labels=2,
    )
elif model_name == 'roberta':
    model = RobertaForSequenceClassification.from_pretrained(
        "roberta-base",
        num_labels=2,
    )
else:
    model = BertForSequenceClassification.from_pretrained(
        'bert-base-uncased',
        num_labels=2,
    )
model.cuda()

optimizer = AdamW(
    model.parameters(),
    lr = lr,
)


criterion = CrossEntropyLoss()
total_steps = len(train_dataloader) * epoch
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, 
                                            num_training_steps = total_steps)

# Train Model
train_model( model_name,model, train_dataloader, scheduler, optimizer, criterion, epoch, lr)

 2 -- 5e-06 -- 2022-10-11 14:32:51.558242


  cpuset_checked))


Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

cuda




Train epoch 1/2:   0%|          | 0/363 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

Epoch [1/2], Train Loss: 0.3582, Train Accuracy: 0.8637 


  _warn_prf(average, modifier, msg_start, len(result))
  cpuset_checked))


Train epoch 2/2:   0%|          | 0/363 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

Epoch [2/2], Train Loss: 0.2921, Train Accuracy: 0.8972 
Training complete!


## Evaluate classifier

In [13]:
model = BertForSequenceClassification.from_pretrained(f"saved_model/{model_name}_{epoch}e_{lr}lr")
with open(f"{model_name}_dev.pkl", 'rb') as f:
    test_dataset = pickle.load(f)

    test_dataloader = DataLoader(
                test_dataset,  
                sampler = RandomSampler(test_dataset), 
                batch_size = batch_size 
    )
    eval_model( model, epoch,lr,test_dataloader)

Eval model:   0%|          | 0/65 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.91      0.97      0.94       892
           1       0.64      0.35      0.45       133

    accuracy                           0.89      1025
   macro avg       0.77      0.66      0.69      1025
weighted avg       0.87      0.89      0.88      1025

Test Precision: 0.6389, Test Recall: 0.3459, Test Macro F1: 0.6938, Test Accuracy: 0.8898 


# Ranking using BM25 - "What to ask as a clarifying question" 📝

We'll not train any model for this baseline, we'll simply use BM25 with is already a strong ranker. 

What's your ideas to improve scores above BM25? 😉

## Installing requirements and dependencies



In [14]:
!pip install pytrec_eval
!pip install rank_bm25
!pip install ast

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytrec_eval
  Downloading pytrec_eval-0.5.tar.gz (15 kB)
Building wheels for collected packages: pytrec-eval
  Building wheel for pytrec-eval (setup.py) ... [?25l[?25hdone
  Created wheel for pytrec-eval: filename=pytrec_eval-0.5-cp37-cp37m-linux_x86_64.whl size=264243 sha256=44a769ed51410108742f99ce0a81519100b2ac50a20952acd91093a1954e50a7
  Stored in directory: /root/.cache/pip/wheels/42/96/77/0829b8b2606f90f61ba10a51277629d2b615604e122ee932f4
Successfully built pytrec-eval
Installing collected packages: pytrec-eval
Successfully installed pytrec-eval-0.5
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25
Successfully installed rank-bm25-0.2.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev

In [15]:
import pandas as pd
from statistics import mean
import os 
import logging
import os
import sys
import random
import numpy as np
import pandas as pd
from rank_bm25 import BM25Okapi
import nltk
from nltk.stem.porter import PorterStemmer
import numpy as np

nltk.download('punkt')
nltk.download('stopwords')

def stem_tokenize(text, remove_stopwords=True):
  stemmer = PorterStemmer()
  tokens = [word for sent in nltk.sent_tokenize(text) \
                                      for word in nltk.word_tokenize(sent)]
  tokens = [word for word in tokens if word not in \
          nltk.corpus.stopwords.words('english')]
  return [stemmer.stem(word) for word in tokens]

np.random.seed(42)
random.seed(42)


logging.basicConfig(
  level=logging.INFO,
  format="%(asctime)s [%(levelname)s] %(message)s",
  handlers=[
      logging.StreamHandler(sys.stdout)
  ]
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Dataloader


In [16]:
from sklearn.model_selection import train_test_split
#selecting part of data for dev set for testing
data_path='public_data/clarifying_questions_train.csv'
df = pd.read_csv(data_path, sep=",")

test = df.drop(columns=['InitializedWorldPath','IsInstructionClear','Partition'])
test = test[~test["ClarifyingQuestion"].isnull()]

#preprocessing question bank for BM25
question_bank_path = 'public_data/question_bank.csv'
question_bank = pd.read_csv(question_bank_path).fillna('')
question_bank['tokenized_question_list'] = question_bank['ClarifyingQuestion'].map(stem_tokenize)
question_bank['tokenized_question_str'] = question_bank['tokenized_question_list'].map(lambda x: ' '.join(x))

## Run BM25

We'll use the 'qrel' column of the dataset to create a corpus to rank from. Then use BM25 to rank each question for every input instruction that has a clarifying question in the dataset.

In [17]:
import ast
print(test.head())
run_file_path = 'dev_bm25'
run_file= open(run_file_path, 'w')
for index, row in test.iterrows():
    list_of_qs=ast.literal_eval(test['qbank'][index]) 
    GameId= test['GameId'][index]
    temp_q_bank=question_bank[question_bank['qrel'].isin(list_of_qs)]
    bm25_corpus = temp_q_bank['tokenized_question_list'].tolist() #creating a seperate corpus for each GameId
    bm25 = BM25Okapi(bm25_corpus)
    #print(temp_q_bank.head())
    top_k=len(bm25_corpus)
    # Runs bm25 for every query and stores output in file.
    examples = []
    all_preds_bm25 = []
    query = test.loc[test['GameId']==GameId, 'InputInstruction'].tolist()[0]
    bm25_ranked_list = bm25.get_top_n(stem_tokenize(query, True), 
                                    bm25_corpus, 
                                    n=top_k)
    bm25_q_list = [' '.join(sent) for sent in bm25_ranked_list]
    docs = temp_q_bank.set_index('tokenized_question_str').loc[bm25_q_list, 'ClarifyingQuestion'].tolist()
    preds = temp_q_bank.set_index('tokenized_question_str').loc[bm25_q_list, 'qrel'].tolist()
    for i, questionid in enumerate(preds):
        #writing results to runfile
        run_file.write('{} 0 {} {} {} BM25-reranker\n'.format(GameId,questionid, i+1, (len(preds)-i)/100))

run_file.close()

          GameId                                 ClarifyingQuestion  \
3   CQ-game-1000                                Which color blocks?   
5   CQ-game-1002  After you remove the one green block there are...   
16  CQ-game-1011              in any square west of the red blocks?   
25  CQ-game-1020             Should I destory east or west puyrple?   
55  CQ-game-1055        Where exactly am I placing the blue blocks?   

                                     InputInstruction   qrel  \
3   Place four blocks to the east of the highest b...  q_149   
5   facing north destroy a green block located on ...  q_436   
16  Stack seven green blocks immediately to the we...  q_111   
25  Facing north place one purple block to the lef...  q_653   
55  facing northdelete top 2 purple blocks on Righ...  q_170   

                                                qbank  
3   'q_696', 'q_203', 'q_516', 'q_677', 'q_769', '...  
5   'q_928', 'q_46', 'q_191', 'q_462', 'q_400', 'q...  
16  'q_101', 'q_186'

## Evaluate on whole dataset

Since there is no training step, here we evaluate BM25 entire dataset.

In [18]:
# evalauting run_file_path in terms of MRR@5,10,20
topic_df = test
topic_question_set_dict = topic_df.groupby('GameId')['qrel'].agg(set).to_dict()
ambigous_questions_count=0
for v in topic_question_set_dict.values():
    ambigous_questions_count+= len(v)

run_df = pd.read_csv(run_file_path, sep=' ', header=None)
run_df = run_df.sort_values(by=[0, 4], ascending=False).drop_duplicates(subset=[0, 4], keep='first')
run_question_set_list = run_df.groupby(0)[2].agg(list).to_dict()
topk_list = [5, 10, 20]

mrr_score_dict = {}

for topk in topk_list:
    metric_name = 'MRR{}'.format(topk)
    mrr_score_dict[metric_name] = {}
    for tid in topic_question_set_dict:
        try: 
            qrel= list(topic_question_set_dict[tid])[0]
            rr= 1/ (run_question_set_list[tid][:topk].index(qrel) + 1 )
        except: 
            rr=0
        
        mrr_score_dict[metric_name][tid] = rr

mean_performance = {}
for metric in mrr_score_dict:
    mean_performance[metric] = sum(mrr_score_dict[metric][k] for k in mrr_score_dict[metric])/ambigous_questions_count
    
for metric in mrr_score_dict:
    print('{}: {}'.format(metric, mean_performance[metric]))

MRR5: 0.32202247191011213
MRR10: 0.33552434456928826
MRR20: 0.34432871423613903


# Submit to AIcrowd 🎯

We need to prepare the models so that the AIcrowd evaluator can talk to them.

Check [this link](https://gitlab.aicrowd.com/aicrowd/challenges/iglu-challenge-2022/iglu-2022-clariq-nlp-starter-kit/-/blob/master/models/README.md) for a detailed explanation.

## Clone starter kit

In [19]:
!git clone http://gitlab.aicrowd.com/aicrowd/challenges/iglu-challenge-2022/iglu-2022-clariq-nlp-starter-kit.git

Cloning into 'iglu-2022-clariq-nlp-starter-kit'...
remote: Enumerating objects: 219, done.[K
remote: Counting objects: 100% (61/61), done.[K
remote: Compressing objects: 100% (44/44), done.[K
remote: Total 219 (delta 33), reused 34 (delta 16), pack-reused 158[K
Receiving objects: 100% (219/219), 177.19 KiB | 283.00 KiB/s, done.
Resolving deltas: 100% (115/115), done.


In [20]:
gitfolder = '/content/iglu-2022-clariq-nlp-starter-kit/'

## Copy data and model weights

Note: NLTK data is copied because internet is not available during evaluation

In [21]:
# Make folder for bert model
!mkdir -p {gitfolder}/models/classifiers/bert_baseline/saved_model

# Copy saved tokenizer and model
!cp -r saved_tokenizer/ {gitfolder}/models/classifiers/bert_baseline/
!cp -r saved_model/{model_name}_{epoch}e_{lr}lr/* {gitfolder}/models/classifiers/bert_baseline/saved_model

# Copy nltk data
!mv /root/nltk_data/ {gitfolder}/models/rankers/nltk_data/

## Create BERT Classifier for inference class as per AIcrowd's spec


In [22]:
%%writefile {gitfolder}/models/classifiers/bert_baseline.py

import numpy as np
import torch
from transformers import BertTokenizer
from transformers import BertForSequenceClassification

class BERTClassifier:
    def __init__(self):
        self.tokenizer = BertTokenizer.from_pretrained("models/classifiers/bert_baseline/saved_tokenizer")
        self.max_seq_length = 128
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.model = BertForSequenceClassification.from_pretrained("models/classifiers/bert_baseline/saved_model")
        self.model.to(self.device)

    def raise_aicrowd_error(self, msg):
        """ Will be used by the evaluator to provide logs, DO NOT CHANGE """
        raise NameError(msg)
    
    def clarification_required(self, instruction, gridworld_state):
        """
        Implements classifier for given instuction - whether a clarifying question is required or not
        Inputs:
            instruction - Single instruction string

            gridworld_state - Internal state from the iglu-gridworld simulator corresponding to the instuction
                              NOTE: The state will only contain the "avatarInfo" and "worldEndingState"

        Outputs:
            0 or 1 - 0 if clarification is not required, 1 if clarification is required 

        """

        with torch.no_grad():
            encoded_dict = self.tokenizer.encode_plus(
                                instruction.lower(),
                                add_special_tokens = True, 
                                max_length = self.max_seq_length,           
                                pad_to_max_length = True,
                                truncation=True, 
                                return_attention_mask = True,   
                                return_tensors = 'pt',     
                        )
            inputs = encoded_dict['input_ids'].to(self.device)
            attention_mask = encoded_dict['attention_mask'].to(self.device)
            results = self.model(inputs, attention_mask=attention_mask)
            
        return np.argmax(results.logits.cpu().numpy())

Writing /content/iglu-2022-clariq-nlp-starter-kit//models/classifiers/bert_baseline.py


## Create BM25 ranker class as per AIcrowd's spec


In [23]:
%%writefile {gitfolder}/models/rankers/bm25_baseline.py

from rank_bm25 import BM25Okapi
import nltk
from nltk.stem.porter import PorterStemmer
import numpy as np
import os

nltk.data.path = [os.path.join(os.getcwd(), 'models/rankers/nltk_data')]

def stem_tokenize(text, remove_stopwords=True):
  stemmer = PorterStemmer()
  tokens = [word for sent in nltk.sent_tokenize(text) \
                                      for word in nltk.word_tokenize(sent)]
  tokens = [word for word in tokens if word not in \
          nltk.corpus.stopwords.words('english')]
  return [stemmer.stem(word) for word in tokens]

class BM25Ranker:
    def __init__(self):
        pass

    def raise_aicrowd_error(self, msg):
        """ Will be used by the evaluator to provide logs, DO NOT CHANGE """
        raise NameError(msg)
    
    def rank_questions(self, instruction, gridworld_state, question_bank):
        """
        Implements the ranking function for a given instruction
        Inputs:
            instruction - Single instruction string, may or may not need any clarifying question
                          The evaluator may pass questions that don't need clarification, 
                          But only questions requiring clarifying questions will be scored

            gridworld_state - Internal state from the iglu-gridworld simulator corresponding to the instuction
                              NOTE: The state will only contain the "avatarInfo" and "worldEndingState"

            question_bank - List of clarifying questions to rank

        Outputs:
            ranks - A sorted list of questions from the question bank
                    Such that the first index corresponds to the best ranked question

        """

        tokenized_questions = [stem_tokenize(q) for q in question_bank]
        token_question_map = {' '.join(tq): q for q, tq in zip(question_bank, tokenized_questions)}
        bm25 = BM25Okapi(tokenized_questions)
        tokenized_instruction = stem_tokenize(instruction, True)
        bm25_ranked_tokenized_questions = bm25.get_top_n(tokenized_instruction, tokenized_questions, n=len(tokenized_questions))
        ranked_joined_sentences = [' '.join(tq) for tq in bm25_ranked_tokenized_questions]
        ranked_question_list = [token_question_map[sent] for sent in ranked_joined_sentences]
        return ranked_question_list

Writing /content/iglu-2022-clariq-nlp-starter-kit//models/rankers/bm25_baseline.py


## Setup out model paths in user_config

In [24]:
%%writefile {gitfolder}/models/user_config.py

from models.classifiers.bert_baseline import BERTClassifier
from models.rankers.bm25_baseline import BM25Ranker

UserClassifer = BERTClassifier
UserRanker = BM25Ranker

Overwriting /content/iglu-2022-clariq-nlp-starter-kit//models/user_config.py


## Setup to use GPU during evalution 💻

In [25]:
%%writefile {gitfolder}/aicrowd.json

{
    "challenge_id": "neurips-2022-iglu-challenge-nlp-task",
    "authors": [
      "iglu-team"
    ],
    "description": "BERT Classifier and BM25 Ranker",
    "gpu": true
  }
  


Overwriting /content/iglu-2022-clariq-nlp-starter-kit//aicrowd.json


## Add libraries we need to install during evaluation

In [26]:
%%writefile -a {gitfolder}/requirements.txt

transformers
rank_bm25
nltk
torch

Appending to /content/iglu-2022-clariq-nlp-starter-kit//requirements.txt


In [27]:
!cp public_data/clarifying_questions_val.csv {gitfolder}/public_data/clarifying_questions_train.csv
!cp public_data/question_bank.csv {gitfolder}/public_data/question_bank.csv

## Check the local evaluator ✅

In [28]:
%cd {gitfolder}
!python local_evaluation.py

/content/iglu-2022-clariq-nlp-starter-kit
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
Moving 0 files to the new cache system
0it [00:00, ?it/s]
Running classifier: 100% 1025/1025 [00:14<00:00, 72.02it/s]
Running ranker: 100% 133/133 [00:36<00:00,  3.62it/s]
Binned F1 Score 0.65
MRR for Ranker 0.34246214084469523
F1 Score for Classifier 0.6937669376693767


## Final submission 🚀

In [29]:
%cd {gitfolder}
!source submit.sh "fhasan - Baseline"

/content/iglu-2022-clariq-nlp-starter-kit
Git setup dont have email defined, setting it to "21979-fhasan8@users.noreply.gitlab.aicrowd.com"
[0;32mMaking submission as "fhasan8"[0m
[0;36mChecking git remote settings...[0m
[0;32mUsing gitlab.aicrowd.com/fhasan8/iglu-2022-clariq-nlp-starter-kit as the submission repository[0m
Updated git hooks.
Git LFS initialized.
[master 16ff0ab] Changes for submission-BERT-Classifier-and-BM25-Ranker-Official-Baseline
 82 files changed, 2370582 insertions(+), 9 deletions(-)
 create mode 100644 models/classifiers/bert_baseline.py
 create mode 100644 models/classifiers/bert_baseline/saved_model/config.json
 create mode 100644 models/classifiers/bert_baseline/saved_model/pytorch_model.bin
 create mode 100644 models/classifiers/bert_baseline/saved_tokenizer/special_tokens_map.json
 create mode 100644 models/classifiers/bert_baseline/saved_tokenizer/tokenizer_config.json
 create mode 100644 models/classifiers/bert_baseline/saved_tokenizer/vocab.txt
 cr