Inspired from [@jhoward](https://www.kaggle.com/jhoward)'s [notebook](https://www.kaggle.com/code/jhoward/iterate-like-a-grandmaster/notebook) & [@asfurnica](https://www.kaggle.com/asfurnica)'s [notebook](https://www.kaggle.com/code/asfurnica/fast-ai-watson-entry)

augmented data sources (thanks to [@tuckerarrants](https://www.kaggle.com/tuckerarrants)):
- [back translated](https://www.kaggle.com/datasets/tuckerarrants/contradictorywatsontwicetranslatedaug)
- [multiple translations](https://www.kaggle.com/datasets/tuckerarrants/contradictorytranslatedtrain)

In [1]:
!pip install -q --user datasets sentencepiece accelerate

[0m

In [2]:
import pandas as pd
import numpy as np

from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import Dataset, DatasetDict, load_metric
from sklearn.model_selection import StratifiedGroupKFold
from scipy import stats as st

In [3]:
path = '../input/contradictory-my-dear-watson'
orig = {
    'train': pd.read_csv(f'{path}/train.csv'),
    'test': pd.read_csv(f'{path}/test.csv'),
}

In [4]:
INCLUDE_FULL_TRANSLATION = True
INCLUDE_BACK_TRANSLATION = True

In [5]:
df = orig['train']
test_df = orig['test']

if INCLUDE_FULL_TRANSLATION:
    path = '../input/contradictorytranslatedtrain'
    full_translations = {
        'en': pd.read_csv(f'{path}/train_en.csv'),
        'bg': pd.read_csv(f'{path}/train_bg.csv'),
        'hi': pd.read_csv(f'{path}/train_hi.csv'),
        'vi': pd.read_csv(f'{path}/train_vi.csv'),
    }
    df = pd.concat((df, *[full_translations[lang] for lang in full_translations.keys()]), ignore_index=True)

if INCLUDE_BACK_TRANSLATION:    
    path = '../input/contradictorywatsontwicetranslatedaug/'
    back_translations = {
        'train': {
            'once': pd.read_csv(f'{path}/translation_aug_train.csv'),
            'twice': pd.read_csv(f'{path}/twice_translated_aug_train.csv'),
            'thrice': pd.read_csv(f'{path}/thrice_translation_aug_train.csv').drop('Unnamed: 0', axis=1),
        },
        'test': {
            'once': pd.read_csv(f'{path}/translation_aug_test.csv'),
            'twice': pd.read_csv(f'{path}/twice_translated_aug_test.csv'),
            'thrice': pd.read_csv(f'{path}/thrice_translation_aug_test.csv').drop('Unnamed: 0', axis=1),
        }
    }
    df = pd.concat((df, *[back_translations['train'][k] for k in back_translations['train'].keys()]), ignore_index=True)
    

df.rename(columns={'label': 'labels'}, inplace=True)
df.shape

(96960, 6)

In [6]:
df.head()

Unnamed: 0,id,premise,hypothesis,lang_abv,language,labels
0,5130fd2cb5,and these comments were considered in formulat...,The rules developed in the interim were put to...,en,English,0
1,5b72532a0b,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.,fr,French,0
3,5622f0c60b,you know they can't really defend themselves l...,They can't defend themselves because of their ...,en,English,0
4,86aaa48b45,ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...,เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร,th,Thai,1


In [7]:
df.tail()

Unnamed: 0,id,premise,hypothesis,lang_abv,language,labels
96955,2b78e2a914,Even well-designed epidemiological outcomes ar...,All studies have the same uncertainty for them.,en,English,2
96956,7e9943d152,"But there are two kinds of joy in doing, and t...",But there are two types of pleasures of doing ...,en,English,0
96957,5085923e6c,The important thing is that it has taken a lon...,"It cannot be moved, not now or ever.",en,English,2
96958,fc8e2fd1fe,At the western end is a detailed model of the ...,The temple complex is located at the eastern end.,en,English,2
96959,44301dfb14,"Solve yourself? Choose RK, or the father of Tu...",Atatرکrk was the father of the Turkish nation.,en,English,0


In [8]:
np.random.seed(7)

`Arabic, Bulgarian, Chinese, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, and Vietnamese`

In [9]:
model_nm = [
#     'bert-base-multilingual-cased',
#     'microsoft/deberta-v3-small',
#     'facebook/bart-large-mnli',
    'MoritzLaurer/mDeBERTa-v3-base-mnli-xnli',
#     'MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c',
    # 'MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli'
#     'alan-turing-institute/mt5-large-finetuned-mnli-xtreme-xnli',    
][-1]

In [10]:
tkn = AutoTokenizer.from_pretrained(model_nm, use_fast=False)

Downloading:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/286 [00:00<?, ?B/s]

In [11]:
tkn.all_special_tokens

['[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]']

In [12]:
tkn.sep_token

'[SEP]'

In [13]:
if tkn.all_special_tokens[0].startswith('<'):
    special_token = '<:-:>'
    if not tkn.sep_token:
        sep_token = '<sep>'
        tkn.sep_token = sep_token
    tkn.add_special_tokens({'additional_special_tokens': [special_token, tkn.sep_token]})
else:
    special_token = '[:-:]'
    tkn.add_special_tokens({'additional_special_tokens': [special_token,]})

In [14]:
tkn.all_special_tokens

['[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]', '[:-:]']

In [15]:
df.loc[:, 'sectok'] = special_token
test_df.loc[:, 'sectok'] = special_token

df.loc[:, "inputs"] = (df.sectok + tkn.sep_token + df.premise + tkn.sep_token + df.hypothesis)
test_df.loc[:, "inputs"] = (test_df.sectok + tkn.sep_token + test_df.premise + tkn.sep_token + test_df.hypothesis)

In [16]:
def tokenize_text(x):
    return tkn(x['inputs'], padding=True, truncation=True) #padding='max_length'

In [17]:
ds = Dataset.from_pandas(df)
ds = ds.map(tokenize_text, batched=True, remove_columns=('premise', 'hypothesis', 'language', 'inputs', 'id', 'lang_abv', 'sectok'))

  0%|          | 0/97 [00:00<?, ?ba/s]

In [18]:
test_ds = Dataset.from_pandas(test_df)
test_ds = test_ds.map(tokenize_text, batched=True, remove_columns=('premise', 'hypothesis', 'language', 'lang_abv', 'inputs', 'sectok'))

  0%|          | 0/6 [00:00<?, ?ba/s]

In [19]:
ds

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 96960
})

In [20]:
accuracy = load_metric('accuracy')
# f1 = load_metric('f1')

def compute_metrics(pred):
    labels=pred.label_ids
    preds=pred.predictions.argmax(1)
#     f1_score=f1.compute(predictions=preds, references=labels, average='weighted')
    ac_score=accuracy.compute(predictions=preds, references=labels)
    print(ac_score)
    return ac_score

Downloading builder script:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

In [21]:
def get_model():
    return (AutoModelForSequenceClassification
            .from_pretrained(model_nm, num_labels=3,
#                              hidden_dropout_prob=0.15,
#                              attention_probs_dropout_prob=0.15
                            ))

def get_trainer(dds, model=None):
    if model is None: model = get_model()

    args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True, save_strategy='epoch', group_by_length=True,
                             evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2, metric_for_best_model='accuracy', label_smoothing_factor=0.2,
                             num_train_epochs=epochs, weight_decay=wd, report_to='none', greater_is_better=True, auto_find_batch_size=True, load_best_model_at_end=True, save_total_limit=1
                            )
    
    
    return Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['valid'],
                   tokenizer=tkn, compute_metrics=compute_metrics)

In [22]:
def get_fold(folds, fold_num):
    trn_idxs, val_idxs = folds[fold_num]
    return DatasetDict({
        'train': ds.select(trn_idxs),
        'valid': ds.select(val_idxs)
    })

In [23]:
def train(ds):
    model = get_model()
    model.resize_token_embeddings(len(tkn))

    print(f'total rows: {ds.num_rows}')
    trainer = get_trainer(ds, model)
    trainer.train()
    return trainer

In [24]:
def ensemble_train(df, n_folds=4):
    cv = StratifiedGroupKFold(n_splits=n_folds)
    df = df.sample(frac=1)
    folds = list(cv.split(df.index.values, df.labels, groups=df.lang_abv))

    trainers = []
    for i in range(n_folds):
        print(f'training with fold:{i}')
        dds = get_fold(folds, i)
        model = get_model()
        model.resize_token_embeddings(len(tkn))
        trainer = get_trainer(dds, model)
        trainer.train()
        trainers.append(trainer)
    
    return trainers

In [25]:
def train_valid_split(df, valid_pct=0.25):

    valid_idxs = df.groupby('lang_abv').apply(lambda x: x.sample(frac=valid_pct)).droplevel(0).index.values
    train_idxs = np.setdiff1d(df.index.values, valid_idxs)

    np.random.shuffle(train_idxs)
    np.random.shuffle(valid_idxs)

    dds = DatasetDict({
        'train': ds.select(train_idxs),
        'valid': ds.select(valid_idxs)
    })
    
    return dds

In [26]:
train_df = df.sample(frac=.5)
dds = train_valid_split(train_df)
dds

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 36361
    })
    valid: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 12119
    })
})

In [27]:
bs = 16
epochs = 1
lr = 4e-5
wd=0.02

# ENSEMBLE = True
ENSEMBLE = False
n_folds = 2
MODE_OR_MEAN = 'mode'  #valid only if ENSEMBLE == True

In [28]:
trainer = ensemble_train(train_df, n_folds=n_folds) if ENSEMBLE else train(dds)

Downloading:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/532M [00:00<?, ?B/s]

total rows: {'train': 36361, 'valid': 12119}


Using cuda_amp half precision backend
***** Running training *****
  Num examples = 36361
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1137


Epoch,Training Loss,Validation Loss,Accuracy
1,1.3631,1.488931,0.815166


***** Running Evaluation *****
  Num examples = 12119
  Batch size = 64


{'accuracy': 0.8151662678438815}


Saving model checkpoint to outputs/checkpoint-1137
Configuration saved in outputs/checkpoint-1137/config.json
Model weights saved in outputs/checkpoint-1137/pytorch_model.bin
tokenizer config file saved in outputs/checkpoint-1137/tokenizer_config.json
Special tokens file saved in outputs/checkpoint-1137/special_tokens_map.json
added tokens file saved in outputs/checkpoint-1137/added_tokens.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from outputs/checkpoint-1137 (score: 0.8151662678438815).


# Predictions

In [29]:
if not ENSEMBLE:
    preds = trainer.predict(test_ds).predictions.argmax(1)
else:
    preds = np.column_stack((tr.predict(test_ds).predictions.argmax(1) for tr in trainer))
    if MODE_OR_MEAN == 'mode':
        preds = st.mode(preds, axis=1).mode.ravel()
    else: # mean
        preds = np.round(preds.mean(axis=1)).astype(int)    

The following columns in the test set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: id. If id are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5195
  Batch size = 64


# Submission

In [30]:
pd.DataFrame({'id': test_ds['id'], 'prediction': preds}).to_csv('submission.csv', index=False)