# PolEval2021 evaluation

This notebook evaluates PolEval2021 dataset on pretrained models: 
- **bert-base-multilingual-cased-finetuned-polish-squad1**
- **bert-base-multilingual-cased-finetuned-polish-squad2**

## Imports

In [1]:
pip install editdistance

Collecting editdistance
  Downloading editdistance-0.6.0-cp37-cp37m-manylinux2010_x86_64.whl (285 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m285.6/285.6 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: editdistance
Successfully installed editdistance-0.6.0
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
import re
from collections import namedtuple
from itertools import product
from copy import deepcopy

import pandas as pd
import numpy as np
import torch

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import pipeline

import editdistance

## Data

In [3]:
def load_dataset(path_w_context, path_ans):
    def lextend_answers(data, path):
        with open(path, 'r') as temp_f:
            col = [ l.replace('\n', '').split("\t") for l in temp_f.readlines() ]
            data["answer"] = col

        return data

    data = pd.read_csv(path_w_context, sep='\t', header=None)
    data[['question', 'context']] = data[0].str.split('context', 1, expand=True)
    data = data.drop(columns=[0,1])
    data = lextend_answers(data, path_ans)
    
    return data

## Loss

In [4]:
def poleval_acc(preds, gtss):
    def numerical_similarity(p, gt):
        numerical_regex = '[0-9]+'
        p_num = "".join(re.findall(numerical_regex, p))
        gt_num = "".join(re.findall(numerical_regex, gt))
        
        return p_num != "" and gt_num != "" and p_num == gt_num
        
        
    assert len(preds) == len(gtss)
    
    scores = []
    for p, gts in zip(preds, gtss):
        min_score = False
        
        for gt in gts:
            score = editdistance.eval(str.lower(p), str.lower(gt)) < 0.5 * len(gt)
            score = score or numerical_similarity(p, gt)
            min_score = score or min_score
        
        scores.append(score)
    
    return np.count_nonzero(scores), 100 * round(np.count_nonzero(scores) / len(preds), 4)

## Model

In [5]:
# mtp = model, tokenizer, pipeline
def create_mtp(model_name):
    MTP = namedtuple("MTP", ["model", "tokenizer", "pipeline"])
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)
    
    qa_pipeline = pipeline(
        "question-answering",
        model=model,
        tokenizer=tokenizer,
        device=0 if torch.cuda.is_available() else -1
    )
    
    return MTP(model, tokenizer, qa_pipeline)

## Evaluation

In [6]:
def evaluate_mtp(mtp, df, questions_limit=None):
    if questions_limit:
        df = df[df.index < questions_limit]
    
    predictions = mtp.pipeline(df[['question', 'context']].to_dict(orient='list'))
    gt_answers = df[['answer']].to_dict(orient='list')['answer']
    preds_ans = [p['answer'] for p in predictions]
    if 'label2' in df:
        for i, (category, pre)  in enumerate(zip(df['label2'], preds_ans)):
            if category == 2 and str.lower(pre) != "tak" and str.lower(pre) != "nie":
                preds_ans[i] = "tak"
    
    no_correct, accuracy = poleval_acc(preds_ans, gt_answers)
    
    return preds_ans, no_correct, accuracy

## Test pretrained models

In [7]:
def test_all_configurations(mtps, dfs, verbose=True):
    results = {}
    
    for mtp, df in product(mtps, dfs):
        config = (mtp.model.name_or_path.split('-')[-1], df.name)
        preds_ans, no_correct, accuracy = evaluate_mtp(mtp, df)

        results[config] = {"predicted_answers": preds_ans,
                           "no_correct": no_correct,
                           "accuracy": accuracy}
        
        if verbose:
            print(f"{config} correct answers: {no_correct} / {len(df)} ({accuracy}%)")
    
    return results

In [8]:
dev_df = load_dataset("../input/poleval2021-with-context/dev-0-input-510.tsv",
                      "../input/poleval2021/dev-0/expected.tsv")
dev_df.name = "Dev"

test_a_df = load_dataset("../input/poleval2021-with-context/test-A-input-510.tsv",
                         "../input/poleval2021/test-A/expected.tsv")
test_a_df.name = "Test A"

test_b_df = load_dataset("../input/poleval2021-with-context/test-B-input-510.tsv",
                         "../input/poleval2021-test-b-expected/test_b_expected.tsv")
test_b_df.name = "Test B"

test_a_df.head()

Unnamed: 0,question,context,answer
0,Czy poeta Lucjan Rydel tworzył także sztuki te...,"Stefan Rydel (senator) Lucjan Rydel, lekarz o...",[tak]
1,W którym państwie została ogłoszona „Deklaracj...,Konstytucja dyrektorialna Deklaracja Praw i O...,[we Francji]
2,Która kawa zawiera alkohol: po turecku czy po ...,Kawa po turecku Parzenie kawy po turecku Do p...,[po irlandzku]
3,W którym mieście zmarł Sławomir Mrożek?,1992 w literaturze Język polski Sławomir Mroż...,[w Nicei]
4,Jak nazywał się autor powieści „Wierna rzeka”?,Łosośna (dopływ Białej Nidy) Łosośna upamiętn...,[Stefan Żeromski]


In [9]:
mtp_squad1 = create_mtp("henryk/bert-base-multilingual-cased-finetuned-polish-squad1")
mtp_squad2 = create_mtp("henryk/bert-base-multilingual-cased-finetuned-polish-squad2")

Downloading:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/700 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/679M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/700 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/679M [00:00<?, ?B/s]

In [10]:
# results = test_all_configurations([mtp_squad1, mtp_squad2], [dev_df, test_a_df, test_b_df])

In [11]:
# results

## Question classificationresults = test_all_configurations([mtp_squad1, mtp_squad2], [dev_df, test_a_df, test_b_df]) 

In [12]:
class ProblemClassifier:
    @staticmethod
    def categorize_df(df, verbose=True):
        df_cp = deepcopy(df)
        df_cp = ProblemClassifier.categorize(df_cp, verbose)
        df_cp = ProblemClassifier.categorize_based_on_question(df_cp, verbose)
        
        return df_cp
    
    @staticmethod
    def categorize(df, verbose=True):
        def containsNumber(value):
            return True in [char.isdigit() for char in value.answer]
        
        df['label'] = 0
        df['label'][df.apply(containsNumber, axis=1)] = 1
        df['label'][df.answer.isin(['tak', 'nie'])] = 2
        return df
    
    @staticmethod
    def categorize_based_on_question(df, verbose=True):
        def containsCzy(value):
            return "Czy " in value.question and not " czy " in value.question
        def containsIle(value):
            return True in [word in value.question.lower() for word in ["ile ", "kiedy ", "ilu ", " wieku ", "w którym roku "]]
            
        df['label2'] = 0
        df['label2'][df.apply(containsIle, axis=1) ] = 1
        df['label2'][df.apply(containsCzy, axis=1) ] = 2
        return df

In [13]:
dev_cl_df = ProblemClassifier.categorize_df(dev_df)
dev_cl_df.name = "Dev classification"
test_a_cl_df = ProblemClassifier.categorize_df(test_a_df)
test_a_cl_df.name = "Test A classification"
test_b_cl_df = ProblemClassifier.categorize_df(test_b_df)
test_b_cl_df.name = "Test B classification"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [14]:
results = test_all_configurations([mtp_squad1, mtp_squad2], [dev_df, test_a_df, test_b_df, dev_cl_df, test_a_cl_df, test_b_cl_df])

  tensor = as_tensor(value)
  for span_id in range(num_spans)


('squad1', 'Dev') correct answers: 175 / 1000 (17.5%)
('squad1', 'Test A') correct answers: 430 / 2500 (17.2%)
('squad1', 'Test B') correct answers: 480 / 2500 (19.2%)
('squad1', 'Dev classification') correct answers: 224 / 1000 (22.400000000000002%)
('squad1', 'Test A classification') correct answers: 565 / 2500 (22.6%)
('squad1', 'Test B classification') correct answers: 576 / 2500 (23.04%)
('squad2', 'Dev') correct answers: 163 / 1000 (16.3%)
('squad2', 'Test A') correct answers: 393 / 2500 (15.72%)
('squad2', 'Test B') correct answers: 438 / 2500 (17.52%)
('squad2', 'Dev classification') correct answers: 213 / 1000 (21.3%)
('squad2', 'Test A classification') correct answers: 528 / 2500 (21.12%)
('squad2', 'Test B classification') correct answers: 534 / 2500 (21.36%)
