## Feature analysis

This notebook operates the data that has been generated on the SubTaskA-zeroshot and SubTaskA-zeroshot-bert notebooks:
- The stored features are loaded from the data directory for training, development, evaluation and test sets.
- The BERT model data is loaded from models directory.

The models are combined in various ways. Baselines for various features are also calculated.

The external outputs of this notebook include:
- CSV files for submission to codalab
- Latex tables for results and examples

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width: 98% !important; }</style>"))
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [None]:
import sys
import os
import re
import numpy as np
import pandas as pd

from tqdm.notebook import tqdm
from sklearn.metrics import f1_score

In [None]:
from lib import util, embeds, fitter, masker, features, sentiment, translate

In [None]:
datapath = 'SemEval_2022_Task2-idiomaticity/SubTaskA/Data'
testpath = 'SemEval_2022_Task2-idiomaticity/SubTaskA/TestData'
multilingual_model = 'distiluse-base-multilingual-cased-v1'

In [None]:
frames = util.load_csv_dataframes(datapath)
tframes = util.load_csv_dataframes(testpath)

In [None]:
zdf = frames['train_zero_shot.csv']
odf = frames['train_one_shot.csv']
ddf = frames['dev.csv']
ddf_gold = frames['dev_gold.csv']
edf = frames['eval.csv']
tdf = tframes['test.csv']

In [None]:
z_emb_multi = embeds.get_embeddings(zdf, modelname=multilingual_model, append=['MWE'])
d_emb_multi = embeds.get_embeddings(ddf, modelname=multilingual_model, append=['MWE'])
e_emb_multi = embeds.get_embeddings(edf, modelname=multilingual_model, append=['MWE'])
t_emb_multi = embeds.get_embeddings(tdf, modelname=multilingual_model, append=['MWE'])

#### Sentence-transformers

In [None]:
z_score, z_probs, z_results = fitter.get_fit_results(z_emb_multi, zdf['Label'], d_emb_multi, ddf_gold['Label'])

In [None]:
z_score

In [None]:
f1_score(z_results, ddf_gold['Label'], average='macro')

Using only sentence-transformers embeddings results in 61.4% score.

Create a result dataframe.

In [None]:
resdf_dev = pd.DataFrame(columns=['Name', 'Score', 'ScoreEN', 'ScorePT'])

### Reload stored data

There are the stored features for the feature model.

In [None]:
zdf_bt3 = pd.read_pickle('data/zdf_bt3_20220104_1.pkl')
ddf_bt3 = pd.read_pickle('data/ddf_bt3_20220104_1.pkl')
edf_bt3 = pd.read_pickle('data/edf_bt3_20220105_1.pkl')
tdf_bt3 = pd.read_pickle('data/tdf_bt3_20220111_1.pkl')

#### Baselines

##### Training set

In [None]:
f1_score(zdf_bt3['Hassub'].map({True: '1', False: '0'}), zdf_bt3['Label'], average='macro')

In [None]:
f1_score(zdf_bt3['Trans'].map({True: '1', False: '0'}), zdf_bt3['Label'], average='macro')

In [None]:
z_sentiment_mean_en = zdf_bt3[zdf_bt3['Language'] == 'EN']['Sentiment'].mean()
z_sentiment_mean_pt = zdf_bt3[zdf_bt3['Language'] == 'PT']['Sentiment'].mean()
(z_sentiment_mean_en, z_sentiment_mean_pt)

In [None]:
zdf_s = zdf_bt3.copy()
zdf_s.loc[(zdf_s['Language'] == 'EN') & (zdf_s['Sentiment'] > z_sentiment_mean_en), 'SentM'] = '1'
zdf_s.loc[(zdf_s['Language'] == 'EN') & (zdf_s['Sentiment'] <= z_sentiment_mean_en), 'SentM'] = '0'
zdf_s.loc[(zdf_s['Language'] == 'PT') & (zdf_s['Sentiment'] > z_sentiment_mean_pt), 'SentM'] = '1'
zdf_s.loc[(zdf_s['Language'] == 'PT') & (zdf_s['Sentiment'] <= z_sentiment_mean_pt), 'SentM'] = '0'

In [None]:
f1_score(zdf_s['SentM'], zdf_s['Label'], average='macro')

In [None]:
zbase_majority = np.max([np.sum(zdf['Label'].astype('int')), len(zdf) - np.sum(zdf['Label'].astype('int'))])/len(zdf)
zbase_majority

##### Development set

In [None]:
def get_lang(df: pd.DataFrame, lang) -> pd.DataFrame:
    return df[df['Language'] == lang]

In [None]:
base_hassub = f1_score(ddf_bt3['Hassub'].map({True: '1', False: '0'}), ddf_gold['Label'], average='macro')
base_trans = f1_score(ddf_bt3['Trans'].map({True: '1', False: '0'}), ddf_gold['Label'], average='macro')
sentiment_mean_en = ddf_bt3[ddf_bt3['Language'] == 'EN']['Sentiment'].mean()
sentiment_mean_pt = ddf_bt3[ddf_bt3['Language'] == 'PT']['Sentiment'].mean()
(sentiment_mean_en, sentiment_mean_pt)

In [None]:
base_hassub_en = f1_score(get_lang(ddf_bt3, 'EN')['Hassub'].map({True: '1', False: '0'}), get_lang(ddf_gold, 'EN')['Label'], average='macro')
base_hassub_pt = f1_score(get_lang(ddf_bt3, 'PT')['Hassub'].map({True: '1', False: '0'}), get_lang(ddf_gold, 'PT')['Label'], average='macro')
base_trans_en = f1_score(get_lang(ddf_bt3, 'EN')['Trans'].map({True: '1', False: '0'}), get_lang(ddf_gold, 'EN')['Label'], average='macro')
base_trans_pt = f1_score(get_lang(ddf_bt3, 'PT')['Trans'].map({True: '1', False: '0'}), get_lang(ddf_gold, 'PT')['Label'], average='macro')

In [None]:
ddf_s = ddf_bt3.copy()
ddf_s.loc[(ddf_s['Language'] == 'EN') & (ddf_s['Sentiment'] > sentiment_mean_en), 'SentM'] = '1'
ddf_s.loc[(ddf_s['Language'] == 'EN') & (ddf_s['Sentiment'] <= sentiment_mean_en), 'SentM'] = '0'
ddf_s.loc[(ddf_s['Language'] == 'PT') & (ddf_s['Sentiment'] > sentiment_mean_pt), 'SentM'] = '1'
ddf_s.loc[(ddf_s['Language'] == 'PT') & (ddf_s['Sentiment'] <= sentiment_mean_pt), 'SentM'] = '0'
ddf_s.loc[ddf_s['Sentiment'] > ddf_bt3['Sentiment'].mean(), 'SentM2'] = '1'
ddf_s.loc[ddf_s['Sentiment'] <= ddf_bt3['Sentiment'].mean(), 'SentM2'] = '0'

Since the sentiment distribution is quite different for English and Portuguese, we want to use different mean values for each language. A score above the mean is taken as literal.

In [None]:
# Total sentiment mean
f1_score(ddf_s['SentM2'], ddf_gold['Label'], average='macro')

In [None]:
# Different mean values for English and Portuguese
base_sentiment = f1_score(ddf_s['SentM'], ddf_gold['Label'], average='macro')
base_sentiment

In [None]:
base_sentiment_en = f1_score(get_lang(ddf_s, 'EN')['SentM'], get_lang(ddf_gold, 'EN')['Label'], average='macro')
base_sentiment_pt = f1_score(get_lang(ddf_s, 'PT')['SentM'], get_lang(ddf_gold, 'PT')['Label'], average='macro')
base_sentiment_en

In [None]:
base_majority = np.max([np.sum(ddf_gold['Label'].astype('int')), len(ddf_gold) - np.sum(ddf_gold['Label'].astype('int'))])/len(ddf_gold)
base_majority

In [None]:
base_majority_en = np.max([np.sum(get_lang(ddf_gold, 'EN')['Label'].astype('int')), len(get_lang(ddf_gold, 'EN')) - np.sum(get_lang(ddf_gold, 'EN')['Label'].astype('int'))])/len(get_lang(ddf_gold, 'EN'))
base_majority_pt = np.max([np.sum(get_lang(ddf_gold, 'PT')['Label'].astype('int')), len(get_lang(ddf_gold, 'PT')) - np.sum(get_lang(ddf_gold, 'PT')['Label'].astype('int'))])/len(get_lang(ddf_gold, 'PT'))
base_majority_en

In [None]:
resdf_dev.loc['base_hassub'] = ['Baseline: hassub', base_hassub, base_hassub_en, base_hassub_pt]
resdf_dev.loc['base_trans'] = ['Baseline: trans', base_trans, base_trans_en, base_trans_pt]
resdf_dev.loc['base_sentiment'] = ['Baseline: sentiment', base_sentiment, base_sentiment_en, base_sentiment_pt]
resdf_dev.loc['base_majority'] = ['Baseline: majority', base_majority, base_majority_en, base_majority_pt]

In [None]:
en_idx = ddf_gold['Language'] == 'EN'
pt_idx = ddf_gold['Language'] == 'PT'

In [None]:
f1_score(z_results[en_idx], get_lang(ddf_gold, 'EN')['Label'], average='macro')

In [None]:
f1_score(z_results[pt_idx], get_lang(ddf_gold, 'PT')['Label'], average='macro')

In [None]:
resdf_dev.loc['sbert'] = ['Sentence transformers',
                          f1_score(z_results, ddf_gold['Label'], average='macro'),
                          f1_score(z_results[en_idx], get_lang(ddf_gold, 'EN')['Label'], average='macro'),
                          f1_score(z_results[pt_idx], get_lang(ddf_gold, 'PT')['Label'], average='macro')
                         ]

In [None]:
resdf_dev

#### Sentence transformers + feature model

Classification with the selected features.

In [None]:
# dropcols = ['Top score', 'FS', 'SS', 'Quotes', 'MWEdiff']
# dropcols = ['Hassub', 'FS', 'Nextdiff']
# dropcols = ['MWEdiff', 'FS']
# dropcols = ['Top score 1', 'Top score 2', 'SS', 'FS', 'MWEdiff']
dropcols = ['Top score 1', 'FS', 'MWEdiff']
zdf_t4 = fitter.get_trainable(zdf_bt3).drop(dropcols, axis=1)
ddf_t4 = fitter.get_trainable(ddf_bt3).drop(dropcols, axis=1)
ddf5_feat_score, ddf5_feat_probs, ddf5_feat_results = fitter.get_fit_results(zdf_t4, zdf['Label'], ddf_t4, ddf_gold['Label'])

In [None]:
mup = fitter.multi_results(ddf_bt3, ddf_gold, z_results, z_probs, ddf5_feat_results, ddf5_feat_probs, ['Caps', 'Hassub'],['Quotes'])
f1_score(mup ['Prediction'], mup['Label'], average='macro')

In [None]:
mup_x = fitter.multi_results(ddf_bt3, ddf_gold, z_results, z_probs, ddf5_feat_results, ddf5_feat_probs, ['Caps'],['Quotes'])
f1_score(mup_x['Prediction'], mup_x['Label'], average='macro')

In [None]:
resdf_dev.loc['sbert_feat'] = ['Sentence transformers + feature',
                               f1_score(mup['Prediction'], mup['Label'], average='macro'),
                               f1_score(get_lang(mup, 'EN')['Prediction'], get_lang(mup, 'EN')['Label'], average='macro'),
                               f1_score(get_lang(mup, 'PT')['Prediction'], get_lang(mup, 'PT')['Label'], average='macro')
                              ]

In [None]:
ddf_sub = frames['dev_submission_format.csv']
ddf_res = ddf_sub.copy()
ddf_res.loc[ddf_res['Setting'] == 'zero_shot', 'Label'] = mup['Prediction']
# ddf_res.to_csv('data/ddf_sub_20220121_1.csv', index=False)

#### Normalization of sentiment scores

Since the sentiment distribution is different for each language, let's see if normalizing sentiment helps.

In [None]:
print(zdf_bt3[zdf_bt3['Language'] == 'EN']['Sentiment'].mean())
print(zdf_bt3[zdf_bt3['Language'] == 'PT']['Sentiment'].mean())
sentdiff = zdf_bt3[zdf_bt3['Language'] == 'PT']['Sentiment'].mean() - zdf_bt3[zdf_bt3['Language'] == 'EN']['Sentiment'].mean()
print(sentdiff)

In [None]:
zdf_bt4 = zdf_bt3.copy()
zdf_bt4['SentNorm'] = zdf_bt4['Sentiment']
zdf_bt4.loc[zdf_bt4['Language'] == 'PT', 'SentNorm'] -= sentdiff

In [None]:
print(ddf_bt3[ddf_bt3['Language'] == 'EN']['Sentiment'].mean())
print(ddf_bt3[ddf_bt3['Language'] == 'PT']['Sentiment'].mean())
sentdiff_d = ddf_bt3[ddf_bt3['Language'] == 'PT']['Sentiment'].mean() - ddf_bt3[ddf_bt3['Language'] == 'EN']['Sentiment'].mean()
print(sentdiff_d)

There's a ~0.08-0.09 difference in sentiment means. Let's use that for normalization. 

In [None]:
ddf_bt4 = ddf_bt3.copy()
ddf_bt4['SentNorm'] = ddf_bt4['Sentiment']
ddf_bt4.loc[ddf_bt4['Language'] == 'PT', 'SentNorm'] -= sentdiff

In [None]:
dropcols_s = ['Top score 1', 'FS', 'MWEdiff', 'Sentiment']
zdf_t6 = fitter.get_trainable(zdf_bt4).drop(dropcols_s, axis=1)
ddf_t6 = fitter.get_trainable(ddf_bt4).drop(dropcols_s, axis=1)
ddf7_feat_score, ddf7_feat_probs, ddf7_feat_results = fitter.get_fit_results(zdf_t6, zdf['Label'], ddf_t6, ddf_gold['Label'])
mup_s = fitter.multi_results(ddf_bt3, ddf_gold, z_results, z_probs, ddf7_feat_results, ddf7_feat_probs, ['Caps', 'Hassub'],['Quotes'])
f1_score(mup_s['Prediction'], mup_s['Label'], average='macro')

Normalization of the Sentiment scores doesn't seem to help.

#### One-hot language variables

In [None]:
zdf_bt5 = pd.get_dummies(zdf_bt3, columns=['Language'])
ddf_bt5 = pd.get_dummies(ddf_bt3, columns=['Language'])
ddf_bt5 = ddf_bt3.copy()

In [None]:
dropcols_m = ['Top score 1', 'FS', 'MWEdiff', 'Top score', 'Top score 2']
zdf_t7 = fitter.get_trainable(zdf_bt5).drop(dropcols_m, axis=1)
ddf_t7 = fitter.get_trainable(ddf_bt5).drop(dropcols_m, axis=1)
ddf8_feat_score, ddf8_feat_probs, ddf8_feat_results = fitter.get_fit_results(zdf_t7, zdf['Label'], ddf_t7, ddf_gold['Label'])
mup_m = fitter.multi_results(ddf_bt3, ddf_gold, z_results, z_probs, ddf8_feat_results, ddf8_feat_probs, ['Caps', 'Hassub'],['Quotes'])
f1_score(mup_m['Prediction'], mup_m['Label'], average='macro')

Encoding the languages as One Hot variables doesn't change things.

### Results from BERT model

##### Multilingual model for both EN,PT

In [None]:
ddf_bert0 = pd.read_csv('models/ZeroShot/0/eval-dev/test_results_None.txt', sep='\t')
ddf_bert0_probs = pd.read_csv('models/ZeroShot/0/eval-dev/test_results_None.txt.probs', sep='\t')

In [None]:
f1_score(ddf_bert0['prediction'].astype('str'), ddf_gold['Label'], average='macro')

In [None]:
resdf_dev.loc['BERT'] = ['BERT', 
                         f1_score(ddf_bert0['prediction'].astype('str'), ddf_gold['Label'], average='macro'),
                         f1_score(ddf_bert0['prediction'][en_idx].astype('str'), get_lang(ddf_gold, 'EN')['Label'], average='macro'),
                         f1_score(ddf_bert0['prediction'][pt_idx].astype('str'), get_lang(ddf_gold, 'PT')['Label'], average='macro')
                        ]

##### English model for English, multilingual (full) model for PT

In [None]:
ddf_bert1 = pd.read_csv('models/ZeroShot/1/eval-dev/test_results_None.txt', sep='\t')
ddf_bert1_probs = pd.read_csv('models/ZeroShot/1/eval-dev/test_results_None.txt.probs', sep='\t')

bert0 is the multilingual model, copy the results from the English model (bert1) over.

In [None]:
ddf_bert_comb = ddf_bert0.copy()
ddf_bert_comb_probs = ddf_bert0_probs.copy()
for idx,row in ddf_bert1.iterrows():
    _idx,_pred = row
    ddf_bert_comb.loc[_idx,'prediction'] = _pred
for idx,row in ddf_bert1_probs.iterrows():
    _idx,_pred = row
    ddf_bert_comb_probs.loc[_idx,'prediction'] = _pred

In [None]:
f1_score(ddf_bert_comb['prediction'].astype('str'), ddf_gold['Label'], average='macro')

In [None]:
f1_score(ddf_bert_comb['prediction'][en_idx].astype('str'), ddf_gold['Label'][en_idx], average='macro')

In [None]:
f1_score(ddf_bert_comb['prediction'][pt_idx].astype('str'), ddf_gold['Label'][pt_idx], average='macro')

##### English model for English, multilingual (PT-only) model for PT

In [None]:
dres_1 = util.load_df('models/ZeroShot/1/eval-dev/test_results_None.txt', delimiter="\t")
dres_2 = util.load_df('models/ZeroShot/2/eval-dev/test_results_None.txt', delimiter="\t")
dres_1['index'] = dres_1['index'].astype(int)
dres_2['index'] = dres_2['index'].astype(int)
dres_2['index'] += len(dres_1)
dres_3 = pd.concat([dres_1, dres_2], ignore_index=True)

In [None]:
dres_1_p = util.load_df('models/ZeroShot/1/eval-dev/test_results_None.txt.probs', delimiter="\t")
dres_2_p = util.load_df('models/ZeroShot/2/eval-dev/test_results_None.txt.probs', delimiter="\t")
dres_1_p['index'] = dres_1_p['index'].astype(float)
dres_2_p['index'] = dres_2_p['index'].astype(float)
dres_2_p['index'] += len(dres_1_p)
dres_3_p = pd.concat([dres_1_p, dres_2_p], ignore_index=True)
dres_3_p['probs'] = dres_3_p['probs'].astype(float)

In [None]:
f1_score(dres_3['prediction'].astype('str'), ddf_gold['Label'], average='macro')

In [None]:
f1_score(dres_3['prediction'][pt_idx].astype('str'), ddf_gold['Label'][pt_idx], average='macro')

In [None]:
resdf_dev.loc['BERT_multi1'] = ['BERT multilingual, separate, PT results from full model', 
                                f1_score(ddf_bert_comb['prediction'].astype('str'), ddf_gold['Label'], average='macro'),
                                f1_score(ddf_bert_comb['prediction'][en_idx].astype('str'), ddf_gold['Label'][en_idx], average='macro'),
                                f1_score(ddf_bert_comb['prediction'][pt_idx].astype('str'), ddf_gold['Label'][pt_idx], average='macro')
                               ]
resdf_dev.loc['BERT_multi2'] = ['BERT multilingual, separate',
                                f1_score(dres_3['prediction'].astype('str'), ddf_gold['Label'], average='macro'),
                                f1_score(dres_3['prediction'][en_idx].astype('str'), ddf_gold['Label'][en_idx], average='macro'),
                                f1_score(dres_3['prediction'][pt_idx].astype('str'), ddf_gold['Label'][pt_idx], average='macro')
                               ]

#### Combine BERT with feature model

In [None]:
mup0 = fitter.multi_results(ddf_bt3, ddf_gold, ddf_bert0['prediction'].astype('str'), ddf_bert0_probs['probs'],
                            ddf5_feat_results, ddf5_feat_probs,
                            ['Caps', 'Hassub'],
                            ['Quotes'], agreeonly=True)
f1_score(mup0['Prediction'], mup0['Label'], average='macro')

Using Sentiment as a forced feature doesn't help.

In [None]:
mup0s = fitter.multi_results(ddf_bt3, ddf_gold, ddf_bert0['prediction'].astype('str'), ddf_bert0_probs['probs'],
                             ddf5_feat_results, ddf5_feat_probs,
                             ['Caps', 'Hassub', 'Sentiment'],
                             ['Quotes'], agreeonly=True)
f1_score(mup0s['Prediction'], mup0s['Label'], average='macro')

##### Combine BERT (en+pt) with feature model

Using different BERT models for English and Portuguese.

A couple of different options used here:
 - Using !Trans as a boolean feature (i.e. not having a good translation is considered idiomatic)
 - agreeonly: only consider boolean features if both models agree

The first results are without using agreeonly feature. First for the case where PT model is trained with all data, the second when it is only trained with PT data.

In [None]:
mup1 = fitter.multi_results(ddf_bt3, ddf_gold, ddf_bert_comb['prediction'].astype('str'),
                            ddf_bert_comb_probs['probs'], ddf5_feat_results, ddf5_feat_probs,
                            ['Caps', 'Hassub'],
                            ['Quotes'])
f1_score(mup1['Prediction'], mup1['Label'], average='macro')

In [None]:
ddf_res2 = ddf_sub.copy()
ddf_res2.loc[ddf_res2['Setting'] == 'zero_shot', 'Label'] = mup1['Prediction']
# ddf_res2.to_csv('data/ddf_sub_20220121_2.csv', index=False)

In [None]:
mup1p = fitter.multi_results(ddf_bt3, ddf_gold, dres_3['prediction'].astype('str'),
                             dres_3_p['probs'], ddf5_feat_results, ddf5_feat_probs,
                             ['Caps', 'Hassub'],
                             ['Quotes'])
f1_score(mup1p['Prediction'], mup1p['Label'], average='macro')

Add the agreeonly feature to the results.

In [None]:
mup2 = fitter.multi_results(ddf_bt3, ddf_gold, ddf_bert_comb['prediction'].astype('str'),
                            ddf_bert_comb_probs['probs'], ddf5_feat_results, ddf5_feat_probs,
                            ['Caps', 'Hassub'],
                            ['Quotes'], agreeonly=True)
f1_score(mup2['Prediction'], mup2['Label'], average='macro')

In [None]:
mup2p = fitter.multi_results(ddf_bt3, ddf_gold, dres_3['prediction'].astype('str'),
                            dres_3_p['probs'], ddf5_feat_results, ddf5_feat_probs,
                            ['Caps', 'Hassub'],
                            ['Quotes'], agreeonly=True)
f1_score(mup2p['Prediction'], mup2p['Label'], average='macro')

The same with !Trans, with or without agreeonly.

In [None]:
mup3 = fitter.multi_results(ddf_bt3, ddf_gold, ddf_bert_comb['prediction'].astype('str'),
                            ddf_bert_comb_probs['probs'], ddf5_feat_results, ddf5_feat_probs,
                            ['Caps', 'Hassub'],
                            ['Quotes', '!Trans'])
f1_score(mup3['Prediction'], mup3['Label'], average='macro')

In [None]:
mup3p = fitter.multi_results(ddf_bt3, ddf_gold, dres_3['prediction'].astype('str'),
                            dres_3_p['probs'], ddf5_feat_results, ddf5_feat_probs,
                            ['Caps', 'Hassub'],
                            ['Quotes', '!Trans'])
f1_score(mup3p['Prediction'], mup3p['Label'], average='macro')

In [None]:
mup4 = fitter.multi_results(ddf_bt3, ddf_gold, ddf_bert_comb['prediction'].astype('str'),
                            ddf_bert_comb_probs['probs'], ddf5_feat_results, ddf5_feat_probs,
                            ['Caps', 'Hassub'],
                            ['Quotes', '!Trans'], agreeonly=True)
f1_score(mup4['Prediction'], mup4['Label'], average='macro')

In [None]:
mup4p = fitter.multi_results(ddf_bt3, ddf_gold, dres_3['prediction'].astype('str'),
                             dres_3_p['probs'], ddf5_feat_results, ddf5_feat_probs,
                             ['Caps', 'Hassub'],
                             ['Quotes', '!Trans'], agreeonly=True)
f1_score(mup4p['Prediction'], mup4p['Label'], average='macro')

In the end, using multilingual model only for Portuguese may or may not improve things.

In [None]:
mup4s = fitter.multi_results(ddf_bt3, ddf_gold, ddf_bert_comb['prediction'].astype('str'),
                             ddf_bert_comb_probs['probs'], ddf5_feat_results, ddf5_feat_probs,
                             ['Caps', 'Hassub', 'Sentiment'],
                             ['Quotes', '!Trans'], agreeonly=True)
f1_score(mup4s['Prediction'], mup4s['Label'], average='macro')

In [None]:
resdf_dev.loc['BERT_feat'] = ['BERT + feature',
                              f1_score(mup0['Prediction'], mup0['Label'], average='macro'),
                              f1_score(mup0['Prediction'][en_idx], mup0['Label'][en_idx], average='macro'),
                              f1_score(mup0['Prediction'][pt_idx], mup0['Label'][pt_idx], average='macro')
                             ]
resdf_dev.loc['BERT_multi_feat'] = ['BERT (multi) + feature',
                                    f1_score(mup1['Prediction'], mup1['Label'], average='macro'),
                                    f1_score(mup1['Prediction'][en_idx], mup1['Label'][en_idx], average='macro'),
                                    f1_score(mup1['Prediction'][pt_idx], mup1['Label'][pt_idx], average='macro')
                                   ]
resdf_dev.loc['BERT_multi_feat_agree'] = ['BERT (multi) + feature (agreeonly)',
                                          f1_score(mup2['Prediction'], mup2['Label'], average='macro'),
                                          f1_score(mup2['Prediction'][en_idx], mup2['Label'][en_idx], average='macro'),
                                          f1_score(mup2['Prediction'][pt_idx], mup2['Label'][pt_idx], average='macro')
                                         ]
resdf_dev.loc['BERT_multi_feat_trans'] = ['BERT (multi) + feature + trans',
                                          f1_score(mup3['Prediction'], mup3['Label'], average='macro'),
                                          f1_score(mup3['Prediction'][en_idx], mup3['Label'][en_idx], average='macro'),
                                          f1_score(mup3['Prediction'][pt_idx], mup3['Label'][pt_idx], average='macro')
                                         ]
resdf_dev.loc['BERT_multi_feat_trans_agree'] = ['BERT (multi) + feature + trans (agreeonly)',
                                                f1_score(mup4['Prediction'], mup4['Label'], average='macro'),
                                                f1_score(mup4['Prediction'][en_idx], mup4['Label'][en_idx], average='macro'),
                                                f1_score(mup4['Prediction'][pt_idx], mup4['Label'][pt_idx], average='macro')
                                               ]
resdf_dev.loc['BERT_multi2_feat_trans_agree'] = ['BERT (multi2) + feature + trans (agreeonly)',
                                                 f1_score(mup4p['Prediction'], mup4p['Label'], average='macro'),
                                                 f1_score(mup4p['Prediction'][en_idx], mup4p['Label'][en_idx], average='macro'),
                                                 f1_score(mup4p['Prediction'][pt_idx], mup4p['Label'][pt_idx], average='macro')
                                                ]

In [None]:
resdf_dev

#### Majority voting classifier

In [None]:
mup_m1 = fitter.majority_results(ddf_bt3, ddf_gold,
                                 ddf_bert_comb['prediction'].astype('str'),
                                 ddf_bert_comb_probs['probs'],
                                 ddf5_feat_results, ddf5_feat_probs,
                                 z_results, z_probs,
                                 ['Caps', 'Hassub'],
                                 ['Quotes', '!Trans'])
f1_score(mup_m1['Prediction'], mup_m1['Label'], average='macro')

In [None]:
mup_m2 = fitter.majority_results(ddf_bt3, ddf_gold,
                                 ddf_bert_comb['prediction'].astype('str'),
                                 ddf_bert_comb_probs['probs'],
                                 ddf5_feat_results, ddf5_feat_probs,
                                 z_results, z_probs,
                                 ['Caps', 'Hassub'],
                                 ['Quotes', '!Trans'], agreeonly=True)
f1_score(mup_m2['Prediction'], mup_m2['Label'], average='macro')

In [None]:
resdf_dev.loc['majority_voting_trans'] = ['Majority voting + trans',
                                          f1_score(mup_m1['Prediction'], mup_m1['Label'], average='macro'),
                                          f1_score(mup_m1[en_idx]['Prediction'], mup_m1[en_idx]['Label'], average='macro'),
                                          f1_score(mup_m1[pt_idx]['Prediction'], mup_m1[pt_idx]['Label'], average='macro')
                                         ]

In [None]:
resdf_dev.loc['majority_voting_trans_agree'] = ['Majority voting + trans (agreeonly)',
                                                f1_score(mup_m2['Prediction'], mup_m2['Label'], average='macro'),
                                                f1_score(mup_m2[en_idx]['Prediction'], mup_m2[en_idx]['Label'], average='macro'),
                                                f1_score(mup_m2[pt_idx]['Prediction'], mup_m2[pt_idx]['Label'], average='macro')
                                               ]

In [None]:
resdf_dev

### Eval results

For the evaluation results, scores are copied from the codalab website.

In [None]:
edf_sub = frames['eval_submission_format.csv']
tdf_sub = tframes['test_submission_format.csv']

In [None]:
edf_bert0 = pd.read_csv('models/ZeroShot/0/eval-eval/test_results_None.txt', sep='\t')
edf_bert0_probs = pd.read_csv('models/ZeroShot/0/eval-eval/test_results_None.txt.probs', sep='\t')

In [None]:
ez_score, ez_probs, ez_results = fitter.get_fit_results(z_emb_multi, zdf['Label'], e_emb_multi)

In [None]:
dropcols3 = ['Top score 1', 'FS', 'MWEdiff']
zdf_et3 = fitter.get_trainable(zdf_bt3).drop(dropcols3, axis=1)
edf_et3 = fitter.get_trainable(edf_bt3).drop(dropcols3, axis=1)

edf_feat_score3, edf_feat_probs3, edf_feat_results3 = fitter.get_fit_results(zdf_et3, zdf['Label'], edf_et3)
edf_comb3 = fitter.multi_results(edf_bt3, None, ez_results, ez_probs, edf_feat_results3, edf_feat_probs3, ['Caps','Hassub'],['Quotes','!Trans'])
edf_comb3.groupby(['Language','Prediction'])['ID'].count()

In [None]:
edf_res1 = edf_sub.copy()
edf_res1.loc[edf_res1['Setting'] == 'zero_shot', 'Label'] = edf_comb3['Prediction']
edf_res1.to_csv('data/eval_sub_20220113_sft.csv', index=False)

In [None]:
edf_comb4 = fitter.multi_results(edf_bt3, None, ez_results, ez_probs, edf_feat_results3, edf_feat_probs3, ['Caps','Hassub'],['Quotes'])
edf_res2 = edf_sub.copy()
edf_res2.loc[edf_res1['Setting'] == 'zero_shot', 'Label'] = edf_comb4['Prediction']
# edf_res2.to_csv('data/eval_sub_20220113_sf.csv', index=False)

In [None]:
edf_comb_bert1 = fitter.multi_results(edf_bt3, None, edf_bert0['prediction'].astype('str'), edf_bert0_probs['probs'],
                                      edf_feat_results3, edf_feat_probs3,
                                      ['Caps','Hassub'],['Quotes','!Trans'])

In [None]:
edf_res_bert1 = edf_sub.copy()
edf_res_bert1.loc[edf_res_bert1['Setting'] == 'zero_shot', 'Label'] = edf_comb_bert1['Prediction']
# edf_res_bert1.to_csv('data/eval_sub_20220113_1.csv', index=False)

This one gets a result of 0.6711207561.

In [None]:
edf_comb_bert1p = fitter.multi_results(edf_bt3, None, edf_bert0['prediction'].astype('str'), edf_bert0_probs['probs'],
                                       edf_feat_results3, edf_feat_probs3,
                                       ['Caps','Hassub'],['Quotes'])

In [None]:
edf_res_bert1p = edf_sub.copy()
edf_res_bert1p.loc[edf_res_bert1p['Setting'] == 'zero_shot', 'Label'] = edf_comb_bert1p['Prediction']
# edf_res_bert1p.to_csv('data/eval_sub_20220113_6.csv', index=False)

In [None]:
edf_comb_bert2 = fitter.multi_results(edf_bt3, None, edf_bert0['prediction'].astype('str'),
                                      edf_bert0_probs['probs'], edf_feat_results3, edf_feat_probs3,
                                      ['Caps','Hassub'],['Quotes'], agreeonly=True)

In [None]:
edf_res_bert2 = edf_sub.copy()
edf_res_bert2.loc[edf_res_bert1['Setting'] == 'zero_shot', 'Label'] = edf_comb_bert2['Prediction']
# edf_res_bert2.to_csv('data/eval_sub_20220113_2.csv', index=False)

This model finally gets a better result than baseline (72.3% vs 70.2%). This essentially means:
 - use the prediction if models agree
 - if they disagree
   - use the BERT base model (because of confidence) except when boolean features Quotes, Caps or Hassub say otherwise.
   - Trans feature (idiomatic interpretation if Trans==False) didn't turn out to be useful.
   
However, using the base BERT model (English for English and multilingual for Portuguese) is a just a tiny bit higher (72.3%).

In [None]:
edf_bert1 = pd.read_csv('models/ZeroShot/1/eval-eval/test_results_None.txt', sep='\t')
edf_bert1_probs = pd.read_csv('models/ZeroShot/1/eval-eval/test_results_None.txt.probs', sep='\t')

In [None]:
edf_bert_comb = edf_bert0.copy()
edf_bert_comb_probs = edf_bert0_probs.copy()
for idx,row in edf_bert1.iterrows():
    _idx,_pred = row
    edf_bert_comb.loc[_idx,'prediction'] = _pred
for idx,row in edf_bert1_probs.iterrows():
    _idx,_pred = row
    edf_bert_comb_probs.loc[_idx,'prediction'] = _pred

In [None]:
edf_comb_bert3 = fitter.multi_results(edf_bt3, None, 
                                      edf_bert_comb['prediction'].astype('str'), edf_bert_comb_probs['probs'], 
                                      edf_feat_results3, edf_feat_probs3,
                                      ['Caps','Hassub'],['Quotes'])

In [None]:
edf_comb_bert4 = fitter.multi_results(edf_bt3, None, 
                                      edf_bert_comb['prediction'].astype('str'), edf_bert_comb_probs['probs'], 
                                      edf_feat_results3, edf_feat_probs3,
                                      ['Caps','Hassub'],['Quotes'], agreeonly=True)

In [None]:
edf_res_bert3 = edf_sub.copy()
edf_res_bert3.loc[edf_res_bert3['Setting'] == 'zero_shot', 'Label'] = edf_comb_bert3['Prediction']
# edf_res_bert3.to_csv('data/eval_sub_20220113_3.csv', index=False)

In [None]:
edf_res_bert4 = edf_sub.copy()
edf_res_bert4.loc[edf_res_bert4['Setting'] == 'zero_shot', 'Label'] = edf_comb_bert4['Prediction']
# edf_res_bert4.to_csv('data/eval_sub_20220113_5.csv', index=False)

Using the English BERT for English improves the result a little bit at 72.5%.

In [None]:
resdf_eval = pd.DataFrame(columns=['Name', 'Score'])

In [None]:
resdf_eval.loc['sbert'] = ['Sentence transformers', 0.5580687255]
resdf_eval.loc['sbert_feat'] = ['Sentence transformers + feature', 0.6459566501]
resdf_eval.loc['BERT'] = ['BERT', 0.7024610232]
resdf_eval.loc['BERT_multi'] = ['BERT (multi)', 0.7229592232]
resdf_eval.loc['BERT_feat'] = ['BERT + feature', 0.7144442603]
resdf_eval.loc['BERT_feat_trans'] = ['BERT + feature + trans', 0.6711207561]
resdf_eval.loc['BERT_feat_agree'] = ['BERT + feature (agree)', 0.7229308218]
resdf_eval.loc['BERT_multi_feat'] = ['BERT (multi) + feature', 0.7202048507]
resdf_eval.loc['BERT_multi_feat_agree'] = ['BERT (multi) + feature (agree)', 0.7252418204]

In [None]:
resdf_eval

### Test results

In [None]:
tdf_bert0 = pd.read_csv('models/ZeroShot/0/eval-test/test_results_None.txt', sep='\t')
tdf_bert0_probs = pd.read_csv('models/ZeroShot/0/eval-test/test_results_None.txt.probs', sep='\t')

#### Sentence transformers + features

In [None]:
tz_score, tz_probs, tz_results = fitter.get_fit_results(z_emb_multi, zdf['Label'], t_emb_multi)

In [None]:
dropcols4 = ['Top score 1', 'FS', 'MWEdiff']
zdf_tt1 = fitter.get_trainable(zdf_bt3).drop(dropcols3, axis=1)
tdf_tt1 = fitter.get_trainable(tdf_bt3).drop(dropcols3, axis=1)

tdf_feat_score, tdf_feat_probs, tdf_feat_results = fitter.get_fit_results(zdf_tt1, zdf['Label'], tdf_tt1)
tdf_comb = fitter.multi_results(tdf_bt3, None, tz_results, tz_probs, tdf_feat_results, tdf_feat_probs, ['Caps','Hassub'],['Quotes'])
tdf_comb.groupby(['Language','Prediction'])['ID'].count()

In [None]:
tdf_res1 = tdf_sub.copy()
tdf_res1.loc[tdf_res1['Setting'] == 'zero_shot', 'Label'] = tdf_comb['Prediction']
# tdf_res1.to_csv('data/test_sub_20220114_1.csv', index=False)

#### Baseline BERT + features

In [None]:
tdf_comb_bert = fitter.multi_results(tdf_bt3, None, tdf_bert0['prediction'].astype('str'), tdf_bert0_probs['probs'],
                                     tdf_feat_results, tdf_feat_probs,
                                     ['Caps','Hassub'],['Quotes'])
tdf_comb_bert.groupby(['Language','Prediction'])['ID'].count()

In [None]:
tdf_res2 = tdf_sub.copy()
tdf_res2.loc[tdf_res2['Setting'] == 'zero_shot', 'Label'] = tdf_comb_bert['Prediction']
# tdf_res2.to_csv('data/test_sub_20220114_2.csv', index=False)

In [None]:
tdf_comb_bert2 = fitter.multi_results(tdf_bt3, None, tdf_bert0['prediction'].astype('str'), tdf_bert0_probs['probs'],
                                      tdf_feat_results, tdf_feat_probs,
                                      ['Caps','Hassub'],['Quotes'], agreeonly=True)
tdf_comb_bert2.groupby(['Language','Prediction'])['ID'].count()

In [None]:
tdf_res3 = tdf_sub.copy()
tdf_res3.loc[tdf_res3['Setting'] == 'zero_shot', 'Label'] = tdf_comb_bert2['Prediction']
# tdf_res3.to_csv('data/test_sub_20220114_3.csv', index=False)

#### Baseline BERT (multi) + features

In [None]:
tdf_bert1 = pd.read_csv('models/ZeroShot/1/eval-test/test_results_None.txt', sep='\t')
tdf_bert1_probs = pd.read_csv('models/ZeroShot/1/eval-test/test_results_None.txt.probs', sep='\t')

In [None]:
tdf_bert_comb = tdf_bert0.copy()
tdf_bert_comb_probs = tdf_bert0_probs.copy()
for idx,row in tdf_bert1.iterrows():
    _idx,_pred = row
    tdf_bert_comb.loc[_idx,'prediction'] = _pred
for idx,row in tdf_bert1_probs.iterrows():
    _idx,_pred = row
    tdf_bert_comb_probs.loc[_idx,'prediction'] = _pred

In [None]:
tdf_comb_bert3 = fitter.multi_results(tdf_bt3, None, 
                                      tdf_bert_comb['prediction'].astype('str'), tdf_bert_comb_probs['probs'], 
                                      tdf_feat_results, tdf_feat_probs,
                                      ['Caps','Hassub'],['Quotes'])
tdf_comb_bert3.groupby(['Language','Prediction'])['ID'].count()

In [None]:
tdf_res4 = tdf_sub.copy()
tdf_res4.loc[tdf_res4['Setting'] == 'zero_shot', 'Label'] = tdf_comb_bert3['Prediction']
# tdf_res4.to_csv('data/test_sub_20220114_4.csv', index=False)

In [None]:
tdf_comb_bert4 = fitter.multi_results(tdf_bt3, None, 
                                      tdf_bert_comb['prediction'].astype('str'), tdf_bert_comb_probs['probs'], 
                                      tdf_feat_results, tdf_feat_probs,
                                      ['Caps','Hassub'],['Quotes'], agreeonly=True)
tdf_comb_bert4.groupby(['Language','Prediction'])['ID'].count()

This is the final submission.

In [None]:
tdf_res5 = tdf_sub.copy()
tdf_res5.loc[tdf_res4['Setting'] == 'zero_shot', 'Label'] = tdf_comb_bert4['Prediction']
# tdf_res5.to_csv('data/test_sub_20220114_5.csv', index=False)

## Results

Code for outputting the results to latex.

In [None]:
resdf_dev2 = resdf_dev.copy()

In [None]:
resdf_dev2['Configuration'] = [
    'Hassub', 'Trans', 'Sentiment', 'Majority class',
    'Sentence transformers', '+ feature',
    'BERT baseline',
    '+ multilingual 1: PT from full model',
    '+ multilingual 2: PT from separate model',
    '+ feature',
    '+ multilingual 1 + feature',
    '+ multilingual 1 + feature, agree',
    '+ multilingual 1 + feature + trans',
    '+ multilingual 1 + feature + trans, agree',
    '+ multilingual 2 + feature + trans, agree',
    'Majority voting + trans',
    'Majority voting + trans, agree'
]
resdf_dev2['F1'] = resdf_dev2['Score']
resdf_dev2['EN'] = resdf_dev2['ScoreEN']
resdf_dev2['PT'] = resdf_dev2['ScorePT']

In [None]:
resdf_dev2.drop(['Name', 'Score', 'ScoreEN', 'ScorePT'], axis=1)

In [None]:
# util.save_table(resdf_dev2.drop(['Name', 'Score', 'ScoreEN', 'ScorePT'], axis=1), 'dev_results', index=False, hlines=[6, 8, 17])

In [None]:
resdf_eval2 = resdf_eval.copy()

In [None]:
resdf_eval2['Configuration'] = [
    'Sentence transformers', '+ feature',
    'BERT baseline',
    '+ multilingual 1',
    '+ feature',
    '+ feature + trans',
    '+ feature, agree',
    '+ multilingual 1 + feature',
    '+ multilingual 1 + feature, agree',
]
resdf_eval2['F1'] = resdf_eval2['Score']

In [None]:
# util.save_table(resdf_eval2.drop(['Name', 'Score'], axis=1), 'eval_results', index=False, hlines=[4])

## Examples

In [None]:
hassub_drops = ['Quotes', 'Caps', 'Sentiment', 'BT', 'Trans',
                'Prevdiff', 'Nextdiff', 'MWEdiff', 'Previous', 'Next',
                'Top terms 1', 'Top terms 2', 'Top score 1', 'Top score 2', 'FS', 'SS',
                'Language', 'Setting', 'MWE']

In [None]:
zdf_bt3 = zdf_bt3.astype({'Hassub': bool, 'Quotes': bool, 'Caps': bool, 'Trans': bool})

In [None]:
zdf_hassub = zdf_bt3.drop(hassub_drops, axis=1).astype({'Short': int, 'FoundIdx': int})

In [None]:
zdf_quotes = zdf_bt3.drop(['Previous', 'Next', 'Language', 'Setting'], axis=1)
zdf_trans = zdf_quotes.copy()
for col in zdf_quotes.columns[4:]:
    if col not in ['Caps', 'Quotes']:
        zdf_quotes.drop(col, axis=1, inplace=True)
    if col not in ['BT','Trans']:
        zdf_trans.drop(col, axis=1, inplace=True)       

In [None]:
zdf_hassub.iloc[[0, 1, 3509, 3512, 3520]]

In [None]:
hascolformat = {'Target': 'p{5cm}', 'Top terms': 'p{2.2cm}', 
                'Top score': 'p{1cm}',
                'FoundIdx': 'p{1cm}', 'FoundScore': 'p{1cm}',
               }
# util.save_table(zdf_hassub.iloc[[0, 1, 3509, 3512, 3520]], 'zdf_hassub2', index=False, colformat=hascolformat, hlines=[2,3,4,5,6])

In [None]:
zdf_quotes.iloc[[1, 2, 7, 12]]

In [None]:
quotecolformat = {'Target': 'p{10cm}'}
# util.save_table(zdf_quotes.iloc[[1, 2, 7, 12]], 'zdf_quotes', index=False, colformat=quotecolformat)

In [None]:
zdf_trans.iloc[[6,7,4085,4088,4474]]

In [None]:
transcolformat = {'Target': 'p{5cm}', 'BT': 'p{5cm}'}
# util.save_table(zdf_trans.iloc[[6,7,4085,4088,4474]], 'zdf_trans', index=False, colformat=transcolformat)

#### Get literal/idiomatic percentages for good MWE examples.

In [None]:
z_counts = util.get_counts(zdf_bt3, 'DataID').drop(columns='Pct correct', axis=1)
z_counts[(z_counts['Pct literal'] > 0.4) & (z_counts['Pct literal'] < 0.6)].sort_values(by=['Language','MWE'])

#### Analyze Hassub results for Galician.

In [None]:
tdf_comb_bert4[(tdf_comb_bert4['Language'] == 'GL') & (tdf_comb_bert4['Hassub'])]

# Extra sets for post-competition

#### Extra sets for post-competition

In [None]:
tdf_featonly = fitter.multi_results(tdf_bt3, None, 
                                    tdf_feat_results, tdf_feat_probs,
                                    tdf_feat_results, tdf_feat_probs,
                                    ['Caps','Hassub'],['Quotes'])
tdf_featonly.groupby(['Language','Prediction'])['ID'].count()

In [None]:
tdf_res_new1 = tdf_sub.copy()
tdf_res_new1.loc[tdf_res_new1['Setting'] == 'zero_shot', 'Label'] = tdf_featonly['Prediction']
# tdf_res_new1.to_csv('data/test_sub_20220205_1.csv', index=False)

In [None]:
tdf_featonly2 = fitter.multi_results(tdf_bt3, None, 
                                     tdf_feat_results, tdf_feat_probs,
                                     tdf_feat_results, tdf_feat_probs,
                                     ['Caps'],['Quotes'])
tdf_featonly2.groupby(['Language','Prediction'])['ID'].count()

In [None]:
tdf_featonly2[tdf_featonly['Prediction'] != tdf_featonly2['Prediction']]

In [None]:
tdf_res_new1p = tdf_sub.copy()
tdf_res_new1p.loc[tdf_res_new1p['Setting'] == 'zero_shot', 'Label'] = tdf_featonly2['Prediction']
# tdf_res_new1p.to_csv('data/test_sub_20220205_1p.csv', index=False)

In [None]:
tdf_comb_bert4[tdf_featonly['Prediction'] != tdf_comb_bert4['Prediction']]

In [None]:
tdf_comb_bert4.groupby(['Language','Prediction'])['ID'].count()

In [None]:
tdf_bert_new1 = fitter.multi_results(tdf_bt3, None, 
                                     tdf_bert_comb['prediction'].astype('str'), tdf_bert_comb_probs['probs'], 
                                     tdf_feat_results, tdf_feat_probs,
                                     ['Caps','Hassub'],['Quotes'])
tdf_bert_new1.groupby(['Language','Prediction'])['ID'].count()

In [None]:
tdf_res_new2 = tdf_sub.copy()
tdf_res_new2.loc[tdf_res_new2['Setting'] == 'zero_shot', 'Label'] = tdf_bert_new1['Prediction']
# tdf_res_new2.to_csv('data/test_sub_20220205_2.csv', index=False)

In [None]:
tdf_bert_new1[tdf_bert_new1['Prediction'] != tdf_comb_bert4['Prediction']]

In [None]:
tdf_bert_new2 = fitter.multi_results(tdf_bt3, None, 
                                     tdf_bert_comb['prediction'].astype('str'), tdf_bert_comb_probs['probs'], 
                                     tdf_feat_results, tdf_feat_probs,
                                     ['Caps'],['Quotes'], agreeonly=True)
tdf_bert_new2.groupby(['Language','Prediction'])['ID'].count()

In [None]:
tdf_res_new3 = tdf_sub.copy()
tdf_res_new3.loc[tdf_res_new3['Setting'] == 'zero_shot', 'Label'] = tdf_bert_new2['Prediction']
# tdf_res_new3.to_csv('data/test_sub_20220205_3.csv', index=False)

In [None]:
tdf_res_new4 = tdf_sub.copy()
tdf_res_new4.loc[(tdf_res_new4['Setting'] == 'zero_shot') & 
                 (tdf_res_new4['Language'] == 'EN'), 'Label'] = tdf_bert_new2.loc[tdf_bert_new2['Language'] == 'EN', 'Prediction']
tdf_res_new4.loc[(tdf_res_new4['Setting'] == 'zero_shot') & 
                 (tdf_res_new4['Language'] != 'EN'), 'Label'] = tdf_featonly2.loc[tdf_featonly2['Language'] != 'EN', 'Prediction']

In [None]:
tdf_res_new4[tdf_res_new4['Label'] != tdf_res_new3['Label']]

In [None]:
tdf_res_new4[tdf_res_new4['Label'] != tdf_res_new1p['Label']]

In [None]:
# tdf_res_new4.to_csv('data/test_sub_20220214_1.csv', index=False)