## Feature building

This notebook builds the feature models using various pipelines and models from HuggingFace.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width: 98% !important; }</style>"))
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [None]:
import sys
import os
import re
import numpy as np
import pandas as pd

from tqdm.notebook import tqdm
from sklearn.metrics import f1_score

In [None]:
from lib import util, embeds, fitter, masker, features, sentiment, translate

Uncomment these two lines to download the required data for the task.

In [None]:
# !git clone https://github.com/H-TayyarMadabushi/SemEval_2022_Task2-idiomaticity.git
# !git clone https://github.com/H-TayyarMadabushi/AStitchInLanguageModels.git

In [None]:
datapath = 'SemEval_2022_Task2-idiomaticity/SubTaskA/Data'

In [None]:
testpath = 'SemEval_2022_Task2-idiomaticity/SubTaskA/TestData'

Load all the CSV files in dataframes.

In [None]:
frames = util.load_csv_dataframes(datapath)

In [None]:
frames.keys()

In [None]:
zdf = frames['train_zero_shot.csv']
odf = frames['train_one_shot.csv']
ddf = frames['dev.csv']
ddf_gold = frames['dev_gold.csv']
edf = frames['eval.csv']

Load trial data.

In [None]:
tframes = util.load_csv_dataframes(testpath)

In [None]:
tframes.keys()

In [None]:
tdf = tframes['test.csv']

Test basic embeddings with sentence-transformers.

In [None]:
z_emb = embeds.get_embeddings(zdf)
z_emb_i = embeds.get_embeddings(zdf, append=['MWE'])

In [None]:
multilingual_model = 'distiluse-base-multilingual-cased-v1'

### Sentence transformers embeddings

Get sentence-transformers embeddings with the best method (appending MWE to the text, ignoring context).

The "best" method isn't actually completely true, as the original paper uses the "idiomatic principle" to encode the MWE, that is, using it as a single token when tokenizing.

In [None]:
z_emb_multi = embeds.get_embeddings(zdf, modelname=multilingual_model, append=['MWE'])

In [None]:
d_emb_multi = embeds.get_embeddings(ddf, modelname=multilingual_model, append=['MWE'])

Do a fitting for the embeddings with Logistic Regression.

In [None]:
z_score, z_probs, z_results = fitter.get_fit_results(z_emb_multi, zdf['Label'], d_emb_multi, ddf_gold['Label'])

In [None]:
z_score

In [None]:
dres = fitter.add_results(ddf, z_results, ddf_gold)

In [None]:
dres

In [None]:
dres_counts = util.get_counts(dres)

Show the MWEs that the model gets wrong more than half of the time. Are there any patterns?

In [None]:
dres_counts[dres_counts['Pct correct'] < 0.5].sort_values(by=['Language','MWE'])

## Feature generation

If you have already generated the features, skip all cells until the "Reload data from disk" section.

### Mask filling (lexical substitution)

Get several features based on mask-filling pipeline.

Rationale: It should be more difficult to get mask filling to work when the MWE is idiomatic.

There are three ways to do mask filling for the MWE:
- replace the whole expression: banana republic -\> \<mask\>
- replace the first term: \<mask\> republic
- replace the second term: banana \<mask\>

The mask filling generates several features:
- Hassub: whether a top-5 term is found in the MWE (exactly)
  - FoundIndex: records the index of found term
  - FoundScore: records the score of found term
- Top score: the confidence score of the top term
- Short/FS/SS: Amount of "Short" terms (less than three characters) in whole mask vs first term replacement vs second term replacement, respectively

Additionally, the top terms are recorded into Top terms and Top score columns (for the whole expression and first and second term)
- The top score is only recorded for an "acceptable" term (at least three characters and no non-word characters) 

In [None]:
zdf_masked = masker.get_masked_features(zdf)

In [None]:
zdf_masked.groupby(['Language','Label','Hassub'])['DataID'].count()

In [None]:
ddf_masked = masker.get_masked_features(ddf)

In [None]:
ddf_masked[ddf_masked['Hassub'] == False][535:600]

In [None]:
ddf_masked.groupby(['Language','Hassub'])['ID'].count()

In [None]:
str_prob = 'Além de ter sido um fracasso de bilheteria e crítica, o filme acabou marcado pelos seus efeitos especiais, principalmente ao antropomorfizar os gatos, que, bem, ficam um pouco bisonhos.'
str_prob_2 = 'Professor livre docente da Unesp, Fortaleza é presidente da Sociedade Paulista de Infectologia e membro do Comitê de Contingência da COVID-19, do Governo do Estado de São Paulo.'
str_prob_3 = 'Com a segurança da imunização em massa e os números traduzindo sua eficácia, fica mais fácil para o americano médio sentir-se confiante em marcar sua próxima viagem, gerando um circulo virtuoso para o setor nos próximos meses.'


In [None]:
masker.replacer2(str_prob_3, 'círculo virtuoso', '<mask>', ' ')

In [None]:
masker.replacer2(str_prob, 'efeito especial', 'efeito <mask>', ' ')

In [None]:
masker.replace_mask_token(str_prob_2, 'livre-docente', 'livre-<mask>')

In [None]:
masker.replace_mask_token(str_prob, 'efeito especial', 'efeito <mask>', '<mask>')

In [None]:
masker.replace_mask_token(str_prob, 'efeito especial', '<mask>')

### Boolean features

Get features: Caps and Quotes.

Rationale:
- MWEs in Caps (Banana Republic vs banana republic) are more likely to be a proper noun (PN)
- Quoted MWEs are more likely to be idiomatic

In [None]:
zdf_masked_feats = features.get_features(zdf_masked)

In [None]:
zdf_masked_feats.groupby(['Language','Label','Caps'])['DataID'].count()

In [None]:
ddf_masked_feats = features.get_features(ddf_masked)

### Sentiment classifier

Rationale: idiomatic expressions are more likely to be affective (positive or negative).

Neutral sentiment probability is used as a proxy for literality.

In [None]:
sentiment_classifier, sentiment_tokenizer, sentiment_config = sentiment.get_classifier_tokenizer()

In [None]:
sentiment.get_sentiment(ddf_masked_feats['Target'].values[0], sentiment_classifier, sentiment_tokenizer, sentiment_config)

In [None]:
ddf_masked_feats_sent = sentiment.get_df_sentiments(ddf_masked_feats, sentiment_classifier, sentiment_tokenizer, sentiment_config)

In [None]:
zdf_masked_feats_sent = sentiment.get_df_sentiments(zdf_masked_feats, sentiment_classifier, sentiment_tokenizer, sentiment_config)

In [None]:
zdf_masked_feats_sent[zdf_masked_feats_sent['Label'] == '0'].mean()

### Backtranslation

Translate text from English to Portuguese and back (and vice versa if the source language is Portuguese).

Rationale: the expression is more likely to be idiomatic if it is not found from the backtranslation.

In [None]:
btmodel1, btmodel2, bttoken1, bttoken2 = translate.get_marian_models()

In [None]:
zdf_bt = translate.backtranslate(zdf_masked_feats_sent, btmodel1, btmodel2, bttoken1, bttoken2, batch_len=10)

In [None]:
ddf_bt = translate.backtranslate(ddf_masked_feats_sent, btmodel1, btmodel2, bttoken1, bttoken2, batch_len=10)

In [None]:
# zdf_bt.sort_values(by="BT", key=lambda x: x.str.len())

In [None]:
zdf_bt2 = translate.record_trans(zdf_bt)

In [None]:
ddf_bt2 = translate.record_trans(ddf_bt)

In [None]:
zdf_bt2.groupby(['Language','Label','Trans'])['DataID'].count()

In [None]:
ddf_bt2.groupby(['Language','Trans'])['ID'].count()

### Previous/next difference

Compare the embeddings of the Target to those of Previous/Next sentence.

Rationale: Idioms are semantic outliers, thus they are more likely to be dissimilar to the context.


In [None]:
zdf_bt3 = embeds.get_prev_next_diff(zdf_bt2, modelname=multilingual_model)
ddf_bt3 = embeds.get_prev_next_diff(ddf_bt2, modelname=multilingual_model)

In [None]:
# zdf_bt3[zdf_bt3['Label'] == '0'].mean()
# zdf_bt3[zdf_bt3['Label'] == '1'].mean()

### Save data to disk.

In [None]:
util.save_pickle(zdf_bt3, 'data/zdf_bt3')

In [None]:
util.save_pickle(ddf_bt3, 'data/ddf_bt3')

### Reload data from disk.

In [None]:
zdf_bt3 = pd.read_pickle('data/zdf_bt3_20220104_1.pkl')

In [None]:
ddf_bt3 = pd.read_pickle('data/ddf_bt3_20220104_1.pkl')

### Training

Combine the classifiers.

In [None]:
zdf_t = fitter.get_trainable(zdf_bt3)
ddf_t = fitter.get_trainable(ddf_bt3)

ddf_feat_score, ddf_feat_probs, ddf_feat_results = fitter.get_fit_results(zdf_t, zdf['Label'], ddf_t, ddf_gold['Label'])
ddf_feat_score

In [None]:
mu = fitter.multi_results(ddf_bt3, ddf_gold, z_results, z_probs, ddf_feat_results, ddf_feat_probs, ['Caps', 'Hassub'], ['Quotes'])
co = len(mu[mu['Prediction'] == mu['Label']])
print(co/len(mu))

In [None]:
f1_score(mu['Prediction'], mu['Label'], average='macro')

In [None]:
mux = fitter.multi_results(ddf_bt3, ddf_gold, z_results, z_probs, ddf_feat_results, ddf_feat_probs, ['Caps', 'Hassub'],['!Trans','Quotes'])
cox = len(mux[mux['Prediction'] == mux['Label']])
print(cox/len(mux))

In [None]:
f1_score(mux['Prediction'], mux['Label'], average='macro')

### One-shot model

In [None]:
# o_emb_multi = embeds.get_embeddings(odf, modelname=multilingual_model, append=['MWE'])
# ozdf = pd.concat([zdf,odf])
# oz_emb_multi = np.concatenate([z_emb_multi, o_emb_multi])

# oz_score, oz_probs, oz_results = fitter.get_fit_results(oz_emb_multi, ozdf['Label'], d_emb_multi, ddf_gold['Label'])
# oz_score

In [None]:
# odf_masked = masker.get_masked_features(odf)
# odf_masked_feats = features.get_features(odf_masked)
# odf_masked_feats_sent = sentiment.get_df_sentiments(odf_masked_feats, sentiment_classifier, sentiment_tokenizer, sentiment_config)
# odf_bt = translate.backtranslate(odf_masked_feats_sent, btmodel1, btmodel2, bttoken1, bttoken2, batch_len=10)
# odf_bt2 = translate.record_trans(odf_bt)
# ozdf_bt2 = pd.concat([zdf_bt2,odf_bt2])
# ozdf_t = fitter.get_trainable(ozdf_bt2)

# oddf_feat_score, oddf_feat_probs, oddf_feat_results = fitter.get_fit_results2(ozdf_t, ozdf['Label'], ddf_t, ddf_gold['Label'])
# oddf_feat_score

In [None]:
# omu = fitter.multi_results(ddf_bt2, ddf_gold, oz_results, oz_probs, oddf_feat_results, oddf_feat_probs, ['Caps', 'Hassub'])
# co = len(omu[omu['Prediction'] == omu['Label']])
# print(co/len(omu))

### Get the best features

Lets check some statistics first.

In [None]:
zdf_bt3[zdf_bt3['Label'] == '0'].mean()

In [None]:
zdf_bt3[zdf_bt3['Label'] == '1'].mean()

In [None]:
zdf_bt3[zdf_bt3['Label'] == '1'].mean() - zdf_bt3[zdf_bt3['Label'] == '0'].mean()

Check best and worst features, maximum three features.

In [None]:
ff2 = fitter.check_feats(zdf_t, zdf['Label'], ddf_t, ddf_gold['Label'], minfeats=1, maxfeats=3)

In [None]:
for col in ff2.columns[3:]:
    if col != 'Score':
        print(col, ff2[ff2[col]]['Score'].mean())

In [None]:
ff2.sort_values(by=['Score'], ascending=False)[:30]

Prune the worst-performing features.

In [None]:
dropcols = ['Top score 1', 'Top score 2', 'SS', 'FS', 'MWEdiff']

In [None]:
zdf_t2 = fitter.get_trainable(zdf_bt3).drop(dropcols, axis=1)
ddf_t2 = fitter.get_trainable(ddf_bt3).drop(dropcols, axis=1)

In [None]:
ff3 = fitter.check_feats(zdf_t2, zdf['Label'], ddf_t2, ddf_gold['Label'], minfeats=2)

In [None]:
ff3.sort_values(by=['Score'], ascending=False)

In [None]:
best = fitter.get_best_features(ff3, topn=2000)

In [None]:
bestres, bestest = fitter.get_bestest_features(best, zdf_bt3, ddf_bt3, ddf_gold, z_results, z_probs, autodrop=dropcols)

In [None]:
bestest.sort_values(by=['Score'], ascending=False)[:30]

In [None]:
# dropcols = ['Top score', 'FS', 'SS', 'Quotes', 'MWEdiff']
# dropcols = ['Top score', 'Top score 1', 'Top score 2', 'Hassub', 'FS', 'Nextdiff']
# dropcols = ['FoundIdx', 'MWEdiff', 'FS', 'Hassub']
dropcols2 = ['MWEdiff', 'FS']
zdf_t4 = fitter.get_trainable(zdf_bt3).drop(dropcols2, axis=1)
ddf_t4 = fitter.get_trainable(ddf_bt3).drop(dropcols2, axis=1)
ddf5_feat_score, ddf5_feat_probs, ddf5_feat_results = fitter.get_fit_results(zdf_t4, zdf['Label'], ddf_t4, ddf_gold['Label'])

In [None]:
mup = fitter.multi_results(ddf_bt3, ddf_gold, z_results, z_probs, ddf5_feat_results, ddf5_feat_probs, ['Caps', 'Hassub'],['Quotes'])
co = len(mup[mup['Prediction'] == mup['Label']])
print(co/len(mup))

In [None]:
f1_score(mup['Prediction'], mup['Label'], average='macro')

### Evaluation data

In [None]:
e_emb_multi = embeds.get_embeddings(edf, modelname=multilingual_model, append=['MWE'])

In [None]:
ez_score, ez_probs, ez_results = fitter.get_fit_results(z_emb_multi, zdf['Label'], e_emb_multi)

In [None]:
edf_masked = masker.get_masked_features(edf)

In [None]:
edf_masked_feats = features.get_features(edf_masked)

In [None]:
edf_masked_feats_sent = sentiment.get_df_sentiments(edf_masked_feats, sentiment_classifier, sentiment_tokenizer, sentiment_config)

In [None]:
edf_bt = translate.backtranslate(edf_masked_feats_sent, btmodel1, btmodel2, bttoken1, bttoken2, batch_len=10)

In [None]:
edf_bt2 = translate.record_trans(edf_bt)

In [None]:
edf_bt3 = embeds.get_prev_next_diff(edf_bt2, modelname=multilingual_model)

Save evaluation data to disk.

In [None]:
util.save_pickle(edf_bt3, 'data/edf_bt3')

In [None]:
edf_bt3 = pd.read_pickle('data/edf_bt3_20220105_1.pkl')

#### Runs with test data

Test data was releaesd on January 10, 2022.

In [None]:
t_emb_multi = embeds.get_embeddings(tdf, modelname=multilingual_model, append=['MWE'])

In [None]:
tz_score, tz_probs, tz_results = fitter.get_fit_results(z_emb_multi, zdf['Label'], t_emb_multi)

In [None]:
tdf_masked = masker.get_masked_features(tdf)

In [None]:
tdf_masked_feats = features.get_features(tdf_masked)

In [None]:
tdf_masked_feats_sent = sentiment.get_df_sentiments(tdf_masked_feats, sentiment_classifier, sentiment_tokenizer, sentiment_config)

Check how Galician sentences are translated.

In [None]:
tdf_gl = tdf_masked_feats_sent[tdf_masked_feats_sent['Language'] == 'GL']

In [None]:
tdf_gl_bt = translate.backtranslate(tdf_gl, btmodel1, btmodel2, bttoken1, bttoken2, batch_len=10)

In [None]:
tdf_gl_bt2 = translate.record_trans(tdf_gl_bt)

In [None]:
tdf_gl_bt2

Run backtranslation for the whole data.

In [None]:
tdf_bt = translate.backtranslate(tdf_masked_feats_sent, btmodel1, btmodel2, bttoken1, bttoken2, batch_len=10)

In [None]:
tdf_bt2 = translate.record_trans(tdf_bt)

In [None]:
tdf_bt3 = embeds.get_prev_next_diff(tdf_bt2, modelname=multilingual_model)

In [None]:
util.save_pickle(tdf_bt3, 'data/tdf_bt3')

In [None]:
# tdf_bt3 = pd.read_pickle('data/tdf_bt3_20220111_1.pkl')