# Ensemble and Result Analysis

## Importing Libraries

In [94]:
import pickle
import numpy as np
import pandas as pd
from scipy.stats import mode
import plotly.graph_objects as go
from sklearn.metrics import classification_report

### Loading Results

#### SentenceBERT

First, we need to unzip the data for SentenceBERT.

In [6]:
! unzip ../../models/SentenceBERT/SentenceBERT_dfs.zip

Archive:  ../../models/SentenceBERT/SentenceBERT_dfs.zip
  inflating: val_df_SentenceBERT.csv  
  inflating: __MACOSX/._val_df_SentenceBERT.csv  
  inflating: test_df_SentenceBERT.csv  
  inflating: __MACOSX/._test_df_SentenceBERT.csv  


In [7]:
sentence_bert_df = pd.read_csv('test_df_SentenceBERT.csv')
sentence_bert_df.head()

Unnamed: 0.1,Unnamed: 0,id,src,tgt,hyp,task,labels,label,p(Hallucination),src_embeddings,hyp_embeddings,tgt_embeddings,src_tgt,src_hyp,hyp_tgt
0,0,1,"Ты удивишься, если я скажу, что на самом деле ...",Would you be surprised if I told you my name i...,You're gonna be surprised if I say my real nam...,MT,"['Not Hallucination', 'Not Hallucination', 'No...",Not Hallucination,0.0,"[0.01028267852962017, -0.045175742357969284, 0...","[0.021117474883794785, -0.04136405885219574, 0...","[0.00677498197183013, -0.031440578401088715, 0...",0.929008,0.917389,0.972094
1,1,2,Еды будет полно.,There will be plenty of food.,The food will be full.,MT,"['Hallucination', 'Not Hallucination', 'Halluc...",Hallucination,0.8,"[-0.05674311891198158, -0.06271328032016754, -...","[-0.06109685078263283, -0.053402844816446304, ...","[-0.045577678829431534, -0.062114838510751724,...",0.792944,0.887604,0.824929
2,2,3,"Думаете, Том будет меня ждать?",Do you think that Tom will wait for me?,You think Tom's gonna wait for me?,MT,"['Not Hallucination', 'Not Hallucination', 'No...",Not Hallucination,0.2,"[-0.008699537254869938, -0.025732094421982765,...","[0.00991713348776102, -0.029770933091640472, 0...","[0.02544650062918663, -0.014093918725848198, 0...",0.881809,0.957938,0.941529
3,3,6,Два брата довольно разные.,The two brothers are pretty different.,There's a lot of friends.,MT,"['Hallucination', 'Hallucination', 'Hallucinat...",Hallucination,1.0,"[0.0023963418789207935, -0.0030373588670045137...","[-0.006189093459397554, -0.06312105804681778, ...","[0.007739310618489981, 0.018740057945251465, -...",0.893029,0.40799,0.37847
4,4,7,<define> Infradiaphragmatic </define> intra- a...,(medicine) Below the diaphragm.,(anatomy) Relating to the diaphragm.,DM,"['Hallucination', 'Hallucination', 'Hallucinat...",Hallucination,0.8,"[-0.01360238529741764, 0.0369502529501915, 0.0...","[-0.04313211515545845, -0.00162073178216815, 0...","[-0.06311369687318802, 0.029431713744997978, 0...",0.38899,0.355295,0.673098


In [15]:
with open('../../models/SentenceBERT/LR_SBERT.pickle', 'rb') as f:
    sentence_bert_lr = pickle.load(f)
sentence_bert_lr

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [8]:
# Remove the unzipped files
! rm test_df_SentenceBERT.csv
! rm val_df_SentenceBERT.csv
! rm -r __MACOSX

#### UniEval

Then, we load the results for UniEval.

In [10]:
unieval_df = pd.read_csv('../UniEval/test_df_UniEval.csv')
unieval_df.head()

Unnamed: 0.1,Unnamed: 0,id,src,tgt,hyp,task,labels,label,p(Hallucination),coherence,consistency,fluency,relevance,overall
0,0,1,"Ты удивишься, если я скажу, что на самом деле ...",Would you be surprised if I told you my name i...,You're gonna be surprised if I say my real nam...,MT,"['Not Hallucination', 'Not Hallucination', 'No...",Not Hallucination,0.0,0.962807,0.959963,0.946093,0.973031,0.960473
1,1,2,Еды будет полно.,There will be plenty of food.,The food will be full.,MT,"['Hallucination', 'Not Hallucination', 'Halluc...",Hallucination,0.8,0.886008,0.94417,0.926647,0.908745,0.916392
2,2,3,"Думаете, Том будет меня ждать?",Do you think that Tom will wait for me?,You think Tom's gonna wait for me?,MT,"['Not Hallucination', 'Not Hallucination', 'No...",Not Hallucination,0.2,0.934748,0.944839,0.962526,0.956602,0.949679
3,3,6,Два брата довольно разные.,The two brothers are pretty different.,There's a lot of friends.,MT,"['Hallucination', 'Hallucination', 'Hallucinat...",Hallucination,1.0,0.903896,0.931672,0.957336,0.843209,0.909028
4,4,7,<define> Infradiaphragmatic </define> intra- a...,(medicine) Below the diaphragm.,(anatomy) Relating to the diaphragm.,DM,"['Hallucination', 'Hallucination', 'Hallucinat...",Hallucination,0.8,0.931151,0.947555,0.930656,0.743565,0.888232


In [30]:
with open('../../models/UniEval/LR_UniEval.pickle', 'rb') as f:
    unieval_lr = pickle.load(f)
unieval_lr

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


#### Fine-Tuned Llama 3

Finally, we load the results for the fine-tuned Llama 3 model.

In [11]:
llm_df = pd.read_csv('../Fine-tuned LLM/test_df_LLM.csv')
llm_df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,id,src,tgt,hyp,task,labels,label,p(Hallucination),text,prediction
0,0,0,1,"Ты удивишься, если я скажу, что на самом деле ...",Would you be surprised if I told you my name i...,You're gonna be surprised if I say my real nam...,MT,"['Not Hallucination', 'Not Hallucination', 'No...",Not Hallucination,0.0,### user: For the task MT Given a source sente...,0
1,1,1,2,Еды будет полно.,There will be plenty of food.,The food will be full.,MT,"['Hallucination', 'Not Hallucination', 'Halluc...",Hallucination,0.8,### user: For the task MT Given a source sente...,1
2,2,2,3,"Думаете, Том будет меня ждать?",Do you think that Tom will wait for me?,You think Tom's gonna wait for me?,MT,"['Not Hallucination', 'Not Hallucination', 'No...",Not Hallucination,0.2,### user: For the task MT Given a source sente...,1
3,3,3,6,Два брата довольно разные.,The two brothers are pretty different.,There's a lot of friends.,MT,"['Hallucination', 'Hallucination', 'Hallucinat...",Hallucination,1.0,### user: For the task MT Given a source sente...,1
4,4,4,7,<define> Infradiaphragmatic </define> intra- a...,(medicine) Below the diaphragm.,(anatomy) Relating to the diaphragm.,DM,"['Hallucination', 'Hallucination', 'Hallucinat...",Hallucination,0.8,### user: For the task DM Given a source sente...,1


We should also get the labels for the test set.

In [13]:
labels = np.array([1 if x > 0.5 else 0 for x in sentence_bert_df['p(Hallucination)'].tolist()])
labels

array([0, 1, 0, ..., 0, 0, 0])

## Results Analysis

Here, we first get the results for each model and then we analyze the results.

### SentenceBERT

In [24]:
sentence_bert_preds = sentence_bert_lr.predict(sentence_bert_df[['src_tgt', 'src_hyp', 'hyp_tgt']].values)
sentence_bert_preds = np.array([1 if x > 0.5 else 0 for x in sentence_bert_preds])
sentence_bert_results = classification_report(labels, sentence_bert_preds, output_dict=True)
print(classification_report(labels, sentence_bert_preds))

              precision    recall  f1-score   support

           0       0.69      0.84      0.76       889
           1       0.66      0.45      0.53       611

    accuracy                           0.68      1500
   macro avg       0.67      0.64      0.65      1500
weighted avg       0.68      0.68      0.67      1500



### UniEval

In [31]:
unieval_preds = unieval_lr.predict(unieval_df[['coherence', 'consistency', 'fluency', 'relevance']].values)
unieval_preds = np.array([1 if x > 0.5 else 0 for x in unieval_preds])
unieval_results = classification_report(labels, unieval_preds, output_dict=True)
print(classification_report(labels, unieval_preds))

              precision    recall  f1-score   support

           0       0.65      0.85      0.74       889
           1       0.61      0.34      0.43       611

    accuracy                           0.64      1500
   macro avg       0.63      0.59      0.59      1500
weighted avg       0.63      0.64      0.61      1500



### Fine-Tuned Llama 3

In [29]:
llm_preds = llm_df['prediction'].values
llm_results = classification_report(labels, llm_preds, output_dict=True)
print(classification_report(labels, llm_preds))

              precision    recall  f1-score   support

           0       0.85      0.61      0.71       889
           1       0.60      0.84      0.70       611

    accuracy                           0.70      1500
   macro avg       0.72      0.73      0.70      1500
weighted avg       0.75      0.70      0.71      1500



### Ensemble

In [93]:
ensemble_preds = np.vstack([sentence_bert_preds, unieval_preds, llm_preds])
majority_vote_predictions, _ = mode(ensemble_preds, axis=0)
majority_vote_predictions = majority_vote_predictions.flatten()
ensemble_results = classification_report(labels, majority_vote_predictions, output_dict=True)
print(classification_report(labels, majority_vote_predictions))

              precision    recall  f1-score   support

           0       0.74      0.85      0.79       889
           1       0.71      0.56      0.63       611

    accuracy                           0.73      1500
   macro avg       0.72      0.70      0.71      1500
weighted avg       0.73      0.73      0.72      1500



Now, let's analyze the results by plotting different metrics in the classification report.

In [64]:
results = {
    'SentenceBERT': sentence_bert_results,
    'UniEval': unieval_results,
    'LLM': llm_results,
    'Majority Vote': ensemble_results
}

results_df = pd.DataFrame(results).transpose()
results_df['precision'] = results_df['weighted avg'].apply(lambda x: x['precision'])
results_df['recall'] = results_df['weighted avg'].apply(lambda x: x['recall'])
results_df['f1-score'] = results_df['weighted avg'].apply(lambda x: x['f1-score'])
results_df.head()

Unnamed: 0,0,1,accuracy,macro avg,weighted avg,precision,recall,f1-score
SentenceBERT,"{'precision': 0.6894639556377079, 'recall': 0....","{'precision': 0.6578947368421053, 'recall': 0....",0.680667,"{'precision': 0.6736793462399067, 'recall': 0....","{'precision': 0.6766047605149658, 'recall': 0....",0.676605,0.680667,0.666354
UniEval,"{'precision': 0.6509028374892519, 'recall': 0....","{'precision': 0.6083086053412463, 'recall': 0....",0.641333,"{'precision': 0.6296057214152491, 'recall': 0....","{'precision': 0.633552786927631, 'recall': 0.6...",0.633553,0.641333,0.613447
LLM,"{'precision': 0.8462732919254659, 'recall': 0....","{'precision': 0.5981308411214953, 'recall': 0....",0.704667,"{'precision': 0.7222020665234805, 'recall': 0....","{'precision': 0.745196600297982, 'recall': 0.7...",0.745197,0.704667,0.705728
Majority Vote,"{'precision': 0.735812133072407, 'recall': 0.8...","{'precision': 0.7133891213389121, 'recall': 0....",0.728667,"{'precision': 0.7246006272056595, 'recall': 0....","{'precision': 0.72667849295963, 'recall': 0.72...",0.726678,0.728667,0.72154


In [74]:
## Plotting the results (Accuracy, Precision, Recall, F1-Score)
fig = go.Figure()
fig.add_trace(go.Bar(x=results_df.index, y=results_df['accuracy'], name='Accuracy'))
fig.add_trace(go.Bar(x=results_df.index, y=results_df['precision'], name='Precision'))
fig.add_trace(go.Bar(x=results_df.index, y=results_df['recall'], name='Recall'))
fig.add_trace(go.Bar(x=results_df.index, y=results_df['f1-score'], name='F1-Score'))
fig.update_layout(barmode='group', title='Models Performance Metrics', xaxis_title='Model', yaxis_title='Score')
fig.show()

As shown above, the ensemble model has the best performance in terms of F1-score, precision, and recall. The fine-tuned Llama 3 model has the best performance in terms of precision, followed by the ensemble model, and the SentenceBERT model. The UniEval model has the worst performance.

### Results Analysis for Tasks

Now, let's analyze the results for each of the three tasks of Definition Modeling (DM), Machine Translation (MT), and Paraphrase Generation (PG).

In [89]:
def analyze_task(task_name):
    
    ## Getting the index of the task in the test df
    task_idx = sentence_bert_df[sentence_bert_df['task'] == task_name].index
    labels_task = labels[task_idx]

    ## Getting the predictions for the DM task
    sentence_bert_task_preds = sentence_bert_preds[task_idx]
    unieval_task_preds = unieval_preds[task_idx]
    llm_task_preds = llm_preds[task_idx]
    majority_vote_task_preds = majority_vote_predictions[task_idx]

    ## Getting the classification report for the DM task
    sentence_bert_task_results = classification_report(labels_task, sentence_bert_task_preds, output_dict=True)
    print("SentenceBERT")
    print(classification_report(labels_task, sentence_bert_task_preds))
    unieval_task_results = classification_report(labels_task, unieval_task_preds, output_dict=True)
    print("UniEval")
    print(classification_report(labels_task, unieval_task_preds))
    llm_task_results = classification_report(labels_task, llm_task_preds, output_dict=True)
    print("LLM")
    print(classification_report(labels_task, llm_task_preds))
    ensemble_task_results = classification_report(labels_task, majority_vote_task_preds, output_dict=True)
    print("Majority Vote")
    print(classification_report(labels_task, majority_vote_task_preds))

    dm_results = {
        'SentenceBERT': sentence_bert_task_results,
        'UniEval': unieval_task_results,
        'LLM': llm_task_results,
        'Majority Vote': ensemble_task_results
    }

    dm_results_df = pd.DataFrame(dm_results).transpose()
    dm_results_df['precision'] = dm_results_df['weighted avg'].apply(lambda x: x['precision'])
    dm_results_df['recall'] = dm_results_df['weighted avg'].apply(lambda x: x['recall'])
    dm_results_df['f1-score'] = dm_results_df['weighted avg'].apply(lambda x: x['f1-score'])

    ## Plotting the results (Accuracy, Precision, Recall, F1-Score) for the DM task
    fig = go.Figure()
    fig.add_trace(go.Bar(x=dm_results_df.index, y=dm_results_df['accuracy'], name='Accuracy'))
    fig.add_trace(go.Bar(x=dm_results_df.index, y=dm_results_df['precision'], name='Precision'))
    fig.add_trace(go.Bar(x=dm_results_df.index, y=dm_results_df['recall'], name='Recall'))
    fig.add_trace(go.Bar(x=dm_results_df.index, y=dm_results_df['f1-score'], name='F1-Score'))
    fig.update_layout(barmode='group', title=f'Models Performance Metrics on {task_name} Task', xaxis_title='Model', yaxis_title='Score')
    fig.show()

#### DM

In [90]:
analyze_task('DM')

SentenceBERT
              precision    recall  f1-score   support

           0       0.60      0.62      0.61       275
           1       0.63      0.61      0.62       288

    accuracy                           0.61       563
   macro avg       0.61      0.61      0.61       563
weighted avg       0.61      0.61      0.61       563

UniEval
              precision    recall  f1-score   support

           0       0.54      0.87      0.67       275
           1       0.70      0.30      0.42       288

    accuracy                           0.58       563
   macro avg       0.62      0.58      0.54       563
weighted avg       0.62      0.58      0.54       563

LLM
              precision    recall  f1-score   support

           0       0.89      0.39      0.54       275
           1       0.62      0.95      0.75       288

    accuracy                           0.68       563
   macro avg       0.76      0.67      0.65       563
weighted avg       0.75      0.68      0.65      

#### MT

In [87]:
analyze_task('MT')

SentenceBERT
              precision    recall  f1-score   support

           0       0.65      0.98      0.78       336
           1       0.88      0.23      0.36       226

    accuracy                           0.68       562
   macro avg       0.77      0.60      0.57       562
weighted avg       0.74      0.68      0.61       562

UniEval
              precision    recall  f1-score   support

           0       0.69      0.91      0.78       336
           1       0.74      0.38      0.51       226

    accuracy                           0.70       562
   macro avg       0.71      0.65      0.64       562
weighted avg       0.71      0.70      0.67       562

LLM
              precision    recall  f1-score   support

           0       0.81      0.74      0.77       336
           1       0.66      0.75      0.70       226

    accuracy                           0.74       562
   macro avg       0.74      0.74      0.74       562
weighted avg       0.75      0.74      0.74      

#### PG

In [91]:
analyze_task('PG')

SentenceBERT
              precision    recall  f1-score   support

           0       0.84      0.88      0.86       278
           1       0.60      0.51      0.55        97

    accuracy                           0.79       375
   macro avg       0.72      0.70      0.71       375
weighted avg       0.78      0.79      0.78       375

UniEval
              precision    recall  f1-score   support

           0       0.76      0.77      0.77       278
           1       0.33      0.32      0.32        97

    accuracy                           0.65       375
   macro avg       0.55      0.54      0.54       375
weighted avg       0.65      0.65      0.65       375

LLM
              precision    recall  f1-score   support

           0       0.87      0.68      0.76       278
           1       0.44      0.70      0.54        97

    accuracy                           0.69       375
   macro avg       0.65      0.69      0.65       375
weighted avg       0.76      0.69      0.71      

As you can see, the ensemble majority voting methods perform fairly better on all tasks. Llama 3 and SentenceBERT also perform reasonable and slightly better than UniEval. The SentenceBERT model performs better on the PG task, while the Llama 3 model performs better on the DM and MT tasks.