# UniEval Model

### Installing and Importing Libraries

In [None]:
! git clone https://github.com/maszhongming/UniEval.git

In [2]:
import os

os.chdir('UniEval')

In [None]:
! pip install -r requirements.txt

In [67]:
import numpy as np
import pandas as pd
from tqdm import tqdm, trange
from utils import convert_to_json
from metric.evaluator import get_evaluator
from sklearn.metrics import mean_squared_error
from sklearn.metrics import classification_report
from sklearn.linear_model import LinearRegression
from sklearn.metrics import precision_recall_curve

## Loading the data

As we are going to use the UniEval model in a zero-shot setting, we only load the val and test datasets.

In [40]:
val_df = pd.read_json('/kaggle/input/shroom/train-dev-test-split/SHROOM_dev-v2/val.model-agnostic.json')
test_df = pd.read_json('/kaggle/input/shroom/train-dev-test-split/SHROOM_test-labeled/test.model-agnostic.json')

In [41]:
val_df.head()

Unnamed: 0,hyp,ref,src,tgt,model,task,labels,label,p(Hallucination)
0,Resembling or characteristic of a weasel.,tgt,The writer had just entered into his eighteent...,Resembling a weasel (in appearance).,,DM,"[Hallucination, Not Hallucination, Not Halluci...",Not Hallucination,0.2
1,Alternative form of sheath knife,tgt,Sailors ' and fishermen 's <define> sheath - k...,.,,DM,"[Hallucination, Hallucination, Hallucination, ...",Hallucination,0.8
2,(obsolete) A short period of time.,tgt,"As to age , Bead could not form any clear impr...","(poetic) An instant, a short moment.",,DM,"[Not Hallucination, Not Hallucination, Not Hal...",Not Hallucination,0.0
3,(slang) An incel.,tgt,Because redpillers are usually normies or <def...,"(incel, _, slang) A man of a slightly lower ra...",,DM,"[Not Hallucination, Not Hallucination, Halluci...",Not Hallucination,0.2
4,"An island in Lienchiang County, Taiwan.",tgt,On the second day of massive live - fire drill...,"An island in Dongyin, Lienchiang, Taiwan, in t...",,DM,"[Not Hallucination, Not Hallucination, Not Hal...",Not Hallucination,0.0


In [42]:
test_df.head()

Unnamed: 0,id,src,tgt,hyp,task,labels,label,p(Hallucination)
0,1,"Ты удивишься, если я скажу, что на самом деле ...",Would you be surprised if I told you my name i...,You're gonna be surprised if I say my real nam...,MT,"[Not Hallucination, Not Hallucination, Not Hal...",Not Hallucination,0.0
1,2,Еды будет полно.,There will be plenty of food.,The food will be full.,MT,"[Hallucination, Not Hallucination, Hallucinati...",Hallucination,0.8
2,3,"Думаете, Том будет меня ждать?",Do you think that Tom will wait for me?,You think Tom's gonna wait for me?,MT,"[Not Hallucination, Not Hallucination, Not Hal...",Not Hallucination,0.2
3,6,Два брата довольно разные.,The two brothers are pretty different.,There's a lot of friends.,MT,"[Hallucination, Hallucination, Hallucination, ...",Hallucination,1.0
4,7,<define> Infradiaphragmatic </define> intra- a...,(medicine) Below the diaphragm.,(anatomy) Relating to the diaphragm.,DM,"[Hallucination, Hallucination, Hallucination, ...",Hallucination,0.8


## Loading the SentenceBERT model

Here we use the summarization evaluator of UniEval, since it is more aligned with our type of hallucinations (sentences here are rather short).

In [43]:
task = 'summarization'

evaluator = get_evaluator(task)



In [44]:
## Evaluate Val data
val_src_list = val_df['src'].tolist()
val_hyp_list = val_df['hyp'].tolist()
val_tgt_list = val_df['tgt'].tolist()

val_data = convert_to_json(output_list=val_hyp_list, src_list=val_src_list, ref_list=val_tgt_list)
val_eval_scores = evaluator.evaluate(val_data)

## Evaluate Test data
test_src_list = test_df['src'].tolist()
test_hyp_list = test_df['hyp'].tolist()
test_tgt_list = test_df['tgt'].tolist()

test_data = convert_to_json(output_list=test_hyp_list, src_list=test_src_list, ref_list=test_tgt_list)
test_eval_scores = evaluator.evaluate(test_data)

Evaluating coherence of 499 samples !!!


100%|██████████| 63/63 [00:07<00:00,  8.07it/s]


Evaluating consistency of 499 samples !!!


100%|██████████| 63/63 [00:07<00:00,  8.26it/s]


Evaluating fluency of 499 samples !!!


100%|██████████| 63/63 [00:03<00:00, 16.60it/s]


Evaluating relevance of 499 samples !!!


100%|██████████| 63/63 [00:05<00:00, 12.33it/s]


Evaluating coherence of 1500 samples !!!


100%|██████████| 188/188 [00:26<00:00,  6.97it/s]


Evaluating consistency of 1500 samples !!!


100%|██████████| 190/190 [00:26<00:00,  7.08it/s]


Evaluating fluency of 1500 samples !!!


100%|██████████| 190/190 [00:11<00:00, 16.33it/s]


Evaluating relevance of 1500 samples !!!


100%|██████████| 188/188 [00:16<00:00, 11.74it/s]


### Train an LR model for the classification

Here we train a simple Logistic Regression model to classify the sentences into the 2 classes by using the scores (coherence, consistency, fluency, and relevance) obtained from the UniEval model.

In [45]:
## Concat the results to the main dfs

val_eval_scores_df = pd.DataFrame(val_eval_scores)
val_df = pd.concat([val_df, val_eval_scores_df], axis=1)

test_eval_scores_df = pd.DataFrame(test_eval_scores)
test_df = pd.concat([test_df, test_eval_scores_df], axis=1)

Below are the updateed datasets with the scores from the UniEval model.

In [46]:
val_df.head()

Unnamed: 0,hyp,ref,src,tgt,model,task,labels,label,p(Hallucination),coherence,consistency,fluency,relevance,overall
0,Resembling or characteristic of a weasel.,tgt,The writer had just entered into his eighteent...,Resembling a weasel (in appearance).,,DM,"[Hallucination, Not Hallucination, Not Halluci...",Not Hallucination,0.2,0.945565,0.96525,0.946174,0.910977,0.941992
1,Alternative form of sheath knife,tgt,Sailors ' and fishermen 's <define> sheath - k...,.,,DM,"[Hallucination, Hallucination, Hallucination, ...",Hallucination,0.8,0.315116,0.413345,0.751151,0.145348,0.40624
2,(obsolete) A short period of time.,tgt,"As to age , Bead could not form any clear impr...","(poetic) An instant, a short moment.",,DM,"[Not Hallucination, Not Hallucination, Not Hal...",Not Hallucination,0.0,0.906511,0.916859,0.911176,0.82243,0.889244
3,(slang) An incel.,tgt,Because redpillers are usually normies or <def...,"(incel, _, slang) A man of a slightly lower ra...",,DM,"[Not Hallucination, Not Hallucination, Halluci...",Not Hallucination,0.2,0.870522,0.909992,0.41205,0.845877,0.759611
4,"An island in Lienchiang County, Taiwan.",tgt,On the second day of massive live - fire drill...,"An island in Dongyin, Lienchiang, Taiwan, in t...",,DM,"[Not Hallucination, Not Hallucination, Not Hal...",Not Hallucination,0.0,0.916784,0.959942,0.926151,0.728723,0.8829


In [47]:
test_df.head()

Unnamed: 0,id,src,tgt,hyp,task,labels,label,p(Hallucination),coherence,consistency,fluency,relevance,overall
0,1,"Ты удивишься, если я скажу, что на самом деле ...",Would you be surprised if I told you my name i...,You're gonna be surprised if I say my real nam...,MT,"[Not Hallucination, Not Hallucination, Not Hal...",Not Hallucination,0.0,0.962807,0.959963,0.946093,0.973031,0.960473
1,2,Еды будет полно.,There will be plenty of food.,The food will be full.,MT,"[Hallucination, Not Hallucination, Hallucinati...",Hallucination,0.8,0.886008,0.94417,0.926647,0.908745,0.916392
2,3,"Думаете, Том будет меня ждать?",Do you think that Tom will wait for me?,You think Tom's gonna wait for me?,MT,"[Not Hallucination, Not Hallucination, Not Hal...",Not Hallucination,0.2,0.934748,0.944839,0.962526,0.956602,0.949679
3,6,Два брата довольно разные.,The two brothers are pretty different.,There's a lot of friends.,MT,"[Hallucination, Hallucination, Hallucination, ...",Hallucination,1.0,0.903896,0.931672,0.957336,0.843209,0.909028
4,7,<define> Infradiaphragmatic </define> intra- a...,(medicine) Below the diaphragm.,(anatomy) Relating to the diaphragm.,DM,"[Hallucination, Hallucination, Hallucination, ...",Hallucination,0.8,0.931151,0.947555,0.930656,0.743565,0.888232


Now, let's train the Logistic Regression model.

In [48]:
val_ph = val_df['p(Hallucination)'].tolist()
test_ph = test_df['p(Hallucination)'].tolist()

X_train = val_df[['coherence', 'consistency', 'fluency', 'relevance']].values
y_train = np.array(val_ph)

X_test = test_df[['coherence', 'consistency', 'fluency', 'relevance']].values
y_test = np.array(test_ph)

reg = LinearRegression().fit(X_train, y_train)
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"MSE Loss: {mse}")

MSE Loss: 0.11096954257055298


## Results

Here, we test two ways of either using the LR or just using the overall score by putting a threshold  (the threshold that maximizes F1 Score), or use the LR based on the 4 evaluation dimensions.

In [79]:
test_labels = [1 if x > 0.5 else 0 for x in y_test]

precision, recall, thresholds = precision_recall_curve(test_labels, y_pred)

fscore = (2 * precision * recall) / (precision + recall)
ix = np.argmax(fscore)

print(f'Best threshold: {thresholds[ix]:.4f}, F-Score: {fscore[ix]:.4f}')

test_pred = [1 if x > thresholds[ix] else 0 for x in test_df['overall'].tolist()]

print("'Overall' Column  Results: ")
print(classification_report(test_labels, test_pred))

Best threshold: 0.3909, F-Score: 0.6124
'Overall' Column  Results: 
              precision    recall  f1-score   support

           0       0.30      0.01      0.02       889
           1       0.40      0.97      0.57       611

    accuracy                           0.40      1500
   macro avg       0.35      0.49      0.29      1500
weighted avg       0.34      0.40      0.24      1500



In [51]:
test_labels = [1 if x > 0.5 else 0 for x in y_test]
test_pred = [1 if x > 0.5 else 0 for x in y_pred]

print("Logistic Regression Results: ")
print(classification_report(test_labels, test_pred))

Logistic Regression Results: 
              precision    recall  f1-score   support

           0       0.65      0.85      0.74       889
           1       0.61      0.34      0.43       611

    accuracy                           0.64      1500
   macro avg       0.63      0.59      0.59      1500
weighted avg       0.63      0.64      0.61      1500



As you can see, the LR model overally performs better than the overall column score.

## Results for each task

### DM

In [52]:
## Get the similarity scores where task is DM
test_df_dm = test_df[test_df['task'] == 'DM']
X_test_dm = test_df_dm[['coherence', 'consistency', 'fluency', 'relevance']].values
y_test_dm = np.array(test_df_dm['p(Hallucination)'].tolist())
y_pred_dm = reg.predict(X_test_dm)
test_labels_dm = [1 if x > 0.5 else 0 for x in y_test_dm]
test_pred_dm = [1 if x > 0.5 else 0 for x in y_pred_dm]
print("DM Task Results: ")
print(classification_report(test_labels_dm, test_pred_dm))

DM Task Results: 
              precision    recall  f1-score   support

           0       0.54      0.87      0.67       275
           1       0.70      0.30      0.42       288

    accuracy                           0.58       563
   macro avg       0.62      0.58      0.54       563
weighted avg       0.62      0.58      0.54       563



### MT

In [53]:
## Get the similarity scores where task is MT
test_df_mt = test_df[test_df['task'] == 'MT']
X_test_mt = test_df_mt[['coherence', 'consistency', 'fluency', 'relevance']].values
y_test_mt = np.array(test_df_mt['p(Hallucination)'].tolist())
y_pred_mt = reg.predict(X_test_mt)
test_labels_mt = [1 if x > 0.5 else 0 for x in y_test_mt]
test_pred_mt = [1 if x > 0.5 else 0 for x in y_pred_mt]
print("MT Task Results: ")
print(classification_report(test_labels_mt, test_pred_mt))

MT Task Results: 
              precision    recall  f1-score   support

           0       0.69      0.91      0.78       336
           1       0.74      0.38      0.51       226

    accuracy                           0.70       562
   macro avg       0.71      0.65      0.64       562
weighted avg       0.71      0.70      0.67       562



### PG

In [54]:
## Get the similarity scores where task is PG
test_df_pg = test_df[test_df['task'] == 'PG']
X_test_pg = test_df_pg[['coherence', 'consistency', 'fluency', 'relevance']].values
y_test_pg = np.array(test_df_pg['p(Hallucination)'].tolist())
y_pred_pg = reg.predict(X_test_pg)
test_labels_pg = [1 if x > 0.5 else 0 for x in y_test_pg]
test_pred_pg = [1 if x > 0.5 else 0 for x in y_pred_pg]
print("PG Task Results: ")
print(classification_report(test_labels_pg, test_pred_pg))

PG Task Results: 
              precision    recall  f1-score   support

           0       0.76      0.77      0.77       278
           1       0.33      0.32      0.32        97

    accuracy                           0.65       375
   macro avg       0.55      0.54      0.54       375
weighted avg       0.65      0.65      0.65       375



As seen in the results, the model performs better in the MT task, and the followings are the PG and DM tasks, respectively.