# BLEU scores to compare BERT to DistilBERT


In [71]:
import nltk
import json
import re
import numpy
import pandas as pd

### Adjust the BLEU function to take two inputs at once
The original BLEU function compares one translation to several golden standards, but we want to compare the performance of BERT and DistilBERT's outputs to the golden standard - correct answers.

BLEU takes an input of the whole document, split as words, and calculates the n-gram scores based on that. Our dataset has small answers per questions but due to the way BLEU works most optimally we will 1st append all the answers together into one document that will be split by word per BERT and DistilBERT results.

We will 2ndly calculate the BLEU scores per question-answer pair in the dataset and see if it could be meaningful. Some of the answers might be too short to make meaning out of some of the n-grams, but it could give us a better idea of what's happening in the dataset.

#### BLEU function 1 - takes in full documents per BERT and DistilBERT outputs

In [72]:
# Define the function to calculate BLEU scores for more than one inputs
def bleu_lists(golden_standard, text1, text2):
    # n-gram weights list - based on the weights we get 1-grams, 2-grams, 3-grams and 4-grams
    weight_list = [(1,0,0,0), (0.5,0.5,0,0), (0.33,0.33,0.33,0), (0.25,0.25,0.25,0.25)]
    # List of input texts
    texts = [text1, text2]
    # n-gram lists for text1 and text2
    ngram_list = [[],[]]

    # Loop over texts in text list
    for txt in texts:
        #Here we get around the issue with BLEU only taking in one text by only feeding it one text
        # For BERT files
        if txt == text1:
            for i in weight_list:
                # Append bleu scores - call the bleu function on text1 and loop over the n-gram weights
                ngram_list[0].append(nltk.translate.bleu_score.sentence_bleu([golden_standard], txt, weights = i))
        # For DistilBERT files
        if txt == text2:
            for i in weight_list:
                ngram_list[1].append(nltk.translate.bleu_score.sentence_bleu([golden_standard], txt, weights = i))
    
    # The output of the for loop is ngram_list which has both the bleu scores of text1 and text2, I want a nice decent dataframe as the output
    # Make a results dataframe
    df = pd.DataFrame(ngram_list, index =['BERT', 'DistilBERT'], columns = ["1-gram", "2-gram", "3-gram", "4-gram"]) 
    return(df)

In [73]:
# Trying it out - lets make some fake data
bert = 'hi hello lots more lots more'
distil = 'i dont know but hi lots more lots more'
gold = 'hi hello there lots more lots more'

# The input texts need to be split strings, run the function
bleu_lists(gold.split(), bert.split(), distil.split())

Unnamed: 0,1-gram,2-gram,3-gram,4-gram
BERT,0.846482,0.757116,0.625601,0.511508
DistilBERT,0.555556,0.456435,0.394138,0.315598


#### BLEU function 2 - calculates a BLEU score per line in the documents per BERT and DistilBERT outputs

In [74]:
# Define the function to calculate BLEU scores for more than one inputs
def bleu_lists_no_df(golden_standard, text1, text2):
    # n-gram weights list
    weight_list = [(1,0,0,0), (0.5,0.5,0,0), (0.33,0.33,0.33,0), (0.25,0.25,0.25,0.25)]
    # List of input texts
    texts = [text1, text2]
    # n-gram lists for text1 and text2
    ngram_list = [[],[]]

    # Loop over texts in text list
    for txt in texts:
        #Here we get around the issue with BLEU only taking in one text by only feeding it one text
        # For BERT files
        if txt == text1:
            for i in weight_list:
                # Append bleu scores - call the bleu function on text1 and loop over the n-gram weights
                ngram_list[0].append(nltk.translate.bleu_score.sentence_bleu([golden_standard], txt, weights = i))
        # For DistilBERT files
        if txt == text2:
            for i in weight_list:
                ngram_list[1].append(nltk.translate.bleu_score.sentence_bleu([golden_standard], txt, weights = i))
    
    # This one returns ngram_list directly for the function that makes a row per answer
    return(ngram_list)

In [75]:
# Trying it out - lets make some fake data that has several rows
bert = ['hi hello lots more lots more', "this is another answer"]
distil = ['i dont know but hi lots more lots more', "this is another another answer"]
gold = ['hi hello there lots more lots more', "this is another one of those answers"]

df = pd.DataFrame(zip(bert, distil, gold), columns=['BERT', 'DistilBERT', 'Answers'])

df

Unnamed: 0,BERT,DistilBERT,Answers
0,hi hello lots more lots more,i dont know but hi lots more lots more,hi hello there lots more lots more
1,this is another answer,this is another another answer,this is another one of those answers


In [76]:
# Define the function to calculate BLEU scores for more than one inputs
# input is a dataframe now
def bleu_more(dataset):
    # Initiate results dataframe
    results = pd.DataFrame()
    
    for i in range(0,len(dataset)):
        # Define the inputs to the bleu function from the dataset
        gold = dataset.Answers[i]
        bert = dataset.BERT[i]
        distil = dataset.DistilBERT[i]
        
        # Get the bleu scores per line in dataframe with the predefined function
        ngram_list = bleu_lists_no_df(gold.split(), bert.split(), distil.split())
        
        # Append to dataframe, index is model name + iteration
        dff = pd.DataFrame(ngram_list, index =['BERT' + str(i), 'DistilBERT' + str(i)], columns = ["1-gram", "2-gram", "3-gram", "4-gram"]) 
        
        # Append results to the dataframe
        results = results.append(dff)

    return(results)

In [77]:
# Run the function on the dataframe made previously
bleu_more(df)

The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Unnamed: 0,1-gram,2-gram,3-gram,4-gram
BERT0,0.846482,0.757116,0.625601,0.5115078
DistilBERT0,0.555556,0.456435,0.394138,0.3155985
BERT1,0.354275,0.334014,0.298951,4.0794369999999997e-78
DistilBERT1,0.402192,0.367149,0.313532,4.603819999999999e-78
