In [1]:
import os

# Mount google drive (for Colab only)
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive',force_remount=True)
    base_folder = '/content/drive/My Drive/unibo/NLP_project/BarneyBot'
    # Install Huggingface libraries for running the notebook in Colab
    os.system("pip install datasets")
    os.system("pip install transformers")
else:
    base_folder = os.getcwd()

# Import character dictionaries, useful to map a character to its data, and a fixed random seed
from Data.data_dicts import character_dict, source_dict, random_state
# Import BBMetrics library, usefull to performs metric scores
from Lib.BBMetrics import BBMetric    

# Import Huggingface transformers and load_dataset usefull for run the model and load datasets
from transformers import TFAutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

In [2]:
# Loads the tokenizer for the pretrained model of DialoGPT small version
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small', cache_dir=os.path.join(os.getcwd(), "cache"))
# Token used for padding by the tokenizer
tokenizer.pad_token = '#'

# Human Metrics

In this notebook we ask to the user to perform a subjective evaluation according to some criteria:
* _Coherency_: the chatbot does not contradict themselves over time
* _Consistency_: the chatbot follows the flow of a conversation naturally
* _Stylish_: the chatbot has a distinct personality, including related quirks.

The following function will load the dataset for evaluate the specified dataset (i.e. the common dataset used to evaluate each character bot).

In [3]:
# Loads a common dataset used for evaluate every character bot  
df_common = load_dataset('csv',
                         data_files=os.path.join(base_folder, 'Data', 'common_dataset.csv'), 
                         cache_dir=os.path.join(base_folder, "cache"))

Using custom data configuration default-1255b828ba93d8fb
Reusing dataset csv (C:\Users\User\Documents\GitHub\cache\csv\default-1255b828ba93d8fb\0.0.0\433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/1 [00:00<?, ?it/s]

Below the user can find a function which performs 3 step in order to successfully evaluate the character specified in `character` parameter:
1. **Chat evaluation**, for estimating the chatbot coherence (i.e. if the chatbot does not contradict themselves over time)
 * by giving a score from 0 to 5 (half score are not admitted)
 
2. **Responses evaluation**, for estimating the chatbot consistency (i.e. how much true the chatbot' answers regarding to what the user previously said)
 * by giving a score from 0 to 5 (half score are not admitted)
 
3. **Style evaluation**, for estimating the chatbot stylish (i.e. how much close are the answer of the chatbot according to what the user think the real character would say in response to him)
 * by giving a score from 0 to 5 (half score are not admitted)

In [4]:
def eval_character(character='Default'):
    # Takes the source location from the dictionary
    source = character_dict[character]['source']
    
    # Checks if the character was trained 
    character_folder = os.path.join(base_folder, 'Data', 'Characters', character)
    if not os.path.exists(character_folder):
        raise Exception("The character " + character + " doesn't exist")
    
    # Loads the pretrained model from the specified checkpoint folder `checkpoint_folder`
    checkpoint_folder = os.path.join(character_folder, character_dict[character]['checkpoint_folder'])
    model = TFAutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=checkpoint_folder)
    
    ### Compute human - coherence
    print("Step 1) Chat with", character, "\n\tPlease evaluate your chat with this character:", flush=True)
    # Loads the metric
    metric = BBMetric.load_metric("human - coherence")
    # Performs the metric evaluation
    metric.train(model=model, tokenizer=tokenizer,
                 filepath=os.path.join(os.getcwd(), "Data", "Characters", character, "humancoherence.csv"),
                 length=5) # length is optional, defaults to 5
    
    ### Compute human - consistency
    print("Step 2) Answers from", character, "\n\tPlease evaluate how true these responses are for the character:", flush=True)
    # Loads the metric
    metric = BBMetric.load_metric("human - consistency")
    # Performs the metric evaluation
    metric.train(model=model, tokenizer=tokenizer,
                 filepath=os.path.join(os.getcwd(), "Data", "Characters", character, "humanconsistency.csv"))
    ### Compute human - style
    print("Step 3) Answers from", character, "\n\tPlease evaluate the style of the responses.", flush=True)
    print("\tDo you think they are responses that", character, "would say?", flush=True)
    # Loads the metric
    metric = BBMetric.load_metric("human - style")
    # Performs the metric evaluation
    metric.train(model=model, tokenizer=tokenizer,
                 filepath=os.path.join(os.getcwd(), "Data", "Characters", character, "humanstyle.csv"),
                 questions=df_common['train'].filter(lambda x: x['source'] == source)['context'])

# Barney
Evaluation of chatbot of _Barney Stinson_ from _How I Met Your Mother_

In [5]:
eval_character(character='Barney')

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at C:\Users\User\Documents\GitHub\Data\Characters\Barney\barney_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Step 1) Chat with Barney 
	Please evaluate your chat with this character:
>> User:What is that suit?
DialoGPT: Hook, line and sinker.
>> User:Wow1
DialoGPT: Wait. What... what are you talking about?
>> User:I don't know, what are you talking about?
DialoGPT: Look, Lily, I know you're excited about this. But it's not like I was saying hey, a million bucks if you make me wear a suit.
>> User:I will give you a million bucks if you will wear a suit!
DialoGPT: I want my money back!
>> User:Which money?
DialoGPT: What is it that you want?
How do you rate this conversation (0 to 5)? 3
Step 2) Answers from Barney 
	Please evaluate how true these responses are for the character:
Question: Who are you?
DialoGPT: Hey, guys. Tell us about yourselves.
Question: What is your name?
DialoGPT: Hann.
Question: What is your job?
DialoGPT: To tell people how great my job is.
Question: Where do you live?
DialoGPT: I don't remember saying the whole I don't like gambling thing. I just said it's bad for me. B

Loading cached processed dataset at C:\Users\User\Documents\GitHub\cache\csv\default-1255b828ba93d8fb\0.0.0\433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519\cache-dc68f8cad46f3147.arrow


Question: Barney, this is about the building.
DialoGPT: I just have to say...
Question: All right. I'll be right there. Stay where you are.
DialoGPT: Okay, but only because I'm in a hurry, and I gotta take my dog to the vet. I'll be right back, and I gotta take care of the baby. Oh, and if you're still interested in that baby, get in here and get him fixed.
Question: I think there's a pretty girl smiling at me there.
DialoGPT: Look, I know I won't be sleeping through it, but it's still good to have you, either. Good luck.
Question: I love you, man.
DialoGPT: I'm so proud of you, dude.
Question: Not even if she's hot?
DialoGPT: Oh, well, I'd say yes. If she's hot. If she's hot...
How do you rate these answers (0 to 5)? 2


# Sheldon
Evaluation of chatbot of _Sheldon Cooper_ from _The Big Bang Theory_

In [None]:
eval_character(character='Sheldon')

# Harry
Evaluation of chatbot of _Harry Potter_ from _Harry Potter_ saga

In [None]:
eval_character(character='Harry')

# Fry
Evaluation of chatbot of _Fry_ from _Futurama_

In [None]:
eval_character(character='Fry')

# Bender
Evaluation of chatbot of _Bender_ from _Futurama_

In [None]:
eval_character(character='Bender')

# Vader
Evaluation of chatbot of _Darth Vader_ from _Star Wars_

In [None]:
eval_character(character='Vader')

# Joey
Evaluation of chatbot of _Joey_ from _Friends_

In [None]:
eval_character(character='Joey')

# Phoebe
Evaluation of chatbot of _Phoebe_ from _Friends_

In [None]:
eval_character(character='Phoebe')