# Introduction Notebook to our data loading and preprocessing

The methods described here are later used in the `pre_train_gpt2` script but from the sourcecode in `/nlp-2021-vda/data/load_dataset.py`. This notebook is just for explanation purposes.

In [2]:
%cd ..

/Users/viviane/University/NLP-Lab/OUR-CODE/nlp-2021-vda


(this was for the imports in our file structure)

In [3]:
import numpy as np
from models import encoder
import random

## Data sample to tensor

This pre-processing step makes a training sample readable to our system by 
* encoding it. 
* adding the `<SOC>` token between context and response to the context (last turn of the sample)
* adding the `<EOT>` token at the end of the sample to show that this sample's dialog ends here 
* adding padding or cutting the text to fit it to the GPT-2 input size

In [26]:
def sentence_to_tensor(context, response, encoder, maxlen):
    """
    Parameters
    ----------
    context:
        context in the input 
    response:
        response to the context
    encoder:
        GPT-2 encodings
    maxlen: 
        Maximum length of the input
    """
    
    # encode history/context
    context_tokens = encoder.encode(context)
    # encode start of conversation marker to show that response comes next
    start_of_conv = encoder.encode(" <SOC> ")
    # encode end of dialog marker
    end_of_text = encoder.encode(" <EOT> ")
    # encode response
    response_tokens = encoder.encode(response)
    # calculate current length of all tokens (len of <SOC> + <EOT> token is 10)
    input_tokens_len = len(context_tokens) + len(response_tokens) + 10
    padding = []
    # if calculated sample length is too long:
    if input_tokens_len >= maxlen:
        if len(context_tokens) > input_tokens_len - maxlen:
            # cut context from beginning if length of context + response is too long
            # and len of context is long enough to cut
            context_tokens = context_tokens[input_tokens_len - maxlen:]
        else:
            # cut response from end if length of context + response is too long
            # and len of response is long enough to cut
            if maxlen-len(context_tokens)-10 < 0:
                return None
            response_tokens = response_tokens[:maxlen-len(context_tokens)-10]
    # if calculated sample length is too short:            
    else:
        # calculate how many chars are missing
        remaining_length = maxlen - input_tokens_len
        # encode padding with #s
        padding = encoder.encode("#") * remaining_length
    # concatenate every encoded piece + padding
    tokens = context_tokens + start_of_conv + response_tokens + end_of_text + padding
    
    return tokens
    

## Getting what we need from EmpathicDialog training files:

This methods takes the EmpathicDialog dataset and does the following pre-processing steps:

* Use only the dialog parts (not the other columns)
* Create 3/4/5 samples per 4/5/6-turn-dialog (in the dataset 4/5/6 turns of two personas make a dialog) by using
  * First turn as context for second turn
  * First and second turn as context for third turn
  * First, second and third turn as context for fourth turn
  * [If 5/6-turn-dialog] First, second, third and fourth turn as context for fith turn
  * [If 6-turn-dialog] First, second, third, fourth and fith turn as context for sixth turn
  * (how many of these for samples we create can be configuerd by `history_len`)
* replace all `_comma_`s by `,`s
* add `<SOC>` markers to show that the next turn from the other dialog partner begins here
    

In [27]:
# function to pre-process training data
def process_training_data(source_path, encoder, maxlen=100, history_len=1, reactonly=False):
    """
    Parameters
    ----------
    source_path:
        path where the csv file is located 
    encoder: 
        GPT-2 encodings
    maxlen: 
        Maximum length of the input
    history_len: 
        how many previous dialogues do we want
    reactonly: 
        If set true, only parses one dialogue; useful for testing
    
    Returns
    -------
        a list with input data i.e tokens of the input needed to train the model

    """

    # open the file
    df = open(source_path).readlines()
    max_hist_len = history_len
    data = []
    history = []
    # for all lines in dataframe
    for i in range(1, len(df)):
        # get previous line of dataframe
        cparts = df[i - 1].strip().split(",")
        # get current line of dataframe
        sparts = df[i].strip().split(",")
        # if previous and current line of dataframe belong to the same 5-turn-dialog
        if cparts[0] == sparts[0]:
            # take turn (the dialog text) of previous line
            prevsent = cparts[5].replace("_comma_", ",").strip()
            # append the turn of previous line to history list
            history.append(prevsent)
            # get number of turn of current line of data frame (1-5)
            idx = int(sparts[1])
            # ractonly is for later testing purposes so at the moment false
            # so if number of dialog turn is not 2, 4 or 6 (???)
            if not reactonly or ((idx % 2) == 0):
                # append the <SOC> tag to every line of the history so far 
                # to show that from here the next turn starts
                prev_str = " <SOC> ".join(history[-max_hist_len :])
                # get current dialog line (the response to the history so far)
                sent = sparts[5].replace("_comma_", ",").strip()
                # encode and pad sample with method above 
                # (use history so far, current turn, encoder and chosen 
                # maxlen for padding)
                tokens = sentence_to_tensor(prev_str, sent, encoder, maxlen)
                # if everything worked, append to data list
                if tokens:
                    data.append(np.stack(tokens))
        # if previous and cureent line do not belong to the same 5-turn-dialog    
        else:
            # reset history for the next 5-turn-dialog 
            history = []
    
    return data

## Pre-Processing

Using the methods above to create the `data` fitting as input of the GPT-2 model. All 64615 samples are now saved in a list (`data`) and can be feeded to the model.

In [28]:
checkpoint_path = 'gpt-2/models/124M'
enc = encoder.get_encoder(checkpoint_path)
data = process_training_data('datasets/empatheticdialogues/train.csv', enc, history_len=4)

## Cutting data samples into batches:
This class takes the batch size (number of samples per batch) and the data and always returns one batch (collection of data samples) when calling the method `sample`.

In [29]:
class batch_sampler(object):

    def __init__(self, data, batch_size, shuffle=False):
        self.data = data
        self.total_size = len(data)
        self.batch_size = batch_size
        self.current_index = 0
        if shuffle:
            random.shuffle(data)

    # Get next batch
    def sample(self):
        # if the next batch would exeed total sample size
        if self.current_index + self.batch_size > self.total_size:
            # reset the index to 0
            self.current_index = 0
        # start index of batch set is current index
        prev_index = self.current_index
        # end index of batch size ist current index + batch size
        # this will be the new current index
        self.current_index = self.current_index + self.batch_size
        return self.data[prev_index:self.current_index]
        

Number of all samples in our dataset:

In [32]:
print(len(data))

64615


## How to get one batch:

In [31]:
bs = batch_sampler(data, batch_size=20, shuffle=True)
d = bs.sample()
print(len(d))

20
