In [21]:
import pickle

###  Sentence boundaries

When dealing with language, it is good to know when a sentence starts and when it ends. That will help the model at the beginning of the prediction, when we don't have any previous words as context. For that purpose, we are going to pad each sentence with a start-of-sentence symbol _"&lt;s>"_ and an end-of-sentence symbol _"&lt;/s>"_. 

Since you already did a similar thing in the n-grams exercise, this function is already implemented for you.

In [4]:
def add_sentence_boundaries(data):
    """
    Takes the data, where each line is a sentence, appends <s> token at the beginning and </s> at the end of each sentence
    Example input: I live in Helsinki
    Example output: <s> I live in Helsinki </s>
    
    Arguments
    ---------
    data : list
            a list of sentences
    
    Returns
    -------
    res : list
            a list of sentences, where each sentence has <s> at the beginning and </s> at the end
    """
    res = []
    for sent in data:
        sent = '<s> ' + sent.rstrip() + ' </s>'
        res.append(sent)
    
    return res

### Index dictionaries <a class="anchor" id="task_1_1"></a> 
Neural networks can't process words as raw strings. Due to that, we need to represent the words with numbers. The first step in doing that is creating two dictionaries: word2idx and idx2word.

The word2idx dictionary contains unique words as keys and unique indices for each of the words as values. <br>
The idx2word dictionary contains unique indices as keys and unique words for each of those indices as values. It is essentially a reversed word2dx, where the keys are the values and the values are the keys.

Example sentences: ["I look forward", "You look forward"] <br>
word2idx = {"I": 1, "look": 2, "forward": 3, "You": 4} <br>
idx2word = {1: "I", 2: "look", 3: "forward", 4: "You"} <br>

Write a function that creates two dictionaries: word2idx and idx2work. The dictionaries should contain all the unique words in the data. <b>The indices should start from 1 and not from 0<b>

In [6]:
def create_indices(data):
    """
    This function creates two dictionaries: word2idx and idx2word, containing each unique word in the dataset
    and its corresponding index.
    Remember that the starting index should be 1 and not 0
    
    Arguments
    ---------
    data - list
            a list of sentences, where each sentence starts with <s>
            and ends with </s> token
    
    Returns
    -------
    word2idx - dictionary
                a dictionary, where the keys are the words and the values are the indices
                
    idx2word - dictionary
                a dictionary, where the keys are the indices and the values are the words
    """
    
    # YOUR CODE HERE
    #raise NotImplementedError()
    word2idx = dict()
    idx2word = dict()
    
    data_list = ''
    for sentence in data:
        data_list = data_list + ' ' + sentence

    data_list = data_list[1:]
    data_split = data_list.split(' ')
    data_unique = []
    for word in data_split:
        if word not in data_unique:
            data_unique.append(word)
            if word not in word2idx.keys():
                 word2idx[word] = data_unique.index(word)+1
    
    for key, value in word2idx.items():
        idx2word[value] = key
    
    return word2idx, idx2word

### Index data <a class="anchor" id="task_1_2"></a>
After we have created the word2idx and idx2word dictionaries, it is time to index the data. In other words, we need to replace each word in the data with its corresponding index.

Write a function that reads each sentence from the data and replaces each word in the sentence with its index from the word2idx dictionary.

In [9]:
def index_data(data, word2idx):
    """
    This function replaces each word in the data with its corresponding index
    
    Arguments
    ---------
    data - list
            a list of sentences, where each sentence starts with <s>
            and ends with </s> token
    
    word2idx - dict
            a dictionary where the keys are the unique words in the data
            and the values are the unique indices corresponding to the words%
    
    Returns
    -------
    data_indexed - list
                a list of sentences, where each word in the sentence is replaced with its index
    """
    
    data_indexed = []
    # YOUR CODE HERE
    #raise NotImplementedError()
    
    for sentence in data:
        sentence_index = []
        for word in sentence.split(' '):
            sentence_index.append(word2idx[word])
        data_indexed.append(sentence_index)
    

    return data_indexed

### Convert sentences to tensors

This function converts each indexed sentence to a LongTensor data type. This is required in order to process it later using Pytorch.

You don't have to modify this function. It is already implemented for you.

In [11]:
def convert_to_tensor(data_indexed):
    """
    This function converts the indexed sentences to LongTensors
    
    Arguments
    ---------
    data_indexed - list
            a list of sentences, where each word in the sentence
            is replaced by its index
    
    Returns
    -------
    tensor_array - list
                a list of sentences, where each sentence
                is a LongTensor
    """
    
    tensor_array = []
    for sent in data_indexed:
        tensor_array.append(torch.LongTensor(sent))    
        
    return tensor_array

### Combine features and labels in a tuple

This function combines each indexed sentence and its corresponding labels to a tuple. This will be beneficial for us when we zero-pad the data later, in order to make the batches have equal-length samples.

You don't have to modify this function. It is already implemented for you.

In [13]:
def combine_data(input_data, labels_data):
    """
    This function converts the input features and the labels into tuples
    where each tuple corresponds to one sentence in the format (features, labels)
    
    Arguments
    ---------
    input_data - list
            a list of tensors containing the training features
    
    labels_data - list
            a list of tensors containing the training labels
    
    Returns
    -------
    res - list
            a list of tuples, where each tuple corresponds to one sentece pair
            in the format (features, labels)
    """
    
    res = []
    
    for i in range(len(input_data)):
        res.append((input_data[i], labels_data[i]))

    return res

### Remove extra data

Since we will be processing the data in equal batches during training, we need to make sure that each batch has equal number of sentences. In case the last batch contains less sentences than the batch size, that batch will be discarded.

This function discards the extra data that doesn't fit in a batch.

You don't have to modify this function. It is already implemented for you.

In [14]:
def remove_extra(data, batch_size):
    """
    This function removes the extra data that does not fit in a batch   
    
    Arguments
    ---------
    data - list
            a list of tuples, where each tuple corresponds to a
            sentence in a format (features, labels)
            
    batch_size - integer
                    the size of the batch
    
    
    Returns
    -------
    data - list
            a list of tuples, where each tuple corresponds to a
            sentence in a format (features, labels)
    """
    
    extra = len(data) % batch_size
    if extra != 0:
        data = data[:-extra][:]

    return data

### Zero-pad the data

In order to process the data in batches, we need to make sure that the sentences in each batch have equal lengths. Since we are working with sentences, each sentence in a batch can have different number of words. In this case, we need to  make the length of each sentence the same as the length of the longest sentence in that batch. We do that by adding zeros at the end of each sentence, until the sentence has equal length as the longest one in the batch.

This function implements the zero-padding.

You don't have to modify this function. It is already implemented for you.

In [16]:
def collate(list_of_samples):
    """
    This function zero-pads the training data in order to process the sentences
    in a batch during training
    
    Arguments
    ---------
    list_of_samples - list
                        a list of tuples, where each tuple corresponds to a
                        sentence in a format (features, labels)
    
    
    Returns
    -------
    pad_input_data - tensor
                        a tensor of input features equal to the batch size,
                        where features are zero-padded to have equal lengths
                        
    input_data_lengths - list
                        a list where each element is the length of the 
                        corresponding sentence
    
    pad_labels_data - tensor
                        a tensor of labels equal to the batch size,
                        where labels are zero-padded to have equal lengths
            
    """
    
    
    list_of_samples.sort(key=lambda x: len(x[0]), reverse=True)
    input_data, labels_data = zip(*list_of_samples)

    input_data_lengths = [len(seq) for seq in input_data]
    
    padding_value = 0

    # pad input
    pad_input_data = pad_sequence(input_data, padding_value=padding_value)
    
    # pad labels
    pad_labels_data = pad_sequence(labels_data, padding_value=padding_value)

    return pad_input_data, input_data_lengths, pad_labels_data

### Prepare features and labels <a class="anchor" id="task_1_3"></a> 
During training, the model takes an input word and outputs a prediction. We will need to compare this prediction to 'true label'. True label is just the next word in the text, but we will need to organize the data, so that every word in the text is considered as this 'true label'.

In the label sentence, every word is moved a step in time, and for the input sentence the last word is missing. 

Example sentence: oops i did it again <br>
INPUT: oops i did it <br>
LABEL: i did it again

Note: the first word in the sentence is start-of-sentence symbol and the last one is end-of-sentence symbol.

Write a function that takes as input the indexed data and returns two arrays: the input array where the last word from each sentence is missing, and the label array, where every word is moved a step in time.

In [18]:
def prepare_for_training(data_indexed):
    """
    This function creates the input features and their corresponding labels
    
    Arguments
    ---------
    data_indexed - list
            a list of sentences, where each word in the sentence
            is replaced by its index
    
    
    Returns
    -------
    input_data - list
            a list of indexed sentences, where the last element of each sentence is removed
            
    labels_data - list
            a list of indexed sentences, where the first element of each sentence is removed
    """
    
    input_data = []
    labels_data = []

     # YOUR CODE HERE
    #raise NotImplementedError()
    for data in data_indexed:    
        input_data.append(data[:-1])
        labels_data.append(data[1:])
    
    return input_data, labels_data

### Preprocess data <a class="anchor" id="task_1_4"></a>
At this point, we have all the necessary functions to prepare the data for training. What is left to do is to run them one by one and get the data in the desired format.

Write a function that takes the data and prepares it for training. You need to do the following steps:

    1. Add sentence boundaries
    2. Create index dictionaries (word2idx and idx2word)
    3. Index the data in a way that each word is replaced by its index
    4. Convert the indexed data to a list of tensors, where each tensor is a sentence
    5. Split each sentence to input and labels

In [20]:
def preprocess_data(data):
    """
    This function runs the whole preprocessing pipeline and returns the prepared
    input features and labels, along with the word2idx and idx2word dictionaries
    
    Arguments
    ---------
    data - list
            a list of sentences that need to be prepared for training
    
    
    Returns
    -------
    input_data - list
            a list of tensors, where each tensor is an indexed sentence used as input feature
            
    labels_data - list
            a list of tensors, where each tensor is an indexed sentence used as a true label
    
    word2idx - dictionary
                a dictionary, where the keys are the words and the values are the indices
                
    idx2word - dictionary
                a dictionary, where the keys are the indices and the values are the words
    """
    
    # YOUR CODE HERE
    #raise NotImplementedError()
    #1. Add sentence boundaries    
    res = add_sentence_boundaries(data)
    
    #2. Create index dictionaries (word2idx and idx2word)
    word2idx, idx2word = create_indices(res)    
    
    #3. Index the data in a way that each word is replaced by its index
    indexed_data = index_data(res, word2idx)
    
    #4. Convert the indexed data to a list of tensors, where each tensor is a sentence
    tensor_array = convert_to_tensor(indexed_data)    
    
    #5. Split each sentence to input and labels
    input_data, labels_data = prepare_for_training(tensor_array)
    
    return input_data, labels_data, word2idx, idx2word

In [22]:
# Load data
# Load result:
with open("data.txt", "rb") as fp:   # Unpickling
    sentences = pickle.load(fp)

print(sentences[22:35])

['We know you love Chewy.', "We know you're here.", "We know you know the Chewy-RyanCohen-GameStop connection, but it wasn't real enough for you yet.", "Well, I don't have to tell you, because you're not stupid, but I will anyway:  it's gotten really real enough for you now.", 'The Chewy executive triumvirate joining the GameStop board of directors is your signal, friend.', 'You may start pumping GME to your boomer audience.', 'Now.', "I don't believe reddit has been too kind to you in the past, but worry not, follow through with this and you'll have lots of friends here and we'll have your back forever.", 'Well, definitely not forever, but at least for a while.', 'What better time to start than today?', 'With love, brother.', 'P.S.', "- don't be afraid to use the rocket ðŸš€, it feels good."]


In [None]:
train_input, train_labels, word2idx, idx2word = preprocess_data(sentences) # run the preprocessing pipeline
train_data = combine_data(train_input, train_labels)
train_data = remove_extra(train_data, batch_size)

In [None]:


pairs_batch_train = DataLoader(dataset=train_data,
                    batch_size=batch_size,
                    shuffle=True,
                    collate_fn=collate,
                    pin_memory=True)