### Modelling Durrant et al. (2011)

To explore the effect of sleep on statistical learning, Durrant et al. (2011) constructed two types of sequence, both made up of regular tones at differing frequencies. One type had a structure in which the preceding two tones determined the next, except for a few transitions which were random to avoid repetition. The other type was the reverse – most transitions were random. After listening to a structured sequence, participants were tested on their ability distinguish short structured and unstructured sequences. Delayed recall was then tested, after a night’s sleep for one group, and after a waking rest for the other. Durrant et al. (2011) found that sleep improved performance more than waking rest, suggesting systems consolidation promotes statistical learning.

Here, we generate a set of sequences based on the transition structure in Durrant et al. (2011). A model with the GPT-2 architecture is trained from scratch on the structured sequences only. At the end of each epoch of the training, the perplexity is calculated for a two test sets of structured and unstructured sequences. We find that the difference in perplexity of these two sets increases over time, corresponding to improved ability to distinguish them. In addition, outputs from the trained model are structured in the same way as the training data.

Tested with conda_pytorch_latest_p36 kernel in AWS SageMaker.

#### Installation:

In [None]:
!pip install simpletransformers

#### Imports:

In [None]:
import pandas as pd
import random
import logging
from random import shuffle
from matplotlib import pyplot as plt
from simpletransformers.language_modeling import (
    LanguageModelingModel,
    LanguageModelingArgs,
)
from simpletransformers.language_generation import (
    LanguageGenerationModel,
)

#### Construct training data

In [None]:
# Transition structure copied from Durrant et al. (2011):
transition_structure = {
    (1,1): 4,
    (1,2): 3,
    (1,3): 2,
    (1,4): 1,
    (1,5): 5,
    (2,1): 5,
    (2,2): 4,
    (2,3): 3,
    (2,4): 2,
    (2,5): 1,
    (3,1): 3,
    (3,2): 2,
    (3,3): 1,
    (3,4): 5,
    (3,5): 4,
    (4,1): 1,
    (4,2): 5,
    (4,3): 4,
    (4,4): 3,
    (4,5): 2, 
    (5,1): 2,
    (5,2): 1,
    (5,3): 5,
    (5,4): 4,
    (5,5): 3
}

def get_sequence():
    start = [random.randint(1,5),random.randint(1,5)]
    for i in range(50):
        num = random.uniform(0, 1)
        if num > 0.1:
            next_val = transition_structure[tuple(start[-2:])]
        else:
            if 0 < num < 0.02:
                next_val = 1
            if 0.02 < num < 0.04:
                next_val = 2
            if 0.04 < num < 0.06:
                next_val = 3
            if 0.06 < num < 0.08:
                next_val = 4
            if 0.08 < num < 0.1:
                next_val = 5
        start.append(next_val)
    return ','.join([str(i) for i in start])

seqs = [get_sequence() for i in range(5000)]
seqs[0:3]

In [None]:
def get_random_sequence():
    start = [random.randint(1,5),random.randint(1,5)]
    for i in range(50):
        next_val = random.randint(1,5)
        start.append(next_val)
    return ','.join([str(i) for i in start])

text_file = open("train.txt", "w")
walks = [get_sequence() for i in range(5000)]
shuffle(walks)
n = text_file.write('\n'.join(walks))
text_file.close()

text_file = open("structured_test.txt", "w")
walks = [get_sequence() for i in range(1000)]
shuffle(walks)
n = text_file.write('\n'.join(walks))
text_file.close()

text_file = open("unstructured_test.txt", "w")
walks = [get_random_sequence() for i in range(1000)]
shuffle(walks)
n = text_file.write('\n'.join(walks))
text_file.close()

#### Train generative model

Train GPT-2 from scratch on dataset created above.

In [None]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

model_args = LanguageModelingArgs()
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.num_train_epochs = 3
model_args.dataset_type = "simple"
model_args.save_model_every_epoch = False
model_args.evaluate_during_training = False
model_args.mlm = False
model_args.learning_rate = 1e-5
model_args.vocab_size=10
model_args.use_early_stopping = True
model_args.manual_seed = 123

train_file = "train.txt"
structured_test_file = "structured_test.txt"
unstructured_test_file = "unstructured_test.txt"

perplexity_structured = []
perplexity_unstructured = []

for num in [1, 2, 3]:
    model_args.num_train_epochs = num
    model = LanguageModelingModel(
        "gpt2", None, train_files='train.txt', args=model_args
    )

    # Train the model
    model.train_model(train_file)
    
    p = model.eval_model(structured_test_file)
    perplexity_structured.append(p)
    p = model.eval_model(unstructured_test_file)
    perplexity_unstructured.append(p)

Let's sample some outputs fron the trained model:

In [None]:
model = LanguageGenerationModel(
    "gpt2", "outputs"
)

model.generate('1,')

#### Plot perplexity against time

In [None]:
perplexity_structured = [p['perplexity'] for p in perplexity_structured]
perplexity_unstructured = [p['perplexity'] for p in perplexity_unstructured]

perplexity_structured = [float(p.detach().numpy()) for p in perplexity_structured]
perplexity_unstructured = [float(p.detach().numpy()) for p in perplexity_unstructured]

In [None]:
plt.figure()
plt.rcParams.update({'font.size' : 15})

epochs = [1,2,3]
groups = [perplexity_structured, perplexity_unstructured]

df = pd.DataFrame({'Structured': perplexity_structured, 'Unstructured': perplexity_unstructured}, index=epochs)
plot = df.plot.bar(title='Perplexity of unstructured vs. structured sequences', 
            color={'Structured': 'lightcoral', 'Unstructured': 'skyblue'},
           xlabel='Epoch', figsize=(7,5), ylabel='Perplexity')

plt.savefig('durrant_data.png')