# Imports

In [188]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [189]:
# global imports
import random
import numpy as np
from accelerate import Accelerator
from sentence_transformers import SentenceTransformer
accelerator = Accelerator()
model = accelerator.prepare(SentenceTransformer("all-mpnet-base-v2", device="cuda")) #all-MiniLM-L6-v2
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
# Get the max token size
max_token_size = model.max_seq_length
print(f"Max token size: {max_token_size}")

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


.gitattributes: 100%|██████████| 1.18k/1.18k [00:00<00:00, 2.95MB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 420kB/s]
README.md: 100%|██████████| 10.6k/10.6k [00:00<00:00, 17.7MB/s]
config.json: 100%|██████████| 571/571 [00:00<00:00, 1.23MB/s]
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 256kB/s]
data_config.json: 100%|██████████| 39.3k/39.3k [00:00<00:00, 35.4MB/s]
pytorch_model.bin: 100%|██████████| 438M/438M [00:01<00:00, 408MB/s] 
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 120kB/s]
special_tokens_map.json: 100%|██████████| 239/239 [00:00<00:00, 559kB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 961kB/s]
tokenizer_config.json: 100%|██████████| 363/363 [00:00<00:00, 1.09MB/s]
train_script.py: 100%|██████████| 13.1k/13.1k [00:00<00:00, 24.8MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.29MB/s]
modules.json: 100%|██████████| 349/349 [00:00<00:00, 767kB/s]


Max token size: 384


In [190]:
# local imports 
import loader
import distances
import algos
import data_processing as dp
import classifier

# Database

The miscellaneous database consists of 20 copyright-free book from children's literature obtained from [The Project Gutenberg](https://www.gutenberg.org/ebooks/bookshelf/20).  
The Tom Swift database consists of 27 books of the series Tom Swift by Victor Appleton obtained from [The Project Gutenberg](https://www.gutenberg.org/ebooks/search/?query=victor+appleton&submit_search=Go%21).

In [191]:
#Choosing the database ("Miscellaneous", "Tom Swift") and the number of books to load (1 min)
nb_books = 25
mode = "chunks"
all_sentences = loader.load_books("Tom Swift", mode, max_token_size, tokenizer=model.tokenizer)
sentences = all_sentences[:nb_books]

In [192]:
#choose a random book and print 1 chunk
r = random.randrange(0, len(sentences)) #not included
r2 = random.randrange(0, len(sentences[r]))
for i in range(r2, r2+1):
    print(sentences[r][i])

wouldn't make much of a moving picture. Well, if we go to Peru, we won't be far from the United States, and we can fly back home in the airship. But we've got to take the Flyer apart, and pack up again." "Will you have time?" asked Mr. Nestor. "Maybe the volcano will get into action before you arrive, and the performance will be all over with." "I think not," spoke Tom, as he again read the cablegram. "Mr. Period says he has advices from Peru to the effect that, on other occasions, it took about a month from the time smoke was first seen coming from the crater, before the fireworks started up. I guess we've got time enough, but we won't waste any." "And I guess Montgomery and Kenneth won't be there to make trouble for us," put in Ned. "It will be some time before they get away from that African town, I think." They began work that day on taking the airship apart for transportation to the steamer that was to carry them across the ocean. Tom decided on going to Panama, to get a series of

# Embeddings
Here we use Sentence-BERT to embed the sentences in the database.

In [193]:
#embeddings of the sentences (1 min)
sentence_embedding = [model.encode(sentences[r]) for r in range(len(sentences))] #cannot stack because different number of pages

# Minimizing the distances
Here we want to find an oprtimal order by maximizing the semantic proximity between neighboring sentences. We have $n!$ possible orderings, so we can't use brute force. Our problem is similar to the traveling salesman problem, which is NP-hard, so we can't solve it optimally, therefore we design some algorithms to try to find a local minimum instead.

In [194]:
#run the local minimum algorithms on a subset of the sentences of the book and compare the permutation distances (1 sec)
pairwise_dist = distances.pairwise_dist(sentence_embedding)
distances2 = pairwise_dist[0][:100,:100]

default_order, random_order = list(range(len(distances2))), np.random.permutation(len(distances2))

for algo in [(lambda x, y: y), algos.greedy_add, algos.greedy_sort]:
    for order in [default_order, random_order]:
        for dist, dist_name in [(lambda o : distances.avg_consecutive_dist(o, distances2), "avg_consecutive_dist"), (lambda o : distances.avg_swap_dist(o, default_order), "avg_swap_dist"), (lambda o : distances.avg_R_dist(o, default_order), "avg_R_dist")]:
            ordered = algo(distances2, order)
            d = dist(order)
            print(f"{algo.__name__:15}{dist_name:25}{d}")
        print()

<lambda>       avg_consecutive_dist     0.2835012674331665
<lambda>       avg_swap_dist            0.0
<lambda>       avg_R_dist               0.0

<lambda>       avg_consecutive_dist     0.4096827208995819
<lambda>       avg_swap_dist            0.2625
<lambda>       avg_R_dist               1.0

greedy_add     avg_consecutive_dist     0.2835012674331665
greedy_add     avg_swap_dist            0.0
greedy_add     avg_R_dist               0.0

greedy_add     avg_consecutive_dist     0.4096827208995819
greedy_add     avg_swap_dist            0.2625
greedy_add     avg_R_dist               1.0

greedy_sort    avg_consecutive_dist     0.2835012674331665
greedy_sort    avg_swap_dist            0.0
greedy_sort    avg_R_dist               0.0

greedy_sort    avg_consecutive_dist     0.4096827208995819
greedy_sort    avg_swap_dist            0.2625
greedy_sort    avg_R_dist               1.0



The first algorithm improves significantly the swap distance compared to a random permutation.

# Classifier: 
### Heuristics
The two metrics discussed were the following: distance betweeen pages or probability that a page is before another
For the distance between pages, we don't have the ground truth so to train a model it's easier to start with a classifier that takes in two pages and outputs a probability that the first page is before the second.

![Classifier Architecture](../img/Classifier_architecture.png)

### Training and Testing Datasets

In [195]:
#split the sentences into discard, training, validation and testing sets keeping the order
print(sentence_embedding[0].shape)
sentences_train, sentences_val, sentences_test = dp.split_sentences(sentence_embedding, 1)
print(sentences_train[0].shape, sentences_val[0].shape, sentences_test[0].shape)

(149, 768)
torch.Size([119, 768]) torch.Size([15, 768]) torch.Size([15, 768])


In [196]:
#create the database of the pairs of a subset of sentences (30 secs for 25 books and 100%)
X_train, y_train = dp.create_database(sentences_train)
X_val, y_val = dp.create_database(sentences_val)
X_test, y_test = dp.create_database(sentences_test)
print(X_train.shape, X_val.shape, X_test.shape)     #size n x (n-1)

  sentence_embeddings = torch.tensor(sentence_embeddings).to("cuda")


torch.Size([366740, 1536]) torch.Size([5792, 1536]) torch.Size([5404, 1536])


### PyTorch Classifier

In [197]:
#hyperparameters
input_dim = X_train[0].shape[0]
output_dim = 1
hidden_dim = 128
learning_rate_list = [0.1, 0.01, 0.001, 0.0001]
epochs_list = [10, 100, 1000]
L2_alphas = [0, 0.01, 0.001, 0.0001]

In [198]:
#Create the classifier
network = classifier.Classifier(input_dim, hidden_dim, output_dim, accelerator)

The best values for the hyperparameters were found to be:
- Number of epochs: 10 000
- Learning Rate: 0.001
- L2 Regularization: 0

In [199]:
#set the best hyperparameters
learning_rate = 0.001
epochs = 1000
L2_alpha = 0

In [200]:
#train the network with the best hyperparameters (1 mins)
loss = network.train(X_train, y_train, X_val, y_val, epochs, learning_rate, L2_alpha, True)

Epoch 0: train loss 0.693014
Epoch 100: train loss 0.491927
Epoch 200: train loss 0.454125
Epoch 300: train loss 0.425547
Epoch 400: train loss 0.386358
Epoch 500: train loss 0.337621
Epoch 600: train loss 0.291686
Epoch 700: train loss 0.254032
Epoch 800: train loss 0.223724
Epoch 900: train loss 0.199020
Validation loss 0.435678


In [201]:
#test on the GPU
y_pred = network(X_test.float().cuda())
#BCE loss
loss = network.loss_fn()(y_pred, y_test.float().cuda())
#convert to numpy arrays and round predictions
y_test_array, y_pred_array = y_test.cpu().detach().numpy(), y_pred.cpu().detach().numpy().round()
#accuracy
accuracy = accuracy_score(y_test_array, y_pred_array)
#F1 score
F1 = f1_score(y_test_array, y_pred_array)
#AUC score
AUC = roc_auc_score(y_test_array, y_pred_array)

print("Number of values %d" % y_test_array.shape[0])
print("Test BCE loss %f" % loss.item())
print("Test accuracy %f" % accuracy)
print("Test F1 %f" % F1.item())
print("Test AUC %f" % AUC.item())

Number of values 5404
Test BCE loss 0.451961
Test accuracy 0.815507
Test F1 0.814581
Test AUC 0.815507


### Exploiting the classifier

In [202]:
#embed a new book (2.5 sec)
new_embedding = [model.encode(all_sentences[nb_books])]

In [203]:
#predict the pairwise page order in the new book using the network (2 secs)
reduced_embedding = dp.split_sentences(new_embedding, 1, 1)[0]
X2, y2 = dp.create_database(reduced_embedding)
y_pred2 = network(X2.float().cuda())

  sentence_embeddings = torch.tensor(sentence_embeddings).to("cuda")


In [204]:
#BCE loss
loss = network.loss_fn()(y_pred2, y2.float().cuda())
#convert to numpy arrays and round predictions
y2_array, y_pred2_array = y2.cpu().detach().numpy(), y_pred2.cpu().detach().numpy().round()
#accuracy
accuracy = accuracy_score(y2_array, y_pred2_array)
#F1 score
F1 = f1_score(y2_array, y_pred2_array)
#AUC score
AUC = roc_auc_score(y2_array, y_pred2_array)

print("Number of values %d" % y2_array.shape[0])
print("Test BCE loss %f" % loss.item())
print("Test accuracy %f" % accuracy)
print("Test F1 %f" % F1.item())
print("Test AUC %f" % AUC.item())

Number of values 20880
Test BCE loss 0.885846
Test accuracy 0.651772
Test F1 0.651889
Test AUC 0.651772


In [205]:
#compute the permutation distances to the real order using an TSP - Traveling Salesman Problem - approximation algorithm

#compute the real order of the masked sentences

# Transformer:
### Heuristics 
Full transformer network that takes in pages as tokens and outputs an order, and the loss function would be a distance between two permutations.

In [206]:
#end-to-end transformer model using swap distance or R-distance as a loss function

In [207]:
#ChatGPT approach:
"""
Ordering the pages of a book using a transformer model can be challenging, especially when dealing with limited token size. Here's how you can create a transformer model for this problem and address the token size limitation:

1. Data Preparation:

    Start by collecting a dataset of books with pages not in order. Each page should be a separate input example, and the pages should be represented as text.

2. Text Tokenization:

    Tokenize the text from each page into smaller units, such as words or subwords, using a tokenizer. Popular tokenization libraries like Hugging Face Transformers provide tokenizers that can handle large texts and split them into tokens without worrying about the token size limitation.

3. Sliding Window Approach:

    As you mentioned, most transformer models have a token size limitation. To overcome this limitation, you can use a sliding window approach. Split each page into overlapping segments or windows of tokens. This will allow you to work with manageable token sizes for your model.

4. Model Architecture:

    You can use a standard transformer architecture for this task. However, you might need to modify it slightly to account for the specific requirements of ordering pages. Your model should be capable of learning the relationships between pages and their optimal order.

5. Training:

    Train your transformer model on the dataset of disordered pages. You can use a contrastive loss function to ensure that the model learns to distinguish between correct and incorrect page orderings. This involves providing pairs of pages where one pair is in the correct order, and another pair is not.

6. Inference:

    When you want to order a book with disordered pages, you can feed the pages into your trained model. The model should provide a probability score or ranking for each possible page order. You can then select the order with the highest score as the predicted correct order.

7. Evaluation:

    To evaluate the model's performance, you can use metrics like mean squared error (MSE) or Kendall's Tau rank correlation to compare the predicted order with the ground truth order.

Keep in mind that this is a challenging NLP task, and it may require a significant amount of data and computational resources. Additionally, your sliding window approach should be carefully designed to minimize information loss while breaking down the text into manageable token-sized chunks.
"""

"\nOrdering the pages of a book using a transformer model can be challenging, especially when dealing with limited token size. Here's how you can create a transformer model for this problem and address the token size limitation:\n\n1. Data Preparation:\n\n    Start by collecting a dataset of books with pages not in order. Each page should be a separate input example, and the pages should be represented as text.\n\n2. Text Tokenization:\n\n    Tokenize the text from each page into smaller units, such as words or subwords, using a tokenizer. Popular tokenization libraries like Hugging Face Transformers provide tokenizers that can handle large texts and split them into tokens without worrying about the token size limitation.\n\n3. Sliding Window Approach:\n\n    As you mentioned, most transformer models have a token size limitation. To overcome this limitation, you can use a sliding window approach. Split each page into overlapping segments or windows of tokens. This will allow you to wor