## Set-up environment

As usual, we first install HuggingFace Transformers, and Datasets.

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Nov 17 00:59:10 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
| N/A   24C    P0    46W / 400W |   1218MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [6]:
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone


In [7]:
!pip install -q datasets

## Prepare data

Here we take a small portion of the IMDB dataset, a binary text classification dataset ("is a movie review positive or negative?").

In [8]:
from datasets import load_dataset

train_ds, test_ds = load_dataset("imdb", split=['train', 'test'])
# train_ds, test_ds = load_dataset("imdb", split=['train[:10]+train[-10:]', 'test[:5]+test[-5:]'])



  0%|          | 0/2 [00:00<?, ?it/s]

We create id2label and label2id mappings, which are handy at inference time.

In [9]:
labels = train_ds.features['label'].names
print(labels)

['neg', 'pos']


In [10]:
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
print(id2label)

{0: 'neg', 1: 'pos'}


Next, we prepare the data for the model using the tokenizer. 

In [11]:
from transformers import PerceiverTokenizer

tokenizer = PerceiverTokenizer.from_pretrained("deepmind/language-perceiver")

train_ds = train_ds.map(lambda examples: tokenizer(examples['text'], padding="max_length", truncation=True),
                        batched=True)
test_ds = test_ds.map(lambda examples: tokenizer(examples['text'], padding="max_length", truncation=True),
                      batched=True)

Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.


We set the format to PyTorch tensors, and create familiar PyTorch dataloaders.

In [12]:
train_ds.set_format(type="torch", columns=['input_ids', 'label'])
test_ds.set_format(type="torch", columns=['input_ids', 'label'])

In [13]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_ds, batch_size=10, shuffle=True)
test_dataloader = DataLoader(test_ds, batch_size=50)

Here we verify some things (always important to check out your data!).

In [14]:
batch = next(iter(train_dataloader))
for k,v in batch.items():
  print(k,v.shape)

label torch.Size([10])
input_ids torch.Size([10, 2048])


In [15]:
tokenizer.decode(batch['input_ids'][3])

"[CLS]Man, this movie sucked big time! I didn't even manage to see the hole thing (my girlfriend did though). Really bad acting, computer animations so bad you just laugh (woman to werewolf), strange clips, the list goes on and on. Don't know if its just me or does this movie remind you of a porn movie? And I don't mean all the naked ladys... It's something about the light or something... This could maybee become a classic just because of the bad acting and all the naked women, but not because it's an original movie white a nice plot twist. My final words are: Don't see it! It's not worth the time. If you wanna see it because the nakedness there's lots of better ones to see![SEP][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][P

In [82]:
batch['label']

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0])

In [96]:
import numpy as np
train_ds['label'].double().mean()

tensor(0.5000, dtype=torch.float64)

## Define model

Next, we define our model, and put it on the GPU.

In [18]:
# preprocessor we customized to use the tagkop encoder
from tagkop_encoding_functions import (
    PerceiverImagePreprocessor,
    TagkopPerceiverTextPreprocessor,
)
from transformers import PerceiverForSequenceClassification

import torch

from transformers.models.perceiver.modeling_perceiver import (
    PerceiverConfig,
    PerceiverModel,
    PerceiverClassificationDecoder,
    PerceiverTextPreprocessor,
    PerceiverClassificationDecoder
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


config = PerceiverConfig(
    num_self_attends_per_block = 4,
    d_model = 64
)

print('config', config)

# Vanilla Perceiver Encodings


preprocessor = PerceiverTextPreprocessor(config)

# Our new awesome encodings
# preprocessor = TagkopPerceiverTextPreprocessor(config)


# preprocessor = PerceiverImagePreprocessor(config,
#                                           in_channels=1,
#                                           prep_type="1d",
#                                           position_encoding_type="fourier",
                                          

#                                           concat_or_add_pos="add",
#                                           out_channels=64,
#                                           project_pos_dim=64,
#                                           # tagkop_position_encoding_kwargs=dict(
#                                           #   num_channels=64,
#                                           #   index_dims=config.image_size**2,
#                                           #   ds="imdb"
#                                           #   ),
#                                           fourier_position_encoding_kwargs = dict(
#                                               concat_pos=False, max_resolution=(224, 224), num_bands=16, sine_only=False
#                                           )
#                                       )

decoder = PerceiverClassificationDecoder(config,
                                          num_channels=config.d_latents,
                                          trainable_position_encoding_kwargs=dict(num_channels=config.d_latents, index_dims=1),
                                          use_query_residual=True,
                                         )

# num_self_attends_per_block, num_self_attention_heads, num_cross_attention_heads to something more reasonable and out_channels project_pos_dim and num_channels to 64
model = PerceiverModel(config, input_preprocessor=preprocessor, decoder=decoder)



model.to(device)

config PerceiverConfig {
  "attention_probs_dropout_prob": 0.1,
  "audio_samples_per_frame": 1920,
  "cross_attention_shape_for_attention": "kv",
  "cross_attention_widening_factor": 1,
  "d_latents": 1280,
  "d_model": 64,
  "hidden_act": "gelu",
  "image_size": 56,
  "initializer_range": 0.02,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "perceiver",
  "num_blocks": 1,
  "num_cross_attention_heads": 8,
  "num_frames": 16,
  "num_latents": 256,
  "num_self_attends_per_block": 4,
  "num_self_attention_heads": 8,
  "output_shape": [
    1,
    16,
    224,
    224
  ],
  "qk_channels": null,
  "samples_per_patch": 16,
  "self_attention_widening_factor": 1,
  "train_size": [
    368,
    496
  ],
  "transformers_version": "4.25.0.dev0",
  "use_query_residual": true,
  "v_channels": null,
  "vocab_size": 262
}



PerceiverModel(
  (input_preprocessor): PerceiverTextPreprocessor(
    (embeddings): Embedding(262, 64)
    (position_embeddings): Embedding(2048, 64)
  )
  (embeddings): PerceiverEmbeddings()
  (encoder): PerceiverEncoder(
    (cross_attention): PerceiverLayer(
      (attention): PerceiverAttention(
        (self): PerceiverSelfAttention(
          (layernorm1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (layernorm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
          (query): Linear(in_features=1280, out_features=64, bias=True)
          (key): Linear(in_features=64, out_features=64, bias=True)
          (value): Linear(in_features=64, out_features=64, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (output): PerceiverSelfOutput(
          (dense): Linear(in_features=64, out_features=1280, bias=True)
        )
      )
      (layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (mlp): PerceiverMLP(
 

In [37]:
# you can then do a forward pass as follows:
tokenizer = PerceiverTokenizer()
text = "hello world"
inputs = tokenizer(text, return_tensors="pt").input_ids
print(inputs)
inputs.to(device)
with torch.no_grad():
   outputs = model(inputs=inputs.unsqueeze(1).to(device))
logits = outputs.logits
print('list(logits.shape): ', list(logits.shape))
# to train, one can train the model using standard cross-entropy:
criterion = torch.nn.CrossEntropyLoss()
labels = torch.tensor([1]).to(device)
loss = criterion(logits, labels)

tensor([[  4, 110, 107, 114, 114, 117,  38, 125, 117, 120, 114, 106,   5]])
using imdb dataset


RuntimeError: ignored

## Train the model

Here we train the model using native PyTorch.

In [19]:
from transformers import AdamW
from tqdm.notebook import tqdm
from sklearn.metrics import accuracy_score

optimizer = AdamW(model.parameters(), lr=1e-4)

model.train()


batch = next(iter(train_dataloader))
for epoch in range(100):  # loop over the dataset multiple times
    torch.save(model.state_dict(), '/content/drive/MyDrive/saved_model/small_network_model_fourier.pt')
    print('saved model')
    print("Epoch:", epoch)
    for i in range(10):
    # for batch in tqdm(train_dataloader):
         # get the inputs; 
         inputs = batch["input_ids"].to(device)
        #  attention_mask = batch["attention_mask"].to(device)
         labels = batch["label"].to(device)

         # zero the parameter gradients
         optimizer.zero_grad()

         # forward + backward + optimize
         outputs = model(inputs=inputs)
         logits = outputs.logits
         
         # to train, one can train the model using standard cross-entropy:
         criterion = torch.nn.CrossEntropyLoss()

         loss = criterion(logits, labels)
         loss.backward()
         optimizer.step()
         
         
         

         # evaluate
         predictions = outputs.logits.argmax(-1).cpu().detach().numpy()
         accuracy = accuracy_score(y_true=batch["label"].numpy(), y_pred=predictions)
         print(f"Loss: {loss.item()}, Accuracy: {accuracy}")



saved model
Epoch: 0
Loss: 0.9116876721382141, Accuracy: 0.4
Loss: 4.553509712219238, Accuracy: 0.6
Loss: 0.7355924844741821, Accuracy: 0.7
Loss: 3.4581539630889893, Accuracy: 0.4
Loss: 1.4204754829406738, Accuracy: 0.4
Loss: 1.0859565734863281, Accuracy: 0.6
Loss: 1.576385736465454, Accuracy: 0.6
Loss: 1.2882251739501953, Accuracy: 0.6
Loss: 0.7216700315475464, Accuracy: 0.6
Loss: 0.7490414381027222, Accuracy: 0.6
saved model
Epoch: 1
Loss: 1.0918055772781372, Accuracy: 0.4
Loss: 0.9054195284843445, Accuracy: 0.4
Loss: 0.6239315867424011, Accuracy: 0.6
Loss: 0.691170334815979, Accuracy: 0.6
Loss: 0.8408317565917969, Accuracy: 0.6
Loss: 0.8485921025276184, Accuracy: 0.6
Loss: 0.7366556525230408, Accuracy: 0.6
Loss: 0.6212990880012512, Accuracy: 0.7
Loss: 0.6274121403694153, Accuracy: 0.6
Loss: 0.736826479434967, Accuracy: 0.8
saved model
Epoch: 2
Loss: 0.7458535432815552, Accuracy: 0.7
Loss: 0.6682078838348389, Accuracy: 0.7
Loss: 0.6056714057922363, Accuracy: 0.7
Loss: 0.6214624643325

KeyboardInterrupt: ignored

## Evaluate the model

Finally, we evaluate the model on the test set. We use the Datasets library to compute the accuracy.

In [40]:
torch.save(model.state_dict(), '/content/drive/MyDrive/saved_model/small_network_model_fourier_embeddings.pt')

Mounted at /content/drive


In [16]:

# import torch
# checkpoint = torch.load('/content/drive/MyDrive/saved_model/small_network_model.pt')
# model.load_state_dict(checkpoint)
# model.eval()

RuntimeError: ignored

In [18]:
from tqdm.notebook import tqdm
from datasets import load_metric

accuracy = load_metric("accuracy")

model.eval()
for batch in tqdm(test_dataloader):
      # get the inputs; 
      inputs = batch["input_ids"].to(device)
      attention_mask = batch["attention_mask"].to(device)
      labels = batch["label"].to(device)

      # forward pass
      outputs = model(inputs=inputs, attention_mask=attention_mask)
      logits = outputs.logits 
      predictions = logits.argmax(-1).cpu().detach().numpy()
      references = batch["label"].numpy()
      accuracy.add_batch(predictions=predictions, references=references)

final_score = accuracy.compute()
print("Accuracy on test set:", final_score)

  after removing the cwd from sys.path.


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

  0%|          | 0/500 [00:00<?, ?it/s]

Accuracy on test set: {'accuracy': 0.62816}


In [45]:
# TAGOP
from tqdm.notebook import tqdm
from datasets import load_metric

accuracy = load_metric("accuracy")

model.eval()
for batch in tqdm(test_dataloader):
      # get the inputs; 
      inputs = batch["input_ids"].to(device)
      labels = batch["label"].to(device)

      # forward pass
      outputs = model(inputs=inputs.unsqueeze(1))
      logits = outputs.logits 
      predictions = logits.argmax(-1).cpu().detach().numpy()
      references = batch["label"].numpy()
      accuracy.add_batch(predictions=predictions, references=references)

final_score = accuracy.compute()
print("Accuracy on test set:", final_score)

  0%|          | 0/500 [00:00<?, ?it/s]

using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb dataset
using imdb d

## Inference

In [22]:
text = "I hated this movie, it's really bad."

input_ids = tokenizer(text, return_tensors="pt").input_ids

# forward pass
outputs = model(inputs=input_ids.to(device))
logits = outputs.logits 
predicted_class_idx = logits.argmax(-1).item()

print("Predicted:", model.config.id2label[predicted_class_idx])

Predicted: LABEL_1
