# PixelBytes: Catching Insights in Unified Multimodal Sequences

Ce notebook présente **PixelBytes**, un modèle innovant conçu pour générer simultanément du texte et des images pixel par pixel sous forme de séquences. L'objectif est d'explorer un embedding unifié qui permet une génération multimodale cohérente.

## Contexte et Architecture Proposée

### Fondements Théoriques
- **Image Transformer** : [Génération d'images pixel par pixel](https://arxiv.org/abs/1802.05751)
- **Bi-Mamba+** : [Modèle bidirectionnel pour la prévision de séries temporelles](https://arxiv.org/abs/2404.15772)
- **MambaByte** : [Modèle d'état d'espace sélectif sans token](https://arxiv.org/abs/2401.13660)

### Concept Clé
Le modèle PixelByte génère des séquences mixtes de texte et d'images. Il doit :
- Gérer les transitions entre texte et image avec des sauts de ligne (ASCII 0A).
- Maintenir la cohérence des dimensions des images générées.
- Assimiler la tâche de "copie" pour reproduire des motifs complexes.

Ce notebook utilise la puissance des GPU T4 x2 de Kaggle pour expérimenter avec des architectures avancées et des jeux de données volumineux, afin de relever les défis de la génération multimodale unifiée.

## Ressources du Projet

### Dataset
Pour ce projet, nous utiliserons le dataset **PixelBytes-Pokemon**, spécialement conçu pour cette tâche de génération multimodale. Ce dataset, créé par l'auteur de ce notebook, est disponible sur Hugging Face : [PixelBytes-Pokemon](https://huggingface.co/datasets/ffurfaro/PixelBytes-Pokemon). Il contient des séquences de texte et d'images de Pokémon, encodées de manière à permettre l'entraînement de notre modèle PixelByte sur des données multimodales.

### Implémentation
L'implémentation du modèle et les scripts d'entraînement sont disponibles dans le dépôt GitHub **Mamba-Bys** : [Mamba-Bys](https://github.com/fabienfrfr/Mamba-Bys). Ce dépôt contient le code source nécessaire pour reproduire les expériences, ainsi que des instructions détaillées sur la configuration et l'utilisation du modèle PixelByte.

## Modele à entrainer :

- 8 LSTM (bidirectionnel + 1,2,3 layers) + (p_embed + bi-2 layers)
- 6 Mamba (bidirectionnel + 1,2,3 layers)
- 3 Transformers (1,2,3 layers)

# Pre-test

Avant d'entrainer les 8 LSTM, prendre le pembed-bi-2 LSTM et tester la génération (à 40%, mais influence des répétitions de caractére et de pixel)

Forte chance de predire le meme pixel à la suite, et de prédire des espaces et voyelles.

Pour la generation, le modele génére uniquement le prochain element central. L'algorithme de génération doit reconstituer la structure 2D. Tel que : 
(T,T,\n,T,\n,P,P,P,\t,P,P,P,\n,T,T) donne 
([[0,0,0],[0,T,0],[0,0,0]],
[[0,0,0],[T,T,0],[0,0,0]],
[[0,0,0],[T,\n,0],[0,0,0]],
[[0,0,0],[\n,T,0],[0,0,0]],
[[0,0,0],[T,\n,0],[0,0,0]],
[[0,0,0],[0,P,0],[0,0,0]],
[[0,0,0],[P,P,0],[0,0,0]],
[[0,0,0],[P,P,0],[0,0,0]],
[[0,0,0],[P,\t,0],[0,0,0]],
[[0,P,P],[0,P,0],[0,0,0]],
[[P,P,P],[P,P,0],[0,0,0]],
[[P,P,P],[P,P,0],[0,0,0]],
[[P,\t,0],[P,\n,0],[0,0,0]],
[[0,0,0],[0,T,0],[0,0,0]],
[[0,0,0],[T,T,0],[0,0,0]],) avec P les pixel, \n les saut de ligne et de modalité, et \t, les changement de ligne de pixex de l'image et/ou les tabulations de texte.

In [1]:
!pip install -q git+https://github.com/fabienfrfr/PixelBytes.git@main

In [2]:
# only in kaggle for HF
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HF_TOKEN")
# no warning msg during train
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
# our approach
from pixelbytes import *

In [3]:
# init
hf_dataset = load_dataset("ffurfaro/PixelBytes-Pokemon")
ds = hf_dataset["train"].train_test_split(test_size=0.1)

train_dataset = PxByDataset(ds["train"]["pixelbyte"], seq_length=256, stride=32)
test_dataset = PxByDataset(ds["test"]["pixelbyte"], seq_length=256, stride=32)

pixelbyte = PixelBytesTokenizer()
vocab_size = pixelbyte.__len__()

Downloading readme:   0%|          | 0.00/426 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.59M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/964 [00:00<?, ? examples/s]

In [14]:
### Config LSTM
# modele de reference (LSTM, bidirectionnel, pxby, 81 dim (9 embed), 64 state, 2 layers) (fait)
model_config = ModelConfig(dim=81, d_state=64, depth=2, vocab_size=vocab_size)
# modele LSTM (LSTM, bidirectionnel, center, 81 dim (9 embed), 64 state, 2 layers) #validateur 1 (fait)
model_config = ModelConfig(dim=81, d_state=64, depth=2, vocab_size=vocab_size, pxby_embed=False, pembed=False)
# modele LSTM (LSTM, bidirectionnel, pxby-noconv, 81 dim (9 embed), 64 state, 2 layers) #validateur 2 (fait)
model_config = ModelConfig(dim=81, d_state=64, depth=2, vocab_size=vocab_size, pembed=False)
# modele LSTM (LSTM, unidirectionnel, pxby, 81 dim (9 embed), 64 state, 2 layers) (fait)
model_config = ModelConfig(dim=81, d_state=64, depth=2, vocab_size=vocab_size, bidirectional=False)
# modele LSTM (LSTM, bidirectionnel, pxby, 36 dim (9 embed), 64 state, 2 layers) (fait)
model_config = ModelConfig(dim=36, d_state=64, depth=2, vocab_size=vocab_size)
# modele LSTM (LSTM, bidirectionnel, pxby, 162 dim (18 embed), 64 state, 2 layers) (fait)
model_config = ModelConfig(dim=162, d_state=64, depth=2, vocab_size=vocab_size)
# modele LSTM (LSTM, bidirectionnel, pxby, 81 dim (9 embed), 32 state, 2 layers) (fait)
model_config = ModelConfig(dim=81, d_state=32, depth=2, vocab_size=vocab_size)
# modele LSTM (LSTM, bidirectionnel, pxby, 81 dim (9 embed), 128 state, 2 layers) (fait)
model_config = ModelConfig(dim=81, d_state=128, depth=2, vocab_size=vocab_size)
# modele LSTM (LSTM, bidirectionnel, pxby, 81 dim (9 embed), 64 state, 1 layers) (fait)
model_config = ModelConfig(dim=81, d_state=64, depth=1, vocab_size=vocab_size)
# modele LSTM (LSTM, bidirectionnel, pxby, 81 dim (9 embed), 64 state, 3 layers) (fait)
model_config = ModelConfig(dim=81, d_state=64, depth=3, vocab_size=vocab_size)
# modele LSTM (LSTM, bidirectionnel, pxby, 81 dim (9 embed), 64 state, 2 layers) #special (fait)

### Model LSTM
#model = SimpleRNNModel(model_config)

### Config Mamba
# modele Mamba (Mamba, bidirectionnel, pxby, 81 dim (9 embed), 64 state, 2 layers)
model_config = ModelConfig(dim=81, d_state=64, depth=2, vocab_size=vocab_size)
# modele Mamba (Mamba, unidirectionnel, pxby, 81 dim (9 embed), 64 state, 2 layers)
model_config = ModelConfig(dim=81, d_state=64, depth=2, vocab_size=vocab_size, bidirectional=False)
# modele Mamba (Mamba, bidirectionnel, pxby, 81 dim (9 embed), 64 state, 1 layers)
model_config = ModelConfig(dim=81, d_state=64, depth=1, vocab_size=vocab_size)

### Model Mamba
#model = bMamba(model_config)

### Config Transformer
# modele Transformer (Transformer, //, pxby, 81 dim (9 embed), 64 state, 1 layers)
model_config = ModelConfig(dim=81, d_state=64, depth=1, vocab_size=vocab_size)
# modele Transformer (Transformer, //, pxby, 81 dim (9 embed), 64 state, 2 layers)
model_config = ModelConfig(dim=81, d_state=64, depth=2, vocab_size=vocab_size)

### Model Transformer
model = SimpleTransformerModel(model_config)


## model show
print(model)

SimpleTransformerModel(
  (embedding): PxByEmbed(
    (projection): Linear(in_features=81, out_features=80, bias=True)
    (norm): LayerNorm((80,), eps=1e-05, elementwise_affine=True)
    (linear_embedding): Embedding(113, 9)
    (patch_embedding): Conv2d(9, 9, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=80, out_features=80, bias=True)
        )
        (linear1): Linear(in_features=80, out_features=320, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=320, out_features=80, bias=True)
        (norm1): LayerNorm((80,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((80,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inpla

In [15]:
train_config = TrainConfig(model=model, model_config=model_config, dataset_name="PixelBytes-Pokemon", hf_token=hf_token,
                           train_dataset=train_dataset,test_dataset=test_dataset, num_epochs=200, repo_name="PixelBytes-Pokemon")
trainer = Trainer(train_config)

Complete path of pytorch model '.pth': models/attention_bi_pxby_conv_81-dim_64-state_2-layer_PixelBytes-Pokemon


In [16]:
trainer.train_and_evaluate()

Training:   2%|▎         | 5/200 [01:32<1:00:36, 18.65s/it]


Epoch 5: Train Loss: 1.5959, Test Loss: 1.6556, Test Acc: 52.26%


Training:   5%|▌         | 10/200 [03:04<58:49, 18.58s/it] 


Epoch 10: Train Loss: 1.4668, Test Loss: 1.6535, Test Acc: 53.23%


Training:   8%|▊         | 15/200 [04:36<57:17, 18.58s/it]


Epoch 15: Train Loss: 1.3796, Test Loss: 1.6471, Test Acc: 52.61%


Training:  10%|█         | 20/200 [06:08<55:43, 18.57s/it]


Epoch 20: Train Loss: 1.3011, Test Loss: 1.6858, Test Acc: 52.85%


Training:  12%|█▎        | 25/200 [07:40<54:03, 18.54s/it]


Epoch 25: Train Loss: 1.2511, Test Loss: 1.7465, Test Acc: 52.79%


Training:  15%|█▌        | 30/200 [09:12<52:34, 18.55s/it]


Epoch 30: Train Loss: 1.2003, Test Loss: 1.8080, Test Acc: 51.81%


Training:  18%|█▊        | 35/200 [10:44<51:03, 18.56s/it]


Epoch 35: Train Loss: 1.1591, Test Loss: 1.8476, Test Acc: 51.05%


Training:  20%|██        | 40/200 [12:16<49:31, 18.57s/it]


Epoch 40: Train Loss: 1.1340, Test Loss: 1.8541, Test Acc: 50.93%


Training:  22%|██▎       | 45/200 [13:48<47:59, 18.58s/it]


Epoch 45: Train Loss: 1.0932, Test Loss: 1.9134, Test Acc: 51.02%


Training:  25%|██▌       | 50/200 [15:20<46:26, 18.58s/it]


Epoch 50: Train Loss: 1.0720, Test Loss: 1.9615, Test Acc: 50.01%


Training:  28%|██▊       | 55/200 [16:52<44:52, 18.57s/it]


Epoch 55: Train Loss: 1.0554, Test Loss: 1.9537, Test Acc: 51.46%


Training:  30%|███       | 60/200 [18:24<43:18, 18.56s/it]


Epoch 60: Train Loss: 1.0304, Test Loss: 1.9836, Test Acc: 50.93%


Training:  32%|███▎      | 65/200 [19:56<41:45, 18.56s/it]


Epoch 65: Train Loss: 1.0163, Test Loss: 1.9856, Test Acc: 51.28%


Training:  35%|███▌      | 70/200 [21:27<40:12, 18.56s/it]


Epoch 70: Train Loss: 0.9993, Test Loss: 2.0339, Test Acc: 50.28%


Training:  38%|███▊      | 75/200 [22:59<38:40, 18.56s/it]


Epoch 75: Train Loss: 0.9874, Test Loss: 2.0518, Test Acc: 50.81%


Training:  40%|████      | 80/200 [24:31<37:07, 18.56s/it]


Epoch 80: Train Loss: 0.9698, Test Loss: 2.0991, Test Acc: 49.90%


Training:  42%|████▎     | 85/200 [26:03<35:33, 18.55s/it]


Epoch 85: Train Loss: 0.9535, Test Loss: 2.0786, Test Acc: 50.55%


Training:  45%|████▌     | 90/200 [27:35<33:59, 18.54s/it]


Epoch 90: Train Loss: 0.9479, Test Loss: 2.1175, Test Acc: 50.28%


Training:  48%|████▊     | 95/200 [29:07<32:25, 18.52s/it]


Epoch 95: Train Loss: 0.9358, Test Loss: 2.1332, Test Acc: 50.66%


Training:  50%|█████     | 100/200 [30:39<30:50, 18.51s/it]


Epoch 100: Train Loss: 0.9202, Test Loss: 2.1489, Test Acc: 50.16%


Training:  52%|█████▎    | 105/200 [32:10<29:18, 18.51s/it]


Epoch 105: Train Loss: 0.9121, Test Loss: 2.1602, Test Acc: 50.55%


Training:  55%|█████▌    | 110/200 [33:42<27:45, 18.51s/it]


Epoch 110: Train Loss: 0.9071, Test Loss: 2.1127, Test Acc: 50.10%


Training:  57%|█████▊    | 115/200 [35:14<26:13, 18.51s/it]


Epoch 115: Train Loss: 0.8952, Test Loss: 2.1678, Test Acc: 49.60%


Training:  60%|██████    | 120/200 [36:45<24:40, 18.51s/it]


Epoch 120: Train Loss: 0.8894, Test Loss: 2.1809, Test Acc: 50.16%


Training:  62%|██████▎   | 125/200 [38:17<23:07, 18.51s/it]


Epoch 125: Train Loss: 0.8820, Test Loss: 2.1815, Test Acc: 49.87%


Training:  65%|██████▌   | 130/200 [39:49<21:35, 18.51s/it]


Epoch 130: Train Loss: 0.8595, Test Loss: 2.2618, Test Acc: 49.54%


Training:  68%|██████▊   | 135/200 [41:20<20:03, 18.51s/it]


Epoch 135: Train Loss: 0.8750, Test Loss: 2.1694, Test Acc: 50.66%


Training:  70%|███████   | 140/200 [42:52<18:30, 18.51s/it]


Epoch 140: Train Loss: 0.8636, Test Loss: 2.2026, Test Acc: 50.16%


Training:  72%|███████▎  | 145/200 [44:24<16:58, 18.51s/it]


Epoch 145: Train Loss: 0.8541, Test Loss: 2.2183, Test Acc: 49.66%


Training:  75%|███████▌  | 150/200 [45:55<15:25, 18.51s/it]


Epoch 150: Train Loss: 0.8450, Test Loss: 2.2015, Test Acc: 49.75%


Training:  78%|███████▊  | 155/200 [47:27<13:52, 18.51s/it]


Epoch 155: Train Loss: 0.8381, Test Loss: 2.2182, Test Acc: 49.40%


Training:  80%|████████  | 160/200 [48:59<12:20, 18.51s/it]


Epoch 160: Train Loss: 0.8347, Test Loss: 2.2290, Test Acc: 48.78%


Training:  82%|████████▎ | 165/200 [50:31<10:47, 18.51s/it]


Epoch 165: Train Loss: 0.8324, Test Loss: 2.2542, Test Acc: 49.25%


Training:  85%|████████▌ | 170/200 [52:02<09:15, 18.50s/it]


Epoch 170: Train Loss: 0.8315, Test Loss: 2.2373, Test Acc: 49.16%


Training:  88%|████████▊ | 175/200 [53:34<07:43, 18.52s/it]


Epoch 175: Train Loss: 0.8225, Test Loss: 2.2766, Test Acc: 49.10%


Training:  90%|█████████ | 180/200 [55:06<06:10, 18.52s/it]


Epoch 180: Train Loss: 0.8138, Test Loss: 2.2989, Test Acc: 49.60%


Training:  92%|█████████▎| 185/200 [56:37<04:37, 18.52s/it]


Epoch 185: Train Loss: 0.8117, Test Loss: 2.2375, Test Acc: 49.87%


Training:  95%|█████████▌| 190/200 [58:09<03:05, 18.53s/it]


Epoch 190: Train Loss: 0.8026, Test Loss: 2.2569, Test Acc: 48.86%


Training:  98%|█████████▊| 195/200 [59:41<01:32, 18.52s/it]


Epoch 195: Train Loss: 0.8097, Test Loss: 2.2663, Test Acc: 50.55%


Training: 100%|██████████| 200/200 [1:01:13<00:00, 18.37s/it]



Epoch 200: Train Loss: 0.7973, Test Loss: 2.2919, Test Acc: 49.10%
Repository 'ffurfaro/PixelBytes-Pokemon' created or already exists.


model.safetensors:   0%|          | 0.00/697k [00:00<?, ?B/s]

Model pushed successfully to PixelBytes-Pokemon, subfolder: attention_bi_pxby_conv_81_dim_64_state_2_layer_best
Repository 'ffurfaro/PixelBytes-Pokemon' created or already exists.


model.safetensors:   0%|          | 0.00/697k [00:00<?, ?B/s]

Model pushed successfully to PixelBytes-Pokemon, subfolder: attention_bi_pxby_conv_81_dim_64_state_2_layer_last
Training completed. Results and models saved.


In [19]:
ls models

[0m[01;34mattention_bi_pxby_conv_81-dim_64-state_1-layer_PixelBytes-Pokemon[0m/
[01;34mattention_bi_pxby_conv_81-dim_64-state_1-layer_best[0m/
[01;34mattention_bi_pxby_conv_81-dim_64-state_1-layer_last[0m/
[01;34mattention_bi_pxby_conv_81-dim_64-state_2-layer_PixelBytes-Pokemon[0m/
[01;34mattention_bi_pxby_conv_81-dim_64-state_2-layer_best[0m/
[01;34mattention_bi_pxby_conv_81-dim_64-state_2-layer_last[0m/
[01;34mssm_bi_pxby_conv_81-dim_64-state_1-layer_PixelBytes-Pokemon[0m/
[01;34mssm_bi_pxby_conv_81-dim_64-state_1-layer_best[0m/
[01;34mssm_bi_pxby_conv_81-dim_64-state_1-layer_last[0m/
[01;34mssm_uni_pxby_conv_81-dim_64-state_2-layer_PixelBytes-Pokemon[0m/
[01;34mssm_uni_pxby_conv_81-dim_64-state_2-layer_best[0m/
[01;34mssm_uni_pxby_conv_81-dim_64-state_2-layer_last[0m/


In [12]:
model_ = bMamba.from_pretrained("ffurfaro/PixelBytes-Pokemon", subfolder="ssm_bi_pxby_conv_81_dim_64_state_2_layer_last")
model_

bMamba(
  (embedding): PxByEmbed(
    (projection): Linear(in_features=81, out_features=81, bias=True)
    (norm): LayerNorm((81,), eps=1e-05, elementwise_affine=True)
    (linear_embedding): Embedding(113, 9)
    (patch_embedding): Conv2d(9, 9, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
  (_mamba): Mamba(
    (in_proj): Linear(in_features=81, out_features=324, bias=False)
    (conv1d): Conv1d(162, 162, kernel_size=(4,), stride=(1,), padding=(3,), groups=162)
    (act): SiLU()
    (x_proj): Linear(in_features=162, out_features=134, bias=False)
    (dt_proj): Linear(in_features=6, out_features=162, bias=True)
    (out_proj): Linear(in_features=162, out_features=81, bias=False)
  )
  (layers): ModuleList(
    (0): Mamba(
      (in_proj): Linear(in_features=81, out_features=324, bias=False)
      (conv1d): Conv1d(162, 162, kernel_size=(4,), stride=(1,), padding=(3,), groups=162)
      (act): SiLU()
      (x_proj): Linear(in_features=162, out_features=134, bias=False)
      (dt