<h1> <a href="http://arxiv.org/abs/2406.07524">Simple and Effective Masked Diffusion Language Models</a> by Sahoo et al., 2024 </h1>

This Colab provides a basic demonstration of using an HF model to generate examples from our paper. The model, featuring a context length of `1024`, was trained on the OpenWebText dataset for 1 million training steps, processing approximately `33B` tokens.


# Install Dependencies

In [None]:
# Please ignore any warnings while installing the dependencies

! pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
! pip install causal-conv1d
! pip install datasets==2.18.0
! pip install einops==0.7.0
! pip install fsspec
! pip install git-lfs==1.6
! pip install h5py==3.10.0
! pip install hydra-core==1.3.2
! pip install lightning==2.2.1
! pip install mamba-ssm
! pip install nvitop==1.3.2
! pip install omegaconf==2.3.0
! pip install packaging==23.2
! pip install pandas
! pip install rich==13.7.1
! pip install seaborn==0.13.2
! pip install scikit-learn==1.4.0
! pip install timm==0.9.16
! pip install transformers==4.38.2
! pip install triton==2.2.0
! pip install wandb==0.13.5
! pip install flash-attn==2.5.6

Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch==2.2.2
  Downloading https://download.pytorch.org/whl/cu121/torch-2.2.2%2Bcu121-cp310-cp310-linux_x86_64.whl (757.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m757.3/757.3 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.17.2
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.17.2%2Bcu121-cp310-cp310-linux_x86_64.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m102.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==2.2.2
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.2.2%2Bcu121-cp310-cp310-linux_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.2.2)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cuda_nvrtc_cu

Collecting lightning==2.2.1
  Downloading lightning-2.2.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Collecting lightning-utilities<2.0,>=0.8.0 (from lightning==2.2.1)
  Downloading lightning_utilities-0.11.2-py3-none-any.whl (26 kB)
Collecting torchmetrics<3.0,>=0.7.0 (from lightning==2.2.1)
  Downloading torchmetrics-1.4.0.post0-py3-none-any.whl (868 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m868.8/868.8 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
Collecting pytorch-lightning (from lightning==2.2.1)
  Downloading pytorch_lightning-2.3.0-py3-none-any.whl (812 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m812.2/812.2 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: lightning-utilities, torchmetrics, pytorch-lightning, lightning
Successfully installed lightning-2.2.1 lightning-utilities-0.11.2 pytorch-lightni

# Git clone

In [None]:
! git clone https://github.com/kuleshov-group/mdlm.git

Cloning into 'mdlm'...
remote: Enumerating objects: 122, done.[K
remote: Counting objects: 100% (122/122), done.[K
remote: Compressing objects: 100% (93/93), done.[K
remote: Total 122 (delta 38), reused 107 (delta 25), pack-reused 0[K
Receiving objects: 100% (122/122), 89.98 KiB | 4.74 MiB/s, done.
Resolving deltas: 100% (38/38), done.


# Imports

In [None]:
import os
os.chdir('mdlm')

import fsspec
import hydra
import lightning as L
import omegaconf
import rich.syntax
import rich.tree
import torch

import dataloader
import diffusion
import main
import utils

# Sample generation

In [None]:
overrides=['mode=sample_eval',
           'eval.checkpoint_path=kuleshov-group/mdlm-no_flashattn-fp32-owt',
           'data=openwebtext-split',
           'model.length=1024',
           'sampling.predictor=ddpm_cache',
           'sampling.steps=1000',
           'loader.eval_batch_size=1',
           'sampling.num_sample_batches=1',
           'backbone=hf_dit']

with hydra.initialize(version_base=None,
                      config_path='configs'):
  config = hydra.compose(config_name='config', overrides=overrides)
  sar_config = hydra.compose(config_name='config', overrides=overrides)

In [None]:
L.seed_everything(config.seed)

logger = utils.get_logger(__name__)
tokenizer = dataloader.get_tokenizer(config)

samples = main.generate_samples(config, logger, tokenizer)
for sample in samples:
  print(sample)

<|endoftext|> do something else and then just work on yourself. You give them a lot of time. So it’s quicker.

KO: You’ve been placed there and there’s been the work there already: working on your game; working on effort; working on progress. How does that feel to be around?

BODYBOARD GETHER: It does give everybody a job to keep spending their time in the game, working on everything. And doing stuff better is just getting around to contributions already made.

RANDON WILLIGAN: There are a lot of stars in the game who have a little bit more experience in South Australia. How was it like to have that?

JOSEL LA PASCO: It was a huge difference. So much for my age group, you were never going to play here in Adelaide, which I did because I wasn’t offer a contract, pretty much everything.

In my third year I didn’t earn a home contract for that. I had a lot of time overseas, varying in the skills and competitions, in the smaller league, but I had so much further time at home. That, yeah. Th

## Semi Autoregressive sample generation

In [None]:
sar_config['sampling']['semi_ar'] = True
sar_config['sampling']['stride_length'] = 512
sar_config['sampling']['num_strides'] = 2

# Generates conext_length + num_strides * stride_length number of tokens.
# In this case we generate 1024 + 2 * 512 = 2048 tokens.

samples = main.generate_samples(sar_config, logger, tokenizer)
for sample in samples:
  print(sample)

INFO:__main__:Generating samples.


Text samples: ['<|endoftext|> fundamental principles of equality and fair treatment, but in order to get more important like fairness, what is the need to eliminate? If the accuser and the defendant have never had an equal conviction I mean how did they all know about both playing a role in this situation?\n\nL: I’m pretty sure someone was devastated with the Free Circuit’s decision. In the matter as it stands, I was able to take a amicus brief on the court that was in favor of the defendant. That is strong enough to be the dissenting opinion on this particular issue. So I’ve been asking you for a long time. I had more than 10 years to weigh in the Free Circuit’s decision. How do you weigh it in today’s second?\n\nCL: Now the court’s message for me is propaganda. Obviously, liberals should not embrace what conservatives said. I will act in a way based on conservative statements, and any leftist will. If the court gets the other way with me then they take it away from the liberal. In ot