diff --git a/docs/package-lock.json b/docs/package-lock.json
index bf06b3d..6a98f0d 100644
--- a/docs/package-lock.json
+++ b/docs/package-lock.json
@@ -1,10 +1,11 @@
 {
-  "name": "docs",
+  "name": "modular-diffusion-docs",
   "version": "0.0.1",
   "lockfileVersion": 3,
   "requires": true,
   "packages": {
     "": {
+      "name": "modular-diffusion-docs",
       "version": "0.0.1",
       "dependencies": {
         "@astrojs/mdx": "^0.19.7",
diff --git a/docs/src/pages/guides/custom-modules.mdx b/docs/src/pages/guides/custom-modules.mdx
index 4f2c2cd..14d20b4 100644
--- a/docs/src/pages/guides/custom-modules.mdx
+++ b/docs/src/pages/guides/custom-modules.mdx
@@ -152,7 +152,7 @@ The `schedule` method precomputes `alpha` and `delta` (cumulative product of `al
 
 ## Denoiser neural network
 
-Modular Diffusion comes with general-use `UNet` and `Transformer` classes, which have proven to be effective denoising networks in the context of Diffusion Models. However, it is not uncommon to see authors make modifications to these networks to achieve even better results. To design your own original network, extend the base abstract `Net` class. This class acts as only a thin wrapper over the standard PyTorch `nn.Module` class, meaning you can use it exactly the same way. The `forward` method should take three tensor arguments: the noisy input `x`, the conditioning matrix `y`, and the diffusion time steps `t`.
+Modular Diffusion comes with general-use `UNet` and `Transformer` classes, which have proven to be effective denoising networks in the context of Diffusion Models. However, it is not uncommon to see authors make modifications to these networks to achieve even better results. To design your own original network, extend the base abstract `Net` class. This class acts as only a thin wrapper over the standard Pytorch `nn.Module` class, meaning you can use it exactly the same way. The `forward` method should take three tensor arguments: the noisy input `x`, the conditioning matrix `y`, and the diffusion time steps `t`.
 
 > Network output shape
 >
diff --git a/docs/src/pages/guides/getting-started.mdx b/docs/src/pages/guides/getting-started.mdx
index 683c4fd..b3d4369 100644
--- a/docs/src/pages/guides/getting-started.mdx
+++ b/docs/src/pages/guides/getting-started.mdx
@@ -20,11 +20,11 @@ Before you start, please install Modular Diffusion in your local Python environm
 python -m pip install modular-diffusion
 ```
 
-Additionally, ensure you've installed the correct [PyTorch distribution](https://pytorch.org/get-started/locally/) for your system.
+Additionally, ensure you've installed the correct [Pytorch distribution](https://pytorch.org/get-started/locally/) for your system.
 
 ## Train a simple model
 
-The first step before training a Diffusion Model is to load your dataset. In this example, we will be using [MNIST](http://yann.lecun.com/exdb/mnist/), which includes 70,000 grayscale images of handwritten digits, and is a great simple dataset to prototype your image models. We are going to load MNIST with [PyTorch Vision](https://pytorch.org/vision/stable/index.html), but you can load your dataset any way you like, as long as it results in a `torch.Tensor` object. We are also going to discard the labels and scale the data to the commonly used $[-1, 1]$ range.
+The first step before training a Diffusion Model is to load your dataset. In this example, we will be using [MNIST](http://yann.lecun.com/exdb/mnist/), which includes 70,000 grayscale images of handwritten digits, and is a great simple dataset to prototype your image models. We are going to load MNIST with [Pytorch Vision](https://pytorch.org/vision/stable/index.html), but you can load your dataset any way you like, as long as it results in a `torch.Tensor` object. We are also going to discard the labels and scale the data to the commonly used $[-1, 1]$ range.
 
 ```python
 import torch
diff --git a/docs/src/pages/modules/denoising-network.mdx b/docs/src/pages/modules/denoising-network.mdx
index 75e5656..60404c5 100644
--- a/docs/src/pages/modules/denoising-network.mdx
+++ b/docs/src/pages/modules/denoising-network.mdx
@@ -7,7 +7,7 @@ visualizations: maybe
 
 # {frontmatter.title}
 
-The backbone of Diffusion Models is a denoising network, which is trained to gradually denoise data. While earlier works used a **U-Net** architecture, newer research has shown that **Transformers** can be used to achieve comparable or superior results. Modular Diffusion ships with both types of denoising network. Both are implemented in PyTorch and thinly wrapped in a `Net` module.
+The backbone of Diffusion Models is a denoising network, which is trained to gradually denoise data. While earlier works used a **U-Net** architecture, newer research has shown that **Transformers** can be used to achieve comparable or superior results. Modular Diffusion ships with both types of denoising network. Both are implemented in Pytorch and thinly wrapped in a `Net` module.
 
 > Future warning
 >
diff --git a/docs/src/pages/modules/diffusion-model.mdx b/docs/src/pages/modules/diffusion-model.mdx
index cd87c30..e56c3a6 100644
--- a/docs/src/pages/modules/diffusion-model.mdx
+++ b/docs/src/pages/modules/diffusion-model.mdx
@@ -16,6 +16,7 @@ In Modular Diffusion, the `Model` class is a high-level interface that allows yo
 - `net` -> Denoising network module.
 - `loss` -> Loss function module.
 - `guidance` (Default: `None`) -> Optional guidance module.
+- `optimizer` (Default: `partial(Adam, lr=1e-4)`) -> Pytorch optimizer constructor function.
 - `device` (Default: `"cpu"`) -> Device to train the model on.
 - `compile` (Default: `true`) -> Whether to compile the model with `torch.compile` for faster training.
 
@@ -28,6 +29,8 @@ from diffusion.loss import Simple
 from diffusion.net import UNet
 from diffusion.noise import Gaussian
 from diffusion.schedule import Cosine
+from torch.optim import AdamW
+from functools import partial
 
 model = diffusion.Model(
     data=Identity(x, y, batch=128, shuffle=True),
@@ -36,6 +39,7 @@ model = diffusion.Model(
     net=UNet(channels=(1, 64, 128, 256), labels=10),
     loss=Simple(parameter="epsilon"),
     guidance=ClassifierFree(dropout=0.1, strength=2),
+    optimizer=partial(AdamW, lr=3e-4),
     device="cuda" if torch.cuda.is_available() else "cpu",
 )
 ```