# Stable Diffusion - Concepts

* https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features
* https://techhenzy.com/stable-diffusion-ultimate-beginners-guide/
* https://www.reddit.com/r/StableDiffusion/comments/z6y6n4/comment/iy4adq5/?utm_source=share&utm_medium=web2x&context=3

## Txt2Img

### Prompts

### Negative Prompt

### Attention

* a `(word)` - increase attention to word by a factor of 1.1
* a `((word))` - increase attention to word by a factor of 1.21 (= 1.1 * 1.1)
* a `[word]` - decrease attention to word by a factor of 1.1
* a `(word:1.5)` - increase attention to word by a factor of 1.5
* a `(word:0.25)` - decrease attention to word by a factor of 4 (= 1 / 0.25)
* a `\(word\)` - use literal () characters in prompt

### Prompt editing

[to:when] - adds to to the prompt after a fixed number of steps (when)
[from::when] - removes from from the prompt after a fixed number of steps (when)

`a [fantasy:cyberpunk:16] landscape`

### Prompt Chunks and BREAK

Typing past standard 75 tokens that Stable Diffusion usually accepts increases prompt size limit from 75 to 150. Typing past that increases prompt size further. This is done by breaking the prompt into chunks of 75 tokens, processing each independently using CLIP's Transformers neural network, and then concatenating the result before feeding into the next component of stable diffusion, the Unet.
Adding a BREAK keyword (must be uppercase) fills the current chunks with padding characters. Adding more text after BREAK text will start a new chunk.

### Clip Skip

This is a slider in settings, and it controls how early the processing of prompt by CLIP network should be stopped.

A more detailed explanation:

CLIP is a very advanced neural network that transforms your prompt text into a numerical representation. Neural networks work very well with this numerical representation and that's why devs of SD chose CLIP as one of 3 models involved in stable diffusion's method of producing images. As CLIP is a neural network, it means that it has a lot of layers. Your prompt is digitized in a simple way, and then fed through layers. You get numerical representation of the prompt after the 1st layer, you feed that into the second layer, you feed the result of that into third, etc, until you get to the last layer, and that's the output of CLIP that is used in stable diffusion. This is the slider value of 1. But you can stop early, and use the output of the next to last layer - that's slider value of 2. The earlier you stop, the less layers of neural network have worked on the prompt.

Some models were trained with this kind of tweak, so setting this value helps produce better results on those models.

### Prompt matrix

Separate multiple prompts using the | character, and the system will produce an image for every combination of them. For example, if you use a busy city street in a modern city|illustration|cinematic lighting prompt, there are four combinations possible (first part of the prompt is always kept):

* a busy city street in a modern city
* a busy city street in a modern city, illustration
* a busy city street in a modern city, cinematic lighting
* a busy city street in a modern city, illustration, cinematic lighting

### Face restoration

Lets you improve faces in pictures using either GFPGAN or CodeFormer. There is a checkbox in every tab to use face restoration, and also a separate tab that just allows you to use face restoration on any picture, with a slider that controls how visible the effect is. You can choose between the two methods in settings.

### Refiner

This secondary model is designed to process the 1024×1024 SD-XL image near completion, to further enhance and refine details in your final output picture. As of version 1.6.0, this is now implemented in the webui natively.

### VAE

A VAE is a variational autoencoder.

An autoencoder is a model (or part of a model) that is trained to produce its input as output. By giving the model less information to represent the data than the input contains, it's forced to learn about the input distribution and compress the information. A stereotypical autoencoder has an hourglass shape - let's say it starts with 100 inputs and reduces it to 50 then 20 then 10 (encoder) and then 10 to 20 to 50 to 100 (decoder). The 10 dimensions that the encoder produces and the decoder consumes are called the latent representation.

Autoencoders can be a powerful paradigm and can be trained in an unsupervised way (without needing to label data since we only need the input data). However, if we want to sample from the input distribution, a vanilla autoencoder makes this difficult or impossible. One variation on the autoencoder is the variational autoencoder where the latent is normally distributed, which allows for the output distribution to be sampled from.

SD is somewhat unique in the vision class of diffusion models in that the diffusion process operates in the autoencoder space instead of pixel space. This makes the diffusion process more computationally efficient / memory efficient compared to a vanilla pixel space diffusion model. One other related technique some models use is to start the diffusion at a lower spatial resolution and progressively upscale to save compute.

In practice, in SD, the VAE is pretty aggressive and the dataset is filtered (indirectly through the aesthetic score) which removes images with a lot of text. This combined with the autoencoder is a significant reason SD struggles more with producing text than models like Dall-e.

From the above, an autoencoder is essential in SD. Generally speaking, there's no reason to modify the autoencoder unless the image distribution you're training on is dramatically different than the natural images given to SD. In this case, you'd likely need to retrain all parts of the model (or at least the unet). One example case where this might be useful is if you wanted to train an audio diffuser using the same components as SD but on "pixel" data from a spectrogram.

### Sampling Steps


The AI model starts from random noise and then iteratively denoises the image until you get your final image. This modifier decides how many denoising steps it will go through. Default is 50, which is perfect for most scenarios. For reference, at around 10 steps you have generally a good idea of the composition and whether you will like that image or not, at around 20 it becomes very close to finished. If cfg_scale and sampler are at default settings, then the difference 20 steps and 150 (the maximum) is often times hard to tell. So if you want to increase the speed at which your images are generated try lowering the steps. Increasing steps also often adds finer detail and fixes artifacts (often but not always). Example prompt is: !dream “Your prompt here” -s 20

### Upscaler

### Hires. fix

A convenience option to partially render your image at a lower resolution, upscale it, and then add details at a high resolution. In other words, this is equivalent to generating an image in txt2img, upscaling it via a method of your choice, and running a second pass on the now upscaled image in img2img to further refine the upscale and create the final result.

### CFG Scale

Simply put, the CFG scale (classifier-free guidance scale) or guidance scale is a parameter that controls how much the image generation process follows the text prompt. The higher the value, the more the image sticks to a given text input.

### Seed

This is the key to creating these pseudo-variations. If you reuse a prompt with the same seed (as well as all other settings such as steps, cfg scale, etc), you will get exactly the same image

### Sampling Method

https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/images/sampling.jpg
https://www.reddit.com/r/StableDiffusion/comments/wwm2at/sampler_vs_steps_comparison_low_to_mid_step_counts/

## Img2Img

### Outpainting

### Inpainting

### Variations

### img2img alternative test

https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#img2img-alternative-test

Adjust your settings for the reconstruction process:

* Use a brief description of the scene: "A smiling woman with brown hair." Describing features you want to change helps. Set this as your starting * prompt, and 'Original Input Prompt' in the script settings.
* You MUST use the Euler sampling method, as this script is built on it.
* Sampling steps: 50-60. This MUCH match the decode steps value in the script, or you'll have a bad time. Use 50 for this demo.
* CFG scale: 2 or lower. For this demo, use 1.8. (Hint, you can edit ui-config.json to change "img2img/CFG Scale/step" to .1 instead of .5.
* Denoising strength - this does matter, contrary to what the old docs said. Set it to 1.
* Width/Height - Use the width/height of the input image.
* Seed...you can ignore this. The reverse Euler is generating the noise for the image now.
* Decode cfg scale - Somewhere lower than 1 is the sweet spot. For the demo, use 1.
* Decode steps - as mentioned above, this should match your sampling steps. 50 for the demo, consider increasing to 60 for more detailed images.

## Bonus: Txt2Img -> Img2Snd

SD-Modelle können darauf trainiert werden, Bilder von Spektrogrammen auszugeben. Diese können dann in Musik umgewandelt werden. Nachfolgend kann Sprache, die durch Text-to-Speeche erzeugt wurde, in ihrer Modulation und Tonalität auf die Musik angepasst werden.

![](images/spectrogram.png)
[CC BY-SA 4.0](https://en.wikipedia.org/wiki/Riffusion)

<div style="display: flex; justify-content: center; align-items: center;">
    <div style="text-align: right; width: 40%; flex-grow: 5; margin-right: 3em;">
    Riffusion - <a href="https://www.riffusion.com/">Website</a> bzw. <a href="https://github.com/riffusion/riffusion">Quellcode</a>
    </div>
    <video style="height: 300px; flex-grow: 5;">
        <source src="images/RoboticPoetryRebellion.mp4" type="video/mp4" />
    </video>
    <div style="height: 300px; flex-shrink: 10;"></div>
</div>



