In [None]:
from init_notebook import *
from src.clipig.app.images import LImage
from src.util.files.filestream import Filestream

In [None]:
def load_limage(filename, task: int = 0):
    with Filestream(filename) as fs:
        limage = LImage()
        limage.load_from_filestream(fs, f"task_{task:02}/limage/config.yaml")
    return limage


tiling_ec = load_limage("../src/clipig/projects/wang-ec2-grass-water.clipig.tar").tiling
#tiling_ec.attributes_map

In [None]:

def show_set(filename, tiling=tiling_ec, name="tileset"):
    if isinstance(filename, PIL.Image.Image):
        image_pil = filename
    else:
        image_pil = PIL.Image.open(filename).convert("RGB")
    image = VF.to_tensor(image_pil)
    map = tiling.render_tile_map(
        image,
        tiling.create_map_stochastic_perlin(size=(7*4, 7*4)),
    )
    display(f"image:{name}-preview")
    display(VF.to_pil_image(make_grid([
        resize(image, 4),
        map,
    ])))
    display(f"image:{name}-ec-16x16px-7x7map")
    display(image_pil)

#show_set("/home/bergi/prog/data/game-art/clipig/sdf-gen_grass-100d-cthulhu-dungeon-masked.png")

# Generating wang tile sets with CLIP

*Tile sets* in computer games are images containing a number of graphics that fit together at certain edges. 
With these one can draw nice minimal game maps and backgrounds. It would be awesome to create those via text prompts,
and i've been trying just long enough...

This is in the making since a couple of years. I played with the OpenAI CLIP model, **a lot** and 
here's a brief summary of my findings so far:

### 1. Use *Robust CLIP*

*Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models,
Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein* ([arxiv:2402.12336](https://arxiv.org/abs/2402.12336))

To recapitulate quickly: CLIP (**C**ontrastive **L**anguage **I**mage **P**rocessing) is a framework to train an 
image encoder and a text encoder to output a similar stream of numbers (an encoding) for similar images and/or texts, and, of course, 
a dissimilar stream of numbers for dissimilar images and/or texts. The stream of numbers as such is not so important.
Important is, how close or far different encodings are. 

A trained model can be used to search for images using a text prompt. But it also can be used to change an image (e.g. noise) 
to match a certain text prompt. It does not create images in the quality of Stable Diffusion but, in comparison, 
the process is very easy to control. One can apply a few steps of CLIP prompt-matching to the source image, then make some adjustments,
apply some transformations and apply a few more CLIP steps. All in the actual image space.

The OpenAI CLIP model was trained on 200,000,000 image/text pairs, if i remember that right, which was never heared of at that time. 
They released the model code and the weights, because, i guess, they thought it can not be used for evil circumstances. However,
the enormous datasets was kept closed. 

Soon after [LAION](https://laion.ai/projects/) took the effort of creating similar datasets of image/text pairs and released it.
The largest of it containg 5 billion image/text pairs!
The [Open CLIP](https://github.com/mlfoundations/open_clip) project reimplemented the model **and** the training code
and researchers with access to a cluster of expensive GPUS trained new models released new weights.

In the following, i use the Robust CLIP model [chs20/FARE4-ViT-B-32-laion2B-s34B-b79K](https://huggingface.co/chs20/FARE4-ViT-B-32-laion2B-s34B-b79K), released on huggingface. As the name suggests, it was initially trained on 2 billion LAION image/text pairs
and then *adversarially fine-tuned* on the ImageNet dataset.

This fine-tuning makes the model much more usable to *create* images.  
Compare these renderings, 1000 steps each, run on a simple gray head template, with prompt: "the face of h.p. lovecraft"

In [None]:
display("image:portrait-lovecraft-clip-vs-robustclip")
limage = load_limage("../src/clipig/projects/lovecraft-face-pixelart.clipig.tar", 2)
VF.to_pil_image(resize(make_grid([
    limage.layers[2].to_torch(),
    limage.layers[3].to_torch(),
]), 2))

OpenAI CLIP is left, the Robost CLIP is right.
They both do not really look like a photo of Lovecraft and the usual CLIP generation issues, like duplicate mouths and such,
are visible but robust CLIP produces much more pronounced renderings and less indefinite color gradients in general. 

### 2. Know your tiling

I certainly recommend this resource **cr31.co.uk**, which is now gone but fortunately mirrored at
[boristhebrave.com](https://www.boristhebrave.com/permanent/24/06/cr31/stagecast/wang/intro.html)

There is ongoing research about optimal packings of larger tile sets. For the **edge-and-corner-tiles** shown below, i used
a variant released by user [caeles](https://opengameart.org/users/caeles) on opengameart.org

When `T`, `TL`, `L` aso. are constants with exclusive bits, you can use this definition to represent the tileset:
```python
EDGE_AND_CORNER_TILESET = [
    [TL|L|BL|T|B|TR|R|BR, TL|T|TR|L|BL|B|BR, TL|T|TR|BL, TL|T|TR, TL|T|TR|BR, TL|T|TR|R|BR|BL, TL|L|BL|T|B|TR|R|BR],
    [BL|L|TL|T|TR|R|BR, BL|L|TL|T|TR|BR, TL|BL|BR, BL, TR|R|BR, TL|L|BL|TR|BR, TL|T|TR|R|BR|B|BL],
    [TL|L|BL|TR, TL|TR|BR, BL|TL|TR|BR, TL|BL|B|BR, BL|TR, TL|TR, TL|T|TR|R|BR],  
    [TL|L|BL, TR, TL|TR|R|BR, BL|L|TL|T|TR, TL, BR, BL|TR|R|BR],
    [TL|L|BL|BR, BL|B|BR, BL|BR|TR, TL|BL, 0, TR|BR, TL|BL|TR|R|BR],
    [TL|L|BL|B|BR|TR, TL|T|TR|BL|BR, TL|TR|BL, TL|BR, BL|BR, BL|B|BR|R|TR, TL|L|BL|TR|R|BR],
    [TL|L|BL|T|B|TR|R|BR, TL|L|BL|B|BR|R|TR, TL|L|BL|B|BR, BL|B|BR|TR, TL|TR|BL|B|BR, TL|T|TR|BL|B|BR, TL|BL|B|BR|R|TR], 
]
```

To just run experiments without immediately falling back to using a painting application, let's have a little framework to 
render tile templates. They are useful for testing and as *suggestions* to the image generation pipeline.

Thanks to the articles by [iQ](https://iquilezles.org/articles/), every graphics programmer knows
about *Signed Distance Functions*. The implicit distance-based representation 
makes it easy to render smooth masks, boundaries or normal-maps for lighting.

In [None]:
display("image:edge-corner-demo")
VF.to_pil_image(resize(make_grid([
    VF.to_tensor(PIL.Image.open("/home/bergi/Pictures/nn/tileset-sdf-gen5.png")),
    VF.to_tensor(PIL.Image.open("/home/bergi/Pictures/nn/tileset-sdf-gen5-norm.png"))
]), 2))
#PIL.Image.open("/home/bergi/Pictures/nn/tileset-sdf-gen.png")

A random map generator can compare the edges and corner settings of adjacent tiles, and generate endless, seemless random maps:

In [None]:
show_set(
    "/home/bergi/Pictures/nn/tileset-sdf-gen5-16.png",
    #load_limage("../src/clipig/projects/wang-ec2-masking.openclip.tar", 15).layers[0].to_pil()
    name="edge-corner",
)

(The small image you can download and use as a tileset template. It's 7x7 tiles with 16x16 pixels each)

Playing with the objects used on the edges and corners creates quite versatile templates:

In [None]:
show_set(
    load_limage("../src/clipig/projects/wang-ec2-grass-water.clipig.tar", 11).layers[0].to_pil(),
    name="edge-corner-round",
)

These endless random maps can then be fed into CLIP, and the gradient of the prompt target passes back to the tileset image.
The map images are slightly rotated and some noise is added to make the result look smoother.

Starting from random noise with the prompt "rpg tile map", the algorithm yielded:

In [None]:
show_set(
    load_limage("../src/clipig/projects/wang-ec2-masking.openclip.tar", 16).layers[1].to_pil(),
    name="rpg-tileset-from-noise",
)

It's nice and smooth and tileable. But it's also pretty chaotic. 
Using the spiky lighting template from above as source creates a tileset that better follows the inside/outside 
logic of the tiling:

In [None]:
show_set(
    load_limage("../src/clipig/projects/wang-ec2-masking.openclip.tar", 16).layers[4].to_pil(),
    name="rpg-tileset-from-template",
)

To gain complete control over the image generation, i use a soft or hard mask to apply different prompts at
the inside and outside of the tiling.

In this case "desert wasteland" and "top-down view of an ancient castle":

In [None]:
show_set(
    load_limage("../src/clipig/projects/wang-ec2-masking.openclip.tar", 17).layers[6].to_pil(),
    name="desert-castle",
)

In [None]:
show_set("/home/bergi/prog/data/game-art/clipig/sdf-gen_grass-100d-cthulhu-dungeon-masked.png")