# Generative AI with Python (with some Machine Learning)

## Diffusion Models

### Text-to-Image

To start with, we need to import the `pytorch` library as well as some useful tools from the `diffusers` library.

I asked you to choose an interpreter called `generative-ai-workshop` to run the code. We are using this because the installation scripts you ran earlier installed these specific libraries to a python environment.

In [None]:
import torch
from diffusers import (
    AutoPipelineForText2Image,
    DEISMultistepScheduler,
    StableDiffusionUpscalePipeline,
    DiffusionPipeline,
    AutoPipelineForImage2Image,
)
from PIL import Image
from diffusers.utils import export_to_video, make_image_grid
from IPython.display import Video, Audio

It's good to check if CUDA is working.

In [None]:
print(torch.cuda.is_available())

The different models will be stored in variables. This is simply for neatness.

In [None]:
TINY_SD = "segmind/tiny-sd"
GHIBLI_DIFFUSION = "nitrosocke/Ghibli-Diffusion"
LCM_DREAMSHAPER = "SimianLuo/LCM_Dreamshaper_v7"
LYKON_DREAMSHAPER = "lykon/dreamshaper-8"
KANDINSKY = "kandinsky-community/kandinsky-2-2-decoder"

We also have to let the code know whether we want to generate content with either Cuda or the CPU. We **always** want to be using Cuda, so let's save the string "cuda" as a variable.

In [None]:
CUDA = "cuda"

Now we can start creating images. We can use the `tiny-sd` model. This model is specifically designed to use very little space, relative to your typical diffusion model. To start with, we can simply ask it to draw "stonehenge" and see how it does.

In [None]:
# the autopipeline method is a one-size-fits-all function for creating images with the diffusers library
pipe = AutoPipelineForText2Image.from_pretrained(TINY_SD, torch_dtype=torch.float16)

# we are ensuring the image generation happens on our graphics card
pipe = pipe.to(CUDA)

# the prompt determines what the diffusion model will draw
prompt = "stonehenge"

# now we can create and display the image
stonehenge_take_one = pipe(prompt).images[0]
display(stonehenge_take_one)

Let's try to use a different model to draw for us. This one is called `Ghibli-Diffusion`. We can ask it to draw a castle for us.

In [None]:
pipe = AutoPipelineForText2Image.from_pretrained(
    GHIBLI_DIFFUSION, torch_dtype=torch.float16
)
pipe = pipe.to(CUDA)

prompt = "castle"
image = pipe(prompt).images[0]
display(image)

But the castle won't look Ghibli-ish unless we provide the trigger phrase. Now we can try again making sure to add "ghibli style" to our prompt.

In [None]:
prompt = "ghibli style, castle"
image = pipe(prompt).images[0]
display(image)

Certain models require trigger phrases in order for a certain style to be "activated." This is because they have been built on top of an existing model.

Now let's try to the `dreamshaper-8` model to draw a blue alien guy.

In [None]:
pipe = AutoPipelineForText2Image.from_pretrained(
    LYKON_DREAMSHAPER, torch_dtype=torch.float16, variant="fp16"
)
pipe.scheduler = DEISMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to(CUDA)

prompt = "portrait photo of blue alien man, light bokeh, intricate, elegant, sharp focus, soft lighting, vibrant colors"

image = pipe(prompt).images[0]
display(image)

Now let's take a look at the `LCM_Dreamshaper_v7` model. This one is the most intensive of the bunch.

In [None]:
pipe = AutoPipelineForText2Image.from_pretrained(LCM_DREAMSHAPER)
pipe.to(torch_device=CUDA, torch_dtype=torch.float16)

prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"

# Can be set to 1~50 steps. LCM support fast inference even <= 4 steps. Recommend: 1~8 steps.
num_inference_steps = 3
image = pipe(
    prompt=prompt, num_inference_steps=num_inference_steps, guidance_scale=8.0
).images[0]
display(image)

`kandinsky-2-2-decoder` is another popular model.

In [None]:
pipe = AutoPipelineForText2Image.from_pretrained(KANDINSKY, torch_dtype=torch.float16)
pipe = pipe.to(CUDA)

prompt = "portrait of a young women, blue eyes, cinematic"

image = pipe(prompt=prompt, prior_guidance_scale=1.0, height=500, width=500).images[0]
display(image)

#### Prompt Enhancement

There are certain key phrases that can cause text-to-image models to generate nicer images. Here are some examples of free tools that can help us find these phrases:

https://www.neuralframes.com/tools/stable-diffusion-prompt-generator  
https://www.feedough.com/stable-diffusion-prompt-generator/

Now let's try drawing Stonehenge again, but this time with the other terms suggested by the prompt enhancers.

In [None]:
pipe = AutoPipelineForText2Image.from_pretrained(TINY_SD, torch_dtype=torch.float16)
pipe = pipe.to(CUDA)

enchanced_prompt = input("Paste your enhanced prompt: ")

# generate an image with the enhanced prompt
stonehenge_take_two = pipe(enchanced_prompt).images[0]

# display both of our stonehenge images
make_image_grid([stonehenge_take_one, stonehenge_take_two], rows=1, cols=2)

These pictures were both drawn by the _same_ model. But these models may need a "nudge" to generate better results.

#### Negative Prompts

We can provide **negative prompts** when generating an image. This is a way of telling our model what _not_ to do. For example, let's try using a model to generate an image of a nature-y scene.

In [None]:
pipe = AutoPipelineForText2Image.from_pretrained(
    LYKON_DREAMSHAPER, torch_dtype=torch.float16
)
prompt = "nature scene, wilderness, hyperrealistic, octane render, 8k, photorealistic, volumetric lighting, epic, dramatic, dark fantasy, National Geographic photo, breathtaking, intricate details, highly detailed, sharp focus, masterpiece."
pipe = pipe.to(CUDA)

image = pipe(prompt).images[0]
display(image)

We can use a negative prompt to make trees and the colour green _less_ prominent in our image.

In [None]:
negative_prompt = "green:0.5, grass, trees, scary, animal, beast"
image = pipe(prompt, negative_prompt=negative_prompt).images[0]
display(image)

### Upscaling

It is also possible to use models to artificially "upscale" an image.

Let's load a low-resolution image of a cat.

In [None]:
low_res_img = Image.open("../pictures/low-res-cat.jpg")
display(low_res_img)

We can use an upscaler to attempt to make the image larger. To do this, we can use the `stable-diffusion-x4-upscaler` model. This can be "assisted" by providing a prompt.

In [None]:
model_id = "stabilityai/stable-diffusion-x4-upscaler"
pipeline = StableDiffusionUpscalePipeline.from_pretrained(
    model_id, torch_dtype=torch.float16
)
pipeline = pipeline.to(CUDA)

prompt = "A white kitten wearing a jingling bell collar, perched on a plush sapphire sofa, cozy indoor lighting, shallow depth of field, sharp focus, volumetric light, photorealistic, 8k, hyperdetailed, cinematic, editorial photography."
upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0]
display(upscaled_image)

Looks a bit weird IMO. But I want to be comprehensive in this workshop...

### Text-to-Video

There are also models that can generative videos from text, although the models that can run on your own hardware are going to be a bit more limited. We can ask this model to give us a video of Spiderman skateboarding.

In [None]:
pipe = DiffusionPipeline.from_pretrained(
    "damo-vilab/text-to-video-ms-1.7b", variant="fp16"
)
pipe = pipe.to(CUDA)

prompt = "spiderman is skateboarding"
video_frames = pipe(prompt).frames[0]
video_path = export_to_video(
    video_frames,
    output_video_path="../output-spiderman.mp4",
)
Video(video_path)

### Image to Image

The `kandinsky-2-2-decoder` model can also be used for `Image2Image` operations. To start with, we can take this picture of a clown.

In [None]:
init_image = Image.open("../pictures/clown.jpg")
display(init_image)

This clown picture can then be handed to the `kandinsky-2-2-decoder` with a prompt that allows us to place the clown in a different environment.

In [None]:
pipeline = AutoPipelineForImage2Image.from_pretrained(
    KANDINSKY, torch_dtype=torch.float16, use_safetensors=True
)
pipeline.enable_model_cpu_offload()

prompt = "A whimsical clown floats in the inky blackness of space, amidst swirling nebulae and distant stars, volumetric lighting, ethereal, hyperdetailed, vibrant colors, galactic masterpiece by Van Gogh and H.R. Giger, trending on ArtStation, breathtaking."
image = pipeline(prompt, image=init_image, strength=0.4).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

### Image Interpolation

The Kandinsky model also allows us to combine two images through a process called interpolation. First, we need to do a bit of memory clearing for this.

In [None]:
import gc

gc.collect()
torch.cuda.empty_cache()

Now we can load the two images we would like to combine.

In [None]:
from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline

prior_pipeline = KandinskyV22PriorPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-prior",
    torch_dtype=torch.float16,
    use_safetensors=True,
).to(CUDA)

img_1 = Image.open("../pictures/ginger-cat.webp")
img_2 = Image.open("../pictures/space.jpg")

make_image_grid([img_1.resize((500, 500)), img_2.resize((500, 500))], rows=1, cols=2)

We assign weights for the images and a prompt, to determine how much influence they have in our "combined" image.

In [None]:
images_texts = ["a spacey cat", img_1, img_2]
weights = [0.2, 0.4, 0.4]

Now we ask the model to combine these two images, with the prompt in mind.

In [None]:
prior_out = prior_pipeline.interpolate(images_texts, weights)

pipeline = KandinskyV22Pipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder",
    torch_dtype=torch.float16,
    use_safetensors=True,
).to("cuda")

image = pipeline(
    image_embeds=prior_out.image_embeds,
    negative_image_embeds=prior_out.negative_image_embeds,
    height=300,
    width=300,
    num_inference_steps=50,
).images[0]
display(image)

### AI Weirdness

We could ask these models to draw _nothing_ and see that happens. They will then start from random noise and..._do whatever._

In [None]:
# put all of our text-to-image models in a list
DIFFUSION_MODELS = [
    TINY_SD,
    GHIBLI_DIFFUSION,
    LCM_DREAMSHAPER,
    LYKON_DREAMSHAPER,
]

# prepare a list for storing our images
images = []

# loop through the different text-to-image models
for model in DIFFUSION_MODELS:
    pipe = AutoPipelineForText2Image.from_pretrained(model, torch_dtype=torch.float16)
    pipe = pipe.to(CUDA)

    # add a picture to our list - use an empty string as a prompt
    images.append(pipe("").images[0])

# display all of the generated pictures
make_image_grid(images, 2, 2)

### Other Options

- ComfyUI
- Automatic1111
- Foooocus
- InvokeAI
- SD.Next
- VoltaML

### Text-to-Speech

We can also use AI tools to convert text into speech. To test this out, let's get a random Dad Joke from https://icanhazdadjoke.com/.

In [None]:
import requests

dad_joke_url = "https://icanhazdadjoke.com/"
headers = {"Accept": "text/plain", "User-Agent": "UAL-Generative-AI-Workshop"}

response = requests.get(dad_joke_url, headers=headers)

random_dad_joke = response.text

Now we can have our AI model speak the Dad Joke for us.

In [None]:
from kokoro import KPipeline

pipeline = KPipeline(lang_code="b")

generator = pipeline(random_dad_joke, voice="af_heart")
for i, (gs, ps, audio) in enumerate(generator):
    display(Audio(data=audio, rate=24000))

### Text-to-Audio

There are some text-to-audio tools out there too. We can try and have one generate an example of some "music" based on a prompt we give it. Run the code below to generate a random genre name. Some results will exist, but some also won't.

In [None]:
import random

prefix = [
    "Post-",
    "Progressive ",
    "Proto ",
    "Experimental ",
    "Deconstructed ",
    "Doom ",
    "Blackened ",
    "Digital ",
    "Garage ",
    "Noise ",
]
genre = [
    "Ambient",
    "Club",
    "Hip Hop",
    "Metal",
    "Hardcore",
    "Noise",
    "Drone",
    "Techno",
    "Opera",
    "Gabber",
    "Gregorian Chant",
    "Dubstep",
    "Industrial",
]

genre_name = random.choice(prefix) + random.choice(genre)

In [None]:
from transformers import pipeline
import scipy

synthesiser = pipeline("text-to-audio", "facebook/musicgen-small")

music = synthesiser(genre_name, forward_params={"do_sample": True})

scipy.io.wavfile.write(
    "musicgen_out.wav", rate=music["sampling_rate"], data=music["audio"]
)
print("Let's listen to some AI-generated", genre_name)
Audio("musicgen_out.wav")

## Large Language Models (LLMs)

### What are Large Language Models?

Large Language Models 

"auto-correct on steroids"

[A short introduction to LLMs](https://www.youtube.com/watch?v=LPZh9BOjkQs)

### Ollama & Generating Text

**Ollama** is a tool that allows us to run LLMs locally. It can be downloaded and used entirely for _free_.

But what does it mean to run something _locally_? That means you're running it _solely_ on your own machine, rather than sending information back and forth with an online service.

This has some key advantages:
- cost
- privacy
- doesn't depend on stable/fast internet access
- peformance isn't affected by how many other people are using the same online services at a given time

To test our Ollama installation, we can see the output from inputting `ollama` in the command line.

It's also possible to do this within Python by using the `subprocess` library. So that's one option...

In [None]:
import subprocess

# Run the `echo` command and capture output
result = subprocess.run(["ollama"], text=True)

print("Output from command line:")
print(result.stdout)

This gives us a list of commands that we can use with Ollama. For our purposes, we're mainly concerned with being able to pull models, list what models are on our system, and remove the ones we no longer want to use. In a fresh installtion, Ollama comes with zero models, but the script has _pulled_ a few already to make things easier. We can use the ollama-python library to list these models. Of course, the command line works too.

In [None]:
import ollama

for model in ollama.list().models:
    print(model.model)

Now we have a plain list of the models on the system - this shows us what was downloaded (or _pulled_) by running the installation script.

To start with, I'm going to create a _variable_ for storing the name of the model I wish to use. This is going to be a _parameter_ that we give repeatedly to the ollama python library, so it makes sense to write it down once and avoid repeating ourselves.

In [None]:
# dophin-phi is 2.7b
DOLPHIN_PHI = "dolphin-phi"
# this particular deepseek model is 7b
DEEPSEEK = "deepseek-r1:7b"
# glm4 9b version
GLM4 = "glm4:latest"
# moondream
MOONDREAM = "moondream"

A convention when programming in Python is to write constants -- variables that are set once and never changes -- in all-caps. This doesn't affect how your code runs, but it can be nice for making things more ordered. I feel it tells me this bit of information is "important" in some way, while using less mental effort.

In [None]:
from ollama import chat

response = chat(
    model=DOLPHIN_PHI,
    messages=[
        {
            "role": "user",
            "content": "What is the capital of France?",
        },
    ],
)
print(response["message"]["content"])

### Streaming

In [None]:
stream = chat(
    model=DOLPHIN_PHI,
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
    stream=True,
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

### Vision Language Models (VLMs)

Vision Language Models can be used to describe images. Let's try this out with this clown image.

![](../pictures/clown.jpg)

First, we need to load the image. To do this, we need ot make use of the `base64` library as it allows us to convert the image into a format that a VLM can understand.

In [None]:
import base64

# load an image as base64
with open("../pictures/clown.jpg", "rb") as image_file:
    data = base64.b64encode(image_file.read()).decode("utf-8")

Now that the image has been loaded, we can send it to the VLM `moondream`, and ask it to tell us what the image contains.

In [None]:
response = ollama.chat(
    model=MOONDREAM,
    messages=[
        {
            "role": "user",
            "content": "What's in this image?",
            "images": [data],  # pass the image in the images field
        },
    ],
)
print(response["message"]["content"])

We can also ask moondream to explain certain details in the image to us.

In [None]:
response = ollama.chat(
    model=MOONDREAM,
    messages=[
        {
            "role": "user",
            "content": "What colour is his hair?",
            "images": [data],  # pass the image in the images field
        },
    ],
)
print(response["message"]["content"])

We can see a list of vision models that work with Ollama here: https://ollama.com/search?c=vision

### Small Language Models

Language Models come in very small sizes too. Some examples include `smollm` and `tinyllama`. While these models are more prone to hallucination, and have more limited "intelligence," they can run quite fast even on less powerful hardware such as Raspberry Pis and computers with older GPUs.

### Hallucination

![](../pictures/how-to-cook-your-dragon.webp)

In [None]:
response = chat(
    model=DOLPHIN_PHI,
    messages=[
        {
            "role": "user",
            "content": "What are some good cookbooks on how to use dragon meat?",
        },
    ],
)
print(response["message"]["content"])

### Thinking Models

Explanation of thinking/reasoning models goes here...

#### The Strawberry Test

In [None]:
stream = chat(
    model=DEEPSEEK,
    messages=[
        {
            "role": "user",
            "content": "How many times does the letter R appear in the word strawberry?",
        }
    ],
    stream=True,
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

### Finding the "Best" Model

trade-offs with sensible output and size/speed  
trial and error experimentation

We can create a quick comparison test by asking various models to generate text based on the same prompt, and see which output we like the most.

Firstly, we can take all the models that are on the system right now, and place them in a Python list. This will make things easier in a moment.

In [None]:
models = [DOLPHIN_PHI, DEEPSEEK, GLM4]

Now, we can create a _function_ for sending the same prompt to different models.

In [None]:
def limerick_creator(model: str):
    response = chat(
        model=model,
        messages=[
            {
                "role": "user",
                "content": "Write a poem about the nature of time.",
            },
        ],
    )

    print("Model:", model)
    print(response["message"]["content"])
    print("\n")

Now we can _call_ this function with our different models, and see how the output varies.

In [None]:
for model in models:
    limerick_creator(model)

## Other ML Tools

### Whisper

The whisper library is capable of understanding audio.

In [None]:
import whisper
import IPython

model = whisper.load_model("turbo")

This function will tell us the detected language of some audio.

In [None]:
def prepare_audio_and_detect_language(audio):

    # prepare the audio for analysis
    audio = whisper.pad_or_trim(audio)

    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(audio, n_mels=128).to(model.device)

    # detect the spoken language
    probs = model.detect_language(mel)[1]
    detected_language = max(probs, key=probs.get)

    # decode the audio
    options = whisper.DecodingOptions()
    result = whisper.decode(model, mel, options)

    return detected_language, result.text

Now let's have a look at an file with some introductory German speech.

In [None]:
GERMAN_AUDIO_PATH = "../audio-files/german.wav"
Audio(GERMAN_AUDIO_PATH)

Now we can use whisper to determine the language and tell us what's being said.

In [None]:
audio = whisper.load_audio(GERMAN_AUDIO_PATH)
lang, content = prepare_audio_and_detect_language(audio)

print(lang)
print(content)

### OpenCV

In [None]:
import cv2
import numpy as np


with open("../object-detection/yolov3.txt", "r") as f:
    CLASSES = [line.strip() for line in f.readlines()]
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))


def get_output_layers(net):

    layer_names = net.getLayerNames()
    try:
        output_layers = [layer_names[i - 1] for i in net.getUnconnectedOutLayers()]
    except:
        output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]

    return output_layers


def draw_prediction(img, class_id, x, y, x_plus_w, y_plus_h):

    label = str(CLASSES[class_id])

    color = COLORS[class_id]

    cv2.rectangle(img, (x, y), (x_plus_w, y_plus_h), color, 2)

    cv2.putText(img, label, (x - 10, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

In [None]:
image = cv2.imread("../object-detection/person.jpg")
display(Image.fromarray(image))

In [None]:
Width = image.shape[1]
Height = image.shape[0]
scale = 0.00392

net = cv2.dnn.readNet(
    "../object-detection/yolov3.weights", "../object-detection/yolov3.cfg"
)

blob = cv2.dnn.blobFromImage(image, scale, (416, 416), (0, 0, 0), True, crop=False)

net.setInput(blob)

outs = net.forward(get_output_layers(net))

class_ids = []
confidences = []
boxes = []
conf_threshold = 0.5
nms_threshold = 0.4


for out in outs:
    for detection in out:
        scores = detection[5:]
        class_id = np.argmax(scores)
        confidence = scores[class_id]
        if confidence > 0.5:
            center_x = int(detection[0] * Width)
            center_y = int(detection[1] * Height)
            w = int(detection[2] * Width)
            h = int(detection[3] * Height)
            x = center_x - w / 2
            y = center_y - h / 2
            class_ids.append(class_id)
            confidences.append(float(confidence))
            boxes.append([x, y, w, h])


indices = cv2.dnn.NMSBoxes(boxes, confidences, conf_threshold, nms_threshold)

for i in indices:
    try:
        box = boxes[i]
    except:
        i = i[0]
        box = boxes[i]

    x = box[0]
    y = box[1]
    w = box[2]
    h = box[3]
    draw_prediction(image, class_ids[i], round(x), round(y), round(x + w), round(y + h))

display(Image.fromarray(image))

## Further Reading

### StableDiffusion

Negative Prompts: https://blog.segmind.com/beginners-guide-to-understanding-negative-prompts-in-stable-diffusion/