Feat/inversion pp #104

cloneofsimo · 2022-12-30T12:33:22Z

Few utilities, including CLIP evaluation, CLIP evaluation preparation, random initialization with sigma.

So I've implemented CLIP text alignment, Image alignment in this PR #67 (comment)

Expect to see some results like above figure, from Custom Diffusion.
https://arxiv.org/abs/2212.04488

cloneofsimo · 2022-12-30T12:37:43Z

Bit more commits coming...

brian6091 · 2022-12-31T09:56:13Z

Will start to look at this today. Just to clarify, the target_imgs input:

def evaluate_pipe(
pipe,
target_images: List[Image.Image],
...
):

should be a list of images matching the example prompts. I'm guessing these could be generated by the same pipe if necessary?

cloneofsimo · 2022-12-31T13:12:40Z

Will start to look at this today. Just to clarify, the target_imgs input:

def evaluate_pipe(
pipe,
target_images: List[Image.Image],
...
):

should be a list of images matching the example prompts. I'm guessing these could be generated by the same pipe if necessary?

Target images are the reference images, so they are like ground truth images

brian6091 · 2022-12-31T14:16:07Z

Will start to look at this today. Just to clarify, the target_imgs input:

def evaluate_pipe(
pipe,
target_images: List[Image.Image],
...
):

should be a list of images matching the example prompts. I'm guessing these could be generated by the same pipe if necessary?

Target images are the reference images, so they are like ground truth images

Right, got it, but I imagine when we actually use this, we won't have ground truth images for all the prompts, so we can generate "ground truth" using the pipe (probably better to use the original model, rather than the trained).

cloneofsimo · 2023-01-05T18:34:28Z

So these seemed to work very well, I'll add these with updated example runfile + example dataset.

cloneofsimo · 2023-01-05T18:44:33Z

Will start to look at this today. Just to clarify, the target_imgs input:

def evaluate_pipe(
pipe,
target_images: List[Image.Image],
...
):

should be a list of images matching the example prompts. I'm guessing these could be generated by the same pipe if necessary?

Target images are the reference images, so they are like ground truth images

Right, got it, but I imagine when we actually use this, we won't have ground truth images for all the prompts, so we can generate "ground truth" using the pipe (probably better to use the original model, rather than the trained).

Oh so I was thinking, we have target subject X to train, testing on prompt Y and see how well it creates image : generated image Z should be faithful to prompt Y, and to subject X. that measure : sim(Z,Y), sim(Z,X) is what we are trying to get here

X : <Custom Yellow Clock>
Y : "photo of clock on the tree"
Z : (Photo of custom yellow clock on tree)

So our only source (ground truth) images are X, since Y are text, and Z is generated with SD
At least that's what I've understood from textual inversion paper and CLIP score papers... It's not explained clearly in the paper as well, so please correct me if I'm wrong!!

cloneofsimo · 2023-01-05T19:01:30Z

I am seeing incredible quality improvements combining all the fancy latest tricks. Especially giving high norm prior gave much editability unlike before.

I'm so proud of this 🤣

cloneofsimo · 2023-01-05T20:12:10Z

I'll merge this I guess

brian6091 · 2023-01-05T20:18:11Z

Will start to look at this today. Just to clarify, the target_imgs input:

def evaluate_pipe(
pipe,
target_images: List[Image.Image],
...
):

should be a list of images matching the example prompts. I'm guessing these could be generated by the same pipe if necessary?

Target images are the reference images, so they are like ground truth images

Right, got it, but I imagine when we actually use this, we won't have ground truth images for all the prompts, so we can generate "ground truth" using the pipe (probably better to use the original model, rather than the trained).

Oh so I was thinking, we have target subject X to train, testing on prompt Y and see how well it creates image : generated image Z should be faithful to prompt Y, and to subject X. that measure : sim(Z,Y), sim(Z,X) is what we are trying to get here

X : <Custom Yellow Clock> Y : "photo of clock on the tree" Z : (Photo of custom yellow clock on tree)

So our only source (ground truth) images are X, since Y are text, and Z is generated with SD At least that's what I've understood from textual inversion paper and CLIP score papers... It's not explained clearly in the paper as well, so please correct me if I'm wrong!!

Ok this makes sense now. Thanks!

brian6091 · 2023-01-05T20:19:12Z

I am seeing incredible quality improvements combining all the fancy latest tricks. Especially giving high norm prior gave much editability unlike before.

I'm so proud of this 🤣

Amazing! But you can't just drop the image without telling us what tricks you used? And what is high norm prior???

cloneofsimo · 2023-01-05T20:35:57Z

In this PR I made 5 changes to get it work :

I used multivector initialization, so it's extended latent. Which quite surprisingly isn't yet implemented in hf textual inversion.
Gradient accumulation
Face conditioned loss, but when textual inversion, we set high blur amount to recognize other features as well
Norm prior : so we give gaussian prior on the norm to be 0.4, and so if the norm is too large, we project it closer to 0.4
Full precision on textual inversion : high precision is needed during inversion. I don't know why, but seems to be the case.

brian6091 · 2023-01-05T21:27:57Z

Thanks for the secret sauce. Very clever the multivector initialization. Does that mean your prompts include all the tokens together?

hafriedlander · 2023-01-05T21:30:02Z

This has gotten bigger than last time I looked :). I haven't had time to understand all the changes, but the results speak for themselves. Great work!

cloneofsimo · 2023-01-05T21:42:24Z

They have <krk> in prompts, and they are substituted by extended tokens( <s1><s2><s3> in my case)

cloneofsimo added 3 commits December 30, 2022 11:27

feat : evaluation pipe

8e5bb5d

prepare clip

656c700

normalized clip score, rand init

03ee6ed

cloneofsimo requested review from brian6091 and hafriedlander December 30, 2022 12:33

fix : score should be avg

bdf432e

cloneofsimo added 3 commits January 5, 2023 17:25

fix : in case of just textual embs in process all

19772da

better pti with projections + mixed precisionless

30eb96d

gitignores

45ed7b8

weight bug

27839e3

example bash

e5de495

cloneofsimo merged commit b379afd into develop Jan 5, 2023

cloneofsimo deleted the feat/inversion_pp branch January 5, 2023 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/inversion pp #104

Feat/inversion pp #104

cloneofsimo commented Dec 30, 2022 •

edited

cloneofsimo commented Dec 30, 2022

brian6091 commented Dec 31, 2022

cloneofsimo commented Dec 31, 2022

brian6091 commented Dec 31, 2022

cloneofsimo commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023 •

edited

cloneofsimo commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023

brian6091 commented Jan 5, 2023

brian6091 commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023 •

edited

brian6091 commented Jan 5, 2023

hafriedlander commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023

Feat/inversion pp #104

Feat/inversion pp #104

Conversation

cloneofsimo commented Dec 30, 2022 • edited

cloneofsimo commented Dec 30, 2022

brian6091 commented Dec 31, 2022

cloneofsimo commented Dec 31, 2022

brian6091 commented Dec 31, 2022

cloneofsimo commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023 • edited

cloneofsimo commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023

brian6091 commented Jan 5, 2023

brian6091 commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023 • edited

brian6091 commented Jan 5, 2023

hafriedlander commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023

cloneofsimo commented Dec 30, 2022 •

edited

cloneofsimo commented Jan 5, 2023 •

edited

cloneofsimo commented Jan 5, 2023 •

edited