# in this notebook we will run some exepriements to figure out what the benchmark design should be.

## Experiment 1: what context to include
We have filtered the dataset into documents that work with wgpu (at least naga), are permissively licenses, and extract functions that contain both a comment directly before as well as a docstring directly at the top of the function body.
we compare the four scenarios:
- comment + header
- comment + header + docstring (both)
- header + docstring
- header (none)
it's 4x150 generations. We will use deepseek-coder (1.3b or 6.7b) as that showed promising signs before.

This notebook will just run generations - the postprocessing might be done externally (like here? https://github.com/Vipitis/bigcode-evaluation-harness/tree/shadereval)
or we throw something together tomorrow.

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, StopStringCriteria
from accelerate import Accelerator
from datasets import load_dataset

from tqdm.auto import tqdm


accelerator = Accelerator()
device = accelerator.device

experiment_ds = load_dataset("Vipitis/Shadereval-experiments-dev")# , download_mode="force_redownload")

# model_id = "deepseek-ai/deepseek-coder-6.7b-base"
model_id = "deepseek-ai/deepseek-coder-1.3b-base"
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
stop_words = ["\nfloat", "\nvec", "\nint", "\nmat"] # should cover the really common cases to speed things up.

# TODO: do we continue to use do_sample=False, because that has a huge impact on quality? -> more experiments
# careful about the do_sample setting here, I changed it to True for a little speed comparison - but no luck.
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=device, do_sample=False, stop_strings=stop_words, return_full_text=False)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
m = 3
k = min(len(experiment_ds["train"])-m, 2+m)
generations = {}
for experiment in tqdm(["input_both", "input_comment", "input_docstring", "input_none"]):
    gen_list = []
    for row in tqdm(experiment_ds["train"].select(range(m,k))): # 10 just to test this locally...
        
        gens = pipe(row[experiment], max_length=512, num_return_sequences=1, tokenizer=tokenizer)
        generated_text = gens[0]["generated_text"]
        # inputs = tokenizer(row[experiment], return_tensors="pt").to(device)
        # generation = model.generate(**inputs, max_length=512, num_return_sequences=1, do_sample=False, stop_strings=stop_words, tokenizer=tokenizer, return_full_text=False)
        # generated_text = tokenizer.decode(generation[0], skip_special_tokens=True)
        gen_list.append(generated_text)
    generations[experiment] = gen_list

# is an 1.3b model really ~10tok/s for me on A750? that seems really slow - maybe I am doing it wrong... it could also be caching in a bit. I updated transformers and acclelerate.
# seems like stop_words don't work on xpu?
generations

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


KeyboardInterrupt: 

In [None]:
# save the generations to a file
import json
with open(f"exp1_generations_{m}-{k}.json", "w", encoding="utf-8") as outfile:
    json.dump(generations, outfile)

In [10]:
experiment_ds["train"][1]

{'id': '4dSXDd',
 'comment': '// This is the big money function that makes the crazy fractally shape\n',
 'header': 'float DistanceToObject(vec3 p)\n{',
 'docstring': '\n    //p += (1.0/p.y)*0.6;\n\n    // Rotate, but only the part that is on the side of rotDir',
 'body': "\n    if (dot(p, rotDir) > 1.0) p *= rotMat;\n\n    // Repeat our position so we can carve out many cylindrical-like things from our solid\n    vec3 rep = fract(p)-0.5;\n    //final = max(final, -(length(rep.xz*rep.xz)*1.0 - 0.0326));\n    float final = -(length(rep.xy*rep.xz) - 0.109);\n    final = max(final, -(length(rep.zy) - 0.33));\n\n    //final = max(final, -(length(rep.xz*rep.xz) - 0.03));\n    //final = max(final, -(length(rep.yz*rep.yz) - 0.03));\n    //final = max(final, -(length(rep.xy*rep.xy) - 0.030266));\n\n    // Repeat the process of carving things out for smaller scales\n    vec3 rep2 = fract(rep*2.0)-0.5;\n    final = max(final, -(length(rep2.xz)*0.5 - 0.125));\n    final = max(final, -(length(rep2

## Experiment 2: Do we include the main function as a 1-shot task?
Motivation: LiveCodeBench does 1-shot for base models, this might help them understand the language that is used (via syntax clues), and also gives them an idea on whats available outside of the context we provide.
every shader will have a main function, and we likely don't want to generate them anyways...
the dataset for this is not prepared. But it would be only non main functions, and we have an additional columns in the dataset for the main function (or at least it's start-byte, as there should be nothing past it).
open questions: do we add comments to explain this? do we list the function names for all other stuff available (and later down also in common tab?)
main issue: the mainImage function is at the bottom, we would mess up the order. Not too bad for transformers as it's all parallel and they might not understand this strict definition... we have to see the results.

alternative: give the model the whole context- but context length will be a problem, FIM is likely the goal - but not all models support that.

In [None]:
# TODO: implement this experiment (but prepare the data first).

## Experiment 3: do_sample True or False? also what generation parameters
this is motivated by observations.
