**WARNING** This notebook is a sandbox, not clean at all.

# Semantic augmentation with LL

**WARNING**:
- Zephyr-b even with quantitization is big for the PC config. **Almost 7Go for 8Go of VRAM**.  
--> Remove all NLP or other models that access to the GPU.
- attention to the maximum token length regarding the GPU memory. Limit between $2^{12}$ and $2^{13}$ at this moment.

In [1]:
import torch
import gc
import json
from pprint import pprint
from tqdm import tqdm
import pickle as pkl
from datasets import Dataset
from pathlib import Path
import numpy as np
import logging
from time import perf_counter

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

gc.collect()
torch.cuda.empty_cache()

#### Load data

In [2]:
data = Dataset.load_from_disk(Path("../data/163/"))

#### Log management

In [7]:
LOG_PATH = Path("./log")

n = 0
for log in LOG_PATH.iterdir():
    if log.suffix == "log":
        i = log.stem.split("_")[-1]
        if i > n:
            n = i
log_file = LOG_PATH / f"llm_{n+1}.log"
logging.basicConfig(filename=log_file.resolve(), encoding="utf-8", level=logging.DEBUG)

# Content extraction with LLM

In [8]:
model_name_or_path = "TheBloke/zephyr-7B-beta-AWQ"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)
# Load model
llm = AutoAWQForCausalLM.from_quantized(
    model_name_or_path,
    fuse_layers=True,
    trust_remote_code=False,
    safetensors=True,
    max_new_tokens=2**12,
)

Fetching 13 files: 100%|██████████| 13/13 [00:00<00:00, 115766.35it/s]
Replacing layers...: 100%|██████████| 32/32 [00:05<00:00,  5.53it/s]
Fusing layers...: 100%|██████████| 32/32 [00:01<00:00, 24.78it/s]


The context to get the result as wanted :

In [10]:
import json

PROMPT_PATH = Path('../data/prompt/')
with open(PROMPT_PATH / '2048_extract.json', 'r') as f:
    prompt_2048 = json.load(f)

In [10]:
def format_prompt(content, init_query):
    user_query = f"""text: ```{content}```"""
    query = init_query.copy()
    query.append(user_query)
    prompt = tokenizer.apply_chat_template(
        query, tokenize=False, add_generation_prompt=True
    )
    return prompt

def query_llm(content, return_token=False):
    prompt = format_prompt(content)
    token_input = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
    n_token_input = token_input.shape[1]
    generation_output = llm.generate(
        token_input,
        do_sample=True,
        temperature=0.2,
        top_p=0.95,
        top_k=40,
        max_new_tokens=2**12,
    )

    token_output = generation_output[0]
    n_token_output = token_output.shape[0]
    decoded_token = tokenizer.decode(token_output, skip_special_tokens=True)
    if return_token:
        result = decoded_token
    else:
        result = decoded_token.split("<|assistant|>")[-1]
    return result, n_token_input, n_token_output


def semantic_augmentation(content):
    prompt = format_prompt(content)
    json_result = query_llm(prompt)
    result = json.loads(json_result.strip())
    return result

## Extraction for input < 2k tokens

### Check previous work

In [11]:
# check if a run cache exists
# if yes, load it
# if none, create it
CACHE_PATH = Path("./run")
CACHE_PATH.mkdir(exist_ok=True)
save_path = CACHE_PATH / "llm_analysis.pkl"

In [12]:
unique_elements, unique_indices = np.unique(
    data["embeddings"], axis=0, return_index=True
)
print("Number of duplicates:", len(data) - len(unique_indices))

Number of duplicates: 1168


### Run extraction

In [13]:
prev_indexes = []
generations = []

if save_path.exists():
    with open(save_path, "rb") as file:
        generations = pkl.load(file)
        prev_indexes = [x["id"] for x in generations]
        n = len(prev_indexes)
        print(f"Cache file found. Size {n}")

indices = [i for i in unique_indices if i not in prev_indexes]
index_errors = []

for i, item in enumerate(tqdm(data.select(indices))):
    success = False
    id_ = indices[i]
    content = item["content"]

    prompt = format_prompt(content, init_query=prompt_2048)
    token_input = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

    if token_input.shape[1] <= 2048:
        try:
            t0 = perf_counter()
            json_results, n_input, n_output = query_llm(content)
            dt = perf_counter() - t0
            result = json.loads(json_results.strip())
            success = True
        except KeyboardInterrupt as ki:
            raise ki
        except:
            index_errors.append(id_)

        msg = f"{id_};{n_input=};{n_output=};{dt};{success}"
        logging.debug(msg)

        if success:
            result["id"] = id_
            generations.append(result)

    # save every 50 iterations
    if (i + 1) % 50 == 0:
        serialized_content = pkl.dumps(generations)
        with open(save_path, "wb") as file:
            file.write(serialized_content)

Cache file found. Size 22273


100%|██████████| 30357/30357 [20:45:55<00:00,  2.46s/it]    


**TODO**
- cleaner log
- analyze the behavior of the model for > 2048 token context + answer.
    - define how to define and treat chunk
    - explore langchain framework to speed-up implementation.
- analyze the impact of context vs user command
- define more systematic evaluation

## Extraction for input > 2k tokens

**Purposes / context**:
- the text with more than 2k tokens are too big for the hardware/GPU
- more long is the text more the LLM is lost in the middle.

**Protocole**:
1. select the input with more than 2k tokens (we could use not treated content)
2. extract chunk from doc.
    - test langchain / llamaindex framework to chunck
3. iterative summary of ideas with a summarizing prompt, key-word + main ideas.
    - be sure to merge / join same/similar ideas.
4. generate the json with last prompt.