**WARNING** This notebook is a sandbox, not clean at all.

# Semantic augmentation with LL

**WARNING**:
- Zephyr-b even with quantitization is big for the PC config. **Almost 7Go for 8Go of VRAM**.  
--> Remove all NLP or other models that access to the GPU.
- attention to the maximum token length regarding the GPU memory. Limit between $2^{12}$ and $2^{13}$ at this moment.

In [None]:
import torch
import gc
import json
from pprint import pprint
from tqdm import tqdm
import pickle as pkl
from datasets import Dataset
from pathlib import Path
import numpy as np
import logging
from time import perf_counter

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

gc.collect()
torch.cuda.empty_cache()

  from .autonotebook import tqdm as notebook_tqdm


Nothing to clear


#### Load data

In [None]:
data = Dataset.load_from_disk(Path("../data/163/"))

#### Log management

In [2]:
LOG_PATH = Path("./log")

n = 0
for log in LOG_PATH.iterdir():
    if log.suffix == "log":
        i = log.stem.split("_")[-1]
        if i > n:
            n = i
log_file = LOG_PATH / f"llm_{n+1}.log"

# Content extraction with LLM

In [None]:
model_name_or_path = "TheBloke/zephyr-7B-beta-AWQ"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)
# Load model
llm = AutoAWQForCausalLM.from_quantized(
    model_name_or_path,
    fuse_layers=True,
    trust_remote_code=False,
    safetensors=True,
    max_new_tokens=2**12,
    
)

Fetching 13 files: 100%|██████████| 13/13 [00:00<00:00, 115766.35it/s]
Replacing layers...: 100%|██████████| 32/32 [00:03<00:00,  9.73it/s]
Fusing layers...: 100%|██████████| 32/32 [00:00<00:00, 55.56it/s]


The context to get the result as wanted :

In [None]:
question = ("""Que faudrait-il faire pour rendre la fiscalité plus juste et plus efficace ?""")
treatment = """- create a dict / json documet with the following keys: relevant, main_content, keyword
- reply "true" or "false" if the text answers the question. key:"relevant"
- extracts the main ideas and propositions from the text into a list.
- reformulates these main ideas and propositions with an infinitive verb and an action. key:"main_content"
- determines the word, feeling, emotion or quality that summarize the content with 0 to 5 keywords. key:"keyword".
"""
system_prompt = f"""
- Treat the text as a french analyst who works in economical, tax, financial and public policy.
- Applies the following analysis treatment and return the responses in JSON format.
- respect the JSON format at all cost.
- answers with less than 3000-tokens-words.

question : ```{question}```
treatment : ```{treatment}```
"""

example_user_1 = f"""text : ```--- 1) Réforme de l'impôt sur le revenu : il faut un paiement de l'impôt par tous dés le 1er euro perçu en prenant en compte tous les revenus.
2) Remise à plat de toutes les niches fiscales et suppression de celles inefficaces et inutiles.
2) Suppression de la taxe d'habitation pour 100% des français et non pas 80% car si cet impôt est bête et injuste , il l'est pour l'ensemble des français.
3) Taxation des entreprises à un taux réel avec là aussi une revue des niches, crédits d'impôts et autres réductions qui permettent à bcp d'entreprises de se soustraire à l'impôt.
```"""
example_assistant_1 = """{"relevant":true,
"main_content": ["payer l'impôt sur le revenu dés le 1er euro perçu", "supprimer les niches fiscales", "supprimer la taxe d'habitation", "Taxer les entreprises à un taux réel"],
"keyword":["justice", "réforme", "égalité"]}"""


example_user_2 = """Que tous les français devraient travailler, ça nous coûterait moins cher."""
example_assistant_2 = """{"relevant":false,
"main_content": ["avoir tous les français au travail"],
"keyword":["travail"]}"""

In [None]:
def format_prompt(content):
    user_prompt = f"""text: ```{content}```"""
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": example_user_1},
        {"role": "assistant", "content": example_assistant_1},
        {"role": "user", "content": example_user_2},
        {"role": "assistant", "content": example_assistant_2},
        {"role": "user", "content": user_prompt},
    ]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    return prompt


def query_llm(content, return_token=False):
    prompt = format_prompt(content)
    token_input = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
    n_token_input = token_input.shape[1]
    generation_output = llm.generate(
        token_input,
        do_sample=True,
        temperature=0.2,
        top_p=0.95,
        top_k=40,
        max_new_tokens=2**12,
    )

    token_output = generation_output[0]
    n_token_output = token_output.shape[0]
    decoded_token = tokenizer.decode(token_output, skip_special_tokens=True)
    if return_token:
        result = decoded_token
    else:
        result = decoded_token.split("<|assistant|>")[-1]
    return result, n_token_input, n_token_output


def semantic_augmentation(content):
    prompt = format_prompt(content)
    json_result = query_llm(prompt)
    result = json.loads(json_result.strip())
    return result

## Run the analysis

### Check previous work

In [None]:
# check if a run cache exists
# if yes, load it
# if none, create it
CACHE_PATH = Path("./run")
CACHE_PATH.mkdir(exist_ok=True)
save_path = CACHE_PATH / "llm_analysis.pkl"

In [None]:
unique_elements, unique_indices = np.unique(
    data["embeddings"], axis=0, return_index=True
)
print("Number of duplicates:", len(data) - len(unique_indices))

Number of duplicates: 1168


### Run extraction

**TODO**
- cleaner log
- analyze the behavior of the model for > 2048 token context + answer.
    - define how to define and treat chunk
    - explore langchain framework to speed-up implementation.
- analyze the impact of context vs user command
- define more systematic evaluation

In [None]:
prev_indexes = []
generations = []

if save_path.exists():
    with open(save_path, "rb") as file:
        generations = pkl.load(file)
        prev_indexes = [x["id"] for x in generations]
        n = len(prev_indexes)
        print(f"Cache file found. Size {n}")

indices = [i for i in unique_indices if i not in prev_indexes]
index_errors = []

for i, item in enumerate(tqdm(data.select(indices))):
    
    success = False
    id_ = indices[i]
    content = item["content"]

    prompt = format_prompt(content)
    token_input = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
    
    if token_input.shape[1] <= 2048:
        try:
            t0 = perf_counter()
            json_results, n_input, n_output = query_llm(content)
            dt = perf_counter() - t0
            result = json.loads(json_results.strip())
            success = True
        except KeyboardInterrupt as ki:
            raise ki
        except:
            index_errors.append(id_)

        msg = f"{id_};{n_input=};{n_output=};{dt};{success}"
        logging.debug(msg)

        if success:
            result["id"] = id_
            generations.append(result)

    # save every 50 iterations
    if (i + 1) % 50 == 0:
        serialized_content = pkl.dumps(generations)
        with open(save_path, "wb") as file:
            file.write(serialized_content)

Cache file found. Size 12390


 25%|██▍       | 10027/40240 [7:20:57<22:08:41,  2.64s/it] 


KeyboardInterrupt: 