# Llama2 7b tuning & inference

So, what if we try something... bigger?...

This contestant is out of league for several reasons:
1. It has much more parameters than other models (60 and 124M vs 7b). It would be much more appropriate to compare it with heavier versions of GPT2 and T5, however I do not have enough resources for that 🙃
2. This model takes way more time to infere than lighter models (for obvious reasons)

Why then I chosed this model as an option? Well, Llama2 is one of the most powerful publicly available transformers at the moment, and I consider skills I acuired while working with it quite handy.

## Llama2 7b tuning

Obviously, I could not do it easily either on the kaggle or locally, because I do not have enough resources available for that. That's why I decided to use service called [modal.com](https://modal.com/) and utilize their computational power to run the evaluations.


### Reproduction steps

First of all, you have to gain access to the service:
1. Regiseter in [modal.com](https://modal.com/) (1 minute, requires GitHub authentication)
2. Enter secret from Huggingface (enter the hf token in the `HUGGINGFACE_TOKEN` field and name it `huggingface`), which could be found in the `Settings/API tokens`.
 
The tool is much easier to use via the terminal, because it generates way too much output. Here is the list of commands to launch it in CLI (and corresponding cell with these commands):
```bash
# Authorization in modal account
modal token new   
# Launch training process
modal run src/models/llama/train_modal.py --dataset llama2_dataset.py --base chat7 --run-id chat7-nontoxic
# Copying PEFT pretrained model from modal cloud to local dir
modal volume get example-results-vol 'chat7-nontoxic/*' models/llama2 
# Running inference for the model in cloud
modal run inference.py --base chat7 --run-id chat7-nontoxic --prompt "[INST]<<SYS>>\nYou are a Twitch moderator that paraphrases sentences to be non-toxic.\n<<SYS>> \n\nCould you paraphrase this: ...?\n [/INST]"

```

The implementation of inference with model running locally is represented below

## Inference

In [1]:
!pip install -q peft
!pip install -q --upgrade bitsandbytes
!pip install -q --upgrade accelerate

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from collections.abc import Iterable
from tqdm.auto import trange
import torch
import numpy as np
import peft
import transformers, accelerate, bitsandbytes



The functions for wrapping the message and running the inference. Note, that we are loading quantified model, since the resources of the Colab is not enough to run it without compression.

In [3]:
def wrap_messages(msgs):
    B_INST, E_INST = "[INST] ", " [/INST]"
    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
    prefixed_queries = [
        B_INST
        + B_SYS
        + "You are a Twitch moderator that paraphrases sentences to be non-toxic.\n"
        + E_SYS
        + "Could you paraphrase this: "
        + msg
        + "?\n"
        + E_INST
        for msg in msgs
    ]
    return prefixed_queries


def predict(requests, greb_answer = False, batch_size = 1, max_length = 64):
    requests = wrap_messages(requests)
    
    model = AutoModelForCausalLM.from_pretrained(
        'daryl149/llama-2-7b-chat-hf', 
        load_in_4bit=True, 
        bnb_4bit_compute_dtype=torch.float16
    )
    model.load_adapter('domrachev03/llama2_7b_detoxification')
    model.eval()
        
    tokenizer = AutoTokenizer.from_pretrained('daryl149/llama-2-7b-chat-hf')
    tokenizer.pad_token = tokenizer.eos_token
    
    
    results = []
    for i in trange(0, len(requests), batch_size):
        batch = [t for t in requests[i: i + batch_size]]
        inputs = tokenizer(
            batch, 
            padding=True, 
            truncation=True, 
            max_length = max_length, 
            return_tensors='pt'
        ).input_ids.to(model.device)
        
        with torch.no_grad():
            out = model.generate(inputs, max_new_tokens=max_length+1)
            decoded = [tokenizer.decode(out_i, skip_special_tokens=True,temperature=0) for out_i in out]
            
            if greb_answer:
                decoded = [
                    decoded[k][decoded[k].find('[/INST]')+len('[/INST]') : decoded[k].find('</s>')] 
                    for k in range(len(decoded))
                ]
            results.extend(decoded)
    
    return results

Test launch

In [4]:
queries = ['Fuck you!', 'This freaking chair makes me nuts', 'This fucking sause, I love it', 'I hate gays', 'Pupupu']

predict(queries, greb_answer=True, batch_size=2)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

["you know, I'm sorry, but I can't do that.",
 'this chair is driving me crazy.',
 'this sauce, I love it.',
 'I hate gays.',
 "I'm just a little puppy."]

## Computing the results

Now, let's load the test dataset and check the performance of the model. Note, that the inference on the whole test dataset would take too much time, and hence only a fraction of it is utilized

> Note: The current setup utilizes 20Gb of RAM and 15.9Gb of videomemory. This is barely enough to run on `Nvidia P100` in Kaggle. 

In [5]:
import datasets

dataset = datasets.load_dataset("domrachev03/toxic_comments_subset")
test_subset = dataset['test'].select(range(5000))

  0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
test_preds = predict([*test_subset['reference']], greb_answer=True, batch_size=64)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

## Metrics & saving

Now, let's compute the metrics for the model 

In [7]:
!pip install -q sacrebleu
!pip install -q evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [17]:
import gc
import tqdm
from tqdm.auto import trange
import torch
import numpy as np

from transformers import AutoModelForSequenceClassification, AutoTokenizer, \
    RobertaTokenizer, RobertaForSequenceClassification

import evaluate


def cleanup():
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()


def get_toxicity(preds, soft=False, batch_size=1, device='cuda'):
    results = []

    model_name = 'SkolkovoInstitute/roberta_toxicity_classifier'

    tokenizer = RobertaTokenizer.from_pretrained(model_name)
    model = RobertaForSequenceClassification.from_pretrained(model_name)
    device = device
    model.to(device)

    model.eval()
    for i in tqdm.tqdm(range(0, len(preds), batch_size)):
        batch = tokenizer(preds[i:i + batch_size], return_tensors='pt', max_length=-1, padding=True).to(device)

        with torch.no_grad():
            logits = model(**batch).logits
            out = torch.softmax(logits, -1)[:, 1].cpu().numpy()
            results.append(out)
    return 1 - np.concatenate(results)


def get_sacrebleu(inputs, preds):
    metric = evaluate.load("sacrebleu")

    result = metric.compute(predictions=preds, references=inputs)
    return result['score']


def get_fluency(preds, soft=False, batch_size=1, device='cuda'):
    path = 'cointegrated/roberta-large-cola-krishna2020'

    model = RobertaForSequenceClassification.from_pretrained(path)
    tokenizer = AutoTokenizer.from_pretrained(path)
    device = device
    model.to(device)

    results = []
    for i in trange(0, len(preds), batch_size):
        batch = [t for t in preds[i: i + batch_size]]
        inputs = tokenizer(batch, max_length=-1, padding=True, return_tensors='pt').to(device)
        with torch.no_grad():
            out = torch.softmax(model(**inputs).logits, -1)[:, 0].cpu().numpy()
            results.append(out)
    return np.concatenate(results)


def compute_metrics(eval_preds, tokenizer=None, print_results=False, batch_size=1, device='cuda'):
    preds, labels = eval_preds
    
    if tokenizer is not None:
        detokenized_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
        filtered_labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        detokenized_labels = tokenizer.batch_decode(filtered_labels, skip_special_tokens=True)
    else:
        detokenized_preds = preds
        detokenized_labels = labels

    results = {}
    results['toxic'] = get_toxicity(detokenized_preds, batch_size=batch_size, device=device)
    results['avg_toxic'] = sum(results['toxic']) / len(results['toxic'])
    cleanup()

    results['bleu'] = get_sacrebleu(detokenized_labels, detokenized_preds) / 100
    cleanup()

    results['fluency'] = get_fluency(detokenized_preds, batch_size=batch_size, device=device)
    results['avg_fluency'] = sum(results['fluency']) / len(results['fluency'])
    cleanup()

    # count metrics
    results['joint'] = sum(results['toxic'] * results['bleu'] * results['fluency']) / len(preds)
    if print_results:
        print("--------------")
        print("Metric   | Value")
        print("--------------")
        print(f"toxic    | {results['avg_toxic']:.2f}")
        print(f"bleu (n) | {results['bleu']:.2f}")
        print(f"fluency  | {results['avg_fluency']:.2f}")
        print("===============")
        print(f"Total    | {results['joint']:.2f}")
        print("--------------")
    return results

In [18]:
labels_list = [*test_subset['translation']]

In [19]:
compute_metrics((test_preds, labels_list), print_results=True, batch_size=64)

Some weights of the model checkpoint at SkolkovoInstitute/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 79/79 [00:10<00:00,  7.60it/s]


  0%|          | 0/79 [00:00<?, ?it/s]

--------------
Metric   | Value
--------------
toxic    | 0.69
bleu (n) | 0.21
fluency  | 0.91
Total    | 0.13
--------------


{'toxic': array([0.9999584 , 0.99572176, 0.9999203 , ..., 0.9960717 , 0.9992476 ,
        0.01411206], dtype=float32),
 'avg_toxic': 0.6933714614152908,
 'bleu': 0.2101026067924932,
 'fluency': array([0.98663205, 0.99079084, 0.9894153 , ..., 0.9943605 , 0.9873584 ,
        0.9846197 ], dtype=float32),
 'avg_fluency': 0.9135820226486773,
 'joint': 0.13385127502502436}

In [11]:
import pandas as pd

preds_dict = pd.DataFrame([[orig_i, pred_i, label_i] for orig_i, pred_i, label_i in zip(test_subset['reference'], test_preds, labels_list)], columns=['Query', 'Predictions', 'Labels'])

preds_dict.head(10)

Unnamed: 0,Query,Predictions,Labels
0,"If they say they don't want your help, you can...","if they say they don't want your help, you can...","if they say they don't need help, you can bet ..."
1,"When you go home, they'il call you ""Mr. No-Leg...","When you get home, they'll call you ""Mr. No-Le...","when you get home, they'll call you Mr."
2,Mrs. Sanders ain't goingfor that late shit no ...,Mrs. Sanders won't be doing that late stuff an...,Mrs. Sanders is not curious about being late.
3,I want every bloody one of them to hear.,I want everyone to hear it.,I want everyone to hear it.
4,That shit hurt.,that hurt.,and that really hurts.
5,How do I know if I tell you what you need to k...,I don't know if you'll tell me what I need to ...,how do I know if I tell you what you want to k...
6,"This guy ain't going nowhere. Oh, shit!",this guy's not going anywhere.,this guy's not going anywhere.
7,Fucked if I can remember.,I can't remember.,if only I could remember.
8,I fucked one stitch and he looks at me now as ...,"I'm sorry, but I'm not sure what you mean.","I mess up a suture, and now he's looking at me..."
9,Earth! Meet my lovely assistant... Tiny tits.,"earth, meet my assistant... tiny tits.","please welcome my beautiful assistant, Maloprs..."


In [12]:
preds_dict.to_csv('llama_test.csv')