# LLM - Meta - LLama-3.1-8b - Transfer Learning

En este Notebook se realiza el proceso de finetuning para los modelos llama de meta sobre el conjunto de datos ISEAR con el objetivo de identificar cuál tiene el mejor rendimiento en términos de f1_score sobre el conjunto de datos de entrenamiento


## Libraries


In [1]:
import transformers
from transformers import AutoTokenizer, set_seed
from datasets import Dataset, DatasetDict, ClassLabel
import pandas as pd
import numpy as np
import evaluate
import torch
from transformers import (
    pipeline,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
)
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from trl import setup_chat_format
import bitsandbytes as bnb

2024-10-22 22:16:39.768547: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-22 22:16:39.785993: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-22 22:16:39.790510: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-22 22:16:39.802874: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Es necesario loggearse con una cuenta de Hugging face que tenga acceso a los modelos


In [2]:
from huggingface_hub import login

# Log in to Hugging Face using your API token
login(token="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")

## Dataset

Se cargan los datasets de `training`, `validation` y `test`. Se utiliza cada uno para:

- `training` = Realizar el fine tunning del modelo.
- `validation` = validar el proceso de fine tunning.
- `test` = testear el modelo con nuevos datos.


In [3]:
df_train = pd.read_csv("../data/data_to_model/train_data.csv").rename(
    columns={"emotion": "label"}
)
df_val = pd.read_csv("../data/data_to_model/val_data.csv").rename(
    columns={"emotion": "label"}
)
df_test = pd.read_csv("../data/data_to_model/test_data.csv").rename(
    columns={"emotion": "label"}
)

In [20]:
df_dict_test = Dataset.from_pandas(df_test)
df_dict_test = df_dict_test.class_encode_column("label")

Casting to class labels:   0%|          | 0/754 [00:00<?, ? examples/s]

# Preparing dataset to model

### Create Queries


In [4]:
# Define the prompt generation functions
def generate_prompt(data_point):
    return f"""
        You are an advanced assistant specialized in analyzing and detecting emotions in short text. 
        You will be provided with a text, and your task is to classify it into **exactly one emotion** from the following list:
        [shame, sadness, joy, guilt, fear, disgust, anger].

        **Important Rules:**
        1. You must return only **one** of the emotions from the list without additional text.
        2. Do **not** create or infer any emotions outside the list.
        3. If the text does not match any emotion exactly, return the closest emotion from the list. 
        4. Do **not** return 'none', 'jealousy', or any other emotion that is not in the list.
        5. Do not return additional text, only **exactly one emotion** of the list.
        
        Your answer can only be shame, sadness, joy, guilt, fear, disgust, or anger.
        If you're unsure, assign the closest emotion from the allowed list.
        
text: {data_point["text"]}
label: {data_point["label"]}""".strip()


def generate_test_prompt(data_point):
    return f"""
        You are an advanced assistant specialized in analyzing and detecting emotions in short text. 
        You will be provided with a text, and your task is to classify it into **exactly one emotion** from the following list:
        [shame, sadness, joy, guilt, fear, disgust, anger].
text: {data_point["text"]}
label: """.strip()

In [5]:
df_train["text"] = df_train.apply(generate_prompt, axis=1)
df_val["text"] = df_val.apply(generate_prompt, axis=1)
df_test["text"] = df_test.apply(generate_test_prompt, axis=1)

# Generate prompts for training and evaluation data
X_train = df_train["text"]
X_val = df_val["text"]
X_test = df_test["text"]

# Generate test prompts and extract true labels
y_test = df_test.loc[:, "label"]

In [23]:
X_test

0      You are an advanced assistant specialized in a...
1      You are an advanced assistant specialized in a...
2      You are an advanced assistant specialized in a...
3      You are an advanced assistant specialized in a...
4      You are an advanced assistant specialized in a...
                             ...                        
749    You are an advanced assistant specialized in a...
750    You are an advanced assistant specialized in a...
751    You are an advanced assistant specialized in a...
752    You are an advanced assistant specialized in a...
753    You are an advanced assistant specialized in a...
Name: text, Length: 754, dtype: object

### Convert df formated to datasets


In [6]:
# Convert to datasets
df_dict_train = Dataset.from_pandas(df_train[["text"]])
df_dict_val = Dataset.from_pandas(df_val[["text"]])

# Crear el DatasetDict con train, validation y test
emotions = DatasetDict(
    {
        "train": df_dict_train,
        "validation": df_dict_val,
    }
)

emotions

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 6027
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 753
    })
})

## Loading model and tokenizer


In [7]:
base_model_name = "meta-llama/Llama-3.2-1B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
)

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype="float16",
    # quantization_config=bnb_config,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

tokenizer.pad_token_id = tokenizer.eos_token_id

In [41]:
import wandb


wb_token = "f605878c9f982d8824041df6842ee9e5baaa993e"

wandb.login(key=wb_token)
run = wandb.init(
    project="Predictions llama-3.2-1b-2epoch-it on Emotion Analysis Dataset ISEAR",
    job_type="training",
    anonymous="allow",
)



[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/dalopeza/.netrc


In [17]:
def predict(test, model, tokenizer):
    y_pred = []
    categories = ["shame", "sadness", "joy", "guilt", "fear", "disgust", "anger"]

    for i in X_test:
        prompt = i
        pipe = pipeline(
            task="text-generation",
            model=model,
            tokenizer=tokenizer,
            max_new_tokens=2,
            temperature=0.1,
        )

        result = pipe(prompt)
        answer = result[0]["generated_text"].split("label:")[-1].strip()

        # Determine the predicted category
        for category in categories:
            if category.lower() in answer.lower():
                y_pred.append(category)
                break
        else:
            y_pred.append("none")

    return y_pred


In [12]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
)

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype="float16",
    quantization_config=bnb_config,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

import bitsandbytes as bnb


def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if "lm_head" in lora_module_names:  # needed for 16 bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)


modules = find_all_linear_names(model)
modules

['up_proj', 'q_proj', 'k_proj', 'gate_proj', 'o_proj', 'v_proj', 'down_proj']

In [15]:
output_dir = "models/llama-3.2-1b-fine-tuned-model-isear"

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules,
)

training_arguments = TrainingArguments(
    output_dir=output_dir,  # directory to save and repository id
    num_train_epochs=10,  # number of training epochs
    per_device_train_batch_size=1,  # batch size per device during training
    gradient_accumulation_steps=8,  # number of steps before performing a backward/update pass
    gradient_checkpointing=True,  # use gradient checkpointing to save memory
    optim="paged_adamw_32bit",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=1,
    learning_rate=2e-4,  # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,  # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,  # warmup ratio based on QLoRA paper
    group_by_length=False,
    lr_scheduler_type="cosine",  # use cosine learning rate scheduler
    report_to="wandb",  # report metrics to w&b
    eval_strategy="steps",  # save checkpoint every epoch
    eval_steps=0.2,
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=emotions["train"],
    eval_dataset=emotions["validation"],
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=512,
    packing=False,
    dataset_kwargs={
        "add_special_tokens": False,
        "append_concat_token": False,
    },
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/6027 [00:00<?, ? examples/s]

Map:   0%|          | 0/753 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [16]:
trainer.train()

  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss
0,0.2251,0.305847
1,0.2281,0.303177
2,0.1697,0.3183
3,0.1992,0.350974
4,0.183,0.4065
5,0.0872,0.467132
6,0.0593,0.531201
8,0.0433,0.603956
9,0.0435,0.621997


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enab

TrainOutput(global_step=7530, training_loss=0.15084042684243099, metrics={'train_runtime': 26910.945, 'train_samples_per_second': 2.24, 'train_steps_per_second': 0.28, 'total_flos': 8.9917827577344e+16, 'train_loss': 0.15084042684243099, 'epoch': 9.995022399203584})

In [17]:
wandb.finish()
model.config.use_cache = True

VBox(children=(Label(value='0.042 MB of 0.042 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,▁▁▁▂▃▅▆▇██
eval/runtime,█▁▂▂▂▂▁▂▂▁
eval/samples_per_second,▁█▇▆▇▆█▆▆█
eval/steps_per_second,▁█▇▇▇▆█▆▆█
train/epoch,▁▁▁▂▂▂▂▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇█████
train/global_step,▁▁▁▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇███
train/grad_norm,▁▂▂▂▂▂▂▂▃▃▄▃▃▅▄█▇▇▆▃▆▄▆▆▃▃▃▂▅▁▂▂▂▁▁▂▁▂▂▂
train/learning_rate,▂█████▇▇▇▇▇▇▇▆▆▆▆▆▆▅▄▄▄▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁
train/loss,█▂▂▂▂▂▂▂▁▁▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
eval/loss,0.622
eval/runtime,98.9185
eval/samples_per_second,7.612
eval/steps_per_second,0.96
total_flos,8.9917827577344e+16
train/epoch,9.99502
train/global_step,7530.0
train/grad_norm,0.10393
train/learning_rate,0.0
train/loss,0.0435


In [25]:
df_dict_test

Dataset({
    features: ['text', 'label'],
    num_rows: 754
})

In [32]:
df_test[["text"]][:100]

Unnamed: 0,text
0,You are an advanced assistant specialized in a...
1,You are an advanced assistant specialized in a...
2,You are an advanced assistant specialized in a...
3,You are an advanced assistant specialized in a...
4,You are an advanced assistant specialized in a...
...,...
95,You are an advanced assistant specialized in a...
96,You are an advanced assistant specialized in a...
97,You are an advanced assistant specialized in a...
98,You are an advanced assistant specialized in a...


In [42]:
df_dict_test = Dataset.from_pandas(df_test[["text"]][:10])

In [47]:
X_test

0      You are an advanced assistant specialized in a...
1      You are an advanced assistant specialized in a...
2      You are an advanced assistant specialized in a...
3      You are an advanced assistant specialized in a...
4      You are an advanced assistant specialized in a...
                             ...                        
749    You are an advanced assistant specialized in a...
750    You are an advanced assistant specialized in a...
751    You are an advanced assistant specialized in a...
752    You are an advanced assistant specialized in a...
753    You are an advanced assistant specialized in a...
Name: text, Length: 754, dtype: object

In [57]:
from tqdm import tqdm


def predict(test, model, tokenizer):
    y_pred = []
    categories = ["Normal", "Depression", "Anxiety", "Bipolar"]

    for i in tqdm(range(len(test))):
        prompt = test.iloc[i]  # ["text"]
        pipe = pipeline(
            task="text-generation",
            model=model,
            tokenizer=tokenizer,
            max_new_tokens=2,
            temperature=0.1,
        )

        result = pipe(prompt)
        answer = result[0]["generated_text"].split("label:")[-1].strip()

        # Determine the predicted category
        for category in categories:
            if category.lower() in answer.lower():
                y_pred.append(category)
                break
        else:
            y_pred.append("none")

    return y_pred


# y_pred = predict(X_test, model, tokenizer)

In [55]:
for i in tqdm(range(len(X_test[:10]))):
    print(i)

100%|██████████| 10/10 [00:00<00:00, 2778.24it/s]

0
1
2
3
4
5
6
7
8
9





In [58]:
y_pred = predict(X_test[:10], model, tokenizer)
# evaluate(y_test, y_pred)

100%|██████████| 10/10 [00:02<00:00,  3.35it/s]


In [43]:
# Tokenize the test data
test_encodings = df_dict_test.map(
    lambda batch: tokenizer(
        batch["text"], padding=True, truncation=True
    ),  # , max_length=512),
    batched=True,
)

# Ensure the tokenizer does not return special tokens as labels
test_encodings = test_encodings.remove_columns(["text"])
test_encodings.set_format("torch")

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [44]:
# Generate predictions
predictions = trainer.predict(test_encodings)

# Extract predicted labels
y_pred = predictions.predictions.argmax(axis=-1)

In [45]:
y_test[:10]

0      anger
1    sadness
2      shame
3    sadness
4      guilt
5    disgust
6      shame
7       fear
8    disgust
9    disgust
Name: label, dtype: object

In [59]:
y_pred

['none',
 'none',
 'none',
 'none',
 'none',
 'none',
 'none',
 'none',
 'none',
 'none']

In [21]:
# Evaluar el modelo en el conjunto de prueba
test_results = trainer.evaluate(eval_dataset=df_dict_test)

# Imprimir los resultados de las métricas en el conjunto de prueba
test_results

ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label']

In [19]:
y_pred = predict(X_test, model, tokenizer)
evaluate(y_test, y_pred)

NameError: name 'predict' is not defined

In [18]:
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix


def plot_confusion_matrix(y_pred, y_true, labels):
    cm = confusion_matrix(y_true, y_pred, normalize="true")
    _, ax = plt.subplots(figsize=(6, 6))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
    plt.title("Normalized confusion matrix")
    plt.show()

In [None]:
# Initialize an empty list to store results
results = []

# Step 1: Classification metrics
print("Classification Report:")
report = classification_report(
    y_test,
    y_pred,
    output_dict=True,
    target_names=df_test["label"].unique(),
)
print(
    classification_report(
        y_test,
        y_pred,
        target_names=df_test["label"].unique(),
    )
)

# Extract important metrics from the classification report for the test set
accuracy = report["accuracy"]
macro_precision = report["macro avg"]["precision"]
macro_recall = report["macro avg"]["recall"]
macro_f1 = report["macro avg"]["f1-score"]

# Store the metrics in the results list
results.append(
    {
        "Model": "llama-3.1-8b",
        "Method": "Transfer Learning",
        "Test Accuracy": accuracy,
        "Test Macro Precision": macro_precision,
        "Test Macro Recall": macro_recall,
        "Test Macro F1-Score": macro_f1,
    }
)

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

plot_confusion_matrix(y_pred, y_test, labels)

results_df

### Consolidating training, validation y test:


In [5]:
# Convert the pandas DataFrame into a Dataset
df_dict_train = Dataset.from_pandas(df_train)
df_dict_train = df_dict_train.class_encode_column("label")

df_dict_val = Dataset.from_pandas(df_val)
df_dict_val = df_dict_val.class_encode_column("label")

df_dict_test = Dataset.from_pandas(df_test)
df_dict_test = df_dict_test.class_encode_column("label")


# Crear el DatasetDict con train, validation y test
emotions = DatasetDict(
    {
        "train": df_dict_train,
        "validation": df_dict_val,
        "test": df_dict_test,
    }
)


# Verificar el resultado
emotions

Casting to class labels:   0%|          | 0/6027 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/753 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/754 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 6027
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 753
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 754
    })
})

## Tokenize Text


In [6]:
base_model_name = "meta-llama/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

tokenizer.pad_token_id = tokenizer.eos_token_id

print("The vocabulary size is:", tokenizer.vocab_size)
print("Maximum context size:", tokenizer.model_max_length)
print(
    "Name of the fields, model need in the forward pass:", tokenizer.model_input_names
)

The vocabulary size is: 128000
Maximum context size: 131072
Name of the fields, model need in the forward pass: ['input_ids', 'attention_mask']


In [7]:
# Tokenize
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=False)

In [8]:
emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)

Map:   0%|          | 0/6027 [00:00<?, ? examples/s]

Map:   0%|          | 0/753 [00:00<?, ? examples/s]

Map:   0%|          | 0/754 [00:00<?, ? examples/s]

In [9]:
emotions_encoded["train"].column_names

['text', 'label', 'input_ids', 'attention_mask']

## Analyzing encoded text


In [10]:
import torch
import torch.nn as nn

from transformers import AutoModel

model_ckpt = base_model_name
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

In [11]:
def extract_hidden_states(batch):
    inputs = {
        k: v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names
    }
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state

    return {"hidden state": last_hidden_state[:, 0].cpu().numpy()}

In [None]:
emotions_encoded.set_format("torch", columns=["input_ids", "attention_mask", "label"])
print(type(emotions_encoded["train"]["input_ids"]))

<class 'torch.Tensor'>


In [13]:
emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True)

: 