**torch**

Description: PyTorch is an open-source machine learning library based on the Torch library. It is used for applications such as computer vision and natural language processing.
Purpose: PyTorch provides the building blocks necessary for designing, training, and deploying deep learning models. It's known for its dynamic computational graph and ease of use.

**peft==0.4.0**

Description: PEFT (Parameter-Efficient Fine-Tuning) is a library designed to make fine-tuning large language models more efficient by using parameter-efficient techniques.
Purpose: It allows you to fine-tune large pre-trained models without having to update all the parameters of the model, which saves on computational resources and time.


**bitsandbytes==0.40.2**

Description: Bitsandbytes is a lightweight library for 8-bit optimizers and quantization methods.
Purpose: It helps in reducing the memory footprint and computational requirements of training large models by using 8-bit numerical representations instead of the typical 16 or 32-bit.

transformers==4.31.0

Description: The Transformers library by Hugging Face provides thousands of pre-trained models to perform tasks on different modalities such as text, vision, and audio.
Purpose: This library simplifies the process of using state-of-the-art transformer models like GPT-3, BERT, and T5 for various natural language processing tasks.


trl==0.4.7

Description: TRL (Transformers Reinforcement Learning) is a library that integrates reinforcement learning with transformer models.
Purpose: It is used for tasks that involve fine-tuning language models using reinforcement learning techniques, such as optimizing text generation based on specific criteria.

trl==0.4.7

Description: TRL (Transformers Reinforcement Learning) is a library that integrates reinforcement learning with transformer models.
Purpose: It is used for tasks that involve fine-tuning language models using reinforcement learning techniques, such as optimizing text generation based on specific criteria.

einops

Description: Einops (Einstein Notation for operations) is a library that provides a flexible and powerful way to perform tensor operations.
Purpose: It simplifies the manipulation and transformation of tensors, making it easier to implement complex operations in deep learning models.

In [1]:
!pip install torch peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 accelerate einops



In [2]:
!pip install tqdm scipy



dataclasses:

Purpose: Provides a decorator and functions for creating data classes in Python. Data classes are a way to automatically generate special methods like __init__() and __repr__() in classes.
Usage Example: @dataclass decorator to define a class with fields.


Typing:

Type hints in Python improve code readability and facilitate static type checking, making the codebase more maintainable and less error-prone.

In [3]:
import os
from dataclasses import dataclass, field
from typing import Optional

import torch
from datasets import load_dataset
from datasets import load_from_disk
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    AutoTokenizer,
    TrainingArguments,
)
from tqdm.notebook import tqdm

from trl import SFTTrainer

In [4]:
from huggingface_hub import interpreter_login

In [5]:
dataset = load_dataset("Amod/mental_health_counseling_conversations", split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [6]:
dataset

Dataset({
    features: ['Context', 'Response'],
    num_rows: 3512
})

In [7]:
import pandas as pd
# Convert to DataFrame
df = pd.DataFrame(dataset)

# Display the first few rows of the DataFrame
df.head(2)

Unnamed: 0,Context,Response
0,I'm going through some things with my feelings...,"If everyone thinks you're worthless, then mayb..."
1,I'm going through some things with my feelings...,"Hello, and thank you for your question and see..."


In [8]:
# Function to transform the row into desired format
def format_row(row):
    question = row['Context']
    answer = row['Response']
    formatted_string = f"[INST] {question} [/INST] {answer} "
    return formatted_string

# Apply the function to each row of the dataframe
df['Formatted'] = df.apply(format_row, axis=1)

# Display the formatted column
df['Formatted']

0       [INST] I'm going through some things with my f...
1       [INST] I'm going through some things with my f...
2       [INST] I'm going through some things with my f...
3       [INST] I'm going through some things with my f...
4       [INST] I'm going through some things with my f...
                              ...                        
3507    [INST] My grandson's step-mother sends him to ...
3508    [INST] My boyfriend is in recovery from drug a...
3509    [INST] The birth mother attempted suicide seve...
3510    [INST] I think adult life is making him depres...
3511    [INST] I just took a job that requires me to t...
Name: Formatted, Length: 3512, dtype: object

In [9]:
# Rename the 'Formatted' column to 'Text'
new_df = df.rename(columns={'Formatted': 'Text'})

new_df

Unnamed: 0,Context,Response,Text
0,I'm going through some things with my feelings...,"If everyone thinks you're worthless, then mayb...",[INST] I'm going through some things with my f...
1,I'm going through some things with my feelings...,"Hello, and thank you for your question and see...",[INST] I'm going through some things with my f...
2,I'm going through some things with my feelings...,First thing I'd suggest is getting the sleep y...,[INST] I'm going through some things with my f...
3,I'm going through some things with my feelings...,Therapy is essential for those that are feelin...,[INST] I'm going through some things with my f...
4,I'm going through some things with my feelings...,I first want to let you know that you are not ...,[INST] I'm going through some things with my f...
...,...,...,...
3507,My grandson's step-mother sends him to school ...,Absolutely not! It is never in a child's best ...,[INST] My grandson's step-mother sends him to ...
3508,My boyfriend is in recovery from drug addictio...,I'm sorry you have tension between you and you...,[INST] My boyfriend is in recovery from drug a...
3509,The birth mother attempted suicide several tim...,"The true answer is, ""no one can really say wit...",[INST] The birth mother attempted suicide seve...
3510,I think adult life is making him depressed and...,How do you help yourself to believe you requir...,[INST] I think adult life is making him depres...


In [10]:
 new_df['Text']

0       [INST] I'm going through some things with my f...
1       [INST] I'm going through some things with my f...
2       [INST] I'm going through some things with my f...
3       [INST] I'm going through some things with my f...
4       [INST] I'm going through some things with my f...
                              ...                        
3507    [INST] My grandson's step-mother sends him to ...
3508    [INST] My boyfriend is in recovery from drug a...
3509    [INST] The birth mother attempted suicide seve...
3510    [INST] I think adult life is making him depres...
3511    [INST] I just took a job that requires me to t...
Name: Text, Length: 3512, dtype: object

In [11]:
 new_df[['Text']]

Unnamed: 0,Text
0,[INST] I'm going through some things with my f...
1,[INST] I'm going through some things with my f...
2,[INST] I'm going through some things with my f...
3,[INST] I'm going through some things with my f...
4,[INST] I'm going through some things with my f...
...,...
3507,[INST] My grandson's step-mother sends him to ...
3508,[INST] My boyfriend is in recovery from drug a...
3509,[INST] The birth mother attempted suicide seve...
3510,[INST] I think adult life is making him depres...


In [12]:
new_df = new_df[['Text']]

In [13]:
new_df.head(3)

Unnamed: 0,Text
0,[INST] I'm going through some things with my f...
1,[INST] I'm going through some things with my f...
2,[INST] I'm going through some things with my f...


In [14]:
# If you want to save the new dataframe to a CSV file:
new_df.to_csv('formatted_data.csv', index=False)

In [15]:
final_df = pd.read_csv("formatted_data.csv")

In [16]:
final_df.head(2)

Unnamed: 0,Text
0,[INST] I'm going through some things with my f...
1,[INST] I'm going through some things with my f...


In [17]:
training_dataset = load_dataset("csv", data_files="formatted_data.csv", split="train")

Generating train split: 0 examples [00:00, ? examples/s]

In [18]:
training_dataset

Dataset({
    features: ['Text'],
    num_rows: 3512
})

 trust_remote_code=True,

Some models on Hugging Face's Model Hub come with custom code that defines how the model should be loaded, initialized, or used. By setting trust_remote_code=True, you allow this custom code to be executed, ensuring that the model works as intended. This is particularly important for models that:

Have non-standard architectures.
Require specific preprocessing or postprocessing steps.
Include custom layers or operations.
How It Works
When you load a model with trust_remote_code=True, the Transformers library fetches not only the model weights but also any custom Python code from the model's repository. This code is then executed to properly initialize and configure the model.

flash_attn=True:

Type: bool
Description: Enables the use of flash attention, which is a more memory-efficient and faster implementation of the attention mechanism.

flash_rotary=True:

Type: bool
Description: Enables the use of rotary embeddings in flash attention. Rotary embeddings can help improve the performance of the attention mechanism by better capturing positional information.

fused_dense=True:

Type: bool
Description: Enables the use of fused dense layers, which combine multiple matrix multiplications into a single operation for better performance.

low_cpu_mem_usage=True:

Type: bool
Description: Reduces CPU memory usage during model loading. This can be particularly useful when working with large models on machines with limited CPU memory.

revision="refs/pr/23":

Type: str
Description: Specifies a particular revision of the model to load. This can be useful if you want to use a specific version of the model from its Git repository.

model.config.use_cache = False: Disables the caching mechanism during training to save memory.

model.config.pretraining_tp = 1: Sets the number of tensor parallelism stages to 1. This is specific to some model architectures that support tensor parallelism.


Tensor Parallelism
Tensor parallelism is a technique used to distribute the computation of a single layer of a neural network across multiple devices (such as GPUs). This allows for more efficient utilization of hardware resources, especially for very large models that might not fit into the memory of a single device.

Setting pretraining_tp = 1
When you set model.config.pretraining_tp = 1, you are configuring the model to use a single stage for tensor parallelism. This essentially means that the model will not split its computation across multiple stages; instead, the entire computation for each layer will be handled as a single unit.

prepare_model_for_kbit_training: Prepares the model for training with lower precision (e.g., 8-bit or 4-bit). This reduces memory usage and computational load.

use_gradient_checkpointing=True: Enables gradient checkpointing, which saves memory by trading off computation. It allows storing fewer intermediate results and recomputing them during backpropagation.

**Gradient Accumulation**

In deep learning, training a model typically involves the following steps in each iteration:

Forward Pass: Compute the output (predictions) from the input data.
Loss Calculation: Compute the loss, which measures the difference between the predicted output and the actual target.
Backward Pass (Backpropagation): Compute the gradients of the loss with respect to the model parameters.
Optimizer Step: Update the model parameters using the computed gradients.
Usually, these steps are performed for each batch of data. However, there are situations where the batch size needs to be increased beyond what can fit into the memory of a single GPU. This is where gradient accumulation comes into play.

Gradient Accumulation Steps
Gradient Accumulation allows you to simulate a larger batch size by accumulating gradients over several mini-batches and then performing an optimizer step. This is especially useful when the available hardware (e.g., GPUs) cannot handle a large batch size due to memory constraints



Practical Example
Scenario
You are training a language model, and your GPU can only handle a batch size of 2 due to memory limitations. However, for stable training and better performance, you want to use an effective batch size of 64.

Solution
By setting gradient_accumulation_steps=32, you accumulate gradients over 32 mini-batches, each with a batch size of 2. After 32 mini-batches, the optimizer updates the model parameters.**





evaluation_strategy="steps":

This indicates that evaluation of the model's performance will be triggered based on a certain number of steps rather than epochs.

eval_steps=2000:

This specifies that evaluation will occur every 2000 training steps.
After every 2000 steps of training (iterations), the model will be evaluated on a validation set or using a validation metric to assess its current performance.

optim="paged_adamw_8bit": Use an 8-bit Adam optimizer with paged memory management. This helps in efficient training with lower precision.

The numerical value 2e-4 means
2×10
−4
 , which is equivalent to 0.0002. This falls within the lower range of learning rates. Specifically:

Numerical Value: 2e-4 = 0.0002


warmup ratio

Stabilization: During the initial stages of training, setting the learning rate too high can lead to unstable updates, causing the model's parameters to oscillate or diverge.
Enhanced Training: Warmup helps in gradually increasing the learning rate, allowing the model to smoothly transition from initial exploration to more refined updates.


weight_decay=0.01,

Control Overfitting: L2 regularization is used to prevent overfitting during training by penalizing large weights.
Simplification: It encourages the model to prefer smaller weights, which can lead to simpler models that generalize better on unseen data.

max_steps=-1: Maximum number of training steps. -1 means it will train for the number of epochs specified.

r:

Description: This parameter (r) typically denotes the number of transformer layers or the model depth. It specifies how many layers are used in the model.
Usage: It affects the model's capacity and computational requirements. Higher values of r imply deeper models with potentially greater expressive power but also increased computational cost.

Role of Bias in Neural Networks
Adjustment of Activation: Bias terms allow each neuron to shift the activation function left or right, providing an additional degree of freedom in the model.
Improved Fit: By adding a bias term, the model can better fit the training data, especially when the data is not centered around zero.
Flexibility: Bias terms can help the model learn more complex patterns by enabling the neurons to activate even when the input is zero.


bias="none": Setting bias="none" in the LoRA configuration excludes bias terms from the specified layers, reducing model complexity and computational requirements.

Task_type:

Description: Defines the type of task the model is designed for.
Usage: task_type="CAUSAL_LM" indicates that the model is configured for a causal language modeling task. Causal language models predict the next word in a sequence given the preceding words, which is useful for tasks like text generation.


target_modules Parameter
Description: The target_modules parameter specifies which modules or layers in the model configuration are targeted for specific modifications or fine-tuning.


- Wqkv (Weight Query, Key, Value) Module:

Purpose: This module is often associated with the attention mechanism in transformer models. It typically handles the weights for the query, key, and value matrices used in the self-attention computation.
Modification: By focusing on this module, the configuration aims to fine-tune the attention mechanism, potentially improving how the model attends to different parts of the input sequence.

- fc1 (Fully Connected Layer 1):

Purpose: This is the first fully connected layer in the model, which typically follows the attention mechanism. It processes the output from the attention mechanism and passes it through an activation function.
Modification: Tuning this layer can help improve the transformation of the attention outputs, enhancing the model's ability to learn complex patterns.

- fc2 (Fully Connected Layer 2):

Purpose: This is the second fully connected layer in the model. It processes the output from the first fully connected layer and usually serves as a final transformation before producing the model's output.
Modification: By focusing on this layer, the configuration aims to fine-tune the final transformations applied to the data, potentially improving the model's overall performance.

In [19]:
base_model = "microsoft/phi-2"
new_model = "phi-2-mental-health"

tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
tokenizer.pad_token=tokenizer.eos_token
tokenizer.padding_side="right"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    # use_flash_attention_2=True, # Phi does not support yet.
    trust_remote_code=True,
    flash_attn=True,
    flash_rotary=True,
    fused_dense=True,
    low_cpu_mem_usage=True,
    device_map={"": 0},
    revision="refs/pr/23",

)

model.config.use_cache = False
model.config.pretraining_tp = 1

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=32,
    evaluation_strategy="steps",
    eval_steps=2000,
    logging_steps=15,
    optim="paged_adamw_8bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    save_steps=2000,
    warmup_ratio=0.05,
    weight_decay=0.01,
    max_steps=-1,
    fp16= True # Enable mixed precision training
)

peft_config = LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules= ["Wqkv", "fc1", "fc2" ] # ["Wqkv", "out_proj", "fc1", "fc2" ], - 41M params
    # modules_to_save=["embed_tokens","lm_head"]
)

trainer = SFTTrainer(
    model=model,
    train_dataset=training_dataset,
    peft_config=peft_config,
    dataset_text_field="Text",
    max_seq_length=690,
    tokenizer=tokenizer,
    args=training_arguments,

)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Map:   0%|          | 0/3512 [00:00<?, ? examples/s]

In [1]:
trainer.train()

NameError: name 'trainer' is not defined

In [None]:
from transformers import pipeline

In [None]:
# Run text generation pipeline with our next model
prompt = "I am not able to sleep in night. Do you have any suggestions?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=250)
result = pipe(f"[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

In [None]:
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

In [None]:
model_name = "microsoft/phi-2"

In [None]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"