<a href="https://colab.research.google.com/github/charnkmr/Resume_App/blob/main/RA_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Step 1: Install Required Libraries

In [1]:
!pip install transformers datasets peft accelerate rouge-score

Collecting datasets
  Downloading datasets-3.3.0-py3-none-any.whl.metadata (19 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidi

## Step 2: Load and Preprocess the Data

In [3]:
import json
from datasets import Dataset

# Load the JSON data
with open('/content/resume_dataset.json', 'r') as f:
    data = json.load(f)

# Convert to Hugging Face Dataset
dataset = Dataset.from_dict({
    "resume_text": [item["resume_text"] for item in data],
    "instruction": [item["instruction"] for item in data],
    "feedback": [item["feedback"] for item in data]
})

# Split the dataset into training and evaluation sets
dataset = dataset.train_test_split(test_size=0.1)

## Step 3: Tokenize the Data

In [4]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")

def preprocess_function(examples):
    inputs = [f"Instruction: {inst}\nResume: {resume}" for inst, resume in zip(examples['instruction'], examples['resume_text'])]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")

    # Tokenize targets
    labels = tokenizer(examples['feedback'], max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Map:   0%|          | 0/27 [00:00<?, ? examples/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

## Step 4: Apply LoRA to the Model

In [5]:
from transformers import T5ForConditionalGeneration
from peft import get_peft_model, LoraConfig, TaskType

# Load the base model
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")

# Setup LoRA configuration
lora_config = LoraConfig(
    r=15,  # Rank of the LoRA
    lora_alpha=16,  # Scaling factor
    lora_dropout=0.1,  # Dropout rate
    task_type=TaskType.SEQ_2_SEQ_LM
)

# Wrap the model with LoRA
model = get_peft_model(model, lora_config)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Step 5: Fine-Tune the Model

In [6]:
import torch
torch.cuda.empty_cache()

In [24]:
from transformers import Trainer, TrainingArguments

# training_args = TrainingArguments(
#     output_dir="./flan-t5-resume-feedback",
#     evaluation_strategy="epoch",
#     learning_rate=3e-5,
#     per_device_train_batch_size=3,
#     per_device_eval_batch_size=3,
#     num_train_epochs=6,
#     weight_decay=0.01,
#     save_total_limit=2,
#     save_steps=500,
#     logging_dir="./logs",
#     logging_steps=10,
#     max_grad_norm=1.0,
#     push_to_hub=False,
#     hub_model_id="charnkmr/flan-t5-base"
# )

training_args = TrainingArguments(
    output_dir="./flan-t5-resume-feedback",
    evaluation_strategy="epoch",  # Keep evaluation at the end of each epoch
    learning_rate=1e-3,          # Increased learning rate - start here
    per_device_train_batch_size=4,  # Reduced batch size (adjust based on GPU memory)
    per_device_eval_batch_size=4,   # Same as train batch size for consistency
    # gradient_accumulation_steps=2,    # Simulate larger batch size if needed
    num_train_epochs=10,           # Slightly increased epochs
    weight_decay=0.01,           # Keep weight decay
    save_total_limit=3,          # Increased save limit
    save_steps=250,              # More frequent saves
    logging_dir="./logs",
    logging_steps=10,
    max_grad_norm=1.0,           # Keep gradient clipping
    # bf16=True, # Try this if fp16 gives nan's
    push_to_hub=False,
    hub_model_id="charnkmr/flan-t5-base"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

# Start training
trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,3.289371
2,3.807900,1.956696
3,2.768300,1.472143
4,2.768300,1.141338
5,2.123600,1.029879
6,1.636700,0.853387
7,1.636700,0.807474
8,1.491700,0.722872
9,1.268300,0.740964
10,1.162700,0.720628


TrainOutput(global_step=70, training_loss=2.037022590637207, metrics={'train_runtime': 40.0373, 'train_samples_per_second': 6.744, 'train_steps_per_second': 1.748, 'total_flos': 186260426588160.0, 'train_loss': 2.037022590637207, 'epoch': 10.0})

## Step 6: Evaluate the model

In [25]:
def evaluate_model(model, tokenizer, examples):
    from rouge_score import rouge_scorer
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)  # Initialize Rouge scorer

    results = []
    for example in examples:
      input_text = f"Instruction: {example['instruction']}\nResume: {example['resume_text']}"
      input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(model.device) # Move input to the same device as the model

      # outputs = model.generate(input_ids, max_length=512) # Generate outputs
      outputs = model.generate(input_ids=input_ids, max_length=128) # Generate outputs
      predicted_feedback = tokenizer.decode(outputs[0], skip_special_tokens=True)
      target_feedback = example['feedback']

      score = scorer.score(target_feedback, predicted_feedback)
      results.append(score)

    # Calculate average Rouge scores
    avg_rouge1 = sum(score['rouge1'].fmeasure for score in results) / len(results)
    avg_rougel = sum(score['rougeL'].fmeasure for score in results) / len(results)

    print(f"Average Rouge-1: {avg_rouge1}")
    print(f"Average Rouge-L: {avg_rougel}")

evaluate_model(model, tokenizer, data) # Evaluate on the loaded data

Average Rouge-1: 0.3997673180755987
Average Rouge-L: 0.3626635384067414


## Step 7: Test on a real data

In [26]:
sample_resume = "Summary: Experienced software engineer with 5+ years of expertise in Python, Java, and cloud technologies. Skills: Python, Java, AWS, SQL, Docker. Experience: Software Engineer at XYZ Corp (2018-2023). Education: B.Tech in Computer Science from ABC University."
sample_instruction = "Provide detailed 5 liner feedback for a back-end developer role."

# Tokenize the sample input with padding and truncation
inputs = tokenizer(
    f"Instruction: {sample_instruction}\nResume: {sample_resume}",
    return_tensors="pt",
    max_length=512,
    truncation=True,
    padding="max_length"
).to(model.device) # Move the input tensors to the same device as the model

# Generate feedback using keyword arguments
outputs = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_length=128
)

# Decode the generated feedback
generated_feedback = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated Feedback: {generated_feedback}")

Generated Feedback: Highlight your experience with back-end technologies like Java, SQL, and Docker.


## Step 8: Push the model to hub

In [31]:
from google.colab import userdata
hf_token = userdata.get('hftoken')

In [32]:
trainer.push_to_hub(token=hf_token)
tokenizer.save_pretrained("charnkmr/flan-t5-base")

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/6.66M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.30k [00:00<?, ?B/s]

('charnkmr/flan-t5-base/tokenizer_config.json',
 'charnkmr/flan-t5-base/special_tokens_map.json',
 'charnkmr/flan-t5-base/spiece.model',
 'charnkmr/flan-t5-base/added_tokens.json')