This notebook shows how to:

*   Fine-tune Phi-3 mini with QLoRA and LoRA
*   Quantize Phi-3 mini with BitsandBytes and GPTQ
*   Run Phi-3 mini with Transformers

Each section of this notebook can be run independently.

Details and comments here: [Phi-3: Fine-tuning and Quantization on Your Computer](https://kaitchup.substack.com/p/phi-3-fine-tuning-and-quantization)


# Inference

With Hugging Face's Transformers (16-bit version)

In [None]:
!pip install -qqq accelerate transformers auto-gptq optimum

Collecting accelerate
  Downloading accelerate-0.29.3-py3-none-any.whl (297 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/297.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━[0m [32m256.0/297.6 kB[0m [31m7.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting auto-gptq
  Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m58.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting optimum
  Downloading optimum-1.19.1-py3-none-any.whl (417 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m417.0/417.0 kB[0m [31m42.0 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from auto-gptq)
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K  

In [None]:
from huggingface_hub import login
hf_token="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"
login(hf_token)


In [None]:
import wandb

wb_token = "69b9681e7dc41d211e8c93a3ba9a6fb8d781404a"

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tune-DeepSeek-R1-Distill-Llama-8B on Medical COT Dataset',
    job_type="training",
    anonymous="allow"
)

In [None]:

import os
!mkdir ~/.kaggle
!touch ~/.kaggle/kaggle.json
!mv .kaggle /root/
api_token = {"username":"quoiwow","key":"331f5417fb93902f065cd5aadff869f4"}
import json
with open('/root/.kaggle/kaggle.json', 'w') as file:
    json.dump(api_token, file)
!chmod 600 ~/.kaggle/kaggle.json

os.environ['KAGGLE_USERNAME'] = 'quoiwow'
os.environ['KAGGLE_KEY'] = '331f5417fb93902f065cd5aadff869f4'
# import kagglehub
# kagglehub.login()

import kaggle
from kaggle import api # import the already authenticated API client
api = kaggle.KaggleApi()
api.authenticate()
api.dataset_download_files('netflix-inc/netflix-prize-data', path='data', unzip=True)


Using the original model (16-bit version)

It requires 7.4 GB of GPU RAM

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

set_seed(1234)  # For reproducibility

prompt = "Artificial intelligence is"

checkpoint = "microsoft/Phi-3-mini-4k-instruct"

tokenizer = AutoTokenizer.from_pretrained(checkpoint,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, torch_dtype="auto", device_map="cuda")

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=150)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Artificial intelligence is an extremely powerful tool to aid in many fields, but it is also necessary to proceed responsibly. In the financial industry, AI can be used to automate processes, provide personalized customer experiences, and detect fraudulent activities. However, it is crucial to ensure that AI systems are transparent, unbiased, and adhere to ethical guidelines. 


With Hugging Face's Transformers with the model quantized with GPTQ 4-bit

It requires 2.7 GB of GPU RAM,

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

set_seed(1234)  # For reproducibility

prompt = "Artificial intelligence is"

checkpoint = "kaitchup/Phi-3-mini-4k-instruct-gptq-4bit"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, device_map='cuda')

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=150)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)

tokenizer_config.json:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.85M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/940 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/623 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]



Artificial intelligence is an extremely powerful tool to aid in many fields, but it is likely at least 10 years away from actually doing most of the cognitive heavy lifting for scientists. But it can perform some simple, tedious tasks very well, such as cataloging and annotating datasets. Artificiar intelligence is already being used to catalog and annotate biological datasets at the University of California, San Francisco. These datasets can then be accessed and used by scientists.

== Data collection and annotation process ==
Each dataset (or "dataset card") is an Excel spreadsheet file that contains various sheets. Each sheet (or "page") represents one image or measurement in the dataset, in turn, as the image or measurements are captured


# Quantization

Bitsandbytes NF4

In [None]:
!pip install -qqq --upgrade transformers bitsandbytes accelerate datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.9/388.9 kB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
else:
  compute_dtype = torch.float16

model_name = "microsoft/Phi-3-mini-4k-instruct"
quant_path = 'Phi-3-mini-4k-instruct-bnb-4bit'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, trust_remote_code=True
)


model.save_pretrained("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

('./Phi-3-mini-4k-instruct-bnb-4bit/tokenizer_config.json',
 './Phi-3-mini-4k-instruct-bnb-4bit/special_tokens_map.json',
 './Phi-3-mini-4k-instruct-bnb-4bit/tokenizer.model',
 './Phi-3-mini-4k-instruct-bnb-4bit/added_tokens.json',
 './Phi-3-mini-4k-instruct-bnb-4bit/tokenizer.json')

GPTQ

More details about the GPTQ quantization in this article:

[Quantize and Fine-tune LLMs with GPTQ Using Transformers and TRL](https://kaitchup.substack.com/p/quantize-and-fine-tune-llms-with)


In [None]:
!pip install -qqq --upgrade transformers auto-gptq accelerate datasets
!python -m pip install -qqq git+https://github.com/huggingface/optimum.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m56.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.2/13.2 MB[0m [31m94.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for optimum (pyproject.toml) ... [?25l[?25hdone


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer
import torch
model_path = 'microsoft/Phi-3-mini-4k-instruct'
w = 4 #quantization to 4-bit. Change to 2, 3, or 8 to quantize with another precision

quant_path = 'Phi-3-mini-4k-instruct-gptq-'+str(w)+'bit'

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
quantizer = GPTQQuantizer(bits=w, dataset="c4", model_seqlen = 2048)
quantized_model = quantizer.quantize_model(model, tokenizer)

quantized_model.save_pretrained("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading readme:   0%|          | 0.00/41.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/319M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (4131 > 4096). Running this sequence through the model will result in indexing errors


Quantizing model.layers blocks :   0%|          | 0/32 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]



Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]



('./Phi-3-mini-4k-instruct-gptq-4bit/tokenizer_config.json',
 './Phi-3-mini-4k-instruct-gptq-4bit/special_tokens_map.json',
 './Phi-3-mini-4k-instruct-gptq-4bit/tokenizer.model',
 './Phi-3-mini-4k-instruct-gptq-4bit/added_tokens.json',
 './Phi-3-mini-4k-instruct-gptq-4bit/tokenizer.json')

AWQ

**Not supported yet**

More details about the AWQ quantization in this article:

[Fast and Small Llama 2 with Activation-Aware Quantization (AWQ)
](https://kaitchup.substack.com/p/fast-and-small-llama-2-with-activation)


In [None]:
!pip install -qqq --upgrade transformers autoawq optimum accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.8/80.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m49.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.4/33.4 MB[0m [31m46.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m98.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m114.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m111.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m110.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'microsoft/Phi-3-mini-4k-instruct'
quant_path = 'Phi-3-mini-4k-instruct-awq-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model with safetensors
model.save_quantized("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)


TypeError: phi3 isn't supported yet.

# Fine-tuning

QLoRA

More details about QLoRA fine-tuning in this article:

[QLoRa: Fine-Tune a Large Language Model on Your GPU](https://kaitchup.substack.com/p/qlora-fine-tune-a-large-language-model-on-your-gpu-27bed5a03e2b)

In [None]:
!pip install -qqq --upgrade bitsandbytes transformers peft accelerate datasets trl flash_attn

It requires 7.7 GB of GPU RAM.

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

#use bf16 and FlashAttention if supported
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'

model_name = "microsoft/Phi-3-mini-4k-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'left'

ds = load_dataset("timdettmers/openassistant-guanaco")

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=compute_dtype, trust_remote_code=True, quantization_config=bnb_config, device_map={"": 0}, attn_implementation=attn_implementation
)

model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)


from trl import SFTConfig

training_arguments = SFTConfig(
        output_dir="./Phi-3_QLoRA",
        evaluation_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        per_device_eval_batch_size=4,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=100,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=100,
        num_train_epochs=3,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/904 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 8
  Total optimization steps = 921
  Number of trainable parameters = 8,912,896


Step,Training Loss,Validation Loss
100,1.3754,1.302107
200,1.1972,1.281482
300,1.1801,1.27474
400,1.1802,1.271885
500,1.1834,1.269624
600,1.1684,1.267865
700,1.1781,1.267419
800,1.17,1.266805
900,1.1515,1.266451


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./drive/MyDrive/Phi-3_QLoRA/checkpoint-307
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/3c0c9df9c11252fb61789d7847fa7d03f2825596/config.json
Model config Phi3Config {
  "_name_or_path": "Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_e

TrainOutput(global_step=921, training_loss=1.1977755477192866, metrics={'train_runtime': 17993.5639, 'train_samples_per_second': 1.642, 'train_steps_per_second': 0.051, 'total_flos': 3.2142890105554944e+17, 'train_loss': 1.1977755477192866, 'epoch': 2.992688870836718})

LoRA

It requires 14.5 GB of GPU RAM

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

#use bf16 and FlashAttention if supported
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'

model_name = "microsoft/Phi-3-mini-4k-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'left'

ds = load_dataset("timdettmers/openassistant-guanaco")

model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=compute_dtype, trust_remote_code=True, device_map={"": 0}, attn_implementation=attn_implementation
)

model.gradient_checkpointing_enable()

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)


from trl import SFTConfig

training_arguments = SFTConfig(
        output_dir="./Phi-3_LoRA",
        evaluation_strategy="steps",
        do_eval=True,
        optim="adamw_torch",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        per_device_eval_batch_size=8,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=100,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=100,
        num_train_epochs=3,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Repo card metadata block was not found. Setting CardData to empty.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 8
***** Running training *****
  Num examples = 9,846
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 4
  Total optimization steps = 921
  Number of trainable parameters = 8,912,896


Step,Training Loss,Validation Loss
100,1.3277,1.254653
200,1.1527,1.239603
300,1.1416,1.233861
400,1.1419,1.231433
500,1.1464,1.229961
600,1.1373,1.22849
700,1.1477,1.228517
800,1.1357,1.227974
900,1.1182,1.227681


***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
Saving model checkpoint to ./drive/MyDrive/Phi-3_LoRA/checkpoint-307
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/3c0c9df9c11252fb61789d7847fa7d03f2825596/config.json
Model config Phi3Config {
  "_name_or_path": "Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_em

TrainOutput(global_step=921, training_loss=1.1607017910571102, metrics={'train_runtime': 11649.9432, 'train_samples_per_second': 2.535, 'train_steps_per_second': 0.079, 'total_flos': 3.35642840068608e+17, 'train_loss': 1.1607017910571102, 'epoch': 2.992688870836718})

In [None]:
!pip install gradio langchain_huggingface langchain langchain_community transformers

In [None]:
from langchain_huggingface.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
# from langchain.chains import LLMChain
import re


def qa_with_context(model, context, question):
    # Create HuggingFacePipeline from the model
    hf = HuggingFacePipeline(pipeline=model)
    print("context:", context)
    print("question:", question)
    # Context text that the model will use to answer the question
    # context = """
    # Albert Einstein was a theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics.
    # """

    # Create a prompt template for QA with context
    template = f"""
    Here is the context:
    {context}

    Based on the above context, provide an answer to the following question:
    {question}

    Answer:
    """

    final_answer = ""
    try:
        # Create a prompt from the template, using context and question
        # qa_prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

        # # Create an LLMChain with Llama for contextual QA
        # qa_chain = LLMChain(llm=llm, prompt=qa_prompt)

        # # Run the QA chain with context and question
        # result = qa_chain.run({"context": context, "question": question})
        # print(result)
        prompt = PromptTemplate.from_template(template)

        chain = prompt | hf
        result = chain.invoke({"question": question, "context": context})

        # Run the QA chain with context and question
        # result = qa_chain.run({"context": context, "question": question})
        # print(result)

        answer = re.search(r"Answer:\s*(.*)", result)
        # Kiểm tra và lấy phần trả lời, loại bỏ khoảng trắng và dấu chấm
        if answer:
            final_answer = re.sub(r"[^\w\s]", "", answer.group(1)).strip() # Loại bỏ dấu chấm ở cuối nếu có
            print(final_answer)
        else:
            print("No answer found.")
    except Exception as e:
        print(e)

    return final_answer


def qa_without_context(model, question):
    # Create HuggingFacePipeline from the model
    hf = HuggingFacePipeline(pipeline=model)
    print("question")
    # Context text that the model will use to answer the question
    # context = """
    # Albert Einstein was a theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics.
    # """

    # Create a prompt template for QA with context
    template = f"""
    Please answer the following question to the best of your ability:
    Question:
    {question}

    Answer:
    """

    final_answer = ""
    try:

        # Tạo PromptTemplate chỉ với câu hỏi
        # qa_no_context_prompt_template = PromptTemplate(template=qa_no_context_prompt, input_variables=["question"])

        # # Tạo LLMChain cho QA không cần ngữ cảnh
        # qa_no_context_chain = LLMChain(llm=llm, prompt=qa_no_context_prompt_template)
        prompt = PromptTemplate.from_template(template)

        chain = prompt | hf
        result = chain.invoke({"question": question})

        # Chạy mô hình để trả lời câu hỏi
        # result = qa_no_context_chain.run({"question": question})

        answer = re.search(r"Answer:\s*(.*)", result)
        # Kiểm tra và lấy phần trả lời, loại bỏ khoảng trắng và dấu chấm
        if answer:
            final_answer = re.sub(r"[^\w\s]", "", answer.group(1)).strip() # Loại bỏ dấu chấm ở cuối nếu có
            print(final_answer)
        else:
            print("No answer found.")
    except Exception as e:
        print(e)

    return final_answer

def text_classification(model, context, categories):
    hf = HuggingFacePipeline(pipeline=model)


    # Tạo prompt cho task phân loại văn bản
    template = f"""Classify the following text into one of the following categories:
    {categories}

    The text is: {context}

    Classification:
    """

    final_answer = ""
    try:
       # Tạo prompt từ template, sử dụng văn bản cần phân loại
        # classification_prompt_template = PromptTemplate(template=classification_prompt, input_variables=["context"])

        # # Tạo LLMChain cho task phân loại văn bản
        # classification_chain = LLMChain(llm=llm, prompt=classification_prompt_template)

        # # Chạy mô hình với văn bản cần phân loại
        # result = classification_chain.run({"context": context})
        prompt = PromptTemplate.from_template(template)

        chain = prompt | hf
        result = chain.invoke({"context": context,"categories":categories})

        # Sử dụng regex để trích xuất phần Classification
        classification = re.search(r"Classification:\s*(.*)", result)

        if classification:
            final_answer = re.sub(r"[^\w\s]", "", classification.group(1)).strip()
            print("Classification:", final_answer)
        else:
            print("No classification found.")

    except Exception as e:
        print(e)

    return final_answer

def text_summarization(model, context):
    hf = HuggingFacePipeline(pipeline=model)

    # Tạo prompt cho task tóm tắt văn bản
    template = f"""
    Summarize the following text into a single, concise paragraph focusing on the key ideas and important points:

    Text:
    {context}

    Summary:
    """

    final_summary = ""
    try:
        # Tạo prompt từ template, sử dụng văn bản cần tóm tắt
        # summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["context"])

        # # Tạo LLMChain cho task tóm tắt văn bản
        # summary_chain = LLMChain(llm=llm, prompt=summary_prompt_template)

        # # Chạy mô hình với văn bản cần tóm tắt
        # result = summary_chain.run({"context": context})

        prompt = PromptTemplate.from_template(template)

        chain = prompt | hf

        result =  chain.invoke({"context": context})
        print(result)

        # Sử dụng regex để trích xuất phần Summary
        summary = re.search(r"Summary:\s*(.+)", result, re.DOTALL)

        if summary:
            final_summary = re.sub(r"[^\w\s.,!?]", "", summary.group(1)).strip()
            print("Summary:", final_summary)
        else:
            print("No summary found.")

    except Exception as e:
        print(e)

    return final_summary

def text_ner(model, context, categories):
    hf = HuggingFacePipeline(pipeline=model)

    # Tạo prompt cho task NER
    template = f"""
    The text is: {context}

    Extract all named entities from the context and classify them into the categories:
    {categories}

    Named Entities-classification:
    """

    final_entities = ""
    try:
        # Tạo prompt từ template, sử dụng văn bản cần phân tích NER
        # ner_prompt_template = PromptTemplate(template=ner_prompt, input_variables=["context"])

        # # Tạo LLMChain cho task NER
        # ner_chain = LLMChain(llm=llm, prompt=ner_prompt_template)

        # # Chạy mô hình với văn bản cần phân tích
        # result = ner_chain.run({"context": context})

        # print("Raw Result:\n", result)
        prompt = PromptTemplate.from_template(template)

        chain = prompt | hf

        result = chain.invoke({"context": context,"categories":categories})
        print(result)

        # Trích xuất phần "Named Entities-classification:" và parse các NER
        ner_classification = re.search(r"Named Entities-classification:\s*(.*)", result, re.DOTALL)

        if ner_classification:
            # Lấy danh sách các entity từ kết quả, chia theo dòng
            final_entities = ner_classification.group(1).strip()

            # entities = entities_text.split("\n")

            # # Duyệt qua các entity và chuyển đổi thành format mong muốn
            # for entity in entities:
            #     match = re.match(r"\d+\.\s*(\w+):\s*(.*)", entity.strip())
            #     if match:
            #         entity_type = match.group(1).upper()  # Loại entity (Person, Location, Organization)
            #         entity_value = match.group(2).strip()  # Giá trị entity

            #         # Kiểm tra nếu value có nhiều địa điểm, tách ra
            #         if entity_type == 'LOCATION' and ',' in entity_value:
            #             # Tách value nếu chứa dấu phẩy
            #             location_values = [val.strip() for val in entity_value.split(',')]
            #             # Thêm từng phần vào final_entities dưới dạng các đối tượng riêng biệt
            #             for location in location_values:
            #                 final_entities.append({"type": "LOCATION", "value": location})
            #         else:
            #             # Thêm entity vào danh sách nếu không phải LOCATION hoặc không có dấu phẩy
            #             final_entities.append({"type": entity_type, "value": entity_value})

    except Exception as e:
        print(f"Error: {e}")

    return final_entities

def chatbot_with_history(model, conversation_history, user_input):
    """
    Chatbot function that incorporates conversational history for more context-aware responses.

    Args:
        model: The language model for generating responses.
        conversation_history: A list of tuples containing (user_message, bot_response) pairs.
        user_input: The latest message from the user.

    Returns:
        bot_response: The chatbot's response to the user input.
    """
    # Create HuggingFacePipeline from the model
    hf = HuggingFacePipeline(pipeline=model)

    # Format conversation history into a dialogue structure
    # history_text = "\n".join(
    #     [f"User: {user_msg}\nBot: {bot_msg}" for user_msg, bot_msg in conversation_history]
    # )
    # history_text = "\n".join([f"User: {m['text']}" if m['is_user'] else f"Bot: {m['text']}" for m in conversation_history])
    history_text = "\n".join([f"{m['role'].capitalize()}: {m['content']}" for m in conversation_history])

    # Add the latest user input to the dialogue
    template = f"""
    You are a helpful and friendly chatbot. Below is the conversation history:

    {history_text}

    Now, the user says:
    {user_input}

    Respond to the user in a thoughtful and engaging manner:
    """

    try:
        # Create a prompt from the template
        prompt = PromptTemplate.from_template(template)

        # Define the pipeline for processing the input through the model
        chain = prompt | hf
        result = chain.invoke({"user_input": user_input, "conversation_history": conversation_history}).strip()
        print("result response:", result, "============")

        # Use regex to extract the response part
        match = re.search(r"Respond to the user in a thoughtful and engaging manner:\s*(.*)", result, re.DOTALL)
        bot_response = match.group(1).splitlines()[0].strip() if match else "Sorry, I couldn't understand the response."

        if "User:" in bot_response:
            match = re.search(r"Assistant:\s*\"?(.*?)(?=\n|$)", bot_response, re.DOTALL)
            bot_response = match.group(1).strip()

        print("chatbot response:", bot_response, "============")
    except Exception as e:
        print("An error occurred:", e)
        bot_response = "Sorry, I encountered an error. Could you rephrase your message?"

    # Update the conversation history
    # conversation_history.append((user_input, bot_response))
    conversation_history.append({"role": "user", "content": user_input})
    conversation_history.append({"role": "assistant", "content": bot_response})

    print(conversation_history)
    return conversation_history, ""

In [None]:
import gradio as gr
from transformers import pipeline
import torch
task = "chat-bot"
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
project_id = 1

from huggingface_hub import login
hf_access_token = "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"
login(token = hf_access_token)
if torch.cuda.is_available():
    if torch.cuda.is_bf16_supported():
        dtype = torch.bfloat16
    else:
        dtype = torch.float16

    print("CUDA is available.")

    _model = pipeline(
        "text-generation",
        model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
        torch_dtype=dtype,
        device_map="auto",  # Hoặc có thể thử "cpu" nếu không ổn,
        max_new_tokens=256,
        token = "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"
    )
else:
    print("No GPU available, using CPU.")
    _model = pipeline(
        "text-generation",
        model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
        device_map="cpu",
        max_new_tokens=256,
        token = "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"
    )
def load_model(task,model_id, project_id, temperature=None, top_p=None, top_k=None, max_new_token=None):
    from huggingface_hub import login
    hf_access_token = "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"
    login(token = hf_access_token)
    if torch.cuda.is_available():
        if torch.cuda.is_bf16_supported():
            dtype = torch.bfloat16
        else:
            dtype = torch.float16

        print("CUDA is available.")

        if not temperature:
            _model = pipeline(
            task,
            model=model_id,
            torch_dtype=dtype,
            device_map="auto",  # Hoặc có thể thử "cpu" nếu không ổn,
            # max_new_tokens=256,
            token = "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"
            )
        else:
            _model = pipeline(
                task,
                model=model_id,
                torch_dtype=dtype,
                device_map="auto",  # Hoặc có thể thử "cpu" nếu không ổn,
                # max_new_tokens=256,
                token = "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",
                max_new_tokens=int(max_new_token),
                temperature=float(temperature),
                top_k=float(top_k),
                top_p=float(top_p)
            )
    else:
        print("No GPU available, using CPU.")
        if not temperature:
            _model = pipeline(
            task,
            model=model_id,
            torch_dtype=dtype,
            device_map="auto",  # Hoặc có thể thử "cpu" nếu không ổn,
            # max_new_tokens=256,
            token = "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"
            )
        else:
            _model = pipeline(
                task,
                model=model_id,
                device_map="cpu",
                # max_new_tokens=256,
                token = "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",
                max_new_tokens=int(max_new_token),
                temperature=float(temperature),
                top_k=float(top_k),
                top_p=float(top_p),
            )

    return _model

# load_model(task,model_id,project_id)

def generate_response(input_text, temperature, top_p, top_k, max_new_token):
    load_model("text-generation", model_id, project_id, temperature, top_p, top_k, max_new_token)

    messages = [
        {"role": "system", "content":  "You are a helpful assistant."},
        {"role": "user", "content": input_text},
    ]

    result = _model(messages, max_length=1024)
    generated_text = result[0]['generated_text']
    print(result)

    return generated_text

def summarization_response(input_text, temperature, top_p, top_k, max_new_token):
    load_model("text-generation", model_id, project_id, temperature, top_p, top_k, max_new_token)
    # _model = load_model(model_id)
    generated_text = text_summarization(_model, input_text)

    return generated_text

def text_classification_response(input_text,categories_text, temperature, top_p, top_k, max_new_token):
    load_model("text-generation", model_id, project_id, temperature, top_p, top_k, max_new_token)
    generated_text = text_classification(_model, input_text, categories_text)
    return generated_text

def question_answering_response(context_textbox,question_textbox, temperature, top_p, top_k, max_new_token):
    load_model("text-generation", model_id, project_id, temperature, top_p, top_k, max_new_token)
    # _model = load_model(model_id)
    if input_text and question_textbox:
        generated_text = qa_with_context(_model, context_textbox, question_textbox)
    elif context_textbox and not question_textbox:
        generated_text = qa_without_context(_model, question_textbox)
    else:
        generated_text = qa_with_context(_model, question_textbox)

    return generated_text

def chatbot_continuous_chat(history, user_input, temperature, top_p, top_k, max_new_token):
    load_model("text-generation", model_id, project_id, temperature, top_p, top_k, max_new_token)
    generated_text = chatbot_with_history(_model, history, user_input)
    return generated_text

with gr.Blocks() as demo_text_generation:
    with gr.Row():
        with gr.Column():
            with gr.Group():
                input_text = gr.Textbox(label="Input text")
                # prompt_text = gr.Textbox(label="Prompt text")
                temperature = gr.Slider(
                    label="Temperature",
                    minimum=0.0,
                    maximum=100.0,
                    step=0.1,
                    value=0.9
                )
                top_p = gr.Slider(
                    label="Top_p",
                    minimum=0.0,
                    maximum=1.0,
                    step=0.1,
                    value=0.6
                )
                top_k = gr.Slider(
                    label="Top_k",
                    minimum=0,
                    maximum=100,
                    step=1,
                    value=0
                )
                max_new_token = gr.Slider(
                    label="Max new tokens",
                    minimum=1,
                    maximum=1024,
                    step=1,
                    value=256
                )
            btn = gr.Button("Submit")
        with gr.Column():
            output_text = gr.Textbox(label="Output text")

    btn.click(
        fn=generate_response,
        inputs=[input_text, temperature, top_p, top_k, max_new_token],
        outputs=output_text
    )

with gr.Blocks() as demo_summarization:
    with gr.Row():
        with gr.Column():
            with gr.Group():
                input_text = gr.Textbox(label="Input text")
                temperature = gr.Slider(
                    label="Temperature",
                    minimum=0.0,
                    maximum=100.0,
                    step=0.1,
                    value=0.9
                )
                top_p = gr.Slider(
                    label="Top_p",
                    minimum=0.0,
                    maximum=1.0,
                    step=0.1,
                    value=0.6
                )
                top_k = gr.Slider(
                    label="Top_k",
                    minimum=0,
                    maximum=100,
                    step=1,
                    value=0
                )
                max_new_token = gr.Slider(
                    label="Max new tokens",
                    minimum=1,
                    maximum=1024,
                    step=1,
                    value=256
                )
                # prompt_text = gr.Textbox(label="Prompt text")
            btn = gr.Button("Submit")
        with gr.Column():
            output_text = gr.Textbox(label="Output text")
    btn.click(
        fn=summarization_response,
        inputs=[input_text, temperature, top_p, top_k, max_new_token],
        outputs=output_text
    )

with gr.Blocks() as demo_question_answering:
    with gr.Row():
        with gr.Column():
            with gr.Group():
                context_textbox = gr.Textbox(label="Context text")
                question_textbox = gr.Textbox(label="Question text")
                temperature = gr.Slider(
                    label="Temperature",
                    minimum=0.0,
                    maximum=100.0,
                    step=0.1,
                    value=0.9
                )
                top_p = gr.Slider(
                    label="Top_p",
                    minimum=0.0,
                    maximum=1.0,
                    step=0.1,
                    value=0.6
                )
                top_k = gr.Slider(
                    label="Top_k",
                    minimum=0,
                    maximum=100,
                    step=1,
                    value=0
                )
                max_new_token = gr.Slider(
                    label="Max new tokens",
                    minimum=1,
                    maximum=1024,
                    step=1,
                    value=256
                )

            btn = gr.Button("Submit")
        with gr.Column():
            output_text =   gr.Textbox(label="Response:")

    btn.click(
        fn=question_answering_response,
        inputs=[context_textbox, question_textbox, temperature, top_p, top_k, max_new_token],
        outputs=output_text
    )

with gr.Blocks() as demo_text_classification:
    with gr.Row():
        with gr.Column():
            with gr.Group():
                input_text = gr.Textbox(label="Input text")
                categories_text = gr.Textbox(label="Categories text")
                temperature = gr.Slider(
                    label="Temperature",
                    minimum=0.0,
                    maximum=100.0,
                    step=0.1,
                    value=0.9
                )
                top_p = gr.Slider(
                    label="Top_p",
                    minimum=0.0,
                    maximum=1.0,
                    step=0.1,
                    value=0.6
                )
                top_k = gr.Slider(
                    label="Top_k",
                    minimum=0,
                    maximum=100,
                    step=1,
                    value=0
                )
                max_new_token = gr.Slider(
                    label="Max new tokens",
                    minimum=1,
                    maximum=1024,
                    step=1,
                    value=256
                )
            btn = gr.Button("Submit")
        with gr.Column():
            output_text = gr.Textbox(label="Response:")

    btn.click(
        fn=text_classification_response,
        inputs=[input_text, categories_text, temperature, top_p, top_k, max_new_token],
        outputs=output_text
    )

def sentiment_classifier(text, temperature, top_p, top_k, max_new_token):
    try:
        sentiment_classifier = pipeline("sentiment-analysis", temperature=temperature, top_p=top_p, top_k=top_k, max_new_token=max_new_token)
        sentiment_response = sentiment_classifier(text)
        # label = sentiment_response[0]['label']
        # score = sentiment_response[0]['score']
        print(sentiment_response)
        import json
        return json.dumps(sentiment_response)
    except Exception as e:
        return str(e)

with gr.Blocks() as demo_sentiment_analysis:
    with gr.Row():
        with gr.Column():
            with gr.Group():
                input_text = gr.Textbox(label="Input text")
                temperature = gr.Slider(
                    label="Temperature",
                    minimum=0.0,
                    maximum=100.0,
                    step=0.1,
                    value=0.9
                )
                top_p = gr.Slider(
                    label="Top_p",
                    minimum=0.0,
                    maximum=1.0,
                    step=0.1,
                    value=0.6
                )
                top_k = gr.Slider(
                    label="Top_k",
                    minimum=0,
                    maximum=100,
                    step=1,
                    value=0
                )
                max_new_token = gr.Slider(
                    label="Max new tokens",
                    minimum=1,
                    maximum=1024,
                    step=1,
                    value=256
                )
            btn = gr.Button("Submit")
        with gr.Column():

            label_text = gr.Label(label="Label: ")
            score_text = gr.Label(label="Score: ")
    btn.click(
        fn=sentiment_classifier,
        inputs=[input_text, temperature, top_p, top_k, max_new_token],
        outputs=output_text
    )

def predict_entities(input_text, categories_text, temperature, top_p, top_k, max_new_token):
    load_model("text-generation", model_id, project_id, temperature, top_p, top_k, max_new_token)
    generated_text = text_ner(_model, input_text, categories_text)
    return generated_text

with gr.Blocks() as demo_ner:
    with gr.Row():
        with gr.Column():
            with gr.Group():
                input_text = gr.Textbox(label="Input text")
                categories_text = gr.Textbox(label="Categories text")
                temperature = gr.Slider(
                    label="Temperature",
                    minimum=0.0,
                    maximum=100.0,
                    step=0.1,
                    value=0.9
                )
                top_p = gr.Slider(
                    label="Top_p",
                    minimum=0.0,
                    maximum=1.0,
                    step=0.1,
                    value=0.6
                )
                top_k = gr.Slider(
                    label="Top_k",
                    minimum=0,
                    maximum=100,
                    step=1,
                    value=0
                )
                max_new_token = gr.Slider(
                    label="Max new tokens",
                    minimum=1,
                    maximum=1024,
                    step=1,
                    value=256
                )
            # with gr.Group():
            #     input_text = gr.Textbox(label="Input text")
            btn = gr.Button("Submit")
        with gr.Column():
            output_text = gr.Textbox(label="Response:")

    btn.click(
        fn=predict_entities,
        inputs=[input_text, categories_text, temperature, top_p, top_k, max_new_token],
        outputs=output_text
    )

with gr.Blocks() as demo_text2text_generation:
    with gr.Row():
        with gr.Column():
            with gr.Group():
                input_text = gr.Textbox(label="Input text", placeholder="Enter your text here")
                temperature = gr.Slider(
                    label="Temperature",
                    minimum=0.0,
                    maximum=100.0,
                    step=0.1,
                    value=1
                )
                top_p = gr.Slider(
                    label="Top_p",
                    minimum=0.0,
                    maximum=1.0,
                    step=0.1,
                    value=0.9
                )
                top_k = gr.Slider(
                    label="Top_k",
                    minimum=0,
                    maximum=100,
                    step=1,
                    value=0
                )
                max_new_token = gr.Slider(
                    label="Max new tokens",
                    minimum=1,
                    maximum=1024,
                    step=1,
                    value=256
                )
            btn = gr.Button("Submit")
        with gr.Column():
            output_text = gr.Textbox(label="Output text")

    # Gắn sự kiện với nút Submit
    btn.click(
        fn=generate_response,
        inputs=[input_text, temperature, top_p, top_k, max_new_token],
        outputs=output_text
    )

with gr.Blocks() as demo_chatbot:
    chatbot = gr.Chatbot(type="messages")
    with gr.Row():
        user_input = gr.Textbox(label="Your message", placeholder="Type your message here...")
    with gr.Row():
        temperature = gr.Slider(
                    label="Temperature",
                    minimum=0.0,
                    maximum=100.0,
                    step=0.1,
                    value=0.9
                )
        top_p = gr.Slider(
            label="Top_p",
            minimum=0.0,
            maximum=1.0,
            step=0.1,
            value=0.6
        )
        top_k = gr.Slider(
            label="Top_k",
            minimum=0,
            maximum=100,
            step=1,
            value=0
        )
        max_new_token = gr.Slider(
            label="Max new tokens",
            minimum=1,
            maximum=10000,
            step=1,
            value=1024
        )
    with gr.Row():
        btn = gr.Button("Send")

    # Bind function to button click
    btn.click(
        fn=chatbot_continuous_chat,
        inputs=[chatbot, user_input, temperature, top_p, top_k, max_new_token],  # Thêm các tham số vào hàm
        outputs=[chatbot, user_input]  # Cập nhật lại chatbot với phản hồi mới và xóa trường nhập liệu
    )


DESCRIPTION = """\
# LLM UI
This is a demo of LLM UI.
"""
with gr.Blocks(css="style.css") as demo:
    gr.Markdown(DESCRIPTION)

    with gr.Tabs():
        if task == "text-generation":
            with gr.Tab(label=task):
                demo_text_generation.render()
        elif task == "summarization":
            with gr.Tab(label=task):
                demo_summarization.render()
        elif task == "question-answering":
            with gr.Tab(label=task):
                demo_question_answering.render()
        elif task == "text-classification":
            with gr.Tab(label=task):
                    demo_text_classification.render()
        elif task == "sentiment-analysis":
            with gr.Tab(label=task):
                demo_sentiment_analysis.render()
        elif task == "ner":
            with gr.Tab(label=task):
                demo_ner.render()
        # elif task == "fill-mask":
        #   with gr.Tab(label=task):
        #         demo_fill_mask.render()
        elif task == "text2text-generation":
            with gr.Tab(label=task):
                demo_text2text_generation.render()
        elif task == "chat-bot":
            with gr.Tab(label=task):
                demo_chatbot.render()


gradio_app, local_url, share_url = demo.launch(share=True, quiet=True, prevent_thread_lock=True, server_name='0.0.0.0',show_error=True)
print(share_url)