-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge loss when applying QAT on YOLOv8n #337
Comments
It seems that you are using QATQuantizer for PTQ. The correct way to do that is listed below. from tinynn.graph.quantization.fake_quantize import set_ptq_fake_quantize
quantizer = QATQuantizer(
model, dummy_input, work_dir='out', config={'override_qconfig_func': set_ptq_fake_quantize}
) |
Hi @peterjc123, raw_model = torch.load(args.model_pretrained_dir)
model = model_rewrite(raw_model, dummy_input, work_dir='TinyNeuralNet/out')
with model_tracer():
quantizer = QATQuantizer(model, dummy_input, work_dir='TinyNeuralNet/out',
config={'asymmetric': True, 'per_tensor': True, 'override_qconfig_func': set_ptq_fake_quantize})
qat_model = quantizer.quantize() Now the model can be trained but training loss and validation metrics are terrible. I expect the losses go lower than 1 like the original model. But the optimizer seems got stuck at 3.0.
Do you know any methods that can be tried to resolve this problem? |
It is advisable to first attempt Post-Training Quantization (PTQ) to quickly observe the loss introduced by quantization. |
Alternatively, could you share the YOLOv8 model file (or the open-source repository you used) and your QAT training script? |
Thanks @zk1998, import sys, os, argparse, random
sys.path.append(os.getcwd())
RANK = int(os.getenv("RANK", -1))
import torch
import torch.nn as nn
import torch.optim as optim
import torch.quantization as torch_q
from ultralytics import YOLO
from ultralytics.yolo.utils.loss import v8DetectionLoss
from tinynn.util.train_util import AverageMeter, DLContext, train, get_device
from tinynn.graph.quantization.quantizer import QATQuantizer
from tinynn.graph.tracer import model_tracer
from tinynn.converter import TFLiteConverter
from tinynn.graph.quantization.algorithm.cross_layer_equalization import cross_layer_equalize, model_rewrite, model_fuse_bn, clear_model_fused_bn
from tinynn.util.bn_restore import model_restore_bn
from tinynn.util.quantization_analysis_util import graph_error_analysis, layer_error_analysis
from tinynn.graph.quantization.fake_quantize import set_ptq_fake_quantize
from datetime import datetime
from pathlib import Path
from ultralytics.yolo.cfg import get_cfg
from ultralytics.utils import *
from ultralytics.yolo.utils import (DEFAULT_CFG, DEFAULT_CFG_DICT, DEFAULT_CFG_KEYS, LOGGER, NUM_THREADS, RANK, ROOT,
callbacks, is_git_dir, yaml_load)
from torchsummary import summary
from copy import copy, deepcopy
from ultralytics.yolo import v8
from ultralytics.yolo.utils.checks import check_imgsz
from ultralytics.yolo.data.utils import check_det_dataset
from ultralytics.yolo.data import build_dataloader
from ultralytics.yolo.utils.torch_utils import de_parallel, torch_distributed_zero_first
from ultralytics.yolo.data.dataset import YOLODataset
from ultralytics.yolo.utils.files import increment_path
from ultralytics.utils.torch_utils import (
EarlyStopping,
de_parallel,
init_seeds,
one_cycle,
select_device,
strip_optimizer,
)
def get_default_config(DEFAULT_CFG_PATH="ultralytics/cfg/default.yaml"):
# Default configuration
DEFAULT_CFG_DICT = yaml_load(DEFAULT_CFG_PATH)
for k, v in DEFAULT_CFG_DICT.items():
if isinstance(v, str) and v.lower() == "none":
DEFAULT_CFG_DICT[k] = None
DEFAULT_CFG_KEYS = DEFAULT_CFG_DICT.keys()
DEFAULT_CFG = IterableSimpleNamespace(**DEFAULT_CFG_DICT)
return DEFAULT_CFG
"""Build Dataset
"""
def get_dataloader(model, dataset_path, cfg, data, batch_size=16, rank=0, mode='train'):
def build_dataset(model, img_path, mode='train', batch=None):
gs = max(int(de_parallel(model).stride.max() if model else 0), 32)
return YOLODataset(
img_path=img_path,
imgsz=320,
batch_size=batch,
augment=False, # augmentation
hyp=cfg, # TODO: probably add a get_hyps_from_cfg function
rect=False, # rectangular batches
cache=None,
single_cls=False,
stride=gs,
pad=0.0 if mode == 'train' else 0.5,
prefix=colorstr(f'{mode}: '),
use_segments=False,
use_keypoints=False,
classes=None,
data=data,
fraction=1.0)
assert mode in ['train', 'val']
with torch_distributed_zero_first(rank): # init dataset *.cache only once if DDP
dataset = build_dataset(model, dataset_path, mode, batch_size)
shuffle = mode == 'train'
if getattr(dataset, 'rect', False) and shuffle:
LOGGER.warning("WARNING ⚠️ 'rect=True' is incompatible with DataLoader shuffle, setting shuffle=False")
shuffle = False
workers = 2 if mode == 'train' else 2
return build_dataloader(dataset, batch_size, workers, shuffle, rank)
def get_validator(batch_size, saved_experiment_dir):
cb = callbacks.get_default_callbacks()
overrides = {'imgsz': 320, 'batch': batch_size, 'conf': 0.25, 'iou': 0.6, 'device': 'cuda:0'}
overrides['rect'] = True
overrides['mode'] = 'val'
overrides.update(overrides)
args = get_cfg(cfg=DEFAULT_CFG, overrides=overrides)
args.data = 'sparseml_quantization/data.yaml'
args.task = 'detect'
args.rect = False
args.imgsz = check_imgsz(320, max_dim=1)
return v8.detect.DetectionValidator(save_dir=saved_experiment_dir, args=args, _callbacks=cb)
def calibrate(model, context: DLContext, eval=True):
model.to(device=context.device)
if eval:
model.eval()
else:
model.train()
avg_batch_time = AverageMeter()
with torch.no_grad():
end = time.time()
for i, batch in enumerate(context.train_loader):
if context.max_iteration is not None and i >= context.max_iteration:
break
img = batch['img'].to(context.device, non_blocking=True).float() / 255.0
model(img)
# measure elapsed time
avg_batch_time.update(time.time() - end)
end = time.time()
if i % 10 == 0:
print(f'Calibrate: [{i}/{len(context.train_loader)}]\tTime {avg_batch_time.avg:.5f}\t')
context.iteration += 1
def setup_seed(seed, cuda_deterministic=True):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
if cuda_deterministic: # slower, more reproducible
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
else: # faster, less reproducible
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.benchmark = True
def train(args, model, optimizer, scheduler, criterion, train_loader, validator, context, max_epochs, cfg, saved_experiment_dir):
loss_names = ['box_loss', 'cls_loss', 'dfl_loss']
model = model.to(context.device)
tloss = None
best_fitness = -1e5
early_stopper = 0
for epoch in range(max_epochs):
# For each epoch
pbar = TQDM(enumerate(train_loader), total=len(train_loader))
if epoch == 5:
train_loader.dataset.close_mosaic(hyp=cfg)
train_loader.reset()
# TRAIN!!!
model.train()
print(('\n' + '%11s' * (4 + len(loss_names))) % ('Epoch', 'GPU_mem', *loss_names, 'Instances', 'Size'))
for i, batch in pbar:
# Warm Up!
# ni = i + len(train_loader) * epoch
# nw = max(round(3 * len(train_loader)), 100)
# if ni <= nw:
# xi = [0, nw]
# for j, x in enumerate(optimizer.param_groups):
# x["lr"] = np.interp(
# ni, xi, [0.1 if j == 0 else 0.0, x["initial_lr"] * one_cycle(1, 0.01, max_epochs)(epoch)])
# if "momentum" in x:
# x["momentum"] = np.interp(ni, xi, [0.8, 0.937])
optimizer.zero_grad()
# Preprocess batch!
batch['img'] = batch['img'].to(context.device, non_blocking=True).float() / 255.0
output = model(batch['img'])
loss, loss_items = criterion(output, batch)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=10.0)
optimizer.step()
# Loss infos
tloss = ((tloss * i + loss_items) / (i + 1) if tloss is not None else loss_items)
mem = f"{torch.cuda.memory_reserved() / 1E9 if torch.cuda.is_available() else 0:.3g}GB" #(GB)
loss_len = tloss.shape[0] if len(tloss.shape) else 1
losses = tloss if loss_len > 1 else torch.unsqueeze(tloss, 0)
# Progress bar
pbar.set_description(
("%11s" * 2 + "%11.4g" * (2 + loss_len))
% (f"{epoch}/{max_epochs-1}", mem, *losses, batch["cls"].shape[0], batch["img"].shape[-1])
)
if (epoch == max_epochs // 3):
print("[INFO] Freeze quantizer parameters!")
model.apply(torch.quantization.disable_observer)
elif (epoch == max_epochs // 3 * 2):
print("[INFO] Freeze batch-norm mean and variance estimates!")
model.apply(torch.nn.intrinsic.qat.freeze_bn_stats)
validate_model = deepcopy(model)
metrics = validator(model=validate_model)
fitness = metrics["fitness"]
if not best_fitness or best_fitness < fitness:
best_fitness = fitness
else:
early_stopper += 1
save_model(epoch, best_fitness, fitness, model, optimizer,
save_path=saved_experiment_dir / "weights")
scheduler.step()
del validate_model
torch.cuda.empty_cache()
if early_stopper == 5:
break
return model
def save_model(epoch, best_fitness, fitness, model, optimizer, save_path):
"""Save model checkpoints based on various conditions."""
ckpt = {
'epoch': epoch,
'best_fitness': best_fitness,
'model': deepcopy(de_parallel(model)),
'optimizer': optimizer.state_dict(),
'date': datetime.now().isoformat()}
try:
import dill as pickle
except ImportError:
import pickle
if not os.path.exists(save_path):
os.makedirs(save_path)
last = save_path / "last.pt"
best = save_path / "best.pt"
torch.save(ckpt, last, pickle_module=pickle)
if best_fitness == fitness:
torch.save(ckpt, best, pickle_module=pickle)
del ckpt
def qat(args):
dataset_yaml_path = '/data/hoangtv23/workspace_AIOT/model_compression_flow/sparseml_quantization/data.yaml'
# uncompressed_model_dir = "yolov8n.pt"
saved_experiment_dir = "TinyNeuralNet/tinynn_runs/exp_yolov8n"
saved_experiment_dir = increment_path(Path(saved_experiment_dir))
context = DLContext()
# Config consists of Augmentation Informations
cfg = get_default_config()
cfg.data = dataset_yaml_path
# Declare Model
raw_model = torch.load(args.model_pretrained_dir)
raw_model.args = cfg
dummy_input = torch.rand(1, 3, 320, 320)
data = check_det_dataset(dataset_yaml_path)
trainset, testset = data['train'], data.get('val') or data.get('test')
device = get_device()
context.device = device
train_loader = get_dataloader(raw_model, trainset, cfg, data, args.batch_size, RANK, "train")
context.train_loader = train_loader
test_loader = get_dataloader(raw_model, testset, cfg, data, args.batch_size*2, RANK, "val")
validator = v8.detect.DetectionValidator(test_loader, save_dir=saved_experiment_dir, args=copy(cfg))
# validator = get_validator(batch_size=args.batch_size, saved_experiment_dir=saved_experiment_dir)
model = model_rewrite(raw_model, dummy_input, work_dir='TinyNeuralNet/out')
# context.max_iteration = 100
# calibrate(model, context=context, eval=True)
with model_tracer():
quantizer = QATQuantizer(model, dummy_input, work_dir='TinyNeuralNet/out',
config={'asymmetric': True, 'per_tensor': True, 'override_qconfig_func': set_ptq_fake_quantize})
qat_model = quantizer.quantize()
qat_model.to(device=device)
""" Build Optimizer!
"""
optimizer = torch.optim.Adam(qat_model.parameters(), lr= args.base_lr, weight_decay=args.weight_decay)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=args.num_epochs + 1, eta_min=0)
""" Build Criterion!
"""
criterion = v8DetectionLoss(raw_model)
""" Train!
"""
qat_model.train()
qat_model.apply(torch_q.enable_fake_quant)
qat_model.apply(torch_q.enable_observer)
""" Train Model!
"""
qat_model = train(args, qat_model, optimizer, scheduler, criterion, train_loader,
validator, context, args.num_epochs, cfg, saved_experiment_dir)
""" Error Analysis!
"""
dummy_input_real = next(iter(train_loader))['img'].cuda().type(torch.FloatTensor)
graph_error_analysis(qat_model, dummy_input_real, metric='cosine')
layer_error_analysis(qat_model, dummy_input_real, metric='cosine')
qat_model.apply(torch_q.disable_observer)
metrics = validator(model=deepcopy(qat_model))
print(metrics)
with torch.no_grad():
qat_model.eval()
qat_model.cpu()
qat_model = quantizer.convert(qat_model)
torch.backends.quantized.engine = quantizer.backend
converter = TFLiteConverter(qat_model, dummy_input.cpu(), tflite_path='TinyNeuralNet/out/quantized_model.tflite')
converter.convert()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--workers', type=int, default=4)
parser.add_argument('--num-epochs', type=int, default=50)
parser.add_argument('--batch-size', type=int, default=32)
parser.add_argument('--base-lr', type=float, default=1e-1)
parser.add_argument('--weight-decay', type=float, default=1e-6)
parser.add_argument('--model-pretrained-dir', type=str, default='oto_pruning/cache/version_split_bbox_conf/DetectionModel_compressed.pt')
setup_seed(seed=2048, cuda_deterministic=False)
args = parser.parse_args()
qat(args) |
I saved the rewritten model and load it with |
If the rewritten model's mAPs has no difference with the original model, I strongly recommend you first use the rewritten model(the py file and pth file in out dir) to try post-quantization and then evaluate the mAPs of the ptq_model(after pytorch quantization converter to see the impact of quantization on mAPs without training. BTW, I am trying to reproduce the YOLOv8 quantization pipeline on ImageNet. I will discuss the details with you later. |
Thank you very much for your support @zk1998, |
@hoangtv2000 Hi, for reproducing your issue, if we utilize QAT, then it will be very time-consuming. So we use PTQ instead. And we believe if PTQ works, then QAT certainly work better. |
Nope, it doesn't sound right. For a correct setup, it should not be dropping that much. |
I use rewrite function in def model_rewrite(model, dummy_input, work_dir='out') -> nn.Module:
"""rewrite model to non-block style"""
with model_tracer():
graph = trace(model, dummy_input)
model_name = type(model).__name__
model_rewrite = f'{model_name}_cle_Rewrite'
model_name_rewrite_lower = model_rewrite.lower()
model_ns = f'out.{model_name_rewrite_lower}'
model_code_path = os.path.join(work_dir, f'{model_name_rewrite_lower}.py')
model_weights_path = os.path.join(work_dir, f'{model_name_rewrite_lower}.pth')
graph.eliminate_dead_graph_pass()
if not os.path.exists(work_dir):
os.makedirs(work_dir)
graph.generate_code(model_code_path, model_weights_path, model_rewrite)
# Import the new model
rewritten_model = import_from_path(model_ns, model_code_path, model_rewrite)()
rewritten_model.load_state_dict(torch.load(model_weights_path))
# os.unlink(model_weights_path)
return rewritten_model
# args.model_pretrained_dir is the trained and pruned model path.
raw_model = torch.load(args.model_pretrained_dir)
dummy_input = torch.rand(1, 3, 320, 320).to(device)
model = model_rewrite(raw_model, dummy_input, work_dir='TinyNeuralNet/out') |
Would you please skip this step for now? |
In my cases, models that are not rewritten will missing layers and cannot expose correct output. Specifically, it lacks some layers for bbox and confidence score prediction, so the model can be used later. The model losses are also not reduced, so I think rewriting the model is not a problem. Set up: raw_model = torch.load(args.model_pretrained_dir)
with model_tracer():
raw_model.to(device)
quantizer = QATQuantizer(raw_model, dummy_input, work_dir='TinyNeuralNet/out',
config={'asymmetric': True, 'per_tensor': True, 'override_qconfig_func': set_ptq_fake_quantize})
qat_model = quantizer.quantize() Error report:
|
this use case can be traced correctly. |
@hoangtv2000 Actually, we will also perform rewrite if you call |
I think I know the reason. it is because the computation logic is slightly different between training mode and evaluation mode. Given the fact that we actually rewrite the model and trace only once in a single mode, it is inevitable that the traced graph won't work for the other mode. You may have to patch the generated model code a little bit so that it works for both training and evaluation. |
Thank you @peterjc123, |
Hi @hoangtv2000 , you can try to quantize Yolov8 detection model as below:
|
Thanks for your response @zk1998, quantizer = QATQuantizer(model, dummy_input, work_dir='TinyNeuralNet/out',
config={'asymmetric': True, 'per_tensor': True, 'force_rewrite': False, 'rewrite_graph': False})
qat_model = quantizer.quantize() The mAP of QAT is dropped to zero. So, I think the problem comes from here. |
There is no |
I know but it will delete my edits if self.training:
return [fake_dequant_1_0, fake_dequant_1_1, fake_dequant_1_2]
else:
return [fake_dequant_0_0, fake_dequant_0_1], [fake_dequant_1_0, fake_dequant_1_1, fake_dequant_1_2] if I set |
@hoangtv2000 It will not delete your edits.if you pass |
Wow that was strange, here is my script and it breaks my model. I will try your solution, thank you for helping me so much. from TinyNeuralNet.out.detectionmodel_q import QDetectionModel
model = QDetectionModel()
dummy_input = torch.rand(1, 3, 320, 320)
model.load_state_dict(torch.load('TinyNeuralNet/out/detectionmodel_q.pth'))
model.to(device=device)
quantizer = QATQuantizer(model, dummy_input, work_dir='TinyNeuralNet/out',
config={'asymmetric': True, 'per_tensor': True, 'force_rewrite': False, 'rewrite_graph': True})
qat_model = quantizer.quantize()
qat_model.to(device=device) |
No, you didn't follow our usage. Come on. It's |
If you pass in the rewritten model as shown in this piece of code, then you should pass |
I was so silly due to not pay attention to answer, sorry for that.
|
So, how about the fake-quantized qat-prepared model's mAPs before training which I mentioned at #337 (comment) step3 |
Hi @zk1998,
|
Hi @hoangtv2000 . It seems that the quantization error is unacceptable, and QAT will degrade to training from scratch. I am trying to reproduce the YOLOv8 detection quantization experiment, which will take some time. |
Yeah, I tried the quantization with with model_tracer():
model = torch.load(args.model_pretrained_dir).cpu()
quantizer = PostQuantizer(model, torch.rand(1, 3, 320, 320), work_dir='TinyNeuralNet/out')
ptq_model = quantizer.quantize()
# Do calibration/inference to get quantization param
ptq_model.eval()
ptq_model.apply(torch_q.disable_fake_quant)
ptq_model.apply(torch_q.enable_observer)
context.max_iteration = 100
calibrate(ptq_model, context)
# Disable observer and enable fake quantization to validate model with quantization error
ptq_model.apply(torch_q.disable_observer)
ptq_model.apply(torch_q.enable_fake_quant)
metrics = validator(model=deepcopy(ptq_model)) And the mAPs come to nearly zero.
|
@hoangtv2000 Would you please organize the data and the code you used to perform training? I think an end-to-end example helps here because we are new to the YOLO framework. There's just too much info in this thread and we may get distracted by something unrelated. |
I tested the dummy input/output and evaluation mAPs with the rewritten model and the original one. The output tensors of them when tested on dummy input are the same. But the mAPs are so difference, so I think the problem comes from the conflict of training/evaluation code and the rewritten model. I will further investigate the cause and solution then notify you later. |
Update, the rewritten model is still perserve mAPs, but the qat ready model mAPs are changed. So the problem comes from the from TinyNeuralNet.out.detectionmodel_q import QDetectionModel
model = QDetectionModel()
model.load_state_dict(torch.load('TinyNeuralNet/out/detectionmodel_q.pth'))
model.to(device=device)
metrics = validator(model=deepcopy(model)) # mAP = 0.65, not changed.
quantizer = QATQuantizer(model, dummy_input, work_dir='TinyNeuralNet/out',
config={'asymmetric': True, 'per_tensor': True, 'force_overwrite': False, 'rewrite_graph': False})
qat_model = quantizer.quantize()
qat_model.to(device=device)
metrics = validator(model=deepcopy(qat_model )) # mAP = 0.0286, dropped from scratch |
I had a new experiment on yolov5 and its mAP50 is also dropped from 0.7 to 0.0. |
All of the code is used to train and evaluate YOLOv8 is come from offical repo of ultralytics and the code for YOLOv5 can be found here. I just took their pretrained model, loaded it and tested on your QAT engine. |
It will take some time for us to replicate the issue since we are not familiar with yolo pipeline. Let's just not talk about something unrelated here and focus on one problem because there are already plenty of them here. Thanks for your patience and understanding. |
Hi @hoangtv2000 , I replicate the quantization pipeline in COCO8 on yolov8n.pt using :
And the script is modified from your code #337 (comment). Here is my example script, and the quantization error is acceptable.
Something you must pay attention:
So I believe YOLOv8 can achieve full INT8 quantization, whether using PTQ or QAT. You can check the script to get more details. |
Hi @zk1998, |
Hi @peterjc123,
I am adapting your quantized-aware training method to my YOLOv8 compression flow,
I follow this pipeline to train my model:
And I met this error, the loss metrics of QAT flow are huge.
I think the wrong fake quantization of input cause this error.
Do you know how to fix it?
Thank you.
The text was updated successfully, but these errors were encountered: