## An End-to-End Transformer Model for Crowd Localization

This notebook comprises the paper analysis named "An End-to-End Transformer Model for Crowd Localization" and implementation details from my perspective.

### Metadata of the Paper

| Field      | Value                                     |
|------------|-------------------------------------------|
| Title      | An end-to-end transformer model for crowd localization |
| Author(s)  | Liang, Dingkang<br>Xu, Wei<br>Bai, Xiang  |
| Book Title | Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part I |
| Pages      | 38--54                                    |
| Year       | 2022                                      |
| Organization | Springer                                 |

### The Aim of This Study

Liang et al. (2022) suggests that <u>predicting head positions for crowd localization</u> is a more practical and advanced task than simply counting the number of people in a crowd. The authors propose a new method called CLTR, which is an end-to-end Crowd Localization Transformer that solves this task in a regression-based paradigm. This approach treats crowd localization as a direct set prediction problem, <u>utilizing extracted features and trainable embeddings as inputs to the transformer-decoder</u>. To generate more reasonable matching results and reduce ambiguous points, the authors introduce a <u>KMO-based Hungarian matcher</u> that considers nearby context as auxiliary matching cost. The effectiveness of this proposed method is evaluated on five datasets with various data settings, and it achieves the best localization performance on the NWPU-Crowd, UCF-QNRF, and ShanghaiTech Part A datasets.

<div style="text-align: center;">
    <img src="./images/1.PNG" alt="Image" />
</div>

### Implementation Details

The proposed approach by authors, "An End-to-End Transformer Model for Crowd Localization," aims to <u>predict crowd instances directly without the need for additional pre-processing or post-processing steps</u>. The implementation consists of a CNN-based backbone, a transformer encoder, a transformer decoder, and a KMO-based matcher.

<div style="text-align: center;">
    <img src="./images/2.PNG" alt="Image" />
</div>


* Below are the key components and steps involved in the implementation:

    * __CNN-based Backbone__: The first step is to extract feature maps from the input image. In this study, the ResNet50 architecture is utilized as the backbone network for its strong feature extraction capabilities.

    * __Feature Map Flattening__: The extracted feature maps are flattened into a 1D sequence, which is then enriched with positional embedding (Fp) to provide spatial information.

    * __Transformer Encoder__: The flattened sequence with positional embedding (Fp) is passed through a transformer encoder layer, resulting in encoded features (Fe). To reduce the channel dimension of the extracted feature maps, a 1x1 convolution is applied.

    * __Transformer Decoder__: The transformer decoder layers take the trainable head queries (Qh) and the encoded features (Fe) as input. Through cross-attention mechanisms, the decoder layers interact with each other, generating the decoded embedding (Fd), which contains both point (person's head) and category information.

    * __Point Regression and Classification Heads__: The decoded embeddings (Fd) are subsequently decoupled into point coordinates and confidence scores using a point regression head and a classification head, respectively. This enables precise localization of crowd instances and classification into specific categories.

    * __KMO-based Matcher__: During the model training process, it is necessary to match the predictions with ground truth (GT) by employing a one-to-one correspondence. Unmatched predicted points are considered as belonging to the "background" class.

    * __Data Augmentation__: To enhance the model's robustness and generalization, various data augmentation techniques are employed during training. These include random cropping, random scaling, and horizontal flipping of the training data.

    * __Optimizer and Learning Rate__: The Adam optimizer with a learning rate of 1e-4 is utilized to optimize the model parameters.

    * __Datasets__: The proposed model is evaluated on three benchmark datasets: UCF-QNRF, JHU-Crowd++, and NWPU-Crowd.

### Additional Keynotes

* The authors point out that regression-based methods, which predict coordinates directly, are more straightforward than detection-based and map-based methods. One advantage of these methods is that they can be trained end-to-end, without the need for preprocessing steps such as creating pseudo ground truth boxes or maps. Moreover, they do not rely on complex multi-scale fusion mechanisms to produce high-quality feature maps.

* The proposed method in this study is inspired from the paper "End-to-end object detection with transformers". This method provides accurate object detection results in a simpler and more effective way. However, the authors note that it cannot be directly applied to crowd localization due to the intrinsic limitations of the matcher. Specifically, the key component in DETR (the method used in the "End-to-end object detection with transformers" paper) is the L1-based Hungarian matcher, which measures the L1 distance of bounding boxes with class confidence to match the prediction-ground truth bounding box pairs, showing superior performance in object detection. However, in crowd datasets, no bounding boxes are given, and for crowd localization, L1 distance can easily lead to ambiguous matching in the point-to-point pairs. Crowd images only contain one category (heads), and the dense heads usually have similar textures, reporting close confidence, which can confuse the matcher. Therefore, the authors introduce a new k-nearest neighbors (KNN) matching objective named KMO as an auxiliary matching cost. The KMO-based Hungarian considers the context from nearby heads, which helps to reduce the ambiguous points and generate more reasonable matching results.

### Benchmark Results

* As part of the CMP719 lecture project, the following tables have been utilized as benchmarks with the aim of achieving comparable results to those presented in the referenced paper. These tables provide an achieved results of the crowd counting performance based on the NWPU and UCF-QNRF datasets by authors.

<div style="text-align: center;">
    <img src="./images/3.PNG" alt="Image" width="50%"/>
</div>

<div style="text-align: center;">
    <img src="./images/4.PNG" alt="Image" width="50%"/>
</div>

* It is also aimed that the visual results will be given for intuition for achieved results.

In [None]:
from __future__ import division

import os
import warnings
import torch
from config import return_args, args
torch.cuda.set_device(int(args.gpu_id[0]))
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu_id
import torch.nn as nn
from torchvision import transforms
import dataset
import math
from utils import get_root_logger, setup_seed
import nni
from nni.utils import merge_parameter
import time
import util.misc as utils
from utils import save_checkpoint
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torch
import numpy as np
from torch.utils.tensorboard import SummaryWriter  # add tensoorboard

if args.backbone == 'resnet50' or args.backbone == 'resnet101':
    from Networks.CDETR import build_model

warnings.filterwarnings('ignore')
'''fixed random seed '''
setup_seed(args.seed)


def main(args):
    if args['dataset'] == 'jhu':
        test_file = './npydata/jhu_val.npy'
    elif args['dataset'] == 'nwpu':
        test_file = './npydata/nwpu_val.npy'

    with open(test_file, 'rb') as outfile:
        test_list = np.load(outfile).tolist()

    utils.init_distributed_mode(return_args)
    model, criterion, postprocessors = build_model(return_args)

    model = model.cuda()

    model = nn.DataParallel(model, device_ids=[int(data) for data in list(args['gpu_id']) if data!=','])
    path = './save_file/log_file/debug/'
    args['save_path'] = path
    if not os.path.exists(args['save_path']):
        os.makedirs(path)
    logger = get_root_logger(path + 'debug.log')
    writer = SummaryWriter(path)

    num_params = 0
    for param in model.parameters():
        num_params += param.numel()
    print("model params:", num_params / 1e6)
    logger.info("model params: = {:.3f}\t".format(num_params / 1e6))

    optimizer = torch.optim.Adam(
        [
            {'params': model.parameters(), 'lr': args['lr']},
        ], lr=args['lr'], weight_decay=args['weight_decay'])
    if args['local_rank'] == 0:
        logger.info(args)

    if not os.path.exists(args['save_path']):
        os.makedirs(args['save_path'])

    if args['pre']:
        if os.path.isfile(args['pre']):
            logger.info("=> loading checkpoint '{}'".format(args['pre']))
            checkpoint = torch.load(args['pre'])
            model.load_state_dict(checkpoint['state_dict'], strict=False)
            args['start_epoch'] = checkpoint['epoch']
            args['best_pred'] = checkpoint['best_prec1']
        else:
            logger.info("=> no checkpoint found at '{}'".format(args['pre']))

    print('best result:', args['best_pred'])
    logger.info('best result = {:.3f}'.format(args['best_pred']))
    torch.set_num_threads(args['workers'])

    if args['local_rank'] == 0:
        logger.info('best result={:.3f}\t start epoch={:.3f}'.format(args['best_pred'], args['start_epoch']))

    test_data = test_list
    if args['local_rank'] == 0:
        logger.info('start training!')

    eval_epoch = 0

    pred_mae, pred_mse, visi = validate(test_data, model, criterion, logger, args)

    writer.add_scalar('Metrcis/MAE', pred_mae, eval_epoch)
    writer.add_scalar('Metrcis/MSE', pred_mse, eval_epoch)

    # save_result
    if args['save']:
        is_best = pred_mae < args['best_pred']
        args['best_pred'] = min(pred_mae, args['best_pred'])
        save_checkpoint({
            'arch': args['pre'],
            'state_dict': model.state_dict(),
            'best_prec1': args['best_pred'],
            'optimizer': optimizer.state_dict(),
        }, visi, is_best, args['save_path'])

    if args['local_rank'] == 0:
        logger.info(
            'mae={:.3f}\t mse={:.3f}\t best_mae={:.3f}\t'.format(
                args['epochs'],
                pred_mae, pred_mse,
                args['best_pred']))


def collate_wrapper(batch):
    targets = []
    imgs = []
    fname = []

    for item in batch:

        if return_args.train_patch:
            fname.append(item[0])

            for i in range(0, len(item[1])):
                imgs.append(item[1][i])

            for i in range(0, len(item[2])):
                targets.append(item[2][i])
        else:
            fname.append(item[0])
            imgs.append(item[1])
            targets.append(item[2])

    return fname, torch.stack(imgs, 0), targets


def validate(Pre_data, model, criterion, logger, args):
    if args['local_rank'] == 0:
        logger.info('begin test')
    test_loader = torch.utils.data.DataLoader(
        dataset.listDataset(Pre_data, args['save_path'],
                            shuffle=False,
                            transform=transforms.Compose([
                                transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                                            std=[0.229, 0.224, 0.225]),

                            ]),
                            args=args, train=False),
        batch_size=1,
    )

    model.eval()

    mae = 0.0
    mse = 0.0
    visi = []

    for i, (fname, img, kpoint, targets, patch_info) in enumerate(test_loader):

        if len(img.shape) == 5:
            img = img.squeeze(0)
        if len(img.shape) == 3:
            img = img.unsqueeze(0)
        if len(kpoint.shape) == 5:
            kpoint = kpoint.squeeze(0)

        with torch.no_grad():
            img = img.cuda()
            outputs = model(img)

        out_logits, out_point = outputs['pred_logits'], outputs['pred_points']
        prob = out_logits.sigmoid()
        prob = prob.view(1, -1, 2)
        out_logits = out_logits.view(1, -1, 2)
        topk_values, topk_indexes = torch.topk(prob.view(out_logits.shape[0], -1),
                                               kpoint.shape[0] * args['num_queries'], dim=1)
        count = 0
        gt_count = torch.sum(kpoint).item()
        for k in range(topk_values.shape[0]):
            sub_count = topk_values[k, :]
            sub_count[sub_count < args['threshold']] = 0
            sub_count[sub_count > 0] = 1
            sub_count = torch.sum(sub_count).item()
            count += sub_count

        mae += abs(count - gt_count)
        mse += abs(count - gt_count) * abs(count - gt_count)

        if i % 30 == 0:
            print('{fname} Gt {gt:.2f} Pred {pred}'.format(fname=fname[0], gt=gt_count, pred=count))

    mae = mae / len(test_loader)
    mse = math.sqrt(mse / len(test_loader))

    print('mae', mae, 'mse', mse)
    return mae, mse, visi


if __name__ == '__main__':
    tuner_params = nni.get_next_parameter()
    params = vars(merge_parameter(return_args, tuner_params))

    main(params)