## Motivation 

:::: {.columns}

::: {.column width="60%"}
The motivations are:

- <p align="justify">Safety:Improve safety on the road.</p>
- <p align="justify">Traffic Management: Manage traffic flow by identifying areas with high congestion.</p>
- <p align="justify">Improved Navigation: Better path planning, trajectory calculation.</p>
- <p align="justify">Use LiDAR and image: LiDAR and Images works best when combined together.</p>
:::

::: {.column width="40%"}
![](images/Acrobat_Jy4Az0Sjhm.png)
:::

::::

- <p align="justify">Flexibility:can be used in robotics, and augmented reality as well.</p>



## Problem statement 

:::: {.columns}

::: {.column width="40%"}
- <p align="justify">Develop a solution to locate 3D box in point cloud.</p>
- <p align="justify">Encode the point cloud efficiently.</p>
:::

::: {.column width="60%"}
![](images/Acrobat_UClLO5CYIH.png)
:::

::::

- <p align="justify">Use transformer directly on the lidar point cloud without voxelizing.</p>
- <p align="justify">Develop an accurate, efficient, and robust model that can generalize to new environments and tasks.</p>


## Challenges 

:::: {.columns}

::: {.column width="60%"}
![](images/Acrobat_YOwpXCawvB.png)
:::

::: {.column width="40%"}
- <p align="justify">LiDAR point clouds are inherently sparse.</p>
- <p align="justify">TLiDAR point cloud density varies due to sensor range, scanning pattern, and object-sensor pose..
- Occlusion issues
- Algorithm design challenges
- 3D detection hurdles

:::

::::




## Existing Methods and Limitations 

- PointNet [@DBLP:conf/cvpr/QiSMG17]:
  - Achieves permutation invariance via symmetric functions.
  - Lacks efficient capture of local structures.

- VoxelNet [@VoxelNet_2018]:
  - Exclusively employs LiDAR data.
  - Grid-based Voxelization can sacrifice details, especially at low resolutions.

- Point Pillar [@PointPillars_2019]:
  - Encodes LiDAR points as pillars, Limits the local resolution.

- Pseudo LiDAR [@DBLP:conf/cvpr/WangCGHCW19]:
  - Converts depth images to pseudo LiDAR.
  - Claimed to suffer from overfitting, as per [@DBLP:conf/iccv/ParkAG0G21].

## Proposed approach

- 3D Transformer Types:
  - Global [@DBLP:conf/cvpr/YuTR00L22]
  - Local [@DBLP:conf/cvpr/PanXSLH21]
  - Point-wise [@DBLP:journals/cvm/GuoCLMMH21]
  - Channel-wise [@DBLP:conf/accv/QiuAB22]
- Point-bert Strategy:
  - Bert-style pre-training for 3D global [@DBLP:conf/cvpr/YuTR00L22]
  - Boosts pure transformer performance but overlooks local features
- Global transformers excel in classification; for localization, both local and global features are vital.
- Our approach draws from the aforementioned studies.

## Notation

- Lidar points $P=\{p_1,p_2,\dots,p_N\} \in \mathbb R^{N \times D}$ 
- Embedded feature map $X\in \mathbb R^{N \times C}$
- Learnable weight matrices for query $W_Q \in \mathbb R^{C \times C_Q},$ for key $W_K \in \mathbb R^{C \times C_K},$ and for value $W_V\in \mathbb R^{C \times C}$, typically $C_K=C_Q$ 
- A Typical Transformer used as an encoder, it has 6 components in general

   ![Transformer Encoder Architecture, courtesy:  @DBLP:journals/corr/abs-2205-07417](images/2023-09-13_15-16.png){#fig-1}



## Formulation 

$$\begin{cases}
    \text{Query}(Q) &=XW_Q \\
    \text{Key}(K) &=XW_K \tag{1}\\
    \text{Value}(V) &=XW_V \\
\end{cases}$$

- Query, Key and value are the core part of transformer
- When we multiply Query with Key it generates attention map
- In the simplest form, if the weights are for key and Query are all 1 it is just a correlation.


## Formulation 


So now attention can be formulated as shown below (Point wise transformer):
$$\text{attention map}=\text{Softmax}\left(\frac{QK^T}{\sqrt{C_K}} \right)\tag{2}$$
Channel wise attention:
$$\text{attention map}=\text{Softmax}\left(\frac{Q^TK}{\sqrt{C_K}} \right)\tag{3}$$

- Pointwise transformer: spatial relationship
- Channelwise transformer: contextual relationship.

## Point Cloud sparsity example

:::: {.columns}

::: {.column width="60%"}
![A typical  point colud](images/2023-09-13_17-45.png){#fig-pcloud}
:::

::: {.column width="40%"}
 ![Corresponding image of point cloud in @fig-pcloud](images/2023-09-13_17-54.png){#fig-image}
:::

::::


- we can see in figure @fig-pcloud how sparse these data are,But in @fig-image for  the same point cloud, the image is well represented

## Point Cloud sparsity example 2

:::: {.columns}

::: {.column width="40%"}
![Another example](images/2023-09-13_19-19.png){#fig-pcloud2}
:::

::: {.column width="60%"}
![Corrosponding image of @fig-pcloud2](images/2023-09-13_19-25.png){#fig-immage2}
:::

::::


- we can see the here again that in @fig-pcloud2, for far object data is less, but in @fig-immage2, there is good representation


## Methodology 


![Proposed Architecture](images/arch_pic1_a_1.png){#fig-fullArch}

## Attention  Encoder 

![Attention based Encoder](images/GlobalFeatures.drawio_1.png){#fig-attEncd}


- we used transformer to encode point cloud directly so that we can get long range attention map
- we used not only point based attention but also channel based attention as well, so that out network can have special attention as well as contextual attention. 
- we take inspiration from @DBLP:journals/pr/FengZLGM20, __And  as a novelty we add channel wise and point wise attention together, and concat it, also we use FPS,__ our method is completely different from @DBLP:journals/pr/FengZLGM20


## Backbone

![Modified backbone](images/Second.png){#fig-backbone}

- we use a modified version of SECOND [@Second] architecture.
- we are using only 2D version of it, we do not process data in in way as SECOND [@Second]
- The features from 3 different layers goes in parallel to FPN[@FPN]
- We use FPN [@FPN] as it is, so we are not showing FPN architecture.

## Loss function

- The loss is defined as the combination of localization loss, classification loss and directional loss. 
- For classification we will use focal loss as there is class imbalance issue.
- For localization we will use smooth L1 loss, as used in point Pillar [@PointPillars_2019], VoxelNet [@VoxelNet_2018] etc.
- The directional loss is simply a cross entropy loss

$$\begin{align*}
\mathcal  L = \frac{1}{N}(\beta_{\text{loc}}\mathcal L_{\text{loc}} + \beta_{\text{cls}}\mathcal L_{\text{cls}} + \beta_{\text{dir}}\mathcal L_{\text{dir}})
\end{align*}$$

## Experiment and Result

- The Kitti dataset is used for training.
- There are 3 class namely car, pedestrian, cyclist, one network shall be trained for all the 3 classes
- Adam optimizer was  tried followed by SGD and other, and was be selected based on validation set performance, the same goes for learning rate and other hyperparameter 
- The $\gamma,\beta$ parameter of the learning rate will be selected based on the experiment.
- The loss weightage are chosen as per Point Pillar [@PointPillars_2019] 

## Dataset

:::: {.columns}

::: {.column width="50%"}
- Training Dataset length =  3712

    - Class Distribution 

        | **category** | **number** |
        |--------------|------------|
        | Pedestrian   | 2207       |
        | Cyclist      | 734        |
        | Car          | 14357      |
:::
::: {.column width="50%"}
- Training Dataset length =  3769

    - Class Distribution 

        | **category** | **number** |
        |--------------|------------|
        | Pedestrian   | 2280       |
        | Cyclist      | 893        |
        | Car          | 14385      |
:::
::::


- we can see the imbalance in the dataset, hence we are using focal loss for classification 
- The Evaluation will be done based on AP11 and AP40 as suggested by the  KITTI Benchmark [@DBLP:conf/cvpr/GeigerLU12]
    



## Implementation : 

- Dual Attention Encoder (__Novelty__)
```python
    VoxelDualAttentionEncoder(
    (linear): Linear(in_features=10, out_features=16, bias=True)
    (layers): ModuleList(
        (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
            (values): Linear(in_features=8, out_features=8, bias=False)
            (keys): Linear(in_features=8, out_features=8, bias=False)
            (queries): Linear(in_features=8, out_features=8, bias=False)
            (fc_out): Linear(in_features=16, out_features=16, bias=True)
            (pointWiseFeatureTransform): Conv2d(2, 16, kernel_size=(2, 2), stride=(2, 2))
            (channelWiseFeatureTransform): ConvTranspose2d(32, 16, kernel_size=(8, 8), stride=(8, 8))
            (combinedFeatureTrnasform): ConvTranspose2d(16, 2, kernel_size=(2, 2), stride=(2, 2))
        )
        (norm1): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
        (dropout): Dropout(p=0.3, inplace=False)
        )
    )
    (dropout): Dropout(p=0.3, inplace=False)
    (fc_out): Linear(in_features=16, out_features=64, bias=True)
    )

```

## Implementation

- Block 1
```python
    (0): Sequential(
      (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
      (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (4): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (5): ReLU(inplace=True)
      (6): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (7): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (8): ReLU(inplace=True)
      (9): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (10): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (11): ReLU(inplace=True)
    )
```


## Implementation


- Block 2
```python
    (1): Sequential(
      (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
      (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (4): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (5): ReLU(inplace=True)
      (6): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (7): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (8): ReLU(inplace=True)
      (9): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (10): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (11): ReLU(inplace=True)
      (12): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (13): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (14): ReLU(inplace=True)
      (15): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (16): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (17): ReLU(inplace=True)
    )
```


## Implementation

- Block 3
```python
    (2): Sequential(
      (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
      (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (4): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (5): ReLU(inplace=True)
      (6): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (7): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (8): ReLU(inplace=True)
      (9): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (10): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (11): ReLU(inplace=True)
      (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (13): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (14): ReLU(inplace=True)
      (15): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (16): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (17): ReLU(inplace=True)
    )
```

## Implementation 

### Neck 

```python
    (0): Sequential(
      (0): ConvTranspose2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (1): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (1): Sequential(
      (0): ConvTranspose2d(128, 128, kernel_size=(2, 2), stride=(2, 2), bias=False)
      (1): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (2): Sequential(
      (0): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(4, 4), bias=False)
      (1): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )

```

## Implementation : 

### Training  

- The training was done with the grid based hyper parameter tuning: 
    - During training, stochastic gradient descent (SGD) with learning rate 0.01 was used for the first 150 epochs and decreased the learning rate to 0.001 for the last 10 epochs.
- Data Augmentation 
    - perturbation independently to each ground truth 3D bounding box together with those LiDAR points within the box.
    - Global scaling.

## Implementation : Evaluation

 
- We follows official KITTI evaluation protocol, which is the most appropriate method to capture object detection performance.
- As we can't find True negative in case of object detection so we do not report confusion matrix.
- For the same reason we can't find ROC curve, Hence for object detection evaluation we go for Precision-Recall curve, the curve doesn't involve True negative so it is possible to compute.
- We first generate prediction using model and find class label then find IOU, precision, recall
- Find the area under precision recall curve. which is known as the  Average precision.
- There are many method to find the area under this PR curve, for eg. AP11, AP40. 


## Experiments 

### Losses 

![](images/loss_compare_1.png)

##  Results :Quantitative

![](images/metrices_compare.png)

##  Results :Qualitative 

:::: {.columns}

::: {.column width="65%"}
![](images/images_000004.png)
:::

::: {.column width="35%"}
![](images/lidar_000004.png)
:::

::::


:::: {.columns}

::: {.column width="65%"}
![](images/images_000005.png)
:::

::: {.column width="35%"}
![](images/lidar_000005.png)
:::

::::






## 

:::: {.columns}

::: {.column width="65%"}
![](images/images_000058.png)
:::

::: {.column width="35%"}
![](images/lidar_000058.png)
:::

::::

:::: {.columns}

::: {.column width="65%"}
![](images/images_000059.png)
:::

::: {.column width="35%"}
![](images/lidar_000059.png)
:::

::::

##

:::: {.columns}

::: {.column width="65%"}
![](images/images_000062.png)
:::

::: {.column width="35%"}
![](images/lidar_000062.png)
:::

::::


:::: {.columns}

::: {.column width="65%"}
![](images/images_000098.png)
:::

::: {.column width="35%"}
![](images/lidar_000098.png)
:::

::::

##

:::: {.columns}

::: {.column width="65%"}
![](images/images_000191.png)
:::

::: {.column width="35%"}
![](images/lidar_000191.png)
:::

::::


:::: {.columns}

::: {.column width="65%"}
![](images/images_000211.png)
:::

::: {.column width="35%"}
![](images/lidar_000211.png)
:::

::::

## Future work 1 

![Local Features](images/LocalFeatures.png){#fig-local}

## Future work 1 

![OverAll Architecture](images/arch_pic1.png){#fig-strechArch}

## Future  work 1 


- we can also get local features, using patch based network
- Point Bert [@DBLP:conf/cvpr/YuTR00L22] used a pre-trained network on point cloud, but they tokenized the point cloud and the performed positional encoding.
- But we can do it more efficiently, the point cloud already has position information as it's coordinate, so if we do not tokenize it we can utilize the coordinate as positional encoding feature.
- __Novelty__: Use patch based attention encoder to get local feature, use the coordinate location as positional encoding

## Future work  2 

![Proposed Architecture](images/overAll.drawio.png){#fig-arch}

## Future work  2 


- @fig-arch shows the over all architecture, we extract global and local  features from image and 3D point cloud, these features are extracted from a transformer based encoder, having point and channel wise attention.
- these features are then fused together with a cross attention mechanism as explained in CAT-Det [@DBLP:conf/cvpr/ZhangC022]  
- __Novelty__ : Use feature based multi modality  fusion of channel wise attention and point wise attention, for cross attention use both global and local attention.

## References  