# R2Plus1D
## 算法原理简介
论文地址：[[1711.11248] A Closer Look at Spatiotemporal Convolutions for Action Recognition (arxiv.org)](https://arxiv.org/abs/1711.11248)

Tran等人在2018年发表在CVPR 的文章《A Closer Look at Spatiotemporal Convolutions for Action Recognition》提出了R(2+1)D，表明将三位卷积核分解为独立的空间和时间分量可以显著提高精度，R(2+1)D中的卷积模块将 $N \times t \times d \times d$ 的3D卷积拆分为 $N \times 1 \times d \times d$ 的2D空间卷积和 $M \times t \times 1 \times 1$ 的1D时间卷积，其中N和M为卷积核的个数，超参数M决定了信号在空间卷积和时间卷积之间投影的中间子空间的维数，论文中将M的值设置为：
$$
M_{i}= \left \lfloor \frac{td^{2}N_{i-1}N_{i}}{d^{2}N_{i-1}+tN_{i}} \right \rfloor
$$

i表示残差网络中第i个卷积块，通过这种方式以保证(2+1)D模块中的参数量近似于3D卷积的参数量。
<div align=center>
    <img src=./pics/r2plus1d.png> 
</div>

与全三维卷积相比，(2+1)D分解有两个优点，首先，尽管没有改变参数的数量，但由于每个块中2D和1D卷积之间的额外激活函数，网络中的非线性数量增加了一倍，非线性数量的增加了可以表示的函数的复杂性。第二个好处在于，将3D卷积强制转换为单独的空间和时间分量，使优化变得更容易，这表现在与相同参数量的3D卷积网络相比，(2+1)D网络的训练误差更低。

下表展示了18层和34层的R3D网络的架构，在R3D中，使用(2+1)D卷积代替3D卷积就能得到对应层数的R(2+1)D网络。
<div align=center>
    <img src=./pics/r3d_block.png> 
</div>



### 环境准备
```text
git clone https://gitee.com/yanlq46462828/zjut_mindvideo.git
cd zjut_mindvideo

# Please first install mindspore according to instructions on the official website: https://www.mindspore.cn/install

pip install -r requirements.txt
pip install -e .
```
### 训练流程


In [7]:
from mindspore import nn
from mindspore import context, load_checkpoint, load_param_into_net
from mindspore.context import ParallelMode
from mindspore.communication import init, get_rank, get_group_size
from mindspore.train import Model
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits

from mindvideo.utils.check_param import Validator,Rel

##### 数据集加载

通过基于VideoDataset编写的Kinetic400类来加载kinetic400数据集。

In [8]:
from mindvideo.data.kinetics400 import Kinetic400
# Data Pipeline.
dataset = Kinetic400(path='/home/publicfile/kinetics-400',
                    split="train",
                    seq=32,
                    num_parallel_workers=1,
                    shuffle=True,
                    batch_size=6,
                    repeat_num=1)
ckpt_save_dir = './r2plus1d'

/home/publicfile/kinetics-400/cls2index.json


##### 数据处理

通过VideoRescale对视频进行缩放，利用VideoResize改变大小，再用VideoRandomCrop对Resize后的视频进行随机裁剪，再用VideoRandomHorizontalFlip根据概率对视频进行水平翻转，利用VideoReOrder对维度进行变换，再用VideoNormalize进行归一化处理。

In [9]:
from mindvideo.data.transforms import VideoRandomCrop, VideoRandomHorizontalFlip, VideoRescale
from mindvideo.data.transforms import VideoNormalize, VideoResize, VideoReOrder

transforms = [VideoRescale(shift=0.0),
                VideoResize([128, 171]),
                VideoRandomCrop([112, 112]),
                VideoRandomHorizontalFlip(0.5),
                VideoReOrder([3, 0, 1, 2]),
                VideoNormalize(mean=[0.43216, 0.394666, 0.37645],
                                std=[0.22803, 0.22145, 0.216989])]
dataset.transform = transforms
dataset_train = dataset.run()
Validator.check_int(dataset_train.get_dataset_size(), 0, Rel.GT)
step_size = dataset_train.get_dataset_size()



##### 网络构建
1. R2Plus1d18中，输入首先经过一个(2+1)D卷积模块，经过一个最大池化层，之后通过4个由(2+1)D卷积模块组成的residual block，再经过平均池化层、展平层最后到全连接层。

2. 输入最先经过的的(2+1)D卷积模块具体为卷积核大小为(1,7,7)的Conv3d再接一个卷积核大小为(3,1,1)的Conv3d，卷积层之间是Batch Normalization和Relu层。

3. R2Plus1d18中包含4个residual block，每个block在模型中都堆叠两次，同时每个block都由两个(2+1)D卷积模块组成，每个(2+1)D卷积都由一个卷积核大小为(1,3,3)的Conv3d再接一个卷积核大小为(3,1,1)的Conv3d组成，卷积层之间仍然是Batch Normalization和Relu层，block的输入和输出之间是残差连接的结构。

In [10]:
from mindvideo.models.r2plus1d import R2Plus1d18
# Create model
network = R2Plus1d18(num_classes=400)

In [11]:
from mindvideo.schedule.lr_schedule import warmup_cosine_annealing_lr_v1
# Set learning rate scheduler.
learning_rate = warmup_cosine_annealing_lr_v1(lr=0.01,
                                                steps_per_epoch=step_size,
                                                warmup_epochs=4,
                                                max_epoch=100,
                                                t_max=100,
                                                eta_min=0)

In [12]:
# Define optimizer.
network_opt = nn.Momentum(network.trainable_params(),
                            learning_rate=learning_rate,
                            momentum=0.9,
                            weight_decay=0.00004)
# Define loss function.
network_loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")


In [13]:
# Set the checkpoint config for the network.
ckpt_config = CheckpointConfig(
        save_checkpoint_steps=step_size,
        keep_checkpoint_max=10)
ckpt_callback = ModelCheckpoint(prefix='r2plus1d_kinetics400',
                                directory=ckpt_save_dir,
                                config=ckpt_config)

In [14]:
# Init the model.
model = Model(network, loss_fn=network_loss, optimizer=network_opt, metrics={'acc'})

In [15]:
# Begin to train.
print('[Start training `{}`]'.format('r2plus1d_kinetics400'))
print("=" * 80)
model.train(1,
            dataset_train,
            callbacks=[ckpt_callback, LossMonitor()],
            dataset_sink_mode=False)
print('[End of training `{}`]'.format('r2plus1d_kinetics400'))



[Start training `r2plus1d_kinetics400`]
epoch: 1 step: 1, loss is 5.998835563659668
epoch: 1 step: 2, loss is 5.921803951263428
epoch: 1 step: 3, loss is 6.024421691894531
epoch: 1 step: 4, loss is 6.08278751373291
epoch: 1 step: 5, loss is 6.014780044555664
epoch: 1 step: 6, loss is 5.945815086364746
epoch: 1 step: 7, loss is 6.078174114227295
epoch: 1 step: 8, loss is 6.0565361976623535
epoch: 1 step: 9, loss is 5.952683448791504
epoch: 1 step: 10, loss is 6.033120632171631
epoch: 1 step: 11, loss is 6.05575704574585
epoch: 1 step: 12, loss is 5.9879350662231445
epoch: 1 step: 13, loss is 6.006839275360107
epoch: 1 step: 14, loss is 5.9968180656433105
epoch: 1 step: 15, loss is 5.971335411071777
epoch: 1 step: 16, loss is 6.0620856285095215
epoch: 1 step: 17, loss is 6.081112861633301
epoch: 1 step: 18, loss is 6.106649398803711
epoch: 1 step: 19, loss is 6.095144271850586
epoch: 1 step: 20, loss is 6.00246000289917
epoch: 1 step: 21, loss is 6.061524868011475
epoch: 1 step: 22, loss

KeyboardInterrupt: 

### 评估流程

In [24]:
from mindspore import context
from mindvideo.data.kinetics400 import Kinetic400

context.set_context(mode=context.GRAPH_MODE, device_target="GPU")

# Data Pipeline.
dataset_eval = Kinetic400("/home/publicfile/kinetics-400",
                            split="val",
                            seq=32,
                            seq_mode="interval",
                            num_parallel_workers=1,
                            shuffle=False,
                            batch_size=8,
                            repeat_num=1)

/home/publicfile/kinetics-400/cls2index.json


In [25]:
from mindvideo.data.transforms import VideoCenterCrop, VideoRescale, VideoReOrder
from mindvideo.data.transforms import VideoNormalize, VideoResize

transforms = [VideoResize([128, 171]),
                VideoRescale(shift=0.0),
                VideoCenterCrop([112, 112]),
                VideoReOrder([3, 0, 1, 2]),
                VideoNormalize(mean=[0.43216, 0.394666, 0.37645],
                                 std=[0.22803, 0.22145, 0.216989])]
dataset_eval.transform = transforms
dataset_eval = dataset_eval.run()

In [26]:
from mindspore import nn
from mindspore import context, load_checkpoint, load_param_into_net
from mindspore.train import Model
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
from mindvideo.utils.callbacks import EvalLossMonitor
from mindvideo.models.r2plus1d import R2Plus1d18

# Create model
network = R2Plus1d18(num_classes=400)

# Define loss function.
network_loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")

param_dict = load_checkpoint('/home/zhengs/r2plus1d/r2plus1d18_kinetic400.ckpt')
load_param_into_net(network, param_dict)

# Define eval_metrics.
eval_metrics = {'Loss': nn.Loss(),
                'Top_1_Accuracy': nn.Top1CategoricalAccuracy(),
                'Top_5_Accuracy': nn.Top5CategoricalAccuracy()}


# Init the model.
model = Model(network, loss_fn=network_loss, metrics=eval_metrics)

print_cb = EvalLossMonitor(model)


In [27]:
# Begin to eval.
print('[Start eval `{}`]'.format('r2plus1d_kinetics400'))
result = model.eval(dataset_eval,
                    callbacks=[print_cb],
                    dataset_sink_mode=False)
print(result)



[Start eval `r2plus1d_kinetics400`]
step:[    1/ 2484], metrics:[], loss:[3.070/3.070], time:1923.473 ms, 
step:[    2/ 2484], metrics:['Loss: 3.0702', 'Top_1_Accuracy: 0.3750', 'Top_5_Accuracy: 0.7500'], loss:[0.808/1.939], time:169.314 ms, 
step:[    3/ 2484], metrics:['Loss: 1.9391', 'Top_1_Accuracy: 0.5625', 'Top_5_Accuracy: 0.8750'], loss:[2.645/2.175], time:192.965 ms, 
step:[    4/ 2484], metrics:['Loss: 2.1745', 'Top_1_Accuracy: 0.5417', 'Top_5_Accuracy: 0.8750'], loss:[2.954/2.369], time:172.657 ms, 
step:[    5/ 2484], metrics:['Loss: 2.3695', 'Top_1_Accuracy: 0.5000', 'Top_5_Accuracy: 0.8438'], loss:[2.489/2.393], time:176.803 ms, 
step:[    6/ 2484], metrics:['Loss: 2.3934', 'Top_1_Accuracy: 0.4750', 'Top_5_Accuracy: 0.8250'], loss:[1.566/2.256], time:172.621 ms, 
step:[    7/ 2484], metrics:['Loss: 2.2556', 'Top_1_Accuracy: 0.4792', 'Top_5_Accuracy: 0.8333'], loss:[0.761/2.042], time:172.149 ms, 
step:[    8/ 2484], metrics:['Loss: 2.0420', 'Top_1_Accuracy: 0.5357', 'Top_5

KeyboardInterrupt: 

## Code
代码仓库地址如下：

Gitee   https://gitee.com/yanlq46462828/zjut_mindvideo

Github  https://github.com/ZJUT-ERCISS/r2plus1d_mindspore