# FACE RECOGNITION WITH ARCFACE

### WITH THE BEST SUCCESS RATE

#### https://arxiv.org/pdf/1804.06655.pdf

* ArcFace
* GAN

## 1. FACE RECOGNITION SYSTEMS


<img src="https://imgur.com/jUJZ4tc.png"  width="600"> <br>

* **Data** : We should have images of faces for model training.
* **Data Process** : GAN is used here.
* **Architecture** : Model selection is done in this section.
* **Loss** : It aims to improve education through loss functions. <br>

<img src="https://imgur.com/xFd67t4.png"  width="1000"><br> <br>
* **Face Alignment** : Data Cleaning.
* **Anti - Spoofing** : Fake image detection.
* **Face Processing** : Data preprocessing. (GAN)
* **Feature Extraction** : Feature extraction is made from images. Focus on a specific area. For example, it outputs a vector of size 512.
* The vector output is given to the loss functions. (Not used during testing.)



<img src="https://imgur.com/0C8sHj2.png"  width="800"><br>


### History Of Face Recognition Systems

<img src="https://imgur.com/37cOL3n.png"  width="1000">

### Modern Face Recognition Deep Learning Models

##### ImageNet Classification with Deep Convolutional Neural Networks (Details of the AlexNet model)

https://papers.nips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

##### Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG Networks)

https://arxiv.org/pdf/1409.1556.pdf

##### Deep Residual Learning for Image Recognition (ResNet)

https://arxiv.org/pdf/1512.03385.pdf

#### 1- Alexnet

<img src="https://imgur.com/ZL5xBvr.png"  width="400"> <br>
* The first layer has 96 **filters** of 11x11. With **max-pooling** the **feature map** is halved. The size of a **256x256** image is reduced to **128x128**.
* The output of the layers is used as input for the next layer.
* The second layer has 5x5 filters and the next layers have 3x3 filters.
* In the last part of the model, there are 3 **fully-connected layers**.



#### 2- VGGNet


<img src="https://imgur.com/X2htpH4.png"  width="800">  <br>
* It has been more successful than Alexnet.
* The disadvantage compared to modern models is the fully-connected layers at the end. 
* Because of these layers, it contains a lot of parameters. It takes up more space than modern models. It works more slowly.



#### GoogleNet & ResNet

<img src="https://imgur.com/rV6eIyo.png"  width="800"> <br>
* GoogleNet has resolved parameter redundancy.
* Convolutions of 1x1 change the size of the feature map, which results in models that require fewer parameters.
* ResNet enables the creation of deeper models.
* The highest success rate in face recognition was obtained using the ResNet model.


<img src="https://imgur.com/OMqR2Oj.png"  width="800"> <br>

* At the bottom are the most successful models of the time.


### Face Recognition Data Sets

##### The Devil of Face Recognition is in the Noise

https://arxiv.org/pdf/1807.11649.pdf


<img src="https://imgur.com/Uk0EcYK.png"  width="800"> <br>

- The ones in red are the **training sets** and the other colors are the **test data sets**.
- In training, we will use **LFW** for the test set and **MS-celeb-1M** (Microsoft's famous pictures) datasets for the training set.


<img src="https://imgur.com/lGioF2l.png"  width="600"> <br>
- Models trained with many users have higher performance rates.

<img src="https://imgur.com/gMgHqqF.png"  width="800"> <br>
- It includes how many different ids and images it contains.
- **Source** : indicates how it was obtained.
- **Cleaned** : indicates how it is preprocessed.



### LOSS FUNCTIONS

##### ArcFace: Additive Angular Margin Loss for Deep Face Recognition
https://arxiv.org/pdf/1801.07698.pdf


<img src="https://imgur.com/N1H7uF5.png"  width="920"> <br>

- Red : indicates those using **Softmax** lost function.
- Green : indicates those using **Euclidean Distance-Based** lost function.
- Blue : indicates those using **Angular Margin Based** lost function.
- Yellow : indicates those using **Softmax Variations**

<img src="https://imgur.com/PvEuNPC.png"  width="920"> <br>
- **Softmax** loss functions are often used in deep learning.
- However, for systems with high intra-class appearance variation, such as face recognition, softmax is not optimized.
- ArcFace extracts distinctive features from face images with the **Angular Margin Loss** proposal.
- The **purpose** of the proposed method (angular margin) is to make our model learn better during training by adding a penalty margin between the images of different users while collecting the same images in the same region in a space plane.
- Figures a and b on the right show the training results using different loss functions.
- Each individual color represents a different class.
- ArcFace has done a better job of keeping **different classes in different regions** in space.

<img src="https://imgur.com/X4Cui7C.png"  width="920"> <br>
- The figure shows a deep learning model trained using ArcFace.
- The variable x denotes features and w denotes normalized weights.
- The **first part** represents the model, while the **middle part** shows the addition of the **angular margin penalty**. The other parts are identical to previous studies.


<img src="https://imgur.com/5oJnXNc.png"  width="400"> <br>
- Different methods are tested and compared on LFW test data.
- **Data Set :** MS1MV2(Microsoft 1 Million)  -  **Model :** ResNet100  -  **Loss Function** : ArcFace   --> highest success rate


## 2. PROJECT

#### Ön Çalışma
##### PyTorch -  Classification Application with CIFAR10
https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py

---

##### ArcFace / backbone / arcfacenet.py  : MODEL
- A ResNet-based model
- An adapted model for face recognition
##### ArcFace / data /
- It includes code on how Test and Training data can be loaded.
##### ArcFace / dataset : MS1M
- Microsoft 1 Million (A certain part of it) : Full model training takes weeks!.
##### ArcFace / dataset : lfw...
- Test data. To be used for testing the model after model training
##### ArcFace / margin / ArcMarginProduct.py
- Python script containing ArcFace margin codes.
##### ArcFace / util / utils.py
- Here are a few functions we will use during the training.
---

### LIBRARIES

In [1]:
# import py files(data,margin,backbone...) for kaggle
import sys
sys.path.append('/kaggle/input/arcface/')

# file operations
import os
from pathlib import Path

# training data visualization
from tqdm import tqdm

# configuration
from easydict import EasyDict as edict

# pytorch
import torch
import torch.nn as nn       # functions required for neural networks
import torch.optim as optim
import torchvision.utils as vutils
from torchvision import transforms as trans


# data uploading
from data.ms1m import get_train_loader
from data.lfw import LFW

# MODEL
from backbone.arcfacenet import SEResNet_IR
from margin.ArcMarginProduct import ArcMarginProduct

from util.utils import save_checkpoint, test




### CONFIGURATION

#### Batch Size & Learning Rate & Epoch
- **Batch Size :** number of images given to the model. (In one iteration the model is given all the images up to the batch size.)
- **Epoch :** Number of times all images are shown to the model. (Number of iterations)
- 1 Epoch = means that the Model sees all the pictures.
- There is a linear relationship between **Learning Rate** and **Batch Size**.
- If the Batch Size is **reduced**, the Learning Rate should also be **reduced.**
- If the number of images given in each iteration decreases, the Learning Rate should also be decreased as the **measurements will become more precise**.
- So the progress step should be reduced.

In [16]:
conf = edict()

conf.train_root = '/kaggle/input/arcface/dataset/MS1M'
conf.lfw_test_root = '/kaggle/input/arcface/dataset/lfw_aligned_112'
conf.lfw_file_list = '/kaggle/input/arcface/dataset/lfw_pair.txt'

conf.mode = 'se_ir'   # 'ir' : ResNet based , 'se_ir' : It includes se blocks as well as ResNet blocks.
conf.depth = 50       # arcfacenet.py > def get_blocks(num_layers) : 50 - 100 - 152 can be selected. (DEPTH)
                      # If depth = 100 it may be better to use ir mode.
conf.margin_type = 'ArcFace'
conf.feature_dim = 512   # specifies what size vector to output when an image is given to the model. According to the article : 512.
conf.scale_size = 32.0
conf.batch_size = 96     # number of images given to the model. (16 can be selected if the video card memory is low).
conf.lr = 0.01
conf.milestones = [8,10,12] # reduces Learning Rate at epochs 8, 10 and 12 (to reduce train loss)
conf.total_epoch = 14

# SAVING CHECKPOINTS & MODELS
conf.save_folder = './saved'
conf.save_dir = os.path.join(conf.save_folder, conf.mode + '_' + str(conf.depth))       # ./saved/se_ir_50
# When the settings in the configuration change, the file name will also change. (we can add different features)

# GPU
conf.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') # If you have an Nvidia video card, use GPU or CPU.
conf.num_workers = 4       # how many workers we want to use on the GPU to load the data.
conf.pin_memory = True


# WARNING : WHEN THESE VALUES ARE SELECTED, THE VIDEO CARD HOLDS APPROXIMATELY 6GB-12GB OF DATA.

In [3]:
os.makedirs(conf.save_dir, exist_ok = True)     # If saved folder does not exist, create it.

### UPLOADING DATA

In [17]:

# DATA AUGMENTATION : When training the model, it is a machine learning technique used to reduce overfitting by training models on several 
# slightly modified copies of existing data. Below is the transformation section.

transform = trans.Compose([               # Compose combines multiple transforms if there is more than one.
    trans.ToTensor(),        # toTensor > allows us to save the images we receive as Tensor and give them to the model.
    # range [0,255] -> [0.0, 1.0]  If rgb is between 0-255, it is scaled as 0-1 and saved as a tensor.

    # NORMALIZATION : normalization of any prominence or brightness in a channel.
    trans.Normalize(mean=(0.5,0.5,0.5), std=(0.5,0.5,0.5))
    # We give values for channels r,g,b. (aim: to train more resilient models)

])

trainloader, class_num = get_train_loader(conf)


In [5]:
print('number of id : ', class_num)

number of id :  200


In [6]:
print(trainloader.dataset)

Dataset ImageFolder
    Number of datapoints: 29148
    Root location: /kaggle/input/arcface/dataset/MS1M
    StandardTransform
Transform: Compose(
               RandomHorizontalFlip(p=0.5)
               ToTensor()
               Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
           )


In [7]:
# we do the same for the test data:
lfwdataset = LFW(conf.lfw_test_root, conf.lfw_file_list, transform = transform)
# Get the lfw_test_root data, get the lfw_file_list labels and apply the transform.

lfwloader = torch.utils.data.DataLoader(lfwdataset, batch_size = 128, num_workers = conf.num_workers)
# The batch_size standard for lfw is usually 128.

### MODEL

In [18]:
print(conf.device)
# if the output is cuda you have an nvidia video card and you can use GPU.

cuda:0


In [9]:
# we create the model and send it to the device.
net = SEResNet_IR(conf.depth, feature_dim = conf.feature_dim, mode = conf.mode).to(conf.device)

margin = ArcMarginProduct(conf.feature_dim, class_num).to(conf.device) # ArcFace Loss Model
# The loss function works like a model.


In [10]:
print(net)

SEResNet_IR(
  (input_layer): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): PReLU(num_parameters=64)
  )
  (output_layer): Sequential(
    (0): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): Dropout(p=0.4, inplace=False)
    (2): Flatten()
    (3): Linear(in_features=25088, out_features=512, bias=True)
    (4): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (body): Sequential(
    (0): BottleNeck_IR_SE(
      (shortcut_layer): MaxPool2d(kernel_size=1, stride=2, padding=0, dilation=1, ceil_mode=False)
      (res_layer): Sequential(
        (0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (2): BatchNorm2d(64, eps=1e-05, moment

In [11]:
criterion = nn.CrossEntropyLoss()
# function that calculates the loss between the calculated output and the actual output.

In [19]:
optimizer = optim.SGD([ # SGD(Stochastic Gradient Descent)
# SGD is an iterative optimization process that searches for an objective function with an optimum value (Minimum/Maximum).
    {'params': net.parameters(), 'weight_decay':5e-4},
    # We give the model parameters (BatchNorm2d, Conv2d, PReLU...) to the optimizer. The optimizer updates them in each iteration.
    # weight_decay is a method used like dropout to prevent overfitting.

    {'params': margin.parameters(), 'weight_decay':5e-4},
    # We give the parameters of the loss function.

], lr = conf.lr, momentum = 0.9, nesterov = True)
# Stochastic Gradient Descent works slowly. When combined with the momentum value, we can get a fast result.
# nesterov : in the background we indicate how the gradients should be calculated.

# Since these parameters are learnable parameters, we ensure that they are updated during training.

In [13]:
print(optimizer)

SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.01
    maximize: False
    momentum: 0.9
    nesterov: True
    weight_decay: 0.0005

Parameter Group 1
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.01
    maximize: False
    momentum: 0.9
    nesterov: True
    weight_decay: 0.0005
)


1. parametre grubu modelin ağırlıklarını güncellerken, 2. parametre grubu ArcFace'in(margin) ağırlıklarını günceller.

In [20]:
def schedule_lr():
    for params in optimizer.param_groups:
        params['lr'] /=10
    print(optimizer, flush = True)   # we print to check the optimizer each time the function is called.

### TRAINING

In [15]:
best_acc = 0

for epoch in range (1,conf.total_epoch+1):
    net.train()

    # net.eval()
    print('epoch {}/{}'.format(epoch,conf.total_epoch))

    if epoch == conf.milestones[0]: # 8
        schedule_lr()
    if epoch == conf.milestones[1]: # 10
        schedule_lr()
    if epoch == conf.milestones[2]: # 12
        schedule_lr()

    for data in tqdm(trainloader):
        img, label = data[0].to(conf.device), data[1].to(conf.device)
        optimizer.zero_grad()

        logits = net(img)
        output = margin(logits,label)
        total_loss = criterion(output,label)
        total_loss.backward()
        optimizer.step()

    # test

    net.eval()
    lfw_acc = test(conf, net, lfwdataset, lfwloader)
    print('\nLFW: {:.4f} | train_loss: {:.4f}\n'.format(lfw_acc, total_loss.item()))

    is_best = lfw_acc > best_acc
    best_acc = max(lfw_acc, best_acc)

    save_checkpoint({
        'epoch' : epoch,
        'net_state_dict' : net.state_dict(),
        'margin_state_dict' : margin.state_dict(),
        'best_acc' : best_acc
    }, is_best, checkpoint = conf.save_dir)



epoch 1/14


100%|██████████| 304/304 [04:02<00:00,  1.26it/s]



LFW: 0.7798 | train_loss: 10.4869

best model saved

epoch 2/14


100%|██████████| 304/304 [04:07<00:00,  1.23it/s]



LFW: 0.8138 | train_loss: 7.3863

best model saved

epoch 3/14


100%|██████████| 304/304 [04:08<00:00,  1.22it/s]



LFW: 0.8282 | train_loss: 5.0021

best model saved

epoch 4/14


100%|██████████| 304/304 [04:09<00:00,  1.22it/s]



LFW: 0.8452 | train_loss: 4.5683

best model saved

epoch 5/14


100%|██████████| 304/304 [04:09<00:00,  1.22it/s]



LFW: 0.8553 | train_loss: 4.1904

best model saved

epoch 6/14


100%|██████████| 304/304 [04:10<00:00,  1.22it/s]



LFW: 0.8620 | train_loss: 1.6346

best model saved

epoch 7/14


100%|██████████| 304/304 [04:08<00:00,  1.22it/s]



LFW: 0.8637 | train_loss: 2.9025

best model saved

epoch 8/14
SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.001
    maximize: False
    momentum: 0.9
    nesterov: True
    weight_decay: 0.0005

Parameter Group 1
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.001
    maximize: False
    momentum: 0.9
    nesterov: True
    weight_decay: 0.0005
)


100%|██████████| 304/304 [04:08<00:00,  1.22it/s]



LFW: 0.8722 | train_loss: 1.0106

best model saved

epoch 9/14


100%|██████████| 304/304 [04:08<00:00,  1.22it/s]



LFW: 0.8737 | train_loss: 0.3112

best model saved

epoch 10/14
SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.0001
    maximize: False
    momentum: 0.9
    nesterov: True
    weight_decay: 0.0005

Parameter Group 1
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.0001
    maximize: False
    momentum: 0.9
    nesterov: True
    weight_decay: 0.0005
)


100%|██████████| 304/304 [04:08<00:00,  1.22it/s]



LFW: 0.8765 | train_loss: 0.8469

best model saved

epoch 11/14


100%|██████████| 304/304 [04:08<00:00,  1.22it/s]



LFW: 0.8753 | train_loss: 0.2048

epoch 12/14
SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    lr: 1e-05
    maximize: False
    momentum: 0.9
    nesterov: True
    weight_decay: 0.0005

Parameter Group 1
    dampening: 0
    differentiable: False
    foreach: None
    lr: 1e-05
    maximize: False
    momentum: 0.9
    nesterov: True
    weight_decay: 0.0005
)


100%|██████████| 304/304 [04:08<00:00,  1.22it/s]



LFW: 0.8775 | train_loss: 0.5172

best model saved

epoch 13/14


100%|██████████| 304/304 [04:08<00:00,  1.22it/s]



LFW: 0.8763 | train_loss: 0.9107

epoch 14/14


100%|██████████| 304/304 [04:08<00:00,  1.22it/s]



LFW: 0.8770 | train_loss: 0.4790



#### SOTA : the State Of The Art
1. Working with the complete MS1M dataset
2. conf.mode = 'ir'
3. conf.depth = '100'
4. conf.total_epoch = 20
5. conf.milestones = [12,16,18]

lfw = gives an accuracy of 99.83%. (is the best rate ever obtained.)
- It takes 5 days with two v100(32GB).

NOTE : MobileFaceNet can be used to run the trained model on a device (mobile etc.).