# SimCLR
The goal of this report is to show our effort in reproducing the paper "A Simple Framework for Contrastive Learning of Visual Representation". [Link to paper](https://arxiv.org/pdf/2002.05709.pdf) We reimplement SIMCLR using pytorch on the basis of official tensorflow version. Moreover, as the requirement of the course, we reproduce the result in table 8 on cifar 10 dataset and get a nice visualization effect on trained image vectors. Morever, we also extends our work to a new dataset RPLAN, and also achieves good visualizaton result. In general, our work can be divided into following parts:

- Reimplement the paper using pytorch on jupyter notebook.

- Reproduce the result of table 8 in the paper using different training strategy, including finetuning and linear evaluation, by using pretrained RESTNET(1X) and RESNET(4X) model.

- Extend to apply SIMCLR on RPLAN dataset. The work includes applying transform on RPLAN images(so that they can fit in the model) and pretrain on these images.

- Visualize the trained image vectors on CIFAR10 and RPLAN dataset by using PCA and SNE, and analysis the pretraining performance of the model.




## First we need to setup the repository for the following work. During our reproduction work, we adapt the code in this [repository](https://github.com/Spijkervet/SimCLR) and made a number of adjustments on it. Besides, we also need to install required environment packages so that our code can run correctly.

In [None]:
!git clone https://github.com/harrychen23235/SimCLR.git
%cd SimCLR
!mkdir -p logs && cd logs && cd ../
!sh setup.sh || python3 -m pip install -r requirements.txt || exit 1
!pip install  pyyaml --upgrade

# Part 2:
#### This part mainly focuses on reproducing the table 8 result on CIFAR10 in the paper. We use the official pretrained checkpoint to pre-load the RESNET model before finetuning. To make our model fit in downstream classfication task, we add a logistic regerssion layer at the end of RESNET model. We train on the model using two different training strategies, finetuning and linear evaluation. Finetuning is just like normal training process, the gradient passes through all the model and all parameters get updated after one backward. For linear evaluation, the parameter of RESNET model is frozen while training, and only the parameter of logistic regression layer gets updated. Two strategies share almost the same code, the only difference is mainly in training and testing process. Two different versions of RESNET models are used, inluding RESNET50(1X) and RESNET50(4X), and we train the model for 500 epoches and compare the test accuracy with the result in the paper.The result will be shown in later part of the report.


#### Install PyTorch/XLA

In [None]:
import os
import torch
import numpy as np
import torchvision
import argparse

from torch.utils.tensorboard import SummaryWriter

apex = False
try:
    from apex import amp
    apex = True
except ImportError:
    print(
        "Install the apex package from https://www.github.com/nvidia/apex to use fp16 for training"
    )

from model import save_model, load_optimizer
from simclr import SimCLR
from simclr.modules import get_resnet, NT_Xent
from simclr.modules.transformations import TransformsSimCLR

#### Import package in our repository and Pytorch.

In [None]:
import torch
import torchvision
import numpy as np
import argparse
from simclr.modules import LogisticRegression
from simclr.modules.resnet_wider import resnet50x4,resnet50x1

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### This function is the Training Function used in Linear evaluation scenario. Obviously, only logistic regression layer participates in the training.

In [None]:
def train_linear_evaluation(args, loader, simclr_model, model, criterion, optimizer):
    loss_epoch = 0
    accuracy_epoch = 0
    for step, (x, y) in enumerate(loader):
        optimizer.zero_grad()

        x = x.to(args.device)
        y = y.to(args.device)

        output = model(x)
        loss = criterion(output, y)

        predicted = output.argmax(1)
        acc = (predicted == y).sum().item() / y.size(0)
        accuracy_epoch += acc

        loss.backward()
        optimizer.step()

        loss_epoch += loss.item()
        # if step % 100 == 0:
        #     print(
        #         f"Step [{step}/{len(loader)}]\t Loss: {loss.item()}\t Accuracy: {acc}"
        #     )

    return loss_epoch, accuracy_epoch

#### This function is the Training Function used in finetuning scenario. Both logistic layer and RESNET model participate in training process.

In [None]:
def train_finetune(args, loader, simclr_model, model, criterion,optimizer_simclr, optimizer_model):
    loss_epoch = 0
    accuracy_epoch = 0
    for step, (x, y) in enumerate(loader):
        optimizer_simclr.zero_grad()
        optimizer_model.zero_grad()

        x = x.to(args.device)
        y = y.to(args.device)

        output_first = simclr_model(x)
        output = model(output_first)
        loss = criterion(output, y)

        predicted = output.argmax(1)
        acc = (predicted == y).sum().item() / y.size(0)
        accuracy_epoch += acc

        loss.backward()

        optimizer_simclr.step()
        optimizer_model.step()
        loss_epoch += loss.item()
        if step % 10 == 0:
             print(
                 f"Step [{step}/{len(loader)}]\t Loss: {loss.item()}\t Accuracy: {acc}"
             )

    return loss_epoch, accuracy_epoch

#### Test Function used in linear evaluation scenario. The principle is the same as training function.

In [None]:
def test_linear_evaluation(args, loader, simclr_model, model, criterion, optimizer):
    loss_epoch = 0
    accuracy_epoch = 0
    model.eval()
    for step, (x, y) in enumerate(loader):
        model.zero_grad()

        x = x.to(args.device)
        y = y.to(args.device)

        output = model(x)
        loss = criterion(output, y)

        predicted = output.argmax(1)
        acc = (predicted == y).sum().item() / y.size(0)
        accuracy_epoch += acc

        loss_epoch += loss.item()

    return loss_epoch, accuracy_epoch



#### Test Function used in finetuning scenario. The principle is the same as training function.

In [None]:
def test_finetune(args, loader, simclr_model, model, criterion):
    loss_epoch = 0
    accuracy_epoch = 0
    model.eval()
    simclr_model.eval()
    for step, (x, y) in enumerate(loader):
        model.zero_grad()
        simclr_model.zero_grad()
        x = x.to(args.device)
        y = y.to(args.device)

        output_first = simclr_model(x)
        output = model(output_first)

        loss = criterion(output, y)

        predicted = output.argmax(1)
        acc = (predicted == y).sum().item() / y.size(0)
        accuracy_epoch += acc

        loss_epoch += loss.item()

    return loss_epoch, accuracy_epoch

In [None]:
from pprint import pprint
from utils import yaml_config_hook

parser = argparse.ArgumentParser(description="SimCLR")
config = yaml_config_hook("./config/config.yaml")
for k, v in config.items():
    parser.add_argument(f"--{k}", default=v, type=type(v))

args = parser.parse_args([])
args.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

#### We set the batch size as 64 for ResNet1X model and 32 for ResNet4X model. These are the maximum size we can use because if we increase it, OOM error will occur. We run our codes in CIFAR10 dataset during reproduction work but our code can also be applied on other image classication dataset, inlcluding STL10.

In [None]:
args.batch_size = 32
args.dataset = "CIFAR10" # You can also set it as "STL10"
args.epoch_num = 100
args.logistic_epochs = 500
args.logistic_batch_size = 32

#### In this part we load dataset into train/test dataloaders. Compared with pretraining process, we don't need to apply any transform while on training dataset.

In [None]:
if args.dataset == "STL10":
    train_dataset = torchvision.datasets.STL10(
        args.dataset_dir,
        split="train",
        download=True,
        transform=TransformsSimCLR(size=args.image_size).test_transform,
    )
    test_dataset = torchvision.datasets.STL10(
        args.dataset_dir,
        split="test",
        download=True,
        transform=TransformsSimCLR(size=args.image_size).test_transform,
    )
elif args.dataset == "CIFAR10":
    train_dataset = torchvision.datasets.CIFAR10(
        args.dataset_dir,
        train=True,
        download=True,
        transform=TransformsSimCLR(size=args.image_size).test_transform,
    )
    test_dataset = torchvision.datasets.CIFAR10(
        args.dataset_dir,
        train=False,
        download=True,
        transform=TransformsSimCLR(size=args.image_size).test_transform,
    )
else:
    raise NotImplementedError

train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=args.logistic_batch_size,
    shuffle=True,
    drop_last=True,
    num_workers=args.workers,
)

test_loader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=args.logistic_batch_size,
    shuffle=False,
    drop_last=True,
    num_workers=args.workers,
)

#### Load ResNet encoder / SimCLR and load model weights. For the pretrained checkpoint, we use [this repo](https://github.com/tonylins/simclr-converter) to convert official checkpoint to pytorch checkpoint. As shown in the code, two different versions of ResNet are used.

In [None]:
#encoder = resnet50x4()
encoder = resnet50x1() 
n_features = encoder.fc.out_features 
simclr_model = encoder
#encoder.load_state_dict(torch.load("/content/drive/MyDrive/simclr-converter/resnet50-4x.pth", map_location=args.device.type)['state_dict'])
encoder.load_state_dict(torch.load("/content/drive/MyDrive/simclr-converter/resnet50-1x.pth", map_location=args.device.type)['state_dict'])
simclr_model = simclr_model.to(args.device)

#### We add a logistic regression layer at the end of ResNet model to fit the model in classfication task.

In [None]:
n_classes = 10 
model = LogisticRegression(n_features, n_classes)
model = model.to(args.device)

#### We use cross entropy as the criterion to get the image classification loss. For the optimizer, we use SGD, which is also used in the paper. We also try to use Adam, and find that there is almost no performance difference compared with SGD.

In [None]:
#optimizer_model = torch.optim.Adam(model.parameters(), lr=3e-4)
optimizer_model = torch.optim.SGD(model.parameters(), lr=0.0001, weight_decay=1e-6,momentum = 0.9)
optimizer_simclr = torch.optim.SGD(simclr_model.parameters(), lr=0.0001, weight_decay=1e-6,momentum = 0.9)
criterion = torch.nn.CrossEntropyLoss()

#### Helper functions to map all input data $X$ to their latent representations $h$ that are used in linear evaluation (they only have to be computed once).

In [None]:
def inference(loader, simclr_model, device):
    feature_vector = []
    labels_vector = []
    for step, (x, y) in enumerate(loader):
        x = x.to(device)

        # get encoding
        with torch.no_grad():
            h = simclr_model(x)

        h = h.detach()

        feature_vector.extend(h.cpu().detach().numpy())
        labels_vector.extend(y.numpy())

        if step % 20 == 0:
            print(f"Step [{step}/{len(loader)}]\t Computing features...")

    feature_vector = np.array(feature_vector)
    labels_vector = np.array(labels_vector)
    print("Features shape {}".format(feature_vector.shape))
    return feature_vector, labels_vector


def get_features(context_model, train_loader, test_loader, device):
    train_X, train_y = inference(train_loader, context_model, device)
    test_X, test_y = inference(test_loader, context_model, device)
    return train_X, train_y, test_X, test_y


def create_data_loaders_from_arrays(X_train, y_train, X_test, y_test, batch_size):
    train = torch.utils.data.TensorDataset(
        torch.from_numpy(X_train), torch.from_numpy(y_train)
    )
    train_loader = torch.utils.data.DataLoader(
        train, batch_size=batch_size, shuffle=False
    )

    test = torch.utils.data.TensorDataset(
        torch.from_numpy(X_test), torch.from_numpy(y_test)
    )
    test_loader = torch.utils.data.DataLoader(
        test, batch_size=batch_size, shuffle=False
    )
    return train_loader, test_loader

In [None]:


print("### Creating features from pre-trained context model ###")
(train_X, train_y, test_X, test_y) = get_features(
    encoder, train_loader, test_loader, args.device
)

arr_train_loader, arr_test_loader = create_data_loaders_from_arrays(
    train_X, train_y, test_X, test_y, args.logistic_batch_size
)

## Training&Testing procedure in linear evaluation scneario

In [None]:
for epoch in range(args.logistic_epochs):
    loss_epoch, accuracy_epoch = train_linear_evaluation(args, arr_train_loader, simclr_model, model, criterion, optimizer_model)
    
    if epoch % 10 == 0:
      print(f"Epoch [{epoch}/{args.logistic_epochs}]\t Loss: {loss_epoch / len(train_loader)}\t Accuracy: {accuracy_epoch / len(train_loader)}")


# final testing
loss_epoch, accuracy_epoch = test_linear_evaluation(
    args, arr_test_loader, simclr_model, model, criterion, optimizer_model
)
print(
    f"[FINAL]\t Loss: {loss_epoch / len(test_loader)}\t Accuracy: {accuracy_epoch / len(test_loader)}"
)

## Training&Testing procedure in finetuning scneario

In [None]:
for epoch in range(args.logistic_epochs):
    loss_epoch, accuracy_epoch = train_finetune(args, train_loader, simclr_model, model, criterion, optimizer_simclr, optimizer_model)
    print(accuracy_epoch)
    if epoch % 10 == 0:
      print(f"Epoch [{epoch}/{args.logistic_epochs}]\t Loss: {loss_epoch / len(train_loader)}\t Accuracy: {accuracy_epoch / len(train_loader)}")


loss_epoch, accuracy_epoch = test_finetune(
    args, test_loader, simclr_model, model, criterion
)
print(
    f"[FINAL]\t Loss: {loss_epoch / len(test_loader)}\t Accuracy: {accuracy_epoch / len(test_loader)}"
)

# Analysis on loss curve

### Linear Evaluation

#### We use weight & biase to get the loss curve of training process. As shown in the following graph. During linear evaluation, the model quickly converges and the loss almost reaches 0 in around 200 epoches. We also compare the convergence speed between ResNet1X and ResNet4X. The result shows that a larger pretraining model won't make the logistic regerssion layer converge faster.
<img src="https://s1.ax1x.com/2022/04/08/LpA8N6.png" alt="i1" style="zoom:100%;" /><img src="https://cdn.discordapp.com/attachments/884910103428476989/961713567076352130/WB_Chart_4_7_2022_9_41_15_PM.png" alt="i2" style="zoom:20%;" /><img src="https://cdn.discordapp.com/attachments/884910103428476989/961717185229779024/WB_Chart_4_7_2022_10_00_22_PM.png" alt="i2" style="zoom:20%;" /><img src="https://cdn.discordapp.com/attachments/884910103428476989/961717185468850206/WB_Chart_4_7_2022_10_00_28_PM.png" alt="i2" style="zoom:0%;" />


### finetune

We also plot the learning curve of finetuning training phase and compare it with the learning curve while the model learns from scratch.

## Result comparison

In this section, we compare the performance of different training strategies and the same strategy with the performance in the paper. All the results are shown in the following table. We fail to run ResNet4X finetune because lack of computational resource. Our ResNet1X finetuning has almost the same performance 
compared with original paper. However, after trying different training settings , our linear evaluation still can not achieve the same performance as the paper. After reading the paper and comparing our code with the official code, I think it might result from the small batch size we use. The author shows that a larger batch size over 512 can significantly increase the performance. But because of lack of memory resource, that is not applicable for us.

| Training Setup              | Note                          | Accuracy           |
| --------------------------- | ----------------------------- | ------------------ |
| ResNet1X finetune           | loading pretrained checkpoint | 0.955 |
| ResNet1X finetune           | learn from scratch            | 0.823|
| ResNet1X finetune           | in the original paper         | 0.977              |
| ResNet1X linear evaluation  | our implementation            | 0.852  |
| RestNet1X linear evaluation | in the original paper         | 0.906               |
| ResNet4X linear evaluation  | our implementation            | 0.897 |
| ResNet4X linear evaluation  | in the original paper         | 0.953              |