# Image classification pressures in language emergence
This repository contains PyTorch code for the short paper "Visual concepts pressure in language emergence". 

## Installation
Open Anaconda Prompt, create a new conda environment with Python 3.7 and activate it.
```
conda create -n le-nlp2 python=3.7
conda activate le-nlp2
```
Install PyTorch and related libraries.
```
conda install pytorch=1.8.1 torchvision=0.9.1 torchaudio=0.8.1 cudatoolkit=10.2 -c pytorch
```
Install other packages.
```
pip install -r requirements.txt
```

Install EGG library from Github repository (commit `ba7ba8f`).
```
pip install git+https://github.com/facebookresearch/EGG.git@ba7ba8f
```

If you wish to run this whole notebook, make sure that it runs in the newly created enviroment.

## Code structure
We divided our research code into multiple python files for better modularization. `main.py` is the most important file which is executed to perform experiments. It consists of the following steps: <br>

1. Parsing arguments specified in command line such as the task (baseline or one of the pressures) and fixed parameters such as embedding size.

2. Loading the `Vision` module (defined in `vision.py`) which is used to extract features from images.

3. Creating the signaling game train and test datasets (`SignalGameDataset` defined in `dataset.py`)  based on Cifar100. `SignalGameDataset` reads Cifar100 images, extracts their features with the `Vision` and samples images used in the game. Sampling differs between tasks, as different labels are used.

4. If `--eval_noise` argument was added in the command line, the Gaussian noise images dataset (`GaussianNoiseDataset` defined in `dataset.py`) is created. Noise images are created with the `_init_dataset()` method and afterwards the functionality is similar to `SignalGameDataset` - the images are passed through the `Vision` module and sampled for games.

5. Initializing the Sender and the Receiver (defined in `sender.py` and `receiver.py`). When trying to reproduce results of Bouchacourt and Baroni, 2018, we tried multiple game types such as variable-length (`SenderReceiverRnnReinforce`) or single symbol (`SymbolGameReinforce`) game optimized with the Reinforce algorithm. However, in the paper results we only use a variable-length game optimized with Gumbel-softmax estimator (`SenderReceiverRnnGS`).

6. Initializing the `BestEpochCheckpointSaver` callback. This is a custom callback, a subclass of `CheckpointSaver` implemented in the EGG library. It is implemented in `best_epoch_checkpoint.py`. As oppose to `CheckpointSaver`, `BestEpochCheckpointSaver` doesn't save the model checkpoint every n epochs, but only when the validation accuracy in the current epoch is the highest accuracy obtained so far. This is especially useful later in the pipeline.

7. Setting the loss depending on the task. In this part, we were slightly constrained by the EGG library which requires passing the loss function as an argument of the `Game` object. Our obstacle was that for the pressure losses we wanted to use the `ic_loss_weight` parameter which is a weight determining how much the additional image classification loss contributes to the total loss. We wanted to the user to have the opportunity to specify this parameter from command line. Thus we created a wrapper class for the loss (`ImageClasLoss` and `TargetClasLoss` in `loss.py`) where this parameter is specified. Each wrapper implements a method `get_loss()` which calculates the pressure loss and utilizes the instance variable `ic_loss_weight`.

8. Initializing the `Game` object, adding callbacks.

9. Initializing the Trainer object and starting training.

10. If `--eval_noise` argument was added in the command line, once training on Cifar100 is finished, the trained model is loaded and evaluated on the noise dataset.


## Example usage
It is possible to run 3 kinds of tasks: the `standard` signaling game with a single loss (baseline) and two tasks with additional visual pressures. You should run all experiments from the project root directory. It seems that the `tqdm` progress bar conflicts with printing in a notebook, so the outputs below look a bit messy. For better visual experience we recommend running these commands in the command line.

### 1. Baseline
Run the following code to train and evaluate the baseline model.

In [1]:
!python main.py


Log path: 28_03_2021_09_26_27_task_standard_seed_7
Parameters specified in the command line:
Image classification task: standard
Game type:  SenderReceiverRnnGS
Image classification loss weight:  1.0
Number of images in the game:  2
Same class probability:  0.0
Evaluate on Gaussian noise images? False

Cell type of the agents: gru
Device: cuda
Files already downloaded and verified
Extract image features from train set
Extract image features from test set Cifar100
Start training
{"loss": 0.31022676825523376, "acc": 0.8313999772071838, "length": 5.730979919433594, "mode": "train", "epoch": 1}
{"loss": 0.20310336351394653, "acc": 0.9107999801635742, "length": 7.791299819946289, "mode": "test", "epoch": 1}
{"loss": 0.10048744082450867, "acc": 0.9520000219345093, "length": 7.361539840698242, "mode": "train", "epoch": 2}
{"loss": 0.08790679275989532, "acc": 0.961899995803833, "length": 8.959699630737305, "mode": "test", "epoch": 2}
{"loss": 0.06602536141872406, "acc": 0.9698200225830078, "l

### 2. MultiLabel Binary image classification pressure (MLB)
As our first pressure, the system additionally predicts for each image (distractors and target image) whether it is of the same class as the target image. The total loss is the loss for that task summed with the standard signaling game loss.

In [42]:
!python main.py --task img_clas

2021-03-28 18:35:03.666160: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
Log path: 28_03_2021_18_35_00_task_img_clas_seed_7


Extracting features in batches:   0%|          | 0/50 [00:00<?, ?it/s]Parameters specified in the command line:
Extracting features in batches:   2%|▏         | 1/50 [00:07<06:02,  7.40s/it]
Image classification task: img_clas
Extracting features in batches:   4%|▍         | 2/50 [00:07<02:33,  3.20s/it]


### 3. Multiclass image classification pressure
As the second pressure, they system additionally predicts the target class of the target image. The total loss is the loss for that task summed with the standard signaling game loss.

In [43]:
!python main.py --task target_clas


Game type:  SenderReceiverRnnGS
Extracting features in batches:   6%|▌         | 3/50 [00:07<01:28,  1.89s/it]Image classification loss weight:  1.0

Extracting features in batches:   8%|▊         | 4/50 [00:08<00:58,  1.26s/it]Number of images in the game:  2
Extracting features in batches:  10%|█         | 5/50 [00:08<00:41,  1.09it/s]
Same class probability:  0.0

Extracting features in batches:  12%|█▏        | 6/50 [00:08<00:31,  1.40it/s]Evaluate on Gaussian noise images? False

Extracting features in batches:  14%|█▍        | 7/50 [00:09<00:25,  1.72it/s]

Extracting features in batches:  16%|█▌        | 8/50 [00:09<00:20,  2.01it/s]Cell type of the agents: gru
Extracting features in batches:  18%|█▊        | 9/50 [00:09<00:17,  2.28it/s]
Device: cuda
Extracting features in batches:  20%|██        | 10/50 [00:10<00:15,  2.50it/s]
Files already downloaded and verified
Extracting features in batches:  22%|██▏       | 11/50 [00:10<00:14,  2.68it/s]

Extract image features from tra

## Reproducing paper results
In this section, we reproduce the results presented in the paper. Bear in mind that these experiments take in total 10-20 hours to run on a GPU.

### Table 1 from the paper

In [None]:
!python run_multiple.py

Running above code will output a list of log names for all runs. Now we will gather the results and create a table to summarize them. Copy this list in the cell below and run the next cells.

In [35]:
log_names = ['27_03_2021_21_13_38_task_standard_seed_7', '27_03_2021_21_18_39_task_standard_seed_122', '27_03_2021_21_22_35_task_standard_seed_809', '27_03_2021_21_26_32_task_standard_seed_7', '27_03_2021_21_37_58_task_standard_seed_122', '27_03_2021_21_46_57_task_standard_seed_809', '27_03_2021_21_58_31_task_standard_seed_7', '27_03_2021_22_10_58_task_standard_seed_122', '27_03_2021_22_23_37_task_standard_seed_809', '27_03_2021_22_35_48_task_img_clas_seed_7', '27_03_2021_22_40_54_task_img_clas_seed_122', '27_03_2021_22_52_54_task_img_clas_seed_809', '27_03_2021_22_57_58_task_img_clas_seed_7', '27_03_2021_23_12_40_task_img_clas_seed_122', '27_03_2021_23_28_27_task_img_clas_seed_809', '27_03_2021_23_43_42_task_img_clas_seed_7', '27_03_2021_23_59_23_task_img_clas_seed_122', '28_03_2021_00_13_58_task_img_clas_seed_809', '28_03_2021_00_58_18_task_target_clas_seed_7', '28_03_2021_01_04_29_task_target_clas_seed_122', '28_03_2021_01_09_23_task_target_clas_seed_809', '28_03_2021_01_23_40_task_target_clas_seed_7', '28_03_2021_01_37_44_task_target_clas_seed_122', '28_03_2021_01_52_23_task_target_clas_seed_809', '28_03_2021_03_06_28_task_target_clas_seed_7', '28_03_2021_03_20_31_task_target_clas_seed_122', '28_03_2021_03_34_47_task_target_clas_seed_809']

In [3]:
from tensorboard.backend.event_processing import event_accumulator
import os
import numpy as np
import pandas as pd
import json
import torch
pd.options.display.float_format = '{:,.3f}'.format

def load_metric_from_event(event_name): 
    event_path = os.path.join("runs", event_name)
    event_name = os.listdir(event_path)[0]
    event_path = os.path.join(event_path, event_name)

    # only read scalars
    ea = event_accumulator.EventAccumulator(event_path, size_guidance={event_accumulator.SCALARS: 0})

    # load scalars
    ea.Reload()

    return get_best_epoch_metrics(ea)

def get_best_epoch_metrics(ea):
    test_accs = ea.Scalars(f"test/acc")
    test_acc_vals = [acc.value for acc in test_accs]
    test_best_acc_idx = np.argmax(test_acc_vals)
    test_best_acc = np.max(test_acc_vals)
    test_best_epoch = test_accs[test_best_acc_idx].step

    train_accs = ea.Scalars(f"train/acc")
    train_best_acc = train_accs[test_best_acc_idx].value
    try:
        train_img_clas_accs = ea.Scalars(f"train/img_class_acc")
        train_best_img_clas_acc = train_img_clas_accs[test_best_acc_idx].value

        
        test_img_clas_accs = ea.Scalars(f"train/img_class_acc")
        test_best_img_clas_acc = test_img_clas_accs[test_best_acc_idx].value
    except KeyError:
        train_best_img_clas_acc = np.nan
        test_best_img_clas_acc = np.nan
    
    return [test_best_epoch, train_best_acc, test_best_acc, train_best_img_clas_acc, test_best_img_clas_acc]

In [4]:
# choose which params to include in the
include_params = ["task", "ic_loss_weight", "num_imgs", "same_class_prob", "seed"]

In [5]:
results = []
for event_name in log_names:
    # get parameters
    with open(f"args//args_{event_name}.json") as json_file:
        params = json.load(json_file)
    # wrap each value into a list
    params = {key:[value] for key, value in params.items()}

    # create dataframe and add parameters
    results_df = pd.DataFrame(params)[include_params]

    # add metrics
    metric_cols = ["best epoch", "best train acc", "best test acc", "best train img acc", "best test img acc"]
    results_df.loc[:, metric_cols] = load_metric_from_event(event_name)

    # add noise acc
    try:
        path = f'interactions//{event_name}//validation'
        interactions = torch.load(f"{path}//interactions_epoch11")
        results_df["noise acc"] = interactions.aux["acc"].mean().item()
    except FileNotFoundError:
        results_df["noise acc"] = 0.0

    # append to final table
    results.append(results_df)


results = pd.concat(results)

from scipy.stats import pearsonr, spearmanr
pearson = pearsonr(results["best test acc"], results["noise acc"])
spearman = spearmanr(results["best test acc"], results["noise acc"])

results.head()

Unnamed: 0,task,ic_loss_weight,num_imgs,same_class_prob,seed,best epoch,best train acc,best test acc,best train img acc,best test img acc,noise acc
0,standard,1.0,2,0.0,7,4,0.974,0.978,,,0.644
0,standard,1.0,2,0.0,122,3,0.968,0.978,,,0.766
0,standard,1.0,2,0.0,809,3,0.968,0.973,,,0.795
0,standard,1.0,2,0.5,7,10,0.933,0.954,,,0.824
0,standard,1.0,2,0.5,122,7,0.955,0.97,,,0.734


In [6]:
# results["success"] = (results["best test acc"]>0.97).astype(int)
results = results.groupby(by=["task", "same_class_prob"]).mean()
results = results.reindex(index = ['standard','img_clas','target_clas'], level="task").fillna('-')
results[["best test acc", "best test img acc"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,best test acc,best test img acc
task,same_class_prob,Unnamed: 2_level_1,Unnamed: 3_level_1
standard,0.0,0.976,-
standard,0.5,0.955,-
standard,1.0,0.945,-
img_clas,0.0,0.973,0.918
img_clas,0.5,0.923,0.856
img_clas,1.0,0.935,1.000
target_clas,0.0,0.972,0.379
target_clas,0.5,0.918,0.455
target_clas,1.0,0.905,0.414


### Gaussian noise images

In [8]:
noise_results = results.reset_index(level="same_class_prob")
noise_results = noise_results.loc[noise_results["same_class_prob"]==0.5,["best test acc", "noise acc"]]
noise_results

Unnamed: 0_level_0,best test acc,noise acc
task,Unnamed: 1_level_1,Unnamed: 2_level_1
standard,0.955,0.809
img_clas,0.923,0.746
target_clas,0.918,0.594


##### Correlation between validation accuracy and noise accuracy

In [None]:
print(f"Pearson correlation: {pearson[0]} (p={np.round(pearson[1],2)})")
print(f"Spearman correlation: {spearman.correlation} (p={np.round(spearman.pvalue, 2)})")

### Topographic Similarity

In [32]:
import os 
from egg.core.language_analysis import TopographicSimilarity

def get_best_epoch(path):
        best_epoch_score = 0.0
        best_epoch = 1
        epoch_interactions = os.listdir(path)[:-1] # without noise
        for epoch, epoch_path in enumerate(epoch_interactions):
            interactions = torch.load(f"{path}//{epoch_path}")
            current_epoch_score = interactions.aux["acc"].mean()
            if current_epoch_score > best_epoch_score:
                best_epoch_score = current_epoch_score
                best_epoch = epoch+1
        return best_epoch

def get_params_df(log_name, include_params=["task", "same_class_prob", "seed"]):
     # get parameters
    with open(f"args//args_{log_name}.json") as json_file:
        params = json.load(json_file)

    # wrap each value into a list
    params = {key:[value] for key, value in params.items()}

    # create dataframe and add parameters
    results_df = pd.DataFrame(params)[include_params]
    return results_df
    
def get_top_sim(log_name, size):
    path = f'interactions//{log_name}//validation'
    best_epoch = get_best_epoch(path)
    interactions = torch.load(f"{path}//interactions_epoch{best_epoch}")
    messages = np.argmax(interactions.message, axis=-1)
    sender_input = interactions.sender_input.squeeze(dim=1)
    random_idx = np.random.randint(low=0, high=sender_input.shape[0], size=size)
    messages = messages[random_idx]
    sender_input = sender_input[random_idx]
    top_sim = TopographicSimilarity.compute_topsim(sender_input, messages, 'cosine', 'euclidean')

    return top_sim

In [36]:
# only calculate TS for runs with same class probability = 0.5
selected_idxs = [3, 4, 5, 12, 13, 14, 21, 22, 23]
log_names = [log_names[i] for i in range(len(log_names)) if i in selected_idxs]
# log_names = ['27_03_2021_16_26_11_task_standard_seed_7', '27_03_2021_16_47_36_task_standard_seed_122']

In [37]:
from tqdm import tqdm

top_sim_results = []
size = 10000 # how many messages do you want to sample - (0, 10000]
for log_name in tqdm(log_names):
    
    top_sim_results_df = get_params_df(log_name)
    top_sim_results_df["top_sim"] = get_top_sim(log_name, size=size)

    # append to final table
    top_sim_results.append(top_sim_results_df)

top_sim = pd.concat(top_sim_results).groupby(by="task").mean()

100%|██████████| 9/9 [5:26:08<00:00, 2174.22s/it]


In [41]:
top_sim

Unnamed: 0_level_0,same_class_prob,seed,top_sim
task,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
img_clas,0.5,312.667,0.049
standard,0.5,312.667,0.119
target_clas,0.5,312.667,0.176
