# Template notebook

Notebook recipe with ready data loading & evaluation

## Steps to do before first run

1. Download the [dataset](https://raganato.github.io/vwsd/)
2. Unpack the downloaded archive to the project folder, so that there is path `data/train_v1` with the contents of the archive

In [89]:
from pathlib import Path
import logging
import json

import pandas as pd
import numpy as np
import torch
from PIL import Image, ImageFile

from src.data import CustomSplitLoader
from src.utils import evaluate

## Config

Paths resolution:

In [2]:
DATASET_VERSION = "v1"
PART = "train"
PATH = Path("data").resolve() / f"{PART}_{DATASET_VERSION}"
DATA_PATH = PATH / f"{PART}.data.{DATASET_VERSION}.txt"
LABELS_PATH = PATH / f"{PART}.gold.{DATASET_VERSION}.txt"
IMAGES_PATH = PATH / f"{PART}_images_{DATASET_VERSION}"
TRAIN_SPLIT_PATH = PATH / "split_train.txt"
VALIDATION_SPLIT_PATH = PATH / "split_valid.txt"
TEST_SPLIT_PATH = PATH / "split_test.txt"

Environment settings:

In [3]:
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# some images from train might not load without the following settings or warnings would be thrown
Image.MAX_IMAGE_PIXELS = None
ImageFile.LOAD_TRUNCATED_IMAGES = True

In [4]:
RANDOM_STATE = 42
torch.manual_seed(RANDOM_STATE)
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Running on {DEVICE}")

Running on cuda


## Loading data

This loader is intended for the creation of custom splits. **Note that to compare your solution in a fair way you should use predefined splits from repo.**

In [5]:
# loads pandas dataframes of corresponding sizes
# rows are shuffled; by default splits with different words in each subset
# original index is preserved in dataframes

# split_loader = CustomSplitLoader(
#     split_parts={
#         "train": 0.7,
#         "validation": 0.1,
#         "test": 0.2,
#     },
#     data_path=DATA_PATH,
#     labels_path=LABELS_PATH,
#     random_state=RANDOM_STATE
# )
# splits = split_loader.get_splits() 

The recommended in most cases way is to use predefined splits, which should be frozen in our repo. In case there is no split files, you should regenerate them using `generate_dev_test_split.py` script

In [9]:
df = pd.read_csv(DATA_PATH, sep='\t', header=None)
df.columns = ["word", "context"] + [f"image{i}" for i in range(10)]
df["label"] = pd.read_csv(LABELS_PATH, sep='\t', header=None)

train_df = df.loc[pd.read_csv(TRAIN_SPLIT_PATH, sep='\t', header=None).T.values[0]]
validation_df = df.loc[pd.read_csv(VALIDATION_SPLIT_PATH, sep='\t', header=None).T.values[0]]
test_df = df.loc[pd.read_csv(TEST_SPLIT_PATH, sep='\t', header=None).T.values[0]]

In [10]:
train_df

Unnamed: 0,word,context,image0,image1,image2,image3,image4,image5,image6,image7,image8,image9,label
0,moorhen,moorhen swamphen,image.3.jpg,image.8.jpg,image.4.jpg,image.1.jpg,image.2.jpg,image.0.jpg,image.5.jpg,image.6.jpg,image.7.jpg,image.9.jpg,image.0.jpg
1,serinus,serinus genus,image.3.jpg,image.23.jpg,image.4.jpg,image.1.jpg,image.2.jpg,image.20.jpg,image.5.jpg,image.24.jpg,image.22.jpg,image.21.jpg,image.20.jpg
2,pegmatite,pegmatite igneous,image.41.jpg,image.39.jpg,image.42.jpg,image.43.jpg,image.40.jpg,image.44.jpg,image.37.jpg,image.38.jpg,image.36.jpg,image.35.jpg,image.35.jpg
4,bonxie,bonxie skua,image.3.jpg,image.77.jpg,image.78.jpg,image.4.jpg,image.1.jpg,image.2.jpg,image.5.jpg,image.79.jpg,image.76.jpg,image.75.jpg,image.75.jpg
5,ixia,ixia genus,image.90.jpg,image.3.jpg,image.91.jpg,image.4.jpg,image.92.jpg,image.1.jpg,image.2.jpg,image.94.jpg,image.93.jpg,image.5.jpg,image.90.jpg
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12861,ducking,ducking hunting,image.964.jpg,image.6176.jpg,image.6742.jpg,image.12919.jpg,image.9996.jpg,image.966.jpg,image.967.jpg,image.12662.jpg,image.4312.jpg,image.965.jpg,image.12919.jpg
12862,tarnish,tarnish discoloration,image.7862.jpg,image.11086.jpg,image.11714.jpg,image.5269.jpg,image.2789.jpg,image.11230.jpg,image.3341.jpg,image.224.jpg,image.222.jpg,image.220.jpg,image.11714.jpg
12865,tragopogon,tragopogon genus,image.3.jpg,image.6250.jpg,image.15001.jpg,image.4.jpg,image.1.jpg,image.2.jpg,image.12074.jpg,image.5.jpg,image.4087.jpg,image.12806.jpg,image.12074.jpg
12866,illustrator,illustrator artist,image.10633.jpg,image.723.jpg,image.13372.jpg,image.881.jpg,image.12635.jpg,image.726.jpg,image.5985.jpg,image.722.jpg,image.724.jpg,image.725.jpg,image.10633.jpg


In [11]:
validation_df

Unnamed: 0,word,context,image0,image1,image2,image3,image4,image5,image6,image7,image8,image9,label
18,maja,maja genus,image.3.jpg,image.310.jpg,image.4.jpg,image.309.jpg,image.1.jpg,image.2.jpg,image.312.jpg,image.5.jpg,image.311.jpg,image.313.jpg,image.309.jpg
24,entoloma,entoloma genus,image.3.jpg,image.405.jpg,image.404.jpg,image.4.jpg,image.1.jpg,image.2.jpg,image.5.jpg,image.406.jpg,image.407.jpg,image.408.jpg,image.404.jpg
25,foulard,foulard fabric,image.340.jpg,image.418.jpg,image.423.jpg,image.343.jpg,image.421.jpg,image.344.jpg,image.422.jpg,image.342.jpg,image.208.jpg,image.420.jpg,image.418.jpg
27,biryani,biryani dish,image.454.jpg,image.436.jpg,image.455.jpg,image.453.jpg,image.451.jpg,image.456.jpg,image.437.jpg,image.457.jpg,image.434.jpg,image.433.jpg,image.451.jpg
28,sobriquet,sobriquet appellation,image.466.jpg,image.478.jpg,image.477.jpg,image.475.jpg,image.476.jpg,image.471.jpg,image.474.jpg,image.473.jpg,image.472.jpg,image.329.jpg,image.466.jpg
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12851,marattia,marattia genus,image.3.jpg,image.11008.jpg,image.6217.jpg,image.4.jpg,image.1.jpg,image.2.jpg,image.11414.jpg,image.5.jpg,image.9223.jpg,image.9977.jpg,image.6217.jpg
12852,tragulus,tragulus genus,image.3.jpg,image.10621.jpg,image.8594.jpg,image.4.jpg,image.1.jpg,image.2.jpg,image.5.jpg,image.4061.jpg,image.14197.jpg,image.12606.jpg,image.4061.jpg
12853,barbwire,barbwire wire,image.58.jpg,image.59.jpg,image.57.jpg,image.5849.jpg,image.56.jpg,image.4112.jpg,image.2784.jpg,image.151.jpg,image.60.jpg,image.6659.jpg,image.4112.jpg
12855,sample,sample distribution,image.65.jpg,image.13362.jpg,image.3623.jpg,image.6254.jpg,image.12852.jpg,image.290.jpg,image.10966.jpg,image.3473.jpg,image.3474.jpg,image.3472.jpg,image.10966.jpg


In [12]:
test_df

Unnamed: 0,word,context,image0,image1,image2,image3,image4,image5,image6,image7,image8,image9,label
3,bangalores,bangalores torpedo,image.58.jpg,image.59.jpg,image.64.jpg,image.57.jpg,image.55.jpg,image.56.jpg,image.62.jpg,image.63.jpg,image.61.jpg,image.60.jpg,image.55.jpg
6,leucaena,leucaena genus,image.105.jpg,image.3.jpg,image.106.jpg,image.109.jpg,image.4.jpg,image.1.jpg,image.2.jpg,image.108.jpg,image.5.jpg,image.107.jpg,image.105.jpg
7,mahonia,mahonia genus,image.3.jpg,image.124.jpg,image.122.jpg,image.4.jpg,image.120.jpg,image.123.jpg,image.1.jpg,image.2.jpg,image.121.jpg,image.5.jpg,image.120.jpg
10,gangster,gangster outlaw,image.166.jpg,image.173.jpg,image.172.jpg,image.165.jpg,image.174.jpg,image.170.jpg,image.171.jpg,image.167.jpg,image.168.jpg,image.169.jpg,image.165.jpg
12,brevicipitidae,brevicipitidae family,image.3.jpg,image.207.jpg,image.206.jpg,image.4.jpg,image.1.jpg,image.2.jpg,image.5.jpg,image.205.jpg,image.208.jpg,image.209.jpg,image.205.jpg
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12854,make,make persuade,image.442.jpg,image.9126.jpg,image.7574.jpg,image.7582.jpg,image.5015.jpg,image.5704.jpg,image.4933.jpg,image.1022.jpg,image.2288.jpg,image.1208.jpg,image.9126.jpg
12856,gunboat,gunboat boat,image.615.jpg,image.364.jpg,image.58.jpg,image.59.jpg,image.122.jpg,image.57.jpg,image.4149.jpg,image.56.jpg,image.680.jpg,image.60.jpg,image.615.jpg
12857,francisella,francisella bacteria,image.3.jpg,image.4.jpg,image.1147.jpg,image.2798.jpg,image.1.jpg,image.2.jpg,image.14303.jpg,image.5.jpg,image.8422.jpg,image.4973.jpg,image.1147.jpg
12863,lookout,lookout watcher,image.5338.jpg,image.11952.jpg,image.58.jpg,image.59.jpg,image.57.jpg,image.56.jpg,image.10445.jpg,image.15132.jpg,image.4060.jpg,image.60.jpg,image.4060.jpg


## Preprocessing

In [None]:
# your code here

## Model setup

In [None]:
# your code here

## Training

In [None]:
# your code here

## Evaluation

In [None]:
# please, make predictions here as list of mappings image file name (str) to prediction (float)
# see sample_submission.json for example
# your code here

In [23]:
# sample predictions generation
# please, REMOVE this code block after implementing real predictions generation
predictions = [{row[f"image{i}"]: np.random.random() for i in range(10)} for _, row in test_df.iterrows()]

In [86]:
evaluate(
    test_df.iloc[:, 2:-1].values,
    test_df["label"].values.reshape(-1, 1),
    predictions,
)

{'acc1': 0.10816448152562574,
 'acc3': 0.30154946364719903,
 'mrr': 0.2968386401044327}

In [90]:
# creates a file in <project root>/data with submissions in target format
with open(PATH / "template_submission.json", 'w') as f:
    json.dump([{k: str(v) for k, v in p.items()} for p in predictions], f, indent=2)

## Further steps

1. If you think that your attempt is successful, please, do not forget to **save your model** & rename `submission_file_path` to avoid rewriting the results
2. Submit the copy of your notebook to the repo (PR, if necessary)
3. **Do a study of cases, where the model does not predict correct labels**