#### Prerequisites
1. Immediately needs a plain C compiler.
2. In the future, you may need a java compiler (Java JDK, not SDK) as well, but not right now

#### Install flexneuart from PyPi
Alterantivey you can install it from sources. Just git clone the framework [from this location](https://github.com/oaqa/FlexNeuART) and execute the following command from the root folder:
```
pip install .
```

In [None]:
# unpack FlexNeuART scripts to a directory of interest
# choose another directory than in this notebook is needed
os.environ['FLEXNEUART_SCRIPTS_DIR']=os.path.expanduser('~/flexneuart_scripts')

In [None]:
!flexneuart_install_extra.sh $FLEXNEUART_SCRIPTS_DIR 0

 Installing additional scripts & binaries 
 log: /disk3/test_mcds2024/install.log
            INSTALL IS COMPLETE!


#### FlexNeuART installs PyTorch, unfortunately, there's still a chance that you need to re-install PyTorch on your own.
[Official PyTorch download page for older Pytorch versions](https://pytorch.org/get-started/previous-versions/)

In [None]:
# But first check if PyTorch installed successfully and it supports GPU (sometimes you can mistakingly install a CPU-only version)
import torch
torch.FloatTensor([3]).cuda()

tensor([3.], device='cuda:0')

#### If the check failed, you need to reinstall pytorch indeed

In [None]:
# use nvidia-smi to check out your CUDA version, this defines which pytorch distribution to use:
# note that I install a relatively old version of PyTorch, because I also have old NVIDIA drivers
!pip uninstall -y torch
!pip install torch==1.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

[0mLooking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu116
Collecting torch==1.13.1+cu116
  Using cached https://download.pytorch.org/whl/cu116/torch-1.13.1%2Bcu116-cp39-cp39-linux_x86_64.whl (1977.9 MB)
[0mInstalling collected packages: torch
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchdata 0.7.1 requires torch>=2, but you have torch 1.13.1+cu116 which is incompatible.
torchtext 0.17.1 requires torch==2.2.1, but you have torch 1.13.1+cu116 which is incompatible.[0m[31m
[0mSuccessfully installed torch-1.13.1+cu116


#### Data setup

In [None]:
# Define the root dataset directory
os.environ['COLLECT_ROOT']=os.path.expanduser('~/datasets')
os.environ['DATASET_NAME']='msmarco_pass'

In [None]:
# Create a collection directory and a directory to store training data

In [None]:
!echo "Your dataset directory: $COLLECT_ROOT/$DATASET_NAME"

Your dataset directory: /home/leo/datasets/msmarco_pass


In [None]:
!mkdir $COLLECT_ROOT
!mkdir $COLLECT_ROOT/$DATASET_NAME
!mkdir $COLLECT_ROOT/$DATASET_NAME/derived_data

In [None]:
# Download sample training data
!cd $COLLECT_ROOT/$DATASET_NAME/derived_data ; \
    wget https://file.io/S3RibiDh6Bhn ; \
    mv S3RibiDh6Bhn cedr_train_pass_50K_200_0_5_0_s1_bitext_2022-03-24.tar.bz2 ; \
    tar jxvf cedr_train_pass_50K_200_0_5_0_s1_bitext_2022-03-24.tar.bz2

--2024-03-11 21:59:54--  https://file.io/S3RibiDh6Bhn
Resolving file.io (file.io)... 45.55.107.24
Connecting to file.io (file.io)|45.55.107.24|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-bzip2]
Saving to: ‘S3RibiDh6Bhn’

S3RibiDh6Bhn            [         <=>        ]  80.05M  47.7MB/s    in 1.7s    

2024-03-11 21:59:56 (47.7 MB/s) - ‘S3RibiDh6Bhn’ saved [83938913]

cedr_train_pass_50K_200_0_5_0_s1_bitext/
cedr_train_pass_50K_200_0_5_0_s1_bitext/text_raw/
cedr_train_pass_50K_200_0_5_0_s1_bitext/text_raw/qrels.txt
cedr_train_pass_50K_200_0_5_0_s1_bitext/text_raw/data_query.tsv
cedr_train_pass_50K_200_0_5_0_s1_bitext/text_raw/test_run.txt
cedr_train_pass_50K_200_0_5_0_s1_bitext/text_raw/train_pairs.tsv
cedr_train_pass_50K_200_0_5_0_s1_bitext/text_raw/data_docs.tsv


#### Training a sample model

In [None]:
# Create a joint model & training configuration and save it to disk
!mkdir $COLLECT_ROOT/$DATASET_NAME/model_conf

In [None]:
import json
MODEL_TRAIN_CONF={
    "max_query_len": 64,
    "max_doc_len": 445,

    "epoch_lr_decay": 0.95,

    "lr_schedule": "const_with_warmup",
    "warmup_pct": 0.2,

    "init_lr": 0.0002,
    "init_bert_lr": 2e-05,

    "loss_func": "pairwise_margin",

    "model.dropout": 0.05,
    "weight_decay": 1e-07,

    "backprop_batch_size": 1,
    "batch_size": 16,
    "batch_size_val": 16,

    "eval_metric": "recip_rank"
}
with open(os.environ['COLLECT_ROOT'] + '/' + os.environ['DATASET_NAME'] + '/model_conf/vanilla_bert.json', 'w') as out_model_file:
    json.dump(MODEL_TRAIN_CONF, out_model_file, indent=4)

In [None]:
# Got to the training script directory and subdirectory scripts
os.chdir(os.environ['FLEXNEUART_SCRIPTS_DIR'])

In [None]:
# Re-setting all key enviromental variables
%env TRAINING_DATA_SUBDIR=cedr_train_pass_50K_200_0_5_0_s1_bitext/text_raw/

env: TRAINING_DATA_SUBDIR=cedr_train_pass_50K_200_0_5_0_s1_bitext/text_raw/


In [None]:
# Run the training script
# Alternatively you can run it in the shell but make sure to do
# export COLLECT_ROOT=$HOME/datasets
# export DATASET_NAME=msmarco_pass
# export TRAINING_DATA_SUBDIR=cedr_train_pass_50K_200_0_5_0_s1_bitext/text_raw/
!./train_nn/train_model.sh \
    $DATASET_NAME \
    $TRAINING_DATA_SUBDIR \
     vanilla_bert \
     -seed 0 \
     -add_exper_subdir todays_experiment \
     -json_conf  model_conf/vanilla_bert.json \
     -epoch_qty 1