# ANLI Dataset Filtering

This Google Colab notebook is inspired by the [*Start Your NLI Research*](https://github.com/facebookresearch/anli/blob/main/mds/start_your_nli_research.md) instructions located on the [ANLI](https://github.com/facebookresearch/anli) GitHub repo. This is intended to be run on a [Google Colab Pro](https://colab.research.google.com/signup) or [Pro+](https://colab.research.google.com/signup) account leveraging a GPU-backed runtime.

## Connect to Google Drive

We will connect to [Google Drive](https://drive.google.com) to store weights and data within the cloud. This is needed because Google Colab has a maximum 24-hour runtime, even with Pro and Pro+ accounts. After running the cell below, you will be prompted to connect with your Google Drive account.

In [None]:
# Mount into drive
from google.colab import drive
drive.mount("/content/drive")

This cell created an `ANLI Project Data` folder within your `Colab Notebooks` folder.

In [None]:
%mkdir -p /content/drive/MyDrive/Colab\ Notebooks/ANLI\ Project\ Data
%mkdir -p /content/drive/MyDrive/Colab\ Notebooks/ANLI\ Project\ Data/scripts
%mkdir -p /content/drive/MyDrive/Colab\ Notebooks/ANLI\ Project\ Data/checkpoints

## GPU Allocation

It is a good idea to capture what kind of GPU we have allocated to us. The following commands do this in a summarized and a verbose manner.

In [None]:
!nvidia-smi -L

In [None]:
!nvidia-smi -q

## Project Setup

### Code Setup

First,  we need to download the [ANLI](https://github.com/facebookresearch/anli) repo and build the dataset.

In [None]:
!git clone https://github.com/facebookresearch/anli.git 2>/dev/null
!source anli/setup.sh

Then, we'll change directory into the source code directory, `anli`.

In [None]:
import os
import sys

try:
    os.chdir('anli/')
except FileNotFoundError as e:
    print(f"Could not change directory: {str(e)}")

Finally, as far as code goes, we'll need the `transformers` module from the popular [Hugging Face](https://huggingface.co/docs/transformers/index) open-source NLP company and `sentencepiece` which is needed to support experiments with the xlnet model.

In [None]:
!pip install transformers sentencepiece

#### Environment Variables

Before moving onto dataset setup, we'll set some environment variables to prepare for the Bash and Python scripts that follow.

In [None]:
%env PYTHONPATH='/env/python:/content/anli/src:/content/anli/utest:/content/anli/src/dataset_tools'
%env MASTER_ADDR=localhost

### Dataset Setup

We can't train a model without data, so this will download the SNLI, MNLI, FEVER, and NLI datasets.

In [None]:
!bash ./script/download_data.sh

Now, we'll transform the dataset into a format that the ANLI project expects.

In [None]:
!python ./src/dataset_tools/build_data.py

## Update Training Script

If you've placed a modified `training.py` script in your GDrive `Colab Notebooks/ANLI Project Data/scripts/` directory, uncomment and run the following line so that your updated script will be used in the ***Model Training*** section.

In [None]:
#!cp /content/drive/MyDrive/Colab\ Notebooks/ANLI\ Project\ Data/scripts/training.py ./src/nli/training.py

Alternatively, if you are storing an updated `training.py` in a public GitHub/GitLab repo, uncomment and run the following line after updating the URL to point to the *raw* file. In a browser, this will look like a plaintext version of the file.

In [None]:
#!curl https://https://raw.githubusercontent.com/username/project/main/src/nli/training.py -o ./src/nli/training.py

## Update Data

If any custom datasets are used for training or evaluation, the following lines bring them from Google Drive into Colab.

In [None]:
#%mkdir -p experiments

In [None]:
#!cp -R /content/drive/MyDrive/Colab\ Notebooks/ANLI\ Project\ Data/data/* ./experiments

## Model Training

Note that a list of supported models and extra, undocumented command line arguments are located in `/content/anli/src/nli/training.py`. Comments are below where changes have been made from [*Start Your NLI Research*](https://github.com/facebookresearch/anli/blob/main/mds/start_your_nli_research.md) instructions.

During training, model checkpoints will be automatically saved in a `saved_models` directory.

***Changelog***

* `-g 1`: This was changed to 1, since we only have one GPU.
* `--single_gpu`: This was added to suppress PyTorch Multiprocessing logic from kicking in.
* `--experiment_name`: The name of the experiment. During training, model checkpoints will be saved in `saved_models/{TRAINING_START_TIME}_[experiment_name]`.

In [None]:
!python ./src/nli/training.py \
    --model_class_name "roberta-large" \
    -n 1 \
    -g 1 \
    --single_gpu \
    -nr 0 \
    --max_length 156 \
    --gradient_accumulation_steps 1 \
    --per_gpu_train_batch_size 16 \
    --per_gpu_eval_batch_size 16 \
    --save_prediction \
    --train_data snli_train:none,mnli_train:none \
    --train_weights 1,1 \
    --eval_data snli_dev:none \
    --eval_frequency 2000 \
    --experiment_name "roberta-large|snli|nli"

**Make sure to queue this command alongside your training cell.** This will copy saved checkpoints from Colab to your Google Drive.

In [None]:
!cp -R ./saved_models/* /content/drive/MyDrive/Colab\ Notebooks/ANLI\ Project\ Data/checkpoints/