
### [1] SSH onto the HPC

`ssh SJSU_ID@coe-hpc1.sjsu.edu`

### [2] Clone Atlas to your home directory

`git clone https://github.com/facebookresearch/atlas.git`

### [3] Copy files from /data to atlas_data folder
##### Ideally, you need the 10 million passage dataset for retrieval, and the nq_data folder

`cp /data/cmpe259-fa23/atlas_data/corpora/wiki/enwiki-dec2018/text-list-10mil.jsonl /home/SJSU_ID/atlas/atlas_data/corpora/wiki/enwiki-dec2018/text-list-10mil.jsonl` 

`cp -r /data/cmpe259-fa23/atlas_data/nq_data /home/SJSU_ID/atlas/atlas_data/`

`cp -r /data/cmpe259-fa23/atlas_data/models/ /home/SJSU_ID/atlas/atlas_data`

### [4] Change directory to Atlas

`cd atlas`

### [5] Load Python 3.8.8

`module load python3/3.8.8`

### [6] Install Requirements 

##### TORCH with CUDA 

`pip3 install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113`


##### requirements.txt file

`pip3 install -r requirements.txt`

### [7] Copy the below batch script and save as atlas_demo.sh

#### Update necessary fields SJSU ID, home directory location etc. accordingly. Make sure Job name contains your SJSU ID 

```
#!/bin/bash
#SBATCH --mail-user=SJSU_EMAIL@sjsu.edu
#SBATCH --mail-user=/dev/null
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --job-name=gpuTest_SJSU_ID
#SBATCH --output=gpuTest_%j.out
#SBATCH --error=gpuTest_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=48:00:00     
##SBATCH --mem-per-cpu=2000
##SBATCH --gres=gpu:p100:1
#SBATCH --partition=gpu   

# on coe-hpc1 cluster load
# module load python3/3.8.8
#
# on coe-hpc2 cluster load:
module load python-3.10.8-gcc-11.2.0-c5b5yhp slurm

export http_proxy=http://172.16.1.2:3128; export https_proxy=http://172.16.1.2:3128

cd /home/SJSU_ID/atlas

DATA_DIR=/home/SJSU_ID/atlas/atlas_data
port=$(shuf -i 15000-16000 -n 1)
TRAIN_FILE="${DATA_DIR}/nq_data/train.64-shot.jsonl"
EVAL_FILES="${DATA_DIR}/nq_data/dev.jsonl"
SAVE_DIR=${DATA_DIR}/experiments/
EXPERIMENT_NAME=my_ten_mil_exp
TRAIN_STEPS=30


# submit your code to Slurm 
python3 /home/SJSU_ID/atlas/train.py --shuffle  --train_retriever  --gold_score_mode pdist   --use_gradient_checkpoint_reader --use_gradient_checkpoint_retriever  --precision bf16   --shard_optim --shard_grads   --temperature_gold 0.01   --refresh_index -1   --query_side_retriever_training  --target_maxlength 16   --reader_model_type google/t5-base-lm-adapt --dropout 0.1 --weight_decay 0.01 --lr 4e-5 --lr_retriever 4e-5 --scheduler linear   --text_maxlength 256   --model_path "/home/SJSU_ID/atlas/atlas_data/models/atlas/base/"  --train_data "${DATA_DIR}/nq_data/train.64-shot.jsonl"   --eval_data "${DATA_DIR}/nq_data/dev.jsonl"   --per_gpu_batch_size 1  --n_context 10   --retriever_n_context 10   --name my_ten_mil_exp   --checkpoint_dir ${SAVE_DIR}   --eval_freq 30   --log_freq 4   --total_steps ${TRAIN_STEPS}   --warmup_steps 5  --save_freq ${TRAIN_STEPS}   --main_port $port   --write_results   --task qa   --index_mode flat   --passages "/home/SJSU_ID/atlas/atlas_data/corpora/wiki/enwiki-dec2018/text-list-10mil.jsonl"  --save_index_path ${SAVE_DIR}/${EXPERIMENT_NAME}/saved_index 
```

### [8] SSH to coe-hpc2

`ssh coe-hpc2`

### [9] Run the batch script through slurm 

`cd atlas`
`sbatch atlas_demo.sh`

### [10] Check status of your job through squeue

`squeue`
`squeue -u SJSU_ID`

### [11] Check log files 

`cat gpuTest_%j.out`

### [12] Check experiment run log 

` cat /home/SJSU_ID/atlas/atlas_data/experiments/my_ten_mil_exp/run.log`

### [13] After code execution, check result file

`head -100 /home/015292108/atlas/atlas_data/experiments/my_ten_mil_exp/dev-step-30.jsonl`


