<a href="https://colab.research.google.com/github/codylw2/CourseProject/blob/main/colab_tfr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# General Setup
This notebook should be ran using a GPU and standard memory. I believe that you can run this using a standard Colab account.

This initial code block is where I ensure that all of the required modules are installed for this notebook.

In [None]:
!pip install -q PyDrive
!pip install -q metapy
!pip install -q pytoml
!pip install -q tensorflow_ranking

# Authenticate for Google Drive
There is a file size limitation for my github repository so I have added the files that will be used for this session into my Google Drive and made them publicly shareable. In order to access them you must authenticate to Google. It should not matter what account you use to do this since the files are public.

In [2]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Acquire Files
This section concerns itself with acquiring the files that are required to run the later scripts. It will also download a tuned version of the model so that there is no reason to runnin training unless desired. All of the scripts are stored within the git repo and that is the first thing download. The remaining supporting files are stored in my Google Drive account due to filesize limitations on github. The files are publicly available so anyone should be able to download and use them. The Google Drive files are downloaded through the PyDrive python module.

In [None]:
!git clone https://github.com/codylw2/CourseProject.git

This code block is where a selection of variables that will be used throughout this notebook are defined.

In [None]:
%cd /content/CourseProject/competition/tfr_custom

WORKDIR = !pwd
WORKDIR = WORKDIR[0]
BASE = WORKDIR + "/../.."
DATASET_DIR = BASE + "/competition/datasets"
JSON_DIR = BASE + "/competition/json_data"
TUNED_MODEL_DIR = WORKDIR + "/finetuned"
VOCAB_FILE = DATASET_DIR + "/scibert_vocab.txt"

!export CUDA_VISIBLE_DEVICES=0
QUERY_TOKENS = "narrative_tokens"
SEQ_LENGTH = 512

The next code block contains the definition of the function that is used to download folders from Google Drive. It iterates over all of the objects in the folder that it is called on. If the object is a file then it will download it but if it is another folder then it will recurse over it as well until everything has been downloaded.

Initial source for how to download a folder: https://stackoverflow.com/questions/47002558/downloading-all-of-the-files-in-a-specific-folder-with-pydrive

In [None]:
import os
def download_folder(dir_name, dir_id):
  curr_wrkdir = os.getcwd()
  if not os.path.exists(dir_name):
    os.makedirs(dir_name)
  os.chdir(dir_name)

  file_list = drive.ListFile({'q': "'{}' in parents and trashed=false".format(dir_id)}).GetList()
  for i, file1 in enumerate(sorted(file_list, key = lambda x: x['title']), start=1):
    print('Downloading from GDrive ({}/{}): {} '.format(i, len(file_list), os.path.join(dir_name, file1['title'])))
    try:
      file1.GetContentFile(file1['title'])
    except:
      download_folder(os.path.join(dir_name, file1['title']), file1['id'])

  os.chdir(curr_wrkdir)
  return

The links shown below are what is generated by Google Driver when you get the link for a file. Contained within the link is an 'id' for the file/folder that you can use to download it.

In [None]:
print('Downloading '+JSON_DIR)
# https://drive.google.com/drive/folders/1A8gdkcwpypbqlOZKHjaR7aYWymYbjWWd?usp=sharing
folder_id = "1A8gdkcwpypbqlOZKHjaR7aYWymYbjWWd"
download_folder(JSON_DIR, folder_id)

print('\nDownloading '+TUNED_MODEL_DIR)
# https://drive.google.com/drive/folders/1ZnFHLsl_F0bT0BBtOGuen3IgMELpUpel?usp=sharing
folder_id = "1ZnFHLsl_F0bT0BBtOGuen3IgMELpUpel"
download_folder(TUNED_MODEL_DIR, folder_id)

# Train Model (not recommended)
Running this section will recreate the training data that was downloaded in the previous section and it will take considerable time to complete. I advise against running this section unless you really want to verify that absolutely everything works. Your results will also not necessarily be the same as mine.

This code block generate the "Example List with Context" files that will be used when training the model. Each the document list associated with each query is broken up into increments that are the size of 'list size'.

In [None]:
!python $WORKDIR/tfr_convert_json_to_elwc.py \
    --vocab_file $VOCAB_FILE \
    --sequence_length=$SEQ_LENGTH \
    --query_file=$JSON_DIR/train_queries.json \
    --qrel_file=$JSON_DIR/train_qrels.json \
    --doc_file=$JSON_DIR/train_docs.json \
    --query_key=$QUERY_TOKENS \
    --output_train_file=$WORKDIR/tfrecord_data/train.elwc.tfrecord \
    --output_eval_file=$WORKDIR/tfrecord_data/eval.elwc.tfrecord \
    --list_size=500 \
    --do_lower_case


The next code block deletes any existing tuning data for the model. If this was not cleared up then the model would start from the last point in the existing tuning data. Since the existing tuning data ends at a point greater than what this code uses, it will not run.

In [22]:
!rm -rf $TUNED_MODEL_DIR

The next script to be executed is what actually trains the model. This script was lifted with minimal changes from the congnitiveai source described within the project documentation. The orginal version required the user to run bazel to compile the script and its dependcies into an executable and this version can be run directly via python. The script works by creating a pipeline that will train the model until it reachs the defined number of training steps. If the saved model located in 'TUNED_MODEL_DIR' already execeeds or equals this number to steps then it will note that and end without performing any training.

In [None]:
!python "$WORKDIR/tfr_train.py" \
    --train_path="$WORKDIR/tfrecord_data/train.elwc.tfrecord" \
    --eval_path="$WORKDIR/tfrecord_data/eval.elwc.tfrecord" \
    --vocab_path=$VOCAB_FILE \
    --model_dir=$TUNED_MODEL_DIR \
    --data_format=example_list_with_context \
    --num_train_steps=10000 \
    --learning_rate=.005 \
    --dropout_rate=0.65 \
    --list_size=500 \
    --embedding_dim=$SEQ_LENGTH \
    --loss=approx_ndcg_loss \
    --listwise_inference \
    --config=cuda


# Predict with Model
This section uses a trained model to predict the relevance scores of documents.

This is the most important block of code in the notebook. It runs the script that predicts the relevance score of a list of documents. The list of documents that will run predictions on are defined via the output of another ranker. There are two main reasons to run this model on the output of another ranker instead of ranking all documents in the corpus. First is that ranking all documents in the corpus can take a very long time. For this corpus it can take approximately 24 hrs or more. The second is that NCDG of the results generated when using the full corpus is very low but re-ranking the outputs of another rankers predictions generates more accurate results than the previous ranker and runs quickly.

The script itself loads all the necessary documents and queries and then loops over the queries. The list of documents for each query is broken up into chunks of size 'docs_at_once' and then converted into ELWC style format. The query tokens that are used for each chunk are determined by the 'query_key' argument and for the model ranks will likely either be "narrative_tokens" or "question_tokens". The formatted chunks are then fed into the model to generate predictions. The list is broken up so that the GPU does not run out of memory while predicting.

This script is a heavily modified version of a script the comes from the cognitiveai git repo that is recognized in the resources section of the project documentation. The intial version required compiling the script with bazel and using Docker to run a prediction server that would feed results back to the script via GRPC. This version of the script does neither of those things with no performance degredation.

In [None]:
!mkdir $WORKDIR/scores
!python $WORKDIR/tfr_predict.py \
    --vocab_file $VOCAB_FILE \
    --sequence_length $SEQ_LENGTH \
    --query_file $JSON_DIR/test_queries.json \
    --query_key $QUERY_TOKENS \
    --doc_file $JSON_DIR/test_docs.json \
    --output_file $WORKDIR/scores/test_scores.json \
    --model_path $TUNED_MODEL_DIR \
    --docs_at_once 500 \
    --rerank_file $BASE/predictions.txt \
    --do_lower_case


This final code block outputs the predicionts file that was generated with the predictions script.

In [None]:
!cat $BASE/predictions.txt