<a href="https://colab.research.google.com/github/codylw2/CourseProject/blob/main/colab_cranfield_metapy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Authenticate for Google Drive
There is a file size limitation for my github repository so I have added the files that will be used for this notebook into my Google Drive and made them publicly shareable. In order to access them you must authenticate to Google. It should not matter what account you use to do this since the files are public.

In [None]:
!pip install -q PyDrive
!pip install -q metapy
!pip install -q pytoml

In [2]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Acquire Files
This section concerns itself with acquiring the files that are required to run the later scripts. It will also download a tuned version of the model so that there is no reason to run training unless desired. All of the scripts are stored within the git repo and that is the first thing download. The remaining supporting files are stored in my Google Drive account due to file size limitations on github. The files are publicly available so anyone should be able to download and use them. The Google Drive files are downloaded through the PyDrive python module.

In [None]:
!git clone https://github.com/codylw2/CourseProject.git

In [None]:
%cd /content/CourseProject/competition
!mkdir json_data
%cd json_data

The links shown below are what is generated by Google Driver when you get the link for a file. Contained within the link is an 'id' for the file/folder that you can use to download it.

Initial source for how to download a file: https://buomsoo-kim.github.io/colab/2018/04/16/Importing-files-from-Google-Drive-in-Google-Colab.md/

In [5]:
# test_docs.json https://drive.google.com/file/d/1XqHy17_eOGk-BE91AC3gmKyJ1PiEjc6A/view?usp=sharing
downloaded = drive.CreateFile({'id':"1XqHy17_eOGk-BE91AC3gmKyJ1PiEjc6A"})
downloaded.GetContentFile('test_queries.json')

# train_docs.json : https://drive.google.com/file/d/1iyJ5F1BAT6BFKLOyumMt6z8jUaJ6KZLQ/view?usp=sharing
downloaded = drive.CreateFile({'id':"1XrBInztxbKW9FdNn8Tso9wPvnzkLbGaT"})
downloaded.GetContentFile('test_docs.json')

# train_queries.json : https://drive.google.com/file/d/1Dp2ExBJtUBE3UOpSuh1AwQnCH6vbYnaD/view?usp=sharing
downloaded = drive.CreateFile({'id':"1Dp2ExBJtUBE3UOpSuh1AwQnCH6vbYnaD"})
downloaded.GetContentFile('train_queries.json')

# train_qrels.json : https://drive.google.com/file/d/1tyGyuYtbGJHcKQIYoF4yaYCL9HRF2hku/view?usp=sharing
downloaded = drive.CreateFile({'id':"1tyGyuYtbGJHcKQIYoF4yaYCL9HRF2hku"})
downloaded.GetContentFile('train_qrels.json')

# test_docs.json https://drive.google.com/file/d/1XrBInztxbKW9FdNn8Tso9wPvnzkLbGaT/view?usp=sharing
downloaded = drive.CreateFile({'id':"1iyJ5F1BAT6BFKLOyumMt6z8jUaJ6KZLQ"})
downloaded.GetContentFile('train_docs.json')


We are deleting the current predictions file so that you can be absolutely sure it is getting creeated with the appropriate values. Don't worry if it says no such file or directory, that is expected.

In [None]:
!rm -f /content/CourseProject/competition/predictions.txt
!cat /content/CourseProject/competition/predictions.txt

# Create Cranfield Datasets
This section creates the datasets in the standard Cranfield format so the metapy can use a ranker to rank the documents.

In [None]:
%cd /content/CourseProject/competition/cranfield_metapy

WORKDIR = !pwd
WORKDIR = WORKDIR[0]
BASE = WORKDIR + "/../.."
DATASET_DIR = BASE + "/competition/datasets"
JSON_DIR = BASE + "/competition/json_data"


The next code block runs the script that generates the datasets from the json files that contain the queries, documents, and query relevance judgements, if training. The api is documented within the project documentation.

In [None]:
!python $WORKDIR/create_cranfield.py \
    --run_type "train;test" \
    --query_keys "query;question;narrative" \
    --doc_keys "title:abstract:intro" \
    --cranfield_dir $WORKDIR \
    --input_dir $JSON_DIR

# Generate Predictions
This section uses a ranker to predict the relevance scores of documents.

The next code block is the most important code block within the notebook. It runs the script that generates the predictions for document relevance. It uses the ranker defined by the 'ranker' argument to rank the documents. The 'run_type' argument combined with the 'dat_keys' argument defines what dataset will be used to rank documents. If using multiple datasets you can define a weight for each dataset for when the predictions are combined into a single ranking.

In [None]:
!python $WORKDIR/search_eval.py \
    --run_type "test" \
    --ranker "bm25" \
    --params "2.0;0.75;4450" \
    --dat_keys "title" \
    --doc_weights "1.0" \
    --cranfield_dir $WORKDIR \
    --predict_dir $BASE \
    --remove_idx


This final code block outputs the predictions file that was generated with the predictions script.

In [None]:
!cat $BASE/predictions.txt