## TestTemplate Notebook

This notebook should be used to submit your model for testing on the test set and the submission of you model to the general leaderboard. Please copy it to your own submission folder and fill it in. Please note that it is important that you pip install any dependencies that your model needs so that we can easily run the model. In some cases, you might want to upload an already trained model to be evaluated, instead of training your model from scracth. This is HIGHLY recommended if you use models that take a long time to train, but for small sklearn models it is not very necessary.

## Submission instructions

If you want to submit your model to be tested on the secret test sets, please implement your model in this 'SubmitRun' notebook, and make sure it works on the sample data. It is easiest to just download the 'Competition' folder for this and check that everything works. If so, please put your notebook and any other files you need (like trained model files). 

In [1]:
# Please install any packages you need in this cell
# For example: !pip install sklearn

In [2]:
import os
import json
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from collections import defaultdict
from IPython.display import display, Markdown

# Metric import
import metricutils

# import any packages your code might need here

You will not have access to the test folder with the test data from both corpora, but you can test that your model will run properly with some samples that we have provided for you in the 'sample_data' directory.

## Adjusting paths

Depending on where you put hour notebook relative to the data, you might have slighly different paths to the training and sample data. To ensure the code works, please adjust the path in the cell below to point to the 'Competition' folder. In this case of this notebook, the path is just one folder up, so '../'. You can use either absolute or relative paths.

In [3]:
path_to_competition_folder = '../'

In [29]:
# This function can be used to load the pdf file names, gold standard json and the text dataframe

def get_data(data_path, train=True):
    """
    This function takes as input the path to either train_data or test_data, and combines
    the information present in the corpus1 and corpus2 subfolders to allow
    you to train on the all the data. the output is a dict containing the dataframe with the text,
    and a 'png' column to the path of the png files belonging to each file. It also contains the gold standard in
    binary vector format.
    
    """
    if train:
        gold_standard_path = '%s/Doclengths_of_the_individual_docs_TRAIN.json' % data_path
    else:
        gold_standard_path = '%s/Doclengths_of_the_individual_docs_TEST.json' % data_path       
    
    dataframe = pd.read_csv('%s/ocred_text.csv.gz' % data_path)
    with open(gold_standard_path, 'r') as json_file:
        gold_json = json.load(json_file)
    
    # Add the png column
    png_column = dataframe['name'] + '-' + dataframe['page'].astype(str) + '.png'
    
    # Make sure it points to the correct path
    png_column_joined = os.path.join(data_path, 'png') + os.sep + png_column
    dataframe['png'] = png_column_joined
    
    binary_gold = {key: metricutils.length_list_to_bin(val) for key, val in gold_json.items()}

    return {'csv': dataframe.sort_values(by=['name', 'page']), 'json': binary_gold}


def get_train_data(data_path):
    """
    This function takes as input the path to 'train_data' and combined
    the information present in the corpus1 and corpus2 subfolders to allow
    you to train on the all the data.
    """
  # We train on both corpus 1 and corpus 2
    c1_path = os.path.join(data_path, 'corpus1')
    c2_path = os.path.join(data_path, 'corpus2')
        
        
    c1_data = get_data(c1_path, train=True)
    c2_data = get_data(c2_path, train=True)

    combined_dataframe = pd.concat([c1_data['csv'], c2_data['csv']])
    combined_json = {**c1_data['json'], **c2_data['json']}
    combined_pdfs = c1_data['pdf'] + c2_data['pdf']
    
    return {'csv': combined_dataframe.sort_values(by=['name', 'page']), 'json': combined_json}


def get_sample_data(data_path):
    """
    This function takes as input the path to the 'sample_data' folder,
    and outputs a dictionary with the text dataframe, the gold standard json and
    the paths to the pds.
    """
    dataframe = pd.read_csv('%s/sample.csv' % data_path)
    with open('%s/sample.json' % data_path, 'r') as json_file:
        gold_json = json.load(json_file)

    # Add the png column
    png_column = dataframe['name'] + '-' + dataframe['page'].astype(str) + '.png'
    
    # Make sure it points to the correct path
    png_column_joined = os.path.join(data_path, 'png') + os.sep + png_column
    dataframe['png'] = png_column_joined
        
    binary_gold = {key: metricutils.length_list_to_bin(val) for key, val in gold_json.items()}

    return {'csv': dataframe.sort_values(by=['name', 'page']), 'json': binary_gold}

## Setting up the model

Unless you are loading in a trained model, this is where you want to insert the code to train your model.

The data is provided as a dictionary with three entries: `{'csv': _, 'json'}`.csv contains the loaded in csv file, json contains the gold standard json in binary format.

We are going to follow an approach that is similar to that of SKLearn, where you make a model that has `train` and `predict` functions. The function to score a modle will be provided by us.

Please also read the `Evaluation` notebook on the surdrive carefully, it shows how the metrics are calculated and what your format should be. Your model should return a dictionary where each key is the document ID and the value is the stream in binary vector format.

In [5]:
class Model:
    def __init__(self):
        pass

    def train(self, train_data):
        # This should train using the intput dictionary
        pass
    def predict(self, test_data):
        # same input dictionary as in the train function
        # YOUR CODE HERE
        pass
        

## Training the model (optional)

If your model requires trainin you can put this boolean to true. If not, you can leave it as-is.

In [6]:
MyModel = Model()

from_scratch = False
if from_scratch:
    train_data = get_train_data(os.path.join(path_to_competition_folder, 'train_data'))
    MyModel.train(train_data)

## Checking the model

For this we will make a prediction with the model on the sample data, and see how whether it works and gives us some scores. Please also add a descriptive name of your model below (preferably also containing your name), this will be used as a title of the plots shown on the test sets, and helps us keep it clear which plots came from which models.

In [7]:
short_model_description = ""

In [8]:
run_on_test = False

if run_on_test:
    
    corpus1_test_data = get_data(os.path.join(path_to_competition_folder, 'test_data/corpus1'), train=False)
    corpus2_test_data = get_data(os.path.join(path_to_competition_folder, 'test_data/corpus2'), train=False)
    
    predictions_corpus1 = predictions = MyModel.predict(corpus1_test_data)
    predictions_corpus2 = predictions = MyModel.predict(corpus2_test_data)

else:
    
    data = get_sample_data(os.path.join(path_to_competition_folder, 'sample_data'))
    
    # You should not have to adjust anything here, the model should just return predections in binary format
    predictions = MyModel.predict(data)



In [11]:
if run_on_test:
    display(Markdown("<b>Scores for the model on corpus 1</b>"))
    metricutils.evaluation_report(corpus1_test_data['json'], predictions_corpus1, title=short_model_description)
    
    display(Markdown("<b>Scores for the model on corpus 2</b>"))
    metricutils.evaluation_report(corpus2_test_data['json'], predictions_corpus2, title=short_model_description)
else:
    display(Markdown("<b>Scores for the model on the sample data</b>"))
    metricutils.evaluation_report(data['json'], predictions, title=short_model_description)