# Model Evaluation Pipeline
This is a simple but functional notebook which generates overall accuracy results for a given model. Below, please enter the path for your model .pt and .bin files and a directory where output should be saved. Then just run the notebook. Theoretically, that's it.

Look for "overall_accs.txt" and "rnn.output" in your working directly. Those are the results.

This code works as intended on my machine, but **please let me know if it breaks on yours**.

This pipeline is heavily reliant on other functions scattered throughout the repo, so please be sure you haven't moved it anywhere else. It is also heavily reliant on the model structure and output files generated during training with our base code. It is finicky enough that if you have any issues, it's probably easiest to just ask me to fix it (or at least show me the error).

# USER INPUT HERE

In [1]:
# Enter the directory where you would like final result files to be saved.
# Then enter the path for your model .pt and .bin files. You must have both.
WORKING_DIR = 'C:/Users/eweeding/Documents/MSE/DeepLearning/FinalProject'
MODEL_PT = WORKING_DIR+'/model_mini.pt'
MODEL_BIN = WORKING_DIR+'/model_mini.bin'

# Set model type. For now, only single-task RNN is supported.
MODEL_TYPE = 'RNN'

# Set analysis mode. For now, only "overall" is supported.
MODE = 'overall'

### OPTIONAL ###
# Other paths you may need or want to change depending on your setup:
TEMPLATE_DIR = './EMNLP2018/templates' # Location of the paired sentence dataset
OUTPUT_FILE = 'all_test_sents.txt'     # Name of .txt file which will contain all test sentences, saved in TEMPLATE_DIR

# END USER INPUT #

In [2]:
# Import statements
import math
import operator
import os
import sys
import torch
import torch.nn as nn
from torch.autograd import Variable

# Import a module within the base code
sys.path.append(os.path.join(os.getcwd(), 'word-language-model'))
import data

# Check for GPU
gpu_avail = torch.cuda.is_available()
print('GPU available:', gpu_avail)

GPU available: True


## Step 1: Generate paired sentences
This generates the default sentence templates for testing. These already exist in ./EMNLP2018/templates, but you may need to generate them for your system if Pickle is being dumb.

In [3]:
%run ./src/make_templates.py $TEMPLATE_DIR

case: obj_rel_across_anim
case: obj_rel_within_anim
case: obj_rel_across_inanim
case: obj_rel_within_inanim
case: subj_rel
case: prep_anim
case: prep_inanim
case: obj_rel_no_comp_across_anim
case: obj_rel_no_comp_within_anim
case: obj_rel_no_comp_across_inanim
case: obj_rel_no_comp_within_inanim
case: simple_agrmt
case: sent_comp
case: vp_coord
case: long_vp_coord
case: reflexives_across
case: simple_reflexives
case: reflexive_sent_comp
case: npi_across_anim
case: npi_across_inanim
case: simple_npi_anim
case: simple_npi_inanim


## Step 2: Test model on the sentences

In [4]:
# Complexity measures and other basic functions
def get_entropy(o):
    ## o should be a vector scoring possible classes
    probs = nn.functional.softmax(o,dim=0)
    logprobs = nn.functional.log_softmax(o,dim=0) #numerically more stable than two separate operations
    return -1 * torch.sum(probs * logprobs)

def get_surps(o):
    ## o should be a vector scoring possible classes
    logprobs = nn.functional.log_softmax(o,dim=0)
    return -1 * logprobs

def get_complexity_apply(o,t,sentid,tags=False):
    ## Use apply() method
    Hs = torch.squeeze(apply(get_entropy,o))
    surps = apply(get_surps,o)
    
    for corpuspos,targ in enumerate(t):
        if tags:
            word = corpus.dictionary.idx2tag[int(targ)]
        else:
            word = corpus.dictionary.idx2word[int(targ)]
        if word == '<eos>' or word == '<EOS>':
            #don't output the complexity of EOS
            continue
        surp = surps[corpuspos][int(targ)]
        
        with open(WORKING_DIR+'/rnn.output', 'a') as f:
            f.write('\n' + str(word)+' '+str(sentid)+' '+str(corpuspos)+' '+str(len(word))+' '+str(float(surp))+' '+str(float(Hs[corpuspos])))

def apply(func, M):
    ## applies a function along a given dimension
    tList = [func(m) for m in torch.unbind(M,dim=0) ]
    res = torch.stack(tList)
    return res

In [5]:
# Functions for getting batches and evaluating on test data
def repackage_hidden(h):
    """Wraps hidden states in new Variables, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)
    
def test_get_batch(source, evaluation=False):
    if isinstance(source, tuple):
        seq_len = len(source[0]) - 1
        data = Variable(source[0][:seq_len], requires_grad=False)
        target = Variable(source[1][:seq_len], requires_grad=False)
    else:
        seq_len = len(source) - 1
        data = Variable(source[:seq_len], requires_grad=False)
        target = Variable(source[1:1+seq_len].view(-1))
    if gpu_avail:
        return data.cuda(), target.cuda()
    else:
        return data, target

def test_evaluate(test_lm_sentences, test_ccg_sentences, lm_data_source, ccg_data_source):
    
    criterion = nn.CrossEntropyLoss()    
    
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    ntokens = len(corpus.dictionary)
    
    with open(WORKING_DIR+'/rnn.output', 'w') as f:
        f.write('word sentid sentpos wlen surp entropy')
    
    for i in range(len(lm_data_source)+len(ccg_data_source)):
        
        if i % 1000 == 0:
            print(f'{i} / {len(lm_data_source)} sentences')
        
        if i >= len(lm_data_source):
            sent_ids = ccg_data_source[i-len(lm_data_source)]
            sent = test_ccg_sentences[i-len(lm_data_source)]
        else:
            sent_ids = lm_data_source[i]
            sent = test_lm_sentences[i]
        
        if gpu_avail:
            sent_ids = sent_ids.cuda()
            
        hidden = model.init_hidden(1) # number of parallel sentences being processed
        
        data, targets = test_get_batch(sent_ids, evaluation=True)
        data=data.unsqueeze(1) # only needed if there is just a single sentence being processed
        output, hidden = model(data, hidden)
        output_flat = output.view(-1, ntokens)
        curr_loss = criterion(output_flat, targets).item()
        total_loss += curr_loss

        # output word-level complexity metrics
        if i >= len(lm_data_source):
            get_complexity_apply(output_flat,targets,i-len(lm_data_source),tags=True)
        else:
            get_complexity_apply(output_flat,targets,i)

        hidden = repackage_hidden(hidden)

    return total_loss / (len(lm_data_source)+len(ccg_data_source))

In [6]:
# A "main" function that coordinates testing
def run_main(lm_data, save, save_lm_data, testfname):
    
    global corpus
    corpus = data.SentenceCorpus(lm_data, False, save_lm_data, True, testfname=testfname)
    
    test_lm_sentences, test_lm_data = corpus.test_lm
    test_ccg_sentences = []
    test_ccg_data = []
    
    # Load the saved model
    global model
    if gpu_avail:
        model = torch.load(save, map_location = 'cuda:0')
    else:
        model = torch.load(save, map_location = 'cpu')    
    print('Your model is:')
    print(model)

    # Run on test data
    test_loss = test_evaluate(test_lm_sentences, test_ccg_sentences, test_lm_data, test_ccg_data)
    
    print('=' * 89)
    print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(test_loss, math.exp(test_loss)))
    print('=' * 89)


In [7]:
from src.tester.TestWriter import TestWriter
from src.template.TestCases import TestCase

writer = TestWriter(TEMPLATE_DIR, OUTPUT_FILE)
testcase = TestCase()
tests = testcase.all_cases

all_test_sents = {}
for test_name in tests:
    test_sents = pickle.load(open(TEMPLATE_DIR+"/"+test_name+".pickle", 'rb'))
    all_test_sents[test_name] = test_sents

writer.write_tests(all_test_sents, 'word')
name_lengths = writer.name_lengths
key_lengths = writer.key_lengths

def score_rnn():
    print("Scoring RNN...")
    with open(WORKING_DIR+'/rnn.output', 'r') as f:
        all_scores = {}
        first = False
        score = 0.
        sent = []
        prev_sentid = -1
        for line in f:
            if not first:
                first = True
                continue
            #print(line)
            #print(line.strip())
            if first and len(line.strip().split()) == 6 and "torch.cuda" not in line:
                wrd, sentid, wrd_score = [line.strip().split()[i] for i in [0,1,4]]
                score = -1 * float(wrd_score) # multiply by -1 to turn surps back into logprobs
                sent.append((wrd, score))
                if wrd == ".":
                    name_found = False
                    for (k1,v1) in sorted(name_lengths.items(), key=operator.itemgetter(1)):
                        if float(sentid) < v1 and not name_found:
                            name_found = True
                            if k1 not in all_scores:
                                all_scores[k1] = {}
                            key_found = False
                            for (k2,v2) in sorted(key_lengths[k1].items(), key=operator.itemgetter(1)):
                                if int(sentid) <  v2 and not key_found:
                                    key_found = True
                                    if k2 not in all_scores[k1]:
                                        all_scores[k1][k2] = []
                                    all_scores[k1][k2].append(sent)
                    sent = []
                    if float(sentid) != prev_sentid+1:
                        logging.info("Error at sents "+sentid+" and "+prev_sentid)
                    prev_sentid = float(sentid)
    return all_scores

def test_LM():      
    print("Testing RNN...")
    run_main(TEMPLATE_DIR, MODEL_PT, MODEL_BIN, OUTPUT_FILE)
    results = score_rnn()
    with open(WORKING_DIR+'/'+MODEL_TYPE+'_results.pickle', 'wb') as f:
        pickle.dump(results, f)

INFO:root:Writing tests...


In [8]:
# Generate testing output
test_LM()

Testing RNN...
Your model is:
RNNModel(
  (drop): Dropout(p=0.2, inplace=False)
  (encoder): Embedding(50001, 200)
  (rnn): LSTM(200, 200, num_layers=2, dropout=0.2)
  (decoder): Linear(in_features=200, out_features=50001, bias=True)
)
0 / 11916 sentences
1000 / 11916 sentences
2000 / 11916 sentences
3000 / 11916 sentences
4000 / 11916 sentences
5000 / 11916 sentences
6000 / 11916 sentences
7000 / 11916 sentences
8000 / 11916 sentences
9000 / 11916 sentences
10000 / 11916 sentences
11000 / 11916 sentences
| End of training | test loss  7.12 | test ppl  1236.42
Scoring RNN...


## Step 3: Analyze the results

In [9]:
from src.template.TestCases import TestCase

testcase = TestCase()
tests = testcase.all_cases

results = pickle.load(open(WORKING_DIR+'/RNN_results.pickle', 'rb'))

joined_results = {}
for name in tests:
    print(name)
    if 'anim' in name:
        new_name = '_'.join(name.split("_")[:-1])
    else:
        new_name = name
    for sub_case in results[name]:
        if new_name not in joined_results:
            joined_results[new_name] = {}
        if sub_case not in joined_results[new_name]:
            joined_results[new_name][sub_case] = []
        joined_results[new_name][sub_case] += results[name][sub_case]
# dump joined results to .pickle file
pickle.dump(joined_results, open(WORKING_DIR+'/RNN_results.joined.pickle', 'wb'))

def is_more_probable(sent_a, sent_b):
    if len(sent_a) != len(sent_b) and args.unit_type == 'word':
        logging.info("ERROR: Mismatch in sentence lengths: (1) ",sent_a, " vs (2) ",sent_b)
    return sum([sent_a[i][1] for i in range(len(sent_a))]) > sum([sent_b[i][1] for i in range(len(sent_b))])

def analyze_agrmt_results(results):
    correct_sents = {}
    incorrect_sents = {}
    for case in results.keys():
        correct_sents[case] = []
        incorrect_sents[case] = []
        for i in range(0,len(results[case]),2):
            grammatical = results[case][i]
            ungrammatical = results[case][i+1]
            if is_more_probable(grammatical, ungrammatical):
                correct_sents[case].append((grammatical, ungrammatical))
            else:
                incorrect_sents[case].append((grammatical, ungrammatical))
    return correct_sents, incorrect_sents

def display_agrmt_results(name, sents):
    # print case-by-case accuracies
    correct_sents, incorrect_sents = sents
    overall_correct = 0.
    total = 0.
    strings = {}
    case_accs = {}
    for case in correct_sents.keys():
        overall_correct += len(correct_sents[case])
        total += len(correct_sents[case]) + len(incorrect_sents[case])
    return float(overall_correct)/total

obj_rel_across_anim
obj_rel_within_anim
obj_rel_across_inanim
obj_rel_within_inanim
subj_rel
prep_anim
prep_inanim
obj_rel_no_comp_across_anim
obj_rel_no_comp_within_anim
obj_rel_no_comp_across_inanim
obj_rel_no_comp_within_inanim
simple_agrmt
sent_comp
vp_coord
long_vp_coord
reflexives_across
simple_reflexives
reflexive_sent_comp
npi_across_anim
npi_across_inanim
simple_npi_anim
simple_npi_inanim


In [10]:
with open(WORKING_DIR+'/overall_accs.txt', 'w') as f:
    for name in joined_results.keys():
        if "npi" in name:
            continue
        else:
            print(f'Analyzing results for {name}')
            sents = analyze_agrmt_results(joined_results[name])
            overall = display_agrmt_results(name, sents)
            f.write(name+": "+str(overall)+"\n")

Analyzing results for obj_rel_across
Analyzing results for obj_rel_within
Analyzing results for subj_rel
Analyzing results for prep
Analyzing results for obj_rel_no_comp_across
Analyzing results for obj_rel_no_comp_within
Analyzing results for simple_agrmt
Analyzing results for sent_comp
Analyzing results for vp_coord
Analyzing results for long_vp_coord
Analyzing results for reflexives_across
Analyzing results for simple_reflexives
Analyzing results for reflexive_sent_comp
