# CNTK 208: ReasoNet for Machine Comprehension

## Introduction and Background

This hands-on tutorial will take you through how to implement [ReasoNet](https://posenhuang.github.io/papers/reasonet_iclr_2017.pdf) in the Microsoft Cognitive Toolkit. Machine comprehension task try to find out the answer for a question given a paragraph of text. 
In this tutorial, we will use [CNN data](https://github.com/deepmind/rc-data) as an example. The data is consist of tuples (q,d,a,A). Here q is the query, d is the document, a is candidate list and A is the true answer. 

### Model Structure

![](ReasoNet/components.png) 
![](ReasoNet/reasonet.png) 

## Data preparing


### Download data
The data can be downloaded via (https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTTljRDVZMFJnVWM) or (https://github.com/deepmind/rc-data)
The downloaded data is packaged as a gz file and to feed to CNTK it needs to be reformated. After unpacking the file, we will get three folders (e.g. training, test, validation), each contains a lot of files where each file is consist of a paragraph of text, a question, the answer to the questions and a list of entities. Here is an example of an instance,

> __*1:*__  http://web.archive.org/web/20150731215720id_/http://edition.cnn.com/2015/04/07/sport/wladimir-klitschko-ukraine-crisis-boxing/<br/>
>__*2:*__  <br/>
>__*3:*__  @entity3 ( @entity2 ) @entity1 heavyweight boxing champion @entity0 has an important title defense coming up , but his thoughts continue to be dominated by the ongoing fight for democracy in @entity8 . speaking to @entity2 from his @entity3 training base ahead of the april 25 showdown with @entity12 challenger @entity11 in @entity13 , @entity0 said the crisis in his homeland has left him shocked and upset . " my country is unfortunately suffering in the war with @entity18 -- not that @entity8 tried to give any aggression to any other nation , in this particular case @entity18 , unfortunately it 's the other way around , " @entity0 told @entity2 . " i never thought that our brother folk is going to have war with us , so that @entity8 and @entity18 are going to be divided with blood , " he added . " unfortunately , we do n't know how far it 's going to go and how worse it 's going to get . the aggression , in the military presence of ( @entity18 ) soldiers and military equipment in my country , @entity8 , is upsetting . " @entity0 is the reigning @entity33 , @entity34 , @entity35 and @entity36 champion and has , alongside older brother @entity37 , dominated the heavyweight division in the 21st century . @entity37 , who retired from boxing in 2013 , is a prominent figure in @entity8 politics . the 43 - year - old has led the @entity43 since 2010 and was elected mayor of @entity45 in may last year . tensions in the former @entity48 state remain high despite a ceasefire agreed in february as @entity50 , led by @entity52 chancellor @entity51 and president of france @entity53 , tries to broker a peace deal between the two sides . the crisis in @entity8 began in november 2013 when former president @entity58 scuttled a trade deal with the @entity60 in favor of forging closer economic ties with @entity18 . the move triggered a wave of anti-government protests which came to a head @entity45 's @entity67 in february 2014 when clashes between protesters and government security forces left around 100 dead . the following month , @entity18 troops entered @entity8 's @entity74 peninsula before @entity18 president @entity75 completed the annexation of @entity74 -- a move denounced by most of the world as illegitimate -- after citizens of the region had voted in favor of leaving @entity8 in a referendum . more than 5,000 people have been killed in the conflict to date . " people are dying in @entity8 every single day , " @entity0 said . " i do not want to see it , nobody wants to see it ... it 's hard to believe these days something like that in @entity50 -- and @entity8 is @entity50 -- can happen . " but with the backing of the international community , @entity0 is confident @entity8 can forge a democratic future rather than slide back towards a @entity48 - era style dictatorship . " i really wish and want this conflict to be solved and it can only be solved with @entity98 help , " he said . " @entity8 is looking forward to becoming a democratic country and live under @entity98 democracy . this is our decision and this is our will to get what we want . " if somebody wants to try to put ( us ) back to the @entity48 times and be part of the @entity108 , we disagree with that . we want to be in freedom . " we have achieved many things in moving forward and showed to the world that we do not want to live under a dictatorship . " @entity0 , whose comments were made as part of a wide - ranging interview for @entity2 's @entity118 series , is routinely kept abreast of developments in @entity8 by brother @entity37 but also returns home whenever he can . " as much time as i can spend , i am there in the @entity8 . it 's not like i am getting the news from mass media and making my own adjustments and judgments on what 's going on . it 's an actual presence and understanding from the inside ... it obviously affects my life , it affects the life of my family . " the 39 - year - old and his fiancée @entity137 celebrated happier times last december when the @entity12 actress gave birth to a baby daughter , @entity142 . " i need to get used to it that i 'm a father , which is really exciting . i hope i 'm going to have a big family with multiple kids , " he said . @entity0 is n't sure when he 'll finally hang up his gloves . " i do n't know how long i can last ... motivation and health have to be there to continue . " but after leaving almost all his boxing opponents battered and bruised -- the @entity8 is seeking an impressive 18th consecutive title defense against @entity11 -- @entity0 is keen to carry on fighting his own and his country 's corner in the opposite way outside the ring . " i just really want that we 'll have less violence in the world ... i hope in peace we can do anything , but if we have war then it 's definitely going to leave us dull and numb . " watch @entity0 's @entity118 interview on @entity2 's @entity165 on wednesday april 8 at 1130 , 1245 , 1445 , 2130 , 2245 and 2345 and thursday april 9 at 0445 ( all times gmt ) and here online .<br/>
>__*4:*__  <br/>
>__*5:*__  @placeholder faces @entity12 challenger @entity11 in @entity13 on april 25<br/>
>__*6:*__  <br/>
>__*7:*__  @entity0<br/>
>__*8:*__  <br/>
>__*9:*__  @entity118:Human to Hero<br/>
>__*10:*__  @entity13:New York<br/>
>__*11:*__  @entity137:Hayden Panettiere<br/>
>__*12:*__  @entity12:American<br/>
>__*13:*__  @entity3:Miami<br/>
>__*14:*__  @entity2:CNN<br/>
>__*15:*__  @entity1:World<br/>
>__*16:*__  @entity0:Klitschko<br/>
>__*17:*__  @entity11:Bryant Jennings<br/>
>__*18:*__  @entity8:Ukraine<br/>
>__*19:*__  @entity53:Francois Hollande<br/>
>__*20:*__  @entity52:German<br/>
>__*21:*__  @entity51:Angela Merkel<br/>
>__*22:*__  @entity50:Europe<br/>
>__*23:*__  @entity75:Vladimir Putin<br/>
>__*24:*__  @entity74:Crimea<br/>
>__*25:*__  @entity58:Victor Yanukovych<br/>
>__*26:*__  @entity33:IBF<br/>
>__*27:*__  @entity35:WBO<br/>
>__*28:*__  @entity34:WBA<br/>
>__*29:*__  @entity37:Vitali<br/>
>__*30:*__  @entity36:IBO<br/>
>__*31:*__  @entity18:Russian<br/>
>__*32:*__  @entity98:Western<br/>
>__*33:*__  @entity108:former Soviet Union<br/>
>__*34:*__  @entity142:Kaya<br/>
>__*35:*__  @entity165:World Sport program<br/>
>__*36:*__  @entity45:Kiev<br/>
>__*37:*__  @entity43:Ukrainian Democratic Alliance for Reform<br/>
>__*38:*__  @entity67:Maidan Square<br/>
>__*39:*__  @entity48:Soviet<br/>
>__*40:*__  @entity60:European Union<br/>

Here line
* __*3*__ is the paragraph
* __*5*__ is the question
* __*7*__ is the answer
* __*9*__ and the rest is the entity mappings in the paragraph and query

We will use the following block of code to download and merge each folder of files into a single file.

In [18]:
import io
import os
import re
import requests
import sys
import tarfile
import shutil

def merge_files(folder, target):
  if os.path.exists(target):
    return
  count = 0
  all_files = os.listdir(folder)
  print("Start to merge {0} files under folder {1} as {2}".format(len(all_files), folder, target))
  for f in all_files:
    txt=os.path.join(folder, f)
    if os.path.isfile(txt):
      with open(txt) as sample:
        content = sample.readlines()
        context = content[2].strip()
        query = content[4].strip()
        answer = content[6].strip()
        entities = []
        for k in range(8, len(content)):
          entities += [ content[k].strip() ]
        with open(target, 'a') as output:
          output.write("{0}\t{1}\t{2}\t{3}\n".format(query, answer, context, "\t".join(entities)))
    count+=1
    if count%1000==0:
      sys.stdout.write(".")
      sys.stdout.flush()
  print()
  print("Finished to merge {0}".format(target))

def download_cnn(target="."):
  if os.path.exists(os.path.join(target, "cnn")):
    shutil.rmtree(os.path.join(target, "cnn"))
  if not os.path.exists(target):
    os.makedirs(target)
  url="https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTTljRDVZMFJnVWM"
  print("Start to download CNN data from {0} to {1}".format(url, target))
  pre_request = requests.get(url)
  confirm_match = re.search(r"confirm=(.{4})", pre_request.content.decode("utf-8"))
  confirm_url = url + "&confirm=" + confirm_match.group(1)
  download_request = requests.get(confirm_url, cookies=pre_request.cookies)
  tar = tarfile.open(mode="r:gz", fileobj=io.BytesIO(download_request.content))
  tar.extractall(target)
  print("Finished to download {0} to {1}".format(url, target))

def file_exists(src):
  return (os.path.isfile(src) and os.path.exists(src))

data_path = "../Examples/LanguageUnderstanding/ReasoNet/Data"
raw_train_data=os.path.join(data_path, "cnn/training.txt")
raw_test_data=os.path.join(data_path, "cnn/test.txt")
raw_validation_data=os.path.join(data_path, "cnn/validation.txt")
if not (file_exists(raw_train_data) and file_exists(raw_test_data) and file_exists(raw_validation_data)):
  download_cnn(data_path)

merge_files(os.path.join(data_path, "cnn/questions/training"), raw_train_data)
merge_files(os.path.join(data_path, "cnn/questions/test"), raw_test_data)
merge_files(os.path.join(data_path, "cnn/questions/validation"), raw_validation_data)
print("All necessary data are downloaded to {0}".format(data_path))

All necessary data are downloaded to ../Examples/LanguageUnderstanding/ReasoNet/Data


### Convert to CNTK Text Format

In order to take advantage of the scalable readers bundled with CNTK, we need to convert the original data into a column separated format [CNTK text format](https://github.com/Microsoft/CNTK/wiki/BrainScript-CNTKTextFormat-Reader). 
There are 5 columns/streams in the conveted CTF data file, including context, query, entity indication, label, entity ids. Here is a snippnet of the converted CTF output for the above example input,

>0 |Q 12:1 |C 4:1 |E 1 |L 0 |EID 4:1<br/>
> |Q 1739:1 |C 626:1 |E 0 |L 0 |EID 2:1<br/>
> |Q 14:1 |C 2:1 |E 1 |L 0 |EID 1:1<br/>
> |Q 5453:1 |C 625:1 |E 0 |L 0 |EID 3:1<br/>
> |Q 13:1 |C 1:1 |E 1 |L 0 |EID 9:1<br/>
> |Q 594:1 |C 7562:1 |E 0 |L 0 |EID 2:1<br/>
> |Q 15:1 |C 5284:1 |E 0 |L 0 |EID 4:1<br/>
> |Q 600:1 |C 1245:1 |E 0 |L 0 |EID 14:1<br/>
> |Q 1307:1 |C 3:1 |E 1 |L 1 |EID 13:1<br/>
> |Q 1309:1 |C 616:1 |E 0 |L 0 |EID 15:1<br/>
> |C 620:1 |E 0 |L 0 |EID 3:1<br/>
> |C 927:1 |E 0 |L 0 |EID 20:1<br/>
> |C 1115:1 |E 0 |L 0 |EID 9:1<br/>
> |C 1017:1 |E 0 |L 0 |EID 20:1<br/>
> |C 1067:1 |E 0 |L 0 |EID 3:1<br/>
> |C 650:1 |E 0 |L 0 |EID 2:1<br/>
> |C 587:1 |E 0 |L 0 |EID 9:1<br/>
> |C 613:1 |E 0 |L 0 |EID 20:1<br/>
> |C 608:1 |E 0 |L 0 |EID 20:1<br/>
> |C 2892:1 |E 0 |L 0 |EID 9:1<br/>
> |C 1015:1 |E 0 |L 0 |EID 3:1<br/>
> |C 589:1 |E 0 |L 0 |EID 35:1<br/>
> |C 615:1 |E 0 |L 0 |EID 36:1<br/>
> |C 2814:1 |E 0 |L 0 |EID 37:1<br/>
> |C 617:1 |E 0 |L 0 |EID 39:1<br/>
> |C 586:1 |E 0 |L 0 |EID 40:1<br/>
> |C 2090:1 |E 0 |L 0 |EID 40:1<br/>
> |C 1057:1 |E 0 |L 0 |EID 9:1<br/>
> |C 597:1 |E 0 |L 0 |EID 44:1<br/>
> |C 2054:1 |E 0 |L 0 |EID 47:1<br/>

The first column is the sequence id, 0. The second is the features of Query, the third is the features of Context, the fourth is a boolean to indicate if that word in the Context is an entity, the fifth is the Label which indicate if that word in the context is the answer. The last is the ID of entities in the context.

The code below performs the conversion. 

**Note**: The downloading and conversion can take upto 30 min and requires 11 GB of local disc space.

In [19]:
import sys
import os
import math
import functools
import numpy as np

class WordFreq:
  def __init__(self, word, id, freq):
    self.word = word
    self.id = id
    self.freq = freq

class Vocabulary:
  """Build word vocabulary with frequency"""
  def __init__(self, name):
    self.name = name
    self.size = 0
    self.__dict = {}
    self.__has_index = False

  def push(self, word):
    if word in self.__dict:
      self.__dict[word].freq += 1
    else:
      self.__dict[word] = WordFreq(word, len(self.__dict), 1)

  def build_index(self, max_size):
    def word_cmp(x, y):
      if x.freq == y.freq :
        return (x.word > y.word) - (x.word < y.word)
      else:
        return x.freq - y.freq

    items = sorted(self.__dict.values(), key=functools.cmp_to_key(word_cmp), reverse=True)
    if len(items)>max_size:
      del items[max_size:]
    self.size=len(items)
    self.__dict.clear()
    for it in items:
      it.id = len(self.__dict)
      self.__dict[it.word] = it
    self.__has_index = True

  def save(self, dst):
    if not self.__has_index:
      self.build_index(sys.maxsize)
    if self.name != None:
      dst.write("{0}\t{1}\n".format(self.name, self.size))
    for it in sorted(self.__dict.values(), key=lambda it:it.id):
      dst.write("{0}\t{1}\t{2}\n".format(it.word, it.id, it.freq))

  def load(self, src):
    line = src.readline()
    if line == "":
      return
    line = line.rstrip('\n')
    head = line.split()
    max_size = sys.maxsize
    if len(head) == 2:
      self.name = head[0]
      max_size = int(head[1])
    cnt = 0
    while cnt < max_size:
      line = src.readline()
      if line == "":
        break
      line = line.rstrip('\n')
      items = line.split()
      self.__dict[items[0]] = WordFreq(items[0], int(items[1]), int(items[2]))
      cnt += 1
    self.size = len(self.__dict)
    self.__has_index = True

  def __getitem__(self, key):
    if key in self.__dict:
      return self.__dict[key]
    else:
      return None

  def values(self):
    return self.__dict.values()

  def __len__(self):
    return self.size

  def __contains__(self, q):
    return q in self.__dict

  @staticmethod
  def is_cnn_entity(word):
    return word.startswith('@entity') or word.startswith('@placeholder')

  @staticmethod
  def load_vocab(vocab_src):
    """
    Loa vocabulary from file.

    Args:
      vocab_src (`str`): the file stored with the vocabulary data
      
    Returns:
      :class:`Vocabulary`: Vocabulary of the entities
      :class:`Vocabulary`: Vocabulary of the words
    """
    word_vocab = Vocabulary("WordVocab")
    entity_vocab = Vocabulary("EntityVocab")
    with open(vocab_src, 'r', encoding='utf-8') as src:
      entity_vocab.load(src)
      word_vocab.load(src)
    return entity_vocab, word_vocab

  @staticmethod
  def build_vocab(input_src, vocab_dst, max_size=50000):
    """
    Build vocabulary from raw corpus file.

    Args:
      input_src (`str`): the path of the corpus file
      vocab_dst (`str`): the path of the vocabulary file to save the built vocabulary
      max_size (`int`): the maxium size of the word vocabulary
    Returns:
      :class:`Vocabulary`: Vocabulary of the entities
      :class:`Vocabulary`: Vocabulary of the words
    """
    # Leave the first as Unknown
    max_size -= 1
    word_vocab = Vocabulary("WordVocab")
    entity_vocab = Vocabulary("EntityVocab")
    linenum = 0
    print("Start build vocabulary from {0} with maxium words {1}. Saved to {2}"\
          .format(input_src, max_size, vocab_dst))
    with open(input_src, 'r', encoding='utf-8') as src:
      all_lines = src.readlines()
      print("Total lines to process: {0}".format(len(all_lines)))
      for line in all_lines:
        line = line.strip('\n')
        ans, query_words, context_words = Vocabulary.parse_corpus_line(line)
        for q in query_words:
          if Vocabulary.is_cnn_entity(q):
          #if q.startswith('@'):
            entity_vocab.push(q)
          else:
            word_vocab.push(q)
        for q in context_words:
          #if q.startswith('@'):
          if Vocabulary.is_cnn_entity(q):
            entity_vocab.push(q)
          else:
            word_vocab.push(q)
        linenum += 1
        if linenum%1000==0:
          sys.stdout.write(".")
          sys.stdout.flush()
    print()
    entity_vocab.build_index(max_size)
    word_vocab.build_index(max_size)
    with open(vocab_dst, 'w', encoding='utf-8') as dst:
      entity_vocab.save(dst)
      word_vocab.save(dst)
    print("Finished to generate vocabulary from: {0}".format(input_src))
    return entity_vocab, word_vocab

  @staticmethod
  def parse_corpus_line(line):
    """
    Parse bing corpus line to answer, query and context.

    Args:
      line (`str`): A line of text of bing corpus
    Returns:
      :`str`: Answer word
      :`str[]`: Array of query words
      :`str[]`: Array of context/passage words

    """
    data = line.split('\t')
    query = data[0]
    answer = data[1]
    context = data[2]
    query_words = query.split()
    context_words = context.split()
    return answer, query_words, context_words

  def build_corpus(entities, words, corpus, output, max_seq_len=100000):
    """
    Build featurized corpus and store it in CNTK Text Format.

    Args:
      entities (class:`Vocabulary`): The entities vocabulary
      words (class:`Vocabulary`): The words vocabulary
      corpus (`str`): The file path of the raw corpus
      output (`str`): The file path to store the featurized corpus data file
    """
    seq_id = 0
    print("Start to build CTF data from: {0}".format(corpus))
    with open(corpus, 'r', encoding = 'utf-8') as corp:
      with open(output, 'w', encoding = 'utf-8') as outf:
        all_lines = corp.readlines()
        print("Total lines to prcess: {0}".format(len(all_lines)))
        for line in all_lines:
          line = line.strip('\n')
          ans, query_words, context_words = Vocabulary.parse_corpus_line(line)
          ans_item = entities[ans]
          query_ids = []
          context_ids = []
          is_entity = []
          entity_ids = []
          labels = []
          pos = 0
          answer_idx = None
          for q in context_words:
            if Vocabulary.is_cnn_entity(q):
              item = entities[q]
              context_ids += [ item.id + 1 ]
              entity_ids += [ item.id + 1 ]
              is_entity += [1]
              if ans_item.id == item.id:
                labels += [1]
                answer_idx = pos
              else:
                labels += [0]
            else:
              item = words[q]
              context_ids += [ (item.id + 1 + entities.size) if item != None else 0 ]
              is_entity += [0]
              labels += [0]
            pos += 1
            if (pos >= max_seq_len):
              break
          if answer_idx is None:
            continue
          for q in query_words:
            if Vocabulary.is_cnn_entity(q):
              item = entities[q]
              query_ids += [ item.id + 1 ]
            else:
              item = words[q]
              query_ids += [ (item.id + 1 + entities.size) if item != None else 0 ]
          #Write featurized ids
          outf.write("{0}".format(seq_id))
          for i in range(max(len(context_ids), len(query_ids))):
            if i < len(query_ids):
              outf.write(" |Q {0}:1".format(query_ids[i]))
            if i < len(context_ids):
              outf.write(" |C {0}:1".format(context_ids[i]))
              outf.write(" |E {0}".format(is_entity[i]))
              outf.write(" |L {0}".format(labels[i]))
            if i < len(entity_ids):
              outf.write(" |EID {0}:1".format(entity_ids[i]))
            outf.write("\n")
          seq_id += 1
          if seq_id%1000 == 0:
            sys.stdout.write(".")
            sys.stdout.flush()
    print()
    print("Finished to build corpus from {0}".format(corpus))
  
vocab_path=os.path.join(data_path, "cnn/cnn.vocab")
train_ctf=os.path.join(data_path, "cnn/training.ctf")
test_ctf=os.path.join(data_path, "cnn/test.ctf")
validation_ctf=os.path.join(data_path, "cnn/validation.ctf")
vocab_size=101000
if not (file_exists(train_ctf) and file_exists(test_ctf) and file_exists(validation_ctf)):
  entity_vocab, word_vocab = Vocabulary.build_vocab(raw_train_data, vocab_path, vocab_size)
  Vocabulary.build_corpus(entity_vocab, word_vocab, raw_train_data, train_ctf)
  Vocabulary.build_corpus(entity_vocab, word_vocab, raw_test_data, test_ctf)
  Vocabulary.build_corpus(entity_vocab, word_vocab, raw_validation_data, validation_ctf)
print("Training data conversion finished.")

Training data conversion finished.


## Basic CNTK imports

In [20]:
import sys
from datetime import datetime
import numpy as np
import cntk
from cntk import Trainer, Axis, device, combine
from cntk.layers.blocks import Stabilizer, _initializer_for,  _INFERRED, Parameter, Placeholder, GRU,Input
from cntk.layers import Recurrence, Convolution
from cntk.ops import input_variable, cross_entropy_with_softmax, classification_error, sequence, reduce_sum, \
    parameter, times, element_times, past_value, plus, placeholder_variable, reshape, constant, sigmoid, \
    convolution, tanh, times_transpose, greater, cosine_distance, element_divide, element_select, exp, \
    future_value, past_value
from cntk.internal import _as_tuple, sanitize_input
from cntk.initializer import uniform, glorot_uniform
from cntk.io import MinibatchSource, CTFDeserializer, StreamDef, StreamDefs, INFINITELY_REPEAT, \
    DEFAULT_RANDOMIZATION_WINDOW
import cntk.ops as ops
import cntk.learners as learners
# Check for an environment variable defined in CNTK's test infrastructure
envvar = 'CNTK_EXTERNAL_TESTDATA_SOURCE_DIRECTORY'
def is_test(): return envvar in os.environ

# Select the right target device when this notebook is being tested
# Currently supported only for GPU 

if 'TEST_DEVICE' in os.environ:
    if os.environ['TEST_DEVICE'] == 'cpu':
        raise ValueError('This notebook is currently not support on CPU') 
    else:
        cntk.device.set_default_device(cntk.device.gpu(0))
cntk.device.set_default_device(cntk.device.gpu(0))

True

## Utils
Some utils will used during model creation and training stage.

### Logger
We use logger to write information both to console and a disk file, so that we can check the inforamtion after the process exited.

In [21]:
import os
import numpy as np
from datetime import datetime
import math

class logger:
  __name=''
  __logfile=''

  @staticmethod
  def init(name=''):
    if not os.path.exists("model"):
      os.mkdir("model")
    if not os.path.exists("log"):
      os.mkdir("log")
    if name=='' or name is None:
      logger.__name='train'
    logger.__logfile = 'log/{}_{}.log'.format(logger.__name, datetime.now().strftime("%m-%d_%H.%M.%S"))
    if os.path.exists(logger.__logfile):
      os.remove(logger.__logfile)
    print('Log with log file: {0}'.format(logger.__logfile))

  @staticmethod
  def log(message, toconsole=True):
    if logger.__logfile == '' or logger.__logfile is None:
      logger.init()
    if toconsole:
      print(message)
    with open(logger.__logfile, 'a') as logf:
      logf.write("{}| {}\n".format(datetime.now().strftime("%Y-%m-%d %H:%M:%S"), message))


### Embedding
In this implementation we apply a special policy to train the embedding layer. For entities in the context/paragraph, we just use fixed random vectors as the embedding of them and won't update them during training stage. For other words in the context/paragraph and query, we will initialize the embedding using glorot uniform initialization or loading from an existing embedding matrix, e.g. glove embedding and they will be updated during training stage. To approch this, we can't simply adopt existing initializer or embedding lookup funciton in CNTK, and we implimented  `create_random_matrix` and `load_embedding` to create *random initialization matrix* and *load existing embedding matrix*. The class `uniform_initializer` will be used by `load_embedding` to initialize *enities* and other words that can't be found in the existing embedding matrix (looking up table).

In [22]:
class uniform_initializer:
  def __init__(self, scale=1, bias=0, seed=0):
    self.seed = seed
    self.scale = scale
    self.bias = bias
    np.random.seed(self.seed)

  def reset(self):
    np.random.seed(self.seed)

  def next(self, size=None):
    return np.random.uniform(0, 1, size)*self.scale + self.bias

def create_random_matrix(rows, columns):
  scale = math.sqrt(6/(rows+columns))*2
  rand = uniform_initializer(scale, -scale/2)
  embedding = [None]*rows
  for i in range(rows):
    embedding[i] = np.array(rand.next(columns), dtype=np.float32)
  return np.ndarray((rows, columns), dtype=np.float32, buffer=np.array(embedding))

def load_embedding(embedding_path, vocab_path, dim, init=None):
  entity_vocab, word_vocab = Vocabulary.load_bingvocab(vocab_path)
  vocab_dim = len(entity_vocab) + len(word_vocab) + 1
  entity_size = len(entity_vocab)
  item_embedding = [None]*vocab_dim
  with open(embedding_path, 'r') as embedding:
    for line in embedding.readlines():
      line = line.strip('\n')
      item = line.split(' ')
      if item[0] in word_vocab:
        item_embedding[word_vocab[item[0]].id + entity_size + 1] = \
        np.array(item[1:], dtype="|S").astype(np.float32)
  if init != None:
    init.reset()

  for i in range(vocab_dim):
    if item_embedding[i] is None:
      if init:
        item_embedding[i] = np.array(init.next(dim), dtype=np.float32)
      else:
        item_embedding[i] = np.array([0]*dim, dtype=np.float32)
  return np.ndarray((vocab_dim, dim), dtype=np.float32, buffer=np.array(item_embedding))

### Basic components
Here we provide some basic components that will be used in the model to simplify the model creation

### Project Cosine Similarity
In ReasoNet we use project cosine similarity to compute the attention between *internal status* and each words in the *external memory* where *external memory* is composited of *paragraph memory* and *query memory*. The formula for *doc attention* can be wrote as,
$$
a_{t,i}^{doc}=softmax_{i=1,...,\left|M^{doc}\right|}{\gamma cos\left(w_1^{doc}m_i^{doc}, w_2^{doc}s_t\right)}
$$
And the formular for *query attention* is similary as above.

In [23]:
def cosine_similarity(src, tgt, name=''):
  """
  Compute the cosine similarity of two squences.
  Src is a sequence of length 1
  Tag is a sequence of lenght >=1
  """
  src_br = sequence.broadcast_as(src, tgt, name='src_broadcast')
  sim = cosine_distance(src_br, tgt, name)
  return sim

def project_cosine_sim(att_dim, init = glorot_uniform(), name=''):
  """
  Compute the project cosine similarity of two input sequences, 
  where each of the input will be projected to a new dimention space (att_dim) via Wi/Wm
  """
  Wi = Parameter(_INFERRED + tuple((att_dim,)), init = init, name='Wi')
  Wm = Parameter(_INFERRED + tuple((att_dim,)), init = init, name='Wm')
  status = placeholder_variable(name='status')
  memory = placeholder_variable(name='memory')
  projected_status = times(status, Wi, name = 'projected_status')
  projected_memory = times(memory, Wm, name = 'projected_memory')
  sim = cosine_similarity(projected_status, projected_memory, name= name+ '_sim')
  return sequence.softmax(sim, name = name)

### Temination gate
The formula used to compute the termination probability in each time step is as,
$$
f_t\left(s_t;\theta_t\right)=sigmoid\left(w_ts_t+b_t\right), where\ \theta_t=\left(w_t, b_t\right)
$$
In our implementaiton, we ignored bias $b_t$.

In [24]:
def termination_gate(init = glorot_uniform(), name=''):
  Wt = Parameter( _INFERRED + tuple((1,)), init = init, name='Wt')
  status = placeholder_variable(name='status')
  return sigmoid(times(status, Wt), name=name)

### Create Model

#### Model parameters
We use `model_params` to wrapper the parameters to create the model.

In [25]:
class model_params:
  def __init__(self, vocab_dim, entity_dim, hidden_dim, embedding_dim=100, embedding_init=None, 
               share_rnn_param=False, max_rl_steps=5, dropout_rate=None, att_dim=384, 
               init=glorot_uniform(), model_name='rsn'):
    self.vocab_dim = vocab_dim
    self.entity_dim = entity_dim
    self.hidden_dim = hidden_dim
    self.embedding_dim = embedding_dim
    self.embedding_init = embedding_init
    self.max_rl_steps = max_rl_steps
    self.dropout_rate = dropout_rate
    self.init = init
    self.model_name = model_name
    self.share_rnn_param = share_rnn_param
    self.attention_dim = att_dim

#### Attention model

In [26]:
def attention_model(context_memory, query_memory, init_status, hidden_dim, att_dim, 
                    max_steps = 5, init = glorot_uniform()):
  """
  Create the attention model for reasonet
  Args:
    context_memory: Context memory
    query_memory: Query memory
    init_status: Intialize status
    hidden_dim: The dimention of hidden state
    att_dim: The dimention of attention
    max_step: Maxuim number of step to revisit the context memory
  """
  gru = GRU((hidden_dim*2, ), name='control_status')
  status = init_status
  output = [None]*max_steps*2
  sum_prob = None
  context_cos_sim = project_cosine_sim(att_dim, name='context_attention')
  query_cos_sim = project_cosine_sim(att_dim, name='query_attention')
  ans_cos_sim = project_cosine_sim(att_dim, name='candidate_attention')
  stop_gate = termination_gate(name='terminate_prob')
  prev_stop = 0
  for step in range(max_steps):
    context_attention_weight = context_cos_sim(status, context_memory)
    query_attention_weight = query_cos_sim(status, query_memory)
    context_attention = sequence.reduce_sum(times(context_attention_weight, context_memory), name='C-Att')
    query_attention = sequence.reduce_sum(times(query_attention_weight, query_memory), name='Q-Att')
    attention = ops.splice(query_attention, context_attention, name='att-sp')
    status = gru(status, attention).output
    termination_prob = stop_gate(status)
    ans_attention = ans_cos_sim(status, context_memory)
    output[step*2] = ans_attention
    if step < max_steps -1:
      stop_prob = prev_stop + ops.log(termination_prob, name='log_stop')
    else:
      stop_prob = prev_stop
    output[step*2+1] = sequence.broadcast_as(ops.exp(stop_prob, name='exp_log_stop'), 
                                             output[step*2], name='Stop_{0}'.format(step))
    prev_stop += ops.log(1-termination_prob, name='log_non_stop')

  final_ans = None
  for step in range(max_steps):
    if final_ans is None:
      final_ans = output[step*2] * output[step*2+1]
    else:
      final_ans += output[step*2] * output[step*2+1]
  combine_func = combine(output + [ final_ans ], name='Attention_func')
  return combine_func

### Define  the network
#### Dynamic axes in CNTK (Key concept)
One of the important concepts in understanding CNTK is the idea of two types of axes:
* static axes, which are the traditional axes of a variable's shape, and
* dynamic axes, which have dimensions that are unknown until the variable is bound to real data at computation time.

The dynamic axes are particularly important in the world of recurrent neural networks. Instead of having to decide a maximum sequence length ahead of time, padding your sequences to that size, and wasting computation, CNTK's dynamic axes allow for variable sequence lengths that are automatically packed in minibatches to be as efficient as possible.

When setting up sequences, there are two dynamic axes that are important to consider. The first is the batch axis, which is the axis along which multiple sequences are batched. The second is the dynamic axis particular to that sequence. The latter is specific to a particular input because of variable sequence lengths in your data. In CNTK, we use the dynamic axe name to idenitify different dynamic axes, and all sequence oprations between different variables require them have the same dynamic axes which means they must have the same length on all the axes. 

In ReasoNet networks, we have five input streams/sequences: *query*, *paragraph*, *label*, *entity id*, *entity indicator*, where *entity id* and *entity indicator* are helper sequences. *Query*, *paragraph* and *entity id* have different sequence lengths so they have different sequence dynamic axis. *label* and *entity id* have the same sequence length as *paragraph*, so they share the same dynamic axis.

In [27]:
def create_model(params):
  """
  Create ReasoNet model
  Args:
    params (class:`model_params`): The parameters used to create the model
  """
  logger.log("Create model: dropout_rate: {0}, init:{1}, embedding_init: {2}"\
             .format(params.dropout_rate, params.init, params.embedding_init))
  # Query and Doc/Context/Paragraph inputs to the model
  batch_axis = Axis.default_batch_axis()
  query_seq_axis = Axis('queryAxis')
  context_seq_axis = Axis('contextAxis')
  query_dynamic_axes = [batch_axis, query_seq_axis]
  query_sequence = Input(shape=(params.vocab_dim), is_sparse=True, 
                                  dynamic_axes=query_dynamic_axes, name='query')
  context_dynamic_axes = [batch_axis, context_seq_axis]
  context_sequence = Input(shape=(params.vocab_dim), is_sparse=True, 
                                    dynamic_axes=context_dynamic_axes, name='context')
  # entitiy ids mask is a sequence with the same length as context sequence where each iterm is and indicator of
  # wether the corresponding word in the context is an entity or not.
  entity_ids_mask = Input(shape=(1,), is_sparse=False, dynamic_axes=context_dynamic_axes, 
                                   name='entity_ids_mask')
  # embedding
  if params.embedding_init is None:
    embedding_init = create_random_matrix(params.vocab_dim, params.embedding_dim)
  else:
    embedding_init = params.embedding_init
  embedding = parameter(shape=(params.vocab_dim, params.embedding_dim), init=None)
  embedding.value = embedding_init
  constant_embedding = constant(embedding_init, shape=(params.vocab_dim, params.embedding_dim))

  if params.dropout_rate is not None:
    query_embedding  = ops.dropout(times(query_sequence , embedding), params.dropout_rate, 
                                   name='query_embedding')
    context_embedding = ops.dropout(times(context_sequence, embedding), params.dropout_rate, 
                                    name='context_embedding')
  else:
    query_embedding  = times(query_sequence , embedding, name='query_embedding')
    context_embedding = times(context_sequence, embedding, name='context_embedding')

  contextGruW = Parameter(_INFERRED +  _as_tuple(params.hidden_dim), init=glorot_uniform(), 
                          name='gru_params')
  queryGruW = Parameter(_INFERRED +  _as_tuple(params.hidden_dim), init=glorot_uniform(), 
                        name='gru_params')
  # We use constant random vectors as the embedding of entities in the paragraph, 
  # as we treat them as meaningless symbolic in the paragraph which is equal to entity shuffle
  entity_embedding = ops.times(context_sequence, constant_embedding, name='constant_entity_embedding')
  
  # Unlike other words in the context, 
  # we keep the entity vectors fixed as a random vector so that each vector just means an identifier 
  # of different entities in the context and it has no semantic meaning
  full_context_embedding = ops.element_select(entity_ids_mask, entity_embedding, context_embedding)
  context_memory = ops.optimized_rnnstack(full_context_embedding, contextGruW, params.hidden_dim, 1, 
                                          True, recurrent_op='gru', name='context_mem')

  query_memory = ops.optimized_rnnstack(query_embedding, queryGruW, params.hidden_dim, 1, True, 
                                        recurrent_op='gru', name='query_mem')
  qfwd = ops.slice(sequence.last(query_memory), -1, 0, params.hidden_dim, name='fwd')
  qbwd = ops.slice(sequence.first(query_memory), -1, params.hidden_dim, params.hidden_dim*2, name='bwd')
  init_status = ops.splice(qfwd, qbwd, name='Init_Status') # get last fwd status and first bwd status
  return attention_model(context_memory, query_memory, init_status, params.hidden_dim, 
                         params.attention_dim, max_steps = params.max_rl_steps)

### Loss fucntion
#### Contractive Reward

In the ReasoNet paper, it gives the fomula of the Reward as
\begin{align}
J(\theta) = \mathbf{E}_{\pi\left(t_{1:T},a_T;\theta\right)}\left[\sum_{t=1}^Tr_t\right]
\end{align}

And it applies REINFORCE algorithm to estimate 
\begin{align} 
\nabla_{\theta}J(\theta) = \mathbf{E}_{\pi\left(t_{1:T},a_T;\theta\right)}\left[\nabla_{\theta}log_{\pi}\left(t_{1:T},a_T;\theta\right)r_T\right]=\sum_{\left(t_{1:T},a_T\right)\in\mathbb{A}^+}\pi\left(t_{1:T},a_T;\theta\right)\left[\nabla_{\theta}log\pi\left(t_{1:T},a_T;\theta\right)\left(r_T-b_T\right)\right]
\end{align}

However, as the baseline $\left\{b_T;T=1...T_{max}\right\}$ are global variables independent of instances, it leads to slow convergence in training ReasoNet. Instead, the paper rewrite the formular as,
$$
\nabla_{\theta}J(\theta) =\sum_{\left(t_{1:T},a_T\right)\in\mathbb{A}^+}\pi\left(t_{1:T},a_T;\theta\right)\left[\nabla_{\theta}log\pi\left(t_{1:T},a_T;\theta\right)\left(r_T-b\right)\right]
$$
,where $b=\sum_{\left(t_{1:T},a_T\right)\in\mathbb{A}^+}\pi\left(t_{1:T},a_T;\theta\right)r_T$ is the average reward on the $\left|\mathbb{A}^+\right|$ episodes.

Since the sum of the rewards over $\left|\mathbb{A}^+\right|$ episodes is zero, $\sum_{\left(t_{1:T},a_T\right)\in\mathbb{A}^+}\pi\left(t_{1:T},a_T;\theta\right)\left(r_T-b\right)=0$, they call it Contractive Reward. Further more, they found using $\left(\frac{r_T}{b}-1\right)$ in replace of $\left(r_T-b\right)$ will lead to a better convergence.

In our implementation, we take the reward in the form,
$$
J(\theta)=\sum_{\left(t_{1:T},a_T\right)\in\mathbb{A}^+}\pi\left(t_{1:T},a_T;\theta\right)\left(\frac{r_T}{b}-1\right) + b
$$
As we only compute gradient on $\pi\left(t_{1:T},a_T;\theta\right)$ and treat other components in the formula as a constant, the derivate is the same as the paper while the output is the average rewards in $\left|\mathbb{A}^+\right|$ episodes.
In CNTK, we use stop_gradient operator over the output of a function to conver it to a constant in the math formula.

In [28]:
def contractive_reward(labels, predictions_and_stop_probabilities):
  """
  Compute the contractive reward loss in paper 'ReasoNet: 
    Learning to Stop Reading in Machine Comprehension'
  Args:
    labels: The lables
    predictions_and_stop_probabilities: A list of tuples, 
    each tuple contains the prediction and stop probability of the coresponding step.
  """
  base = None
  avg_rewards = None
  for step in range(len(predictions_and_stop_probabilities)):
    pred = predictions_and_stop_probabilities[step][0]
    stop = predictions_and_stop_probabilities[step][1]
    if base is None:
      base = ops.element_times(pred, stop)
    else:
      base = ops.plus(ops.element_times(pred, stop), base)
  avg_rewards = ops.stop_gradient(sequence.reduce_sum(base*labels))
  base_reward = sequence.broadcast_as(avg_rewards, base, name = 'base_line')
  # While  the learner will mimize the loss by default, we want it to maxiumize the rewards
  # Maxium rewards => minimal -rewards
  # So we use (1-r/b) as the rewards instead of (r/b-1)
  step_cr = ops.stop_gradient(1- ops.element_divide(labels, base_reward))
  normalized_contractive_rewards = ops.element_times(base, step_cr)
  rewards = sequence.reduce_sum(normalized_contractive_rewards) + avg_rewards
  return rewards

#### Loss and accuracy

In [29]:
def accuracy_func(prediction, label, name='accuracy'):
  """
  Compute the accuracy of the prediction
  """
  pred_max = ops.hardmax(prediction, name='pred_max')
  norm_label = ops.equal(label, [1], name='norm_label')
  acc = ops.times_transpose(pred_max, norm_label, name='accuracy')
  return acc

def loss(model, params:model_params):
  """
  Compute the loss and accuracy of the model output
  """
  model_args = {arg.name:arg for arg in model.arguments}
  context = model_args['context']
  entity_ids_mask = model_args['entity_ids_mask']
  entity_condition = greater(entity_ids_mask, 0, name='condidion')
  # Get all the enities in the paragraph via gather operator, which will create a new dynamic sequence axis 
  entities_all = sequence.gather(entity_condition, entity_condition, name='entities_all')
  # The generated dynamic axis has the same length as the input enity id sequence, 
  # so we asign it as the entity id's dynamic axis.
  entity_ids = Input(shape=(params.entity_dim), is_sparse=True, 
                              dynamic_axes=entities_all.dynamic_axes, name='entity_ids')
  wordvocab_dim = params.vocab_dim
  labels_raw = Input(shape=(1,), is_sparse=False, dynamic_axes=context.dynamic_axes, 
                              name='labels')
  #answers = sequence.scatter(sequence.gather(model.outputs[-1], entity_condition), entities_all, name='Final_Ans')
  #labels = sequence.scatter(sequence.gather(labels_raw, entity_condition), entities_all, name='EntityLabels')
  answers = sequence.gather(model.outputs[-1], entity_condition, 
                             name='Final_Ans')
  labels = sequence.gather(labels_raw, entity_condition, 
                            name='EntityLabels')
  entity_id_matrix = ops.reshape(entity_ids, params.entity_dim)
  expand_pred = sequence.reduce_sum(element_times(answers, entity_id_matrix))
  expand_label = ops.greater_equal(sequence.reduce_sum(element_times(labels, entity_id_matrix)), 1)
  expand_candidate_mask = ops.greater_equal(sequence.reduce_sum(entity_id_matrix), 1)
  predictions_and_stop_probabilities=[]
  for step in range(int((len(model.outputs)-1)/2)):
    predictions_and_stop_probabilities += [(model.outputs[step*2], model.outputs[step*2+1])]
  loss_value = contractive_reward(labels_raw, predictions_and_stop_probabilities)
  accuracy = accuracy_func(expand_pred, expand_label, name='accuracy')
  apply_loss = combine([loss_value, answers, labels, accuracy], name='Loss')
  return apply_loss


### Adam Learner

In [30]:
def create_adam_learner(learn_params, learning_rate = 0.0005, gradient_clipping_threshold_per_sample=0.001):
  """
  Create adam learner
  """
  lr_schedule = learners.learning_rate_schedule(learning_rate, learners.UnitType.sample)
  momentum = learners.momentum_schedule(0.90)
  gradient_clipping_threshold_per_sample = gradient_clipping_threshold_per_sample
  gradient_clipping_with_truncation = True
  momentum_var = learners.momentum_schedule(0.999)
  lr = learners.adam(learn_params, lr_schedule, momentum, True, momentum_var,
          gradient_clipping_threshold_per_sample = gradient_clipping_threshold_per_sample,
          gradient_clipping_with_truncation = gradient_clipping_with_truncation)
  learner_desc = 'Alg: Adam, learning rage: {0}, momentum: {1}, gradient clip: {2}'\
    .format(learning_rate, momentum[0], gradient_clipping_threshold_per_sample)
  logger.log("Create learner. {0}".format(learner_desc))
  return lr


### Create Reader
The data is stored in CNTK Text Format and we need to create a reader to consume the data. There are 5 columns/streams in the data file, e.g. *context*, *query*, *entity indication*, *label*, *entity ids*. And we use `bind_data` function to bind the *streams* with CNTK functions' (e.g. *model*, *loss*) *arguments* based on their names.

In [31]:
def create_reader(path, vocab_dim, entity_dim, randomize, rand_size= DEFAULT_RANDOMIZATION_WINDOW, size=INFINITELY_REPEAT):
  """
  Create data reader for the model
  Args:
    path: The data path
    vocab_dim: The dimention of the vocabulary
    entity_dim: The dimention of entities
    randomize: Where to shuffle the data before feed into the trainer
  """
  return MinibatchSource(CTFDeserializer(path, StreamDefs(
    context  = StreamDef(field='C', shape=vocab_dim, is_sparse=True),
    query    = StreamDef(field='Q', shape=vocab_dim, is_sparse=True),
    entities  = StreamDef(field='E', shape=1, is_sparse=False),
    label   = StreamDef(field='L', shape=1, is_sparse=False),
    entity_ids   = StreamDef(field='EID', shape=entity_dim, is_sparse=True)
    )), randomize=randomize)

def bind_data(func, data):
  """
  Bind data outputs to cntk function arguments based on the argument name
  """
  bind = {}
  for arg in func.arguments:
    if arg.name == 'query':
      bind[arg] = data.streams.query
    if arg.name == 'context':
      bind[arg] = data.streams.context
    if arg.name == 'entity_ids_mask':
      bind[arg] = data.streams.entities
    if arg.name == 'labels':
      bind[arg] = data.streams.label
    if arg.name == 'entity_ids':
      bind[arg] = data.streams.entity_ids
  return bind

## Train the model

### Trainer

In [None]:
def __evaluation(trainer, data, bind, minibatch_size, epoch_size):
  """
  Evaluate the loss and accurate of the evaluation data set during training stage
  """
  if epoch_size is None:
    epoch_size = 1
  for key in bind.keys():
    if key.name == 'labels':
      label_arg = key
      break
  eval_acc = 0
  eval_s = 0
  k = 0
  print("Start evaluation with {0} samples ...".format(epoch_size))
  while k < epoch_size:
    mbs = min(epoch_size - k, minibatch_size)
    mb = data.next_minibatch(mbs, input_map=bind)
    k += mb[label_arg].num_samples
    sm = mb[label_arg].num_sequences
    avg_acc = trainer.test_minibatch(mb)
    eval_acc += sm*avg_acc
    eval_s += sm
    sys.stdout.write('.')
    sys.stdout.flush()
  eval_acc /= eval_s
  print("")
  logger.log("Evaluation Acc: {0}, samples: {1}".format(eval_acc, eval_s))
  return eval_acc

def train(model, m_params:model_params, learner, train_data, max_epochs=1, 
          save_model_flag=False, epoch_size=270000, eval_data=None, eval_size=None, 
          check_point_freq=0.1, minibatch_size=50000, model_name='rsn'):
  """
  Train the model
  Args:
    model: The created model
    m_params: Model parameters
    learner: The learner used to train the model
  """
  criterion_loss = loss(model, m_params)
  loss_func = criterion_loss.outputs[0]
  eval_func = criterion_loss.outputs[-1]
  trainer = Trainer(model.outputs[-1], (loss_func, eval_func), learner)
  # Get minibatches of sequences to train with and perform model training
  # bind inputs to data from readers
  train_bind = bind_data(criterion_loss, train_data)
  for k in train_bind.keys():
    if k.name == 'labels':
      label_key = k
      break
  eval_bind = bind_data(criterion_loss, eval_data)

  i = 0
  minibatch_count = 0
  training_progress_output_freq = 500
  check_point_interval = int(epoch_size*check_point_freq)
  check_point_id = 0
  for epoch in range(max_epochs):
    epoch_loss = 0
    epoch_acc = 0
    epoch_samples = 0
    i = 0
    win_loss = 0
    win_acc = 0
    win_samples = 0
    chk_loss = 0
    chk_acc = 0
    chk_samples = 0
    while i < epoch_size:
      # get next minibatch of training data
      mbs = min(minibatch_size, epoch_size - i)
      mb_train = train_data.next_minibatch(minibatch_size, input_map=train_bind)
      i += mb_train[label_key].num_samples
      trainer.train_minibatch(mb_train)
      minibatch_count += 1
      sys.stdout.write('.')
      sys.stdout.flush()
      # collect epoch-wide stats
      samples = trainer.previous_minibatch_sample_count
      ls = trainer.previous_minibatch_loss_average * samples
      acc = trainer.previous_minibatch_evaluation_average * samples
      epoch_loss += ls
      epoch_acc += acc
      win_loss += ls
      win_acc += acc
      chk_loss += ls
      chk_acc += acc
      epoch_samples += samples
      win_samples += samples
      chk_samples += samples
      if int(epoch_samples/training_progress_output_freq) != \
        int((epoch_samples-samples)/training_progress_output_freq):
        print('')
        logger.log("Lastest sample count = {}, Train Loss: {}, Evalualtion ACC: {}"\
                   .format(win_samples, win_loss/win_samples,
          win_acc/win_samples))
        logger.log("Total sample count = {}, Train Loss: {}, Evalualtion ACC: {}"\
                   .format(chk_samples, chk_loss/chk_samples,
          chk_acc/chk_samples))
        win_samples = 0
        win_loss = 0
        win_acc = 0
      new_chk_id = int(i/check_point_interval)
      if new_chk_id != check_point_id and i < epoch_size :
        check_point_id = new_chk_id
        print('')
        logger.log("--- CHECKPOINT %d: samples=%d, loss = %.2f, acc = %.2f%% ---" % (check_point_id, 
                                                                                     chk_samples, 
                                                                                     chk_loss/chk_samples, 
                                                                                     100.0*(chk_acc/chk_samples)))
        if eval_data:
          __evaluation(trainer, eval_data, eval_bind, minibatch_size, eval_size)
        if save_model_flag:
          # save the model every epoch
          model_filename = os.path.join('model', "model_%s_%03d.dnn" % (model_name, check_point_id))
          model.save_model(model_filename)
          logger.log("Saved model to '%s'" % model_filename)
        chk_samples = 0
        chk_loss = 0
        chk_acc = 0

    print('')
    logger.log("--- EPOCH %d: samples=%d, loss = %.2f, acc = %.2f%% ---" % (epoch, epoch_samples,
                                                                            epoch_loss/epoch_samples,
                                                                            100.0*(epoch_acc/epoch_samples)))
  eval_acc = 0
  if eval_data:
    eval_acc = __evaluation(trainer, eval_data, eval_bind, minibatch_size, eval_size)
  if save_model_flag:
    # save the model every epoch
    model_filename = os.path.join('model', "model_%s_final.dnn" % (model_name))
    model.save_model(model_filename)
    logger.log("Saved model to '%s'" % model_filename)
  return (epoch_loss/epoch_samples, epoch_acc/epoch_samples, eval_acc)


### Test all the components

In [None]:
import sys
import os
import cntk.device as device
import numpy as np
from cntk.ops.tests.ops_test_utils import cntk_device
from cntk.ops import input_variable, past_value, future_value
from cntk.io import MinibatchSource
from cntk import Trainer, Axis, device, combine
from cntk.layers import Recurrence, Convolution
import cntk.ops as ops
import cntk
import math

def test_reasonet():  
  data_path = train_ctf
  eval_path = validation_ctf
  vocab_dim = 101585
  entity_dim = 586
  epoch_size=289716292
  eval_size=2993016
  hidden_dim=384
  max_rl_steps=5
  max_epochs=5
  embedding_dim=100
  att_dim = 384
  demo_only=True
  if demo_only:
    # Use a smaller minibatch_size to reduce memory usage for demo popurse only
    minibatch_size=2000
    hidden_dim = 256
    att_dim = 256
    max_rl_steps = 3
    max_epochs = 1
  else:
    # The average sequence length is about 700, so we set the minibatch_size to 50000 in sequence num,
    # which is about 70 samples/instanes per minibatch
    minibatch_size=50000  
    
  params = model_params(vocab_dim = vocab_dim, entity_dim = entity_dim, hidden_dim = hidden_dim, 
                        embedding_dim = embedding_dim, att_dim=att_dim, max_rl_steps = max_rl_steps,
                        embedding_init = None, dropout_rate = 0.2)

  train_data = create_reader(data_path, vocab_dim, entity_dim, True, rand_size=epoch_size)
  eval_data = create_reader(eval_path, vocab_dim, entity_dim, False, rand_size=eval_size) \
if eval_path is not None else None
  embedding_init = None

  model = create_model(params)
  learner = create_adam_learner(model.parameters)
  (train_loss, train_acc, eval_acc) = train(model, params, learner, train_data, 
                                            max_epochs=max_epochs, epoch_size=epoch_size, 
                                            save_model_flag=False, model_name=os.path.basename(data_path),
                                            eval_data=eval_data, eval_size=eval_size, check_point_freq=0.1,
                                            minibatch_size = minibatch_size)

test_reasonet()

Log with log file: log/train_03-24_09.02.30.log
Create model: dropout_rate: 0.2, init:<cntk.cntk_py.Dictionary; proxy of <Swig Object of type 'CNTK::ImageTransform *' at 0x7f5d40654900> >, embedding_init: None
Create learner. Alg: Adam, learning rage: 0.0005, momentum: 0.9, gradient clip: 0.001
.........................................................................................................................................................................................................................................................
Lastest sample count = 500, Train Loss: 0.10945142503883108, Evalualtion ACC: 0.296
Total sample count = 500, Train Loss: 0.10945142503883108, Evalualtion ACC: 0.296
..........................................................................................................................................................................................................................................................
Lastest sample count = 501, Train Los