# **This code will not meet the minimum requirement, because I deleted several important mechanism and tricks I used in training. Also, if you can understand what I am doing, why dont you just use transformer based models? That's a better solution LOL**

## Strategy Explanation:
1. **strategy intuition**



I used a deep learning model to play this game.
I built a model which will generate a probability distribution of the appearance of each letter at each position based on the input, and use the letter with highest probability as our guess.
For example, the input can be '\<ha_gma_>', where '<' is the start of input and '>' is the end of input, and our target is to get a probability distribution of letters at the '\_' position. The model will generate a tensor with shape of (1,7,26), where the first dimension(1) is batch size(which is not important), second dimension(7) representing 7 positions in the word, and the third dimension(26) is probability distribution of 26 words.
I then mask the guessed letters position, only leaving the probability of all letters at the two '_'s (a tensor with (1,2,26)). I also masked the incorrect letters as well.
For the first guess, I started with vowels. Once the model finds a vowel it will start guessing with the model I trained.
2. **model structure and training tricks**

The model is a encoder-decoder structure based model with attention mechanism and teacher forcing which helped improve the performance. The model originates from LAS model in 2015 paper 'Listen, Attend and Spell'https://arxiv.org/pdf/1508.01211.pdf.
For encoder I used pyramid lstm.
For the decoder I used attention mechanism and lstm to generate predictions. I also added teacher forcing and grumble noise to improve the performance.
For the loss function, I modified pytorch nnEntropyLoss, reallocating weights of guessed letters to help the model focus on the unguessed letters.
3. **dataset**

My input looks like '\<h#gma#>', where #s are masks(unguessed letters), '<' is start of input and '>' is end of input.
targets are words like 'hangman>' (no start of word sign, only end of word sign).
4. **further improvement**
This model certainly has some flaws. It didn't effectively use the information from guessed but incorrect letters, and it averaged the probability of all possible letters, which is not quite reasonable.
One way to solve this problem is that we can add a max pooling layer at the end of the model, thus the model will only generate one letter that has the largest probability, correspondingly, we have to modify the loss function so that as long as the model generates a letter that appears in the word, it will not punish the model. Also we have to build another encoder to encode the incorrect letters and send the incorrect guesses into the decoder so as to let the model learn to avoid incorrect guesses.
These ideas are promising, but unfortunately because I am in final weeks at school, I didn't have enough time to implement all these ideas.

In [None]:
import json
import requests
import random
import string
import secrets
import time
import re
import collections
try:
    from urllib.parse import parse_qs, urlencode, urlparse
except ImportError:
    from urlparse import parse_qs, urlparse
    from urllib import urlencode
from sklearn.model_selection import train_test_split
import pandas as pd

In [None]:
from google.colab import drive

drive.mount("/content/gdrive", force_remount=True)
! ls /content/gdrive/MyDrive

##Data_preprocess##

generate training data

In [None]:
import numpy as np
from numba.typed import List
from numba import jit, prange


def remove_one_length_word(words):
    return [w for w in words if len(np.unique(list(w))) > 1]


with open("words_250000_train.txt") as f:
    words = f.read().split('\n')

mask_char = '#'

np.random.shuffle(words)


n_words = len(words)
train_ratio = 0.8
dev_ratio = 0.1
test_ratio = 0.1
repeat=1

train_words = words[:int(n_words * train_ratio)]*repeat
dev_words = words[int(n_words * train_ratio): int(n_words * (train_ratio + dev_ratio))]*repeat
test_words = words[int(n_words * (train_ratio + dev_ratio)):]*repeat

train_words = remove_one_length_word(train_words)
dev_words = remove_one_length_word(dev_words)
test_words = remove_one_length_word(test_words)


with open('words_250000_train.txt', 'w') as f:
    f.write("\n".join(train_words))

with open('words_250000_train.txt', 'w') as f:
    f.write('\n'.join(dev_words))

with open('words_250000_train.txt', 'w') as f:
    f.write('\n'.join(test_words))

mask_rate=0.4
@jit(nopython=True, nogil=True)
def create_mask_words(words):
    result = []

    for i in prange(len(words)):
        word_lst = list(words[i])
        word_len = len(word_lst)
        n_mask = max(int(word_len * mask_rate), 1)
        indices = np.random.choice(np.arange(word_len), size=n_mask)
        for j in prange(len(indices)):
            word_lst[indices[j]] = mask_char
        masked_word = ''.join(word_lst)
        result.append(masked_word)
    return result


# @jit(nopython=True, nogil=True)
def create_mask_words_unique(words):
    result = []

    for i in range(len(words)):
        word_ = list(words[i])
        word_lst = np.unique(word_)
        word_len = len(word_lst)
        n_mask = max(int(word_len * mask_rate), 1)
        indices = np.random.choice(np.arange(word_len), size=n_mask)
        letters = word_lst[indices]
        for l in letters:
            for j, w in enumerate(word_):
                if w == l:
                    word_[j] = '#'
        masked_word = ''.join(word_)
        result.append(masked_word)
    return result


def create_typed_list(words):
    typed_list = List()
    for word in words:
        typed_list.append(word)
    return typed_list


def write_train_dev_test():
    train_mask_words = []
    dev_mask_words = []
    test_mask_words =[]
    for i in range(repeat):
        print(len(train_mask_words))
        train_mask_words+= create_mask_words_unique(create_typed_list(train_words))
        dev_mask_words+= create_mask_words_unique(create_typed_list(dev_words))
        test_mask_words+= create_mask_words_unique(create_typed_list(test_words))

    with open('words_alpha_train_unique_big.txt', 'a') as f:
        for masked_word, word in zip(train_mask_words, train_words):
            f.write(','.join([masked_word, word])+'\n')
    with open('words_alpha_dev_unique_big.txt', 'a') as f:
        for masked_word, word in zip(dev_mask_words, dev_words):
            f.write(','.join([masked_word, word])+'\n')
    with open('words_alpha_test_unique_big.txt', 'a') as f:
        for masked_word, word in zip(test_mask_words, test_words):
            f.write(','.join([masked_word, word])+'\n')


write_train_dev_test()

0


generate letter2index and index2letter mappings

In [None]:
LETTER_LIST = [ ' ','a', 'b', 'c', 'd', 'e', \
          'f', 'g', 'h', 'i', 'j', 'k', \
          'l', 'm', 'n', 'o', 'p', 'q', \
         'r', 's', 't', 'u', 'v', 'w', \
         'x', 'y', 'z', '#','>','<','/']

def create_dictionaries(letter_list):
    '''
    Create dictionaries for letter2index and index2letter transformations
    based on LETTER_LIST

    Args:
        letter_list: LETTER_LIST

    Return:
        letter2index: Dictionary mapping from letters to indices
        index2letter: Dictionary mapping from indices to letters
    '''
    n = len(letter_list)
    letter2index = {letter_list[i]:i for i in range(0, n)}
    index2letter = {i:letter_list[i] for i in range(0, n)}
    return letter2index, index2letter
letter2index, index2letter = create_dictionaries(LETTER_LIST)

def transform_letter_to_index(transcript):
    '''
    :param transcript :(N, ) Transcripts are the text input
    :param letter_list: Letter list defined above
    :return letter_to_index_list: Returns a list for all the transcript sentence to index
    '''
    #for sent in transcript:
    letters = []
    for word in transcript:
        # Converte from byte format to string for mapping
        s = word.encode('utf-8').decode('utf-8')
        for c in s:
            letters.append(letter2index[c])
        # Space between each word

    return letters

def transform_index_to_letter(batch_indices):
    '''
    Transforms numerical index input to string output by converting each index 
    to its corresponding letter from LETTER_LIST

    Args:
        batch_indices: List of indices from LETTER_LIST with the shape of (N, )
    
    Return:
        transcripts: List of converted string transcripts. This would be a list with a length of N
    '''

    index_to_letter_list = []
    for r in batch_indices:
        curr = ""
        for i in r:
            # Reached the end of the sentence
            if i ==0:
              break
            curr += index2letter[i]
        index_to_letter_list.append(curr)
    return index_to_letter_list
create_dictionaries(LETTER_LIST)

({' ': 0,
  '#': 27,
  '/': 30,
  '<': 29,
  '>': 28,
  'a': 1,
  'b': 2,
  'c': 3,
  'd': 4,
  'e': 5,
  'f': 6,
  'g': 7,
  'h': 8,
  'i': 9,
  'j': 10,
  'k': 11,
  'l': 12,
  'm': 13,
  'n': 14,
  'o': 15,
  'p': 16,
  'q': 17,
  'r': 18,
  's': 19,
  't': 20,
  'u': 21,
  'v': 22,
  'w': 23,
  'x': 24,
  'y': 25,
  'z': 26},
 {0: ' ',
  1: 'a',
  2: 'b',
  3: 'c',
  4: 'd',
  5: 'e',
  6: 'f',
  7: 'g',
  8: 'h',
  9: 'i',
  10: 'j',
  11: 'k',
  12: 'l',
  13: 'm',
  14: 'n',
  15: 'o',
  16: 'p',
  17: 'q',
  18: 'r',
  19: 's',
  20: 't',
  21: 'u',
  22: 'v',
  23: 'w',
  24: 'x',
  25: 'y',
  26: 'z',
  27: '#',
  28: '>',
  29: '<',
  30: '/'})

In [None]:
print(transform_letter_to_index('asdf'))

[1, 19, 4, 6]


##DataLoader

In [None]:
import torch
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence


class LibriSamples(torch.utils.data.Dataset):

    def __init__(self, data_path):

        self.file = data_path
        with open(data_path) as f:
          words = f.read().split('\n')

        self.X = []
        self.y = [] 
        for i in words:
          temp=i.split(',')
          if len(temp)==2:
            self.X.append(temp[0])
            self.y.append(temp[1])
    def __len__(self):

        return len(self.X)

    def __getitem__(self, ind):

        X=self.X[ind]
        y=self.y[ind]
        y='<'+y+'>'
        X=transform_letter_to_index(X)
        y=transform_letter_to_index(y)
        
        X=torch.tensor(X).to("cpu")
        X=torch.unsqueeze(X,1)
        y=torch.tensor(y).to("cpu")

        return X, y

    def collate_fn(self, batch_data):
        inputs, targets = zip(*batch_data)
        batch_size = len(inputs)
        feature_dim = inputs[0].shape[-1]

        inputs = list(inputs)
        input_lens = [len(_input) for _input in inputs]
        longest_input = max(input_lens)
        padded_input = torch.zeros(batch_size, longest_input, feature_dim)
        

        for i, input_len in enumerate(input_lens):

            cur_input = inputs[i]
            padded_input[i, 0:input_len] = cur_input
        padded_input = padded_input.permute(1, 0, 2)  # (T, B, D)
        
        

        targets = list(targets)
        target_lens = [len(label) for label in targets]
        longest_target = max(target_lens)
        padded_decoder = torch.zeros((batch_size, longest_target), dtype=torch.int64)
        padded_target = torch.zeros((batch_size, longest_target), dtype=torch.int64)  # (B, T)
        for i, target_len in enumerate(target_lens):
            cur_target = targets[i]

            padded_decoder[i, :target_len-1] = cur_target[:-1]
            padded_target[i, :target_len-1] = cur_target[1:]
        return padded_input, padded_target, padded_decoder, input_lens, target_lens


In [None]:
batch_size=512*8
train_data = LibriSamples('/content/words_alpha_train_unique_big.txt')
val_data = LibriSamples('/content/words_alpha_dev_unique_big.txt')
test_data = LibriSamples('/content/words_alpha_test_unique_big.txt')

train_loader = torch.utils.data.DataLoader(train_data,collate_fn=train_data.collate_fn, batch_size=batch_size, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_data,collate_fn=val_data.collate_fn, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_data,collate_fn=test_data.collate_fn, batch_size=batch_size, shuffle=True)

test dataloader

In [None]:
for data in train_loader:
    x, y, a,lx, ly = data
    print(type(x),type(y),type(a),type(lx),type(ly))
    print(y[0]) # desired 

    print(x.shape, y.shape,a.shape, len(lx), len(ly))
    x=torch.permute(x,(1,0,2))
    x=torch.squeeze(x)
    print(x.shape)
    print(transform_index_to_letter(x.numpy()))
    print(transform_index_to_letter(a.numpy()))
    print(transform_index_to_letter(y.numpy()))
    break

<class 'torch.Tensor'> <class 'torch.Tensor'> <class 'torch.Tensor'> <class 'list'> <class 'list'>
tensor([16,  9, 22, 15, 20,  1,  2, 12,  5, 28,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0])
torch.Size([22, 4096, 1]) torch.Size([4096, 24]) torch.Size([4096, 24]) 4096 4096
torch.Size([4096, 22])
['#iv#t#ble', 'th#oc#eso#', 'tru##edu#', '#n#iele##ron#', 'alleli#m#', 'go##ism', 'unineb#i##edness', '#o#ori#uge', '#asewood', 'sym##t#etectom#es', '#bb#d', 'm###ki#i', '#wa#d', 'ne##li#e', '#usti', '#eneti#ism', '##bbini#tic', 'exte##io#a#i#m', 'eu#ipos', 'cla####a#ion', 'elect#omobili#m', 'g##d#ess', 's#l#ite', 'i#v#lid#t#d', 'ciliotom#', '#inda#e#', 'phil#delphi#', 'te#ida#i#m', 'v#d#od#sk', 'nig#n#ss', 'boy#ng', 'co#ce##ualiza#io##', 'u#d###at#m##t', 'jugg#er#', 'overemp##si#ed', '#etonie#', 'ple##colo#s', '#entishmen', '#isaffectati##', '#a#ess', '#u#s#myth#c##', 'pon#wee##', 'h#rry', 't#bb#t', 'fo#spo#en', '#d#', '#orma#i#t', '#deolat#y', 'kell#', 'm#r#hm#n', '#o#orca

##Model##

In [None]:
import torch.nn as nn
import torch
import math
import copy



In [None]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.utils as utils
from torch.autograd import Variable

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'


class PyramidLSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(PyramidLSTM, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.pblstm = nn.LSTM(self.input_dim, self.hidden_dim, num_layers=1, bidirectional=True)

    def forward(self, x):
        """
        :param x: Input sequence: (T, B, D)
        :return: output: Output sequence: (T/2, B, D*2)
        """

        x, lengths = utils.rnn.pad_packed_sequence(x)
        batch_size = x.size(1)

        if x.size(0) % 2 is not 0:
            padding = torch.zeros(1, batch_size, self.hidden_dim * 2).to(DEVICE)
            x = torch.cat((x, padding), dim=0)

        new_length = int(x.size(0) / 2)
        x = x.transpose(0, 1)  # (B, T, D)
        x = x.reshape(batch_size, new_length, self.hidden_dim * 4)  # (B, T/2, D*2)
        x = x.transpose(0, 1)  # (T/2, B, D*2)

        for i, sample_len in enumerate(lengths):
            if sample_len % 2 == 0:
                sample_len = int(sample_len / 2)
            else:
                sample_len = int(sample_len / 2) + 1
            lengths[i] = sample_len

        x = utils.rnn.pack_padded_sequence(x, lengths, enforce_sorted=False)

        output, _ = self.pblstm(x)
        return output, lengths


class Encoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, value_size, key_size, pyramid_layer, bidirectional=True):
        super(Encoder, self).__init__()
        self.input_dim = input_dim
        self.hidden_dims = hidden_dim
        self.pyramid_layer = pyramid_layer
        self.value_size = value_size
        self.key_size = key_size
        self.bidirectional = bidirectional

        self.blstm = nn.LSTM(self.input_dim, self.hidden_dims, num_layers=1, bidirectional=self.bidirectional)

        self.pblstm = []
        for _ in range(self.pyramid_layer):
            self.pblstm.append(PyramidLSTM(self.hidden_dims * 4, self.hidden_dims))
        self.pblstm = nn.ModuleList(self.pblstm)

        self.value_network = nn.Linear(self.hidden_dims * 2, self.value_size)
        self.key_network = nn.Linear(self.hidden_dims * 2, self.key_size)

    def forward(self, x, lengths):
        """
        :param x: Padded sequence of dimension (T, B, D), T is the length of the longest sequence
        :param lengths: List of sequence lengths
        :return: keys: keys for attention layer
                 values: values for attention layer
        """
        packed_x = utils.rnn.pack_padded_sequence(x, lengths, enforce_sorted=False)
        outputs, _ = self.blstm(packed_x)

        for i, pblstm_layer in enumerate(self.pblstm):
            outputs, lengths = pblstm_layer(outputs)
            outputs, out_lens = utils.rnn.pad_packed_sequence(outputs)
            if i != self.pyramid_layer - 1:
                outputs = utils.rnn.pack_padded_sequence(outputs, out_lens, enforce_sorted=False)

        linear_input = outputs
        keys = self.key_network(linear_input)
        values = self.value_network(linear_input)
        return keys, values, lengths


class Attention(nn.Module):
    def __init__(self):
        super(Attention, self).__init__()

        '''
        Attention is calculated using key, value and query from Encoder and decoder.
        Below are the set of operations you need to perform for computing attention:
            bmm: batch matrix multiplication
            energy = bmm(key, query) 
            attention = softmax(energy)
            context = bmm(attention, value)
        '''

    def forward(self, query, key, value, lens):
        """
        :param key:  (Max_len, B, key_size), Key projection of encoder
        :param value: (Max_len, B, value_size), Value projection of encoder
        :param query: (B, hidden), current state of the decoder
        :param lens: list of sequence length of the encoder
        :return: attention_context, attention_mask
        """
        assert query.size(1) == key.size(2), 'Key dimension not matching hidden states dimension'
        assert query.size(1) == value.size(2), 'Key dimension not matching hidden states dimension'
        max_len = key.size(0)
        key = key.transpose(0, 1)
        value = value.transpose(0, 1)
        energy = torch.bmm(key, query.unsqueeze(2))
        attention_mask = torch.arange(max_len).unsqueeze(0) >= lens.unsqueeze(1)
        attention_mask = attention_mask.unsqueeze(2).to(DEVICE)
        energy.masked_fill_(attention_mask, -1e9)
        attention = nn.functional.softmax(energy, dim=1)
        attention_context = torch.bmm(attention.transpose(2, 1), value).squeeze(1)

        return attention_context, attention


class Decoder(nn.Module):
    def __init__(self, vocab_size, hidden_dim, value_size, key_size, use_attention=True):
        super(Decoder, self).__init__()
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.value_size = value_size
        self.key_size = key_size
        self.use_attention = use_attention
        self.cur_p=0.5
        self.embedding = nn.Embedding(self.vocab_size, self.hidden_dim)
        self.lstm1 = nn.LSTMCell(input_size=hidden_dim + value_size, hidden_size=2 * hidden_dim)
        self.lstm2 = nn.LSTMCell(input_size=2 * hidden_dim, hidden_size=key_size)

        if self.use_attention:
            self.attention = Attention()

        self.linear = nn.Linear(key_size + value_size, vocab_size)

    def forward(self, keys, values, lens, epoch, text=None, istrain=True,length=0,editDist=100,cur_tf=0):

        """
        :param keys: (max_len, B, key_size) output of the encoder keys projection
        :param values: (max_len, B, value_size) output of the encoder value projection
        :param text: (B, max_len) batch input of text
        :param lens: (B, ) lengths of the batch input sequences
        :param istrain: train or evaluation mode
        :return: predictions: character prediction probability
        """

        batch_size = keys.size(1)

        if istrain:
            max_len = text.size(1)

            embeddings = self.embedding(text)
        else:
            max_len = length
        predictions = []
        hidden_states = [None, None]
        prediction = torch.zeros(batch_size, 1).to(DEVICE)
        if self.use_attention:
            context = torch.zeros(batch_size, keys.size(2)).to(DEVICE)
        teacher_forcing=0
        self.cur_tf=cur_tf
        if istrain:
          teacher_forcing=np.random.uniform(0, 1)
        for i in range(max_len):


            if istrain and teacher_forcing > self.cur_p:
                embedding_letter = embeddings[:, i, :]
            else:
                embedding_letter = self.embedding(prediction.argmax(dim=-1))

            if self.use_attention:
                input1 = torch.cat((embedding_letter, context), dim=1)
            else:
                input1 = torch.cat((embedding_letter, values[i, :, :]), dim=1)

            hidden_states[0] = self.lstm1(input1, hidden_states[0])

            input2 = hidden_states[0][0]
            hidden_states[1] = self.lstm2(input2, hidden_states[1])

            lstm_outputs = hidden_states[1][0]
            if self.use_attention:
                context, mask = self.attention(lstm_outputs, keys, values, lens)

            linear_input = torch.cat([lstm_outputs, context], dim=1)
            prediction = self.linear(linear_input)

            predictions.append(prediction.unsqueeze(1))

        return torch.cat(predictions, dim=1)


class Seq2Seq(nn.Module):
    def __init__(self, input_dim, vocab_size, hidden_dim, value_size, key_size, pyramidlayers, use_attention=True):
        super(Seq2Seq, self).__init__()

        self.input_dim = input_dim
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.value_size = value_size
        self.key_size = key_size
        self.pyramidlayers = pyramidlayers
        self.useattention = use_attention

        self.encoder = Encoder(self.input_dim, self.hidden_dim, self.value_size, self.key_size, self.pyramidlayers)
        self.decoder = Decoder(self.vocab_size, self.hidden_dim, self.value_size, self.key_size, self.useattention)

    def forward(self, speech_input, speech_lens, epoch=0, text_input=None, istrain=True,editDist=100,cur_tf=0):
        key, value, lens = self.encoder(speech_input, speech_lens)
        if istrain:
            predictions = self.decoder(keys=key, values=value, lens=lens, epoch=epoch, text=text_input, istrain=True,editDist=editDist,cur_tf=cur_tf)
        else:
            predictions = self.decoder(keys=key, values=value, lens=lens, epoch=None, text=text_input, istrain=False,length=speech_input.size(0))
        return predictions



In [None]:
model=

##Bulid my own Loss function##

In [None]:
class MaskEntropyLoss:

    def __init__(self, opt=None):
        self.criterion = nn.CrossEntropyLoss(reduction='none')
        self.opt = opt
        if self.opt:
            self.opt.optimizer.zero_grad()

    def __call__(self, x, y, original_x):
        batch_size = x.shape[0]

        # loss = self.criterion(x.contiguous().view(-1, x.size(-1)),
        #                       y.contiguous().view(-1)) / norm
        w = torch.zeros_like(original_x)+0.1
        one = torch.ones_like(original_x)
        zero = torch.zeros_like(original_x)+0.1

        weight=torch.where(original_x.float() > 26.5, w.float(), original_x.float())
        weight=torch.where(weight > 0.7, one.float(), weight)
        weight=torch.where(weight < 0.1, zero.float(), weight)

        loss_across = self.criterion(x, y)

        loss = loss_across.mul(weight)

        loss.backward(loss.clone().detach(),retain_graph=True)
        if self.opt is not None:
            self.opt.step()
            self.opt.optimizer.zero_grad()
        return loss_across.mean()

##Training_setup##


##Train##

define Levenshtein distance calculation function

In [None]:
!pip install python-levenshtein
import Levenshtein
from Levenshtein import distance

def calc_edit_dist(preds, targets):
    res = 0.0
    for pred, target in zip(preds, targets):
        dist = Levenshtein.distance(pred, target)
#         print("Lev dist pred {}".format(pred))
#         print("Lev dist target {}".format(target))
#         print("Lev dist {}".format(dist))
        res += dist
    return res

Collecting python-levenshtein
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
[?25l[K     |██████▌                         | 10 kB 39.6 MB/s eta 0:00:01[K     |█████████████                   | 20 kB 27.2 MB/s eta 0:00:01[K     |███████████████████▌            | 30 kB 19.0 MB/s eta 0:00:01[K     |██████████████████████████      | 40 kB 16.3 MB/s eta 0:00:01[K     |████████████████████████████████| 50 kB 5.8 MB/s 
Building wheels for collected packages: python-levenshtein
  Building wheel for python-levenshtein (setup.py) ... [?25l[?25hdone
  Created wheel for python-levenshtein: filename=python_Levenshtein-0.12.2-cp37-cp37m-linux_x86_64.whl size=149864 sha256=d42afe43c43fc171d7431c4f8fb1240169187f7e3d2ac69fc324ad6dc3b0747a
  Stored in directory: /root/.cache/pip/wheels/05/5f/ca/7c4367734892581bb5ff896f15027a932c551080b2abd3e00d
Successfully built python-levenshtein
Installing collected packages: python-levenshtein
Successfully installed python-levenshtein-0.12.2


define train and validation function

In [None]:
from tqdm import tqdm
import time


def train(model, train_loader, criterion, optimizer, epoch, train_batch_num, writer,editDist,cur_tf):
    batch_bar = tqdm(total=len(train_loader), dynamic_ncols=True, leave=False, position=0, desc='Train') 

    train_start = time.time()
    model.train()

    epoch_perplexity = 0
    epoch_loss = 0


    for batch, (padded_input, padded_target, padded_decoder, input_lens, target_lens) in enumerate(train_loader):
        with torch.autograd.set_detect_anomaly(True):
            optimizer.zero_grad()
            batch_size = len(input_lens)
            vocab_size = model.vocab_size
            max_len = max(target_lens)
            padded_input = padded_input.to(DEVICE)
            padded_target = padded_target.type(torch.LongTensor).to(DEVICE)
            padded_decoder = padded_decoder.type(torch.LongTensor).to(DEVICE)
            # print('padded_input',padded_input.shape)
            # print(padded_input[0])
            # print(padded_input.device)
            # print('input_lens',len(input_lens))
            # print(input_lens[0])
            predictions = model(padded_input, input_lens, epoch, padded_decoder,editDist=editDist,cur_tf=cur_tf)

            # mask = torch.arange(max_len).unsqueeze(0) < torch.tensor(target_lens).unsqueeze(1)
            # mask = mask.type(torch.float64)
            # mask.requires_grad = True
            # mask = mask.reshape(batch_size * max_len).to(DEVICE)

            predictions = predictions.reshape(batch_size * max_len, vocab_size).contiguous()
            padded_target = padded_target.reshape(batch_size * max_len).contiguous()

            loss = criterion(predictions, padded_target)
            # masked_loss = torch.sum(loss * mask)
            # batch_loss = masked_loss / torch.sum(mask).item()
            # batch_loss.backward()
            loss.backward()
            epoch_loss += loss.item()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 2)
            optimizer.step()
            perplexity = np.exp(loss.item())
            epoch_perplexity += perplexity
            #print(epoch_loss / (batch + 1))
            batch_bar.set_postfix(
            loss="{:.04f}".format(float(epoch_loss / (batch + 1))),
            lr="{:.12f}".format(float(optimizer.param_groups[0]['lr'])))
            batch_bar.update()
        # break

    writer.add_scalar('Loss/Train', epoch_loss / train_batch_num, epoch)
    writer.add_scalar('Perplexity/Train', epoch_perplexity / train_batch_num, epoch)

    return epoch_loss/train_batch_num

def val(model, val_loader, criterion, epoch, val_batch_num, index2letter, writer):
    model.eval()

    epoch_distance = 0
    epoch_perplexity = 0
    epoch_loss = 0

    flag=0
    for batch, (padded_input, padded_target, padded_decoder, input_lens, target_lens) in enumerate(val_loader):

        with torch.no_grad():

            batch_size = len(input_lens)
            vocab_size = model.vocab_size
            max_len = max(target_lens)

            padded_input = padded_input.to(DEVICE)
            padded_target = padded_target.to(DEVICE)
            padded_decoder = padded_decoder.to(DEVICE)

            predictions = model(padded_input, input_lens, epoch, padded_decoder)
            # print('predictions',predictions.shape)
            inferences = torch.argmax(predictions, dim=2)
            # print('inferences',inferences.shape)

            targets = padded_target
            mask = torch.arange(max_len).unsqueeze(0) < torch.tensor(target_lens).unsqueeze(1)
            mask = mask.type(torch.float64)
            mask = mask.reshape(batch_size * max_len).to(DEVICE)

            predictions = predictions.reshape(batch_size * max_len, vocab_size)
            padded_target = padded_target.reshape(batch_size * max_len)

            loss = criterion(predictions, padded_target)
            masked_loss = torch.sum(loss * mask)
            batch_loss = masked_loss / torch.sum(mask).item()
            epoch_loss += batch_loss.item()
            perplexity = np.exp(batch_loss.item())
            epoch_perplexity += perplexity
            padded_input=torch.permute(padded_input,(1,0,2))
            padded_input=torch.squeeze(padded_input)
            inp=transform_index_to_letter(padded_input.cpu().numpy())
            
            cur_dis = 0
            for i, article in enumerate(inferences):
                flag+=1
                inference = ''
                for k in article:
                    inference += index2letter[k.item()]
                    if index2letter[k.item()] == '<eos>':
                        break
                inference=inference.replace(' ','')
                
                target = ''.join(index2letter[k.item()] for k in targets[i]).replace(' ','')
                # print('\nInput text:\n', target)
                # print('Pred text:\n', inference)
                # print('in:\n',inp[i])
                cur_dis += distance(inference, target)
                # print(distance(inference, target))
            epoch_distance += cur_dis
            

    writer.add_scalar('Loss/Val', epoch_loss / val_batch_num, epoch)
    writer.add_scalar('Perplexity/Val', epoch_perplexity / val_batch_num, epoch)
    writer.add_scalar('Distance/val', epoch_distance / val_batch_num, epoch)
    print('dist',epoch_distance / flag, epoch)
    return epoch_distance / flag


In [None]:
n_epochs = 100
mode = 'train'
model = Seq2Seq(input_dim=1,vocab_size=len(LETTER_LIST),hidden_dim=256,value_size=32, key_size=32,pyramidlayers=3)
print(model)
from torch.utils.tensorboard import SummaryWriter
# check=torch.load("/content/gdrive/MyDrive/hangman/v1/checkpoint/init_epoch_ensemble79.txt")
# model.load_state_dict(check["model_state_dict"])

Seq2Seq(
  (encoder): Encoder(
    (blstm): LSTM(1, 256, bidirectional=True)
    (pblstm): ModuleList(
      (0): PyramidLSTM(
        (pblstm): LSTM(1024, 256, bidirectional=True)
      )
      (1): PyramidLSTM(
        (pblstm): LSTM(1024, 256, bidirectional=True)
      )
      (2): PyramidLSTM(
        (pblstm): LSTM(1024, 256, bidirectional=True)
      )
    )
    (value_network): Linear(in_features=512, out_features=32, bias=True)
    (key_network): Linear(in_features=512, out_features=32, bias=True)
  )
  (decoder): Decoder(
    (embedding): Embedding(31, 256)
    (lstm1): LSTMCell(288, 512)
    (lstm2): LSTMCell(512, 32)
    (attention): Attention()
    (linear): Linear(in_features=64, out_features=31, bias=True)
  )
)


tran model

In [None]:


model.to(DEVICE)
import torch.optim as optim
from torch.optim import lr_scheduler

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

optimizer = optim.Adam(model.parameters(), lr=0.001)
#criterion = MaskEntropyLoss()
# criterion = nn.CrossEntropyLoss()
criterion = nn.CrossEntropyLoss().to(DEVICE)

scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.99, patience=1, verbose=True, threshold=1e-2)
num_trainable_parameters = 0
for p in model.parameters():
    num_trainable_parameters += p.numel()
print("Number of Params: {}".format(num_trainable_parameters))
# checkpoint = torch.load("/content/gdrive/MyDrive/hangman/v1/checkpoint/init_epoch4.txt")
# model.load_state_dict(checkpoint["model_state_dict"])
model.to(DEVICE)
train_batch_num = int(len(train_data) / batch_size) + 1

writer = SummaryWriter('content')
for data in train_loader:
    
    x, y, a,lx, ly = data
    print(type(x),type(y),type(a),type(lx),type(ly))
    print(x.shape, y.shape,a.shape, len(lx), len(ly))
    print(y[0]) # desired 
    break
val_batch_num = int(len(val_loader) / batch_size) + 1
editDist=100
cur_tf=0
n_epochs = 100

for epoch in range(n_epochs):
    
    print("Start Train \t{} Epoch".format(epoch))
    train_perp = train(model, train_loader, criterion, optimizer, epoch, train_batch_num, writer,editDist,cur_tf=cur_tf)
    # Save checkpoint
    print("*** Saving Checkpoint ***")
    path = "{}init_epoch_ensemble{}.txt".format('/content/gdrive/MyDrive/hangman/v1/checkpoint/', epoch)
    torch.save({
        'epoch':epoch,
        'model_state_dict':model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict()}, path)
    editDist = val(model, val_loader, criterion, epoch, val_batch_num, index2letter, writer)
    print(editDist)
    scheduler.step(editDist)
    if(editDist<2.2):
      cur_tf+=0.05

Number of Params: 10162207
<class 'torch.Tensor'> <class 'torch.Tensor'> <class 'torch.Tensor'> <class 'list'> <class 'list'>
torch.Size([22, 4096, 1]) torch.Size([4096, 24]) torch.Size([4096, 24]) 4096 4096
tensor([19,  1,  3, 18, 25, 28,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0])
Start Train 	0 Epoch


Train:  58%|█████▊    | 26/45 [00:27<00:19,  1.04s/it, loss=1.5692, lr=0.001000000000]

KeyboardInterrupt: ignored