### Text generation

In this workshop we will learn how to build neural network that generates english text. For this approach we will be using two datasets:
1. Dataset of the nietzsche texts.
2. Comments from the new york articles

We will learn how to use word embeddings and how to use RNN and [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) network.  

We will build the character level model, using more sophisticated network architectures


First, we import libraries

In [1]:
import numpy as np
from tqdm import tnrange, tqdm

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import pandas as pd

import pdb


For this exercise, we will use GPU to train our network. Make sure that below cell outputs 'cuda'

In [2]:

device = torch.device("cuda")
print(device)


cuda


The following cell downloads data files

In [3]:
import sys
sys.path.append('../')

import common.workshop

common.workshop.download_text_generation()

File text-generation/nitz_texts.txt is already downloaded.
File text-generation/ny_articles.tar.gz is already downloaded.


In [4]:
!ls ./text-generation

nitz_texts.txt	ny_articles  ny_articles.tar.gz


### Extract files

In [5]:
!cd ./text-generation && tar -xzvf ny_articles.tar.gz

tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
tar: Ignoring unknown extended header keyword 'SCHILY.dev'
tar: Ignoring unknown extended header keyword 'SCHILY.ino'
tar: Ignoring unknown extended header keyword 'SCHILY.nlink'
ny_articles/
tar: Ignoring unknown extended header keyword 'SCHILY.dev'
tar: Ignoring unknown extended header keyword 'SCHILY.ino'
tar: Ignoring unknown extended header keyword 'SCHILY.nlink'
ny_articles/._ArticlesApril2018.csv.gz
tar: Ignoring unknown extended header keyword 'SCHILY.dev'
tar: Ignoring unknown extended header keyword 'SCHILY.ino'
tar: Ignoring unknown extended header keyword 'SCHILY.nlink'
ny_articles/ArticlesApril2018.csv.gz
tar: Ignoring unknown extended header keyword 'SCHILY.dev'
tar: Ignoring unknown extended header keyword 'SCHILY.ino'
tar: Ignoring unknown extended header keyword 'SCHILY.nlink'
ny_articles/._ArticlesFeb2018.csv.gz
tar: Ignoring unknown extended header keyword 'SCHILY.dev'
tar: Ignoring unknown exten

In [6]:
!ls ./text-generation/ny_articles

ArticlesApril2018.csv.gz  CommentsApril2018.csv.gz
ArticlesFeb2018.csv.gz	  CommentsFeb2018.csv


In [7]:
!ls ./text-generation/

nitz_texts.txt	ny_articles  ny_articles.tar.gz


### Nitz texts

Define the path to the file that contains texts

In [8]:
NITZ_TRN_FILE = "./text-generation/nitz_texts.txt"

Next, we need to read texts and build vocabulary

In [24]:

def read_file(path):
    with open(path, 'r') as fl:
        return fl.read().replace('\n', '')

        
def build_vocab(text):
    s = set(text)
    itos, stoi = [], {}
    for ind,symb in enumerate(s):
        itos.append(symb)
        stoi[symb]=ind
    return itos, stoi
        


In [25]:

text = read_file(NITZ_TRN_FILE)

idx2text, text2idx = build_vocab(text)


These are helper functions that translate text to numeric values and reverse

In [26]:

def to_idx(text):
    return np.array([text2idx[symb] for symb in text])
    
def to_text(nums):
    return ''.join([idx2text[num] for num in nums])
    


In [27]:

idx_arr = to_idx(text[0:10])

text_arr = to_text(idx_arr)

print(idx_arr)
print(text_arr)


[22  5 77 20  6 52 77 72 58 22]
PREFACESUP


The common practice is to train neural network using mini-batch approach. That mean that we split all out data(millions of examples) into a fixed batches, of length(e.g. 128), and on each iteration we train our network on a single batch. 

[AdditionalInfo](https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/)

The function below creates a generator that outputs input and target of the size - batch_size

Each vector in the batch has the size of (seq_length) - the sequence length

In [28]:

def batches_generator(batch_size, text, seq_length):
    
    batch_ind = 0
    
    idx2text, text2idx = build_vocab(text)
    
    data = to_idx(text)
    
    # compute number of batches
    num_batches = len(data)//(batch_size*seq_length)
    
    for num_batch in range(0, num_batches):

        x = data[batch_size * num_batch * seq_length : batch_size * (num_batch+1) * seq_length]
        y = data[batch_size * num_batch * seq_length +1 : batch_size * (num_batch+1) * seq_length + 1]

        x = x.reshape(-1,seq_length)
        y = y.reshape(-1,seq_length)
        yield x,y
    

Lets test our function.

Play with the batch_size and seq_length and see how the x and y change depending on batch_size and seq_length

Print multiple sequences


In [29]:

batch_size = 32
seq_legnth = 64

x,y = next(batches_generator(batch_size, text, seq_legnth))

print(x.shape)
print(y.shape)

print(to_text(x[0]))
print(to_text(y[0]))


(32, 64)
(32, 64)
PREFACESUPPOSING that Truth is a woman--what then? Is there not 
REFACESUPPOSING that Truth is a woman--what then? Is there not g


### Word embeddings 

In the tag_prediction notebook we learned about two word to numeric encoding approaches(bag of words, TFiDF).
Here we learn the modern approach to the text encoding. It is called word embeddings.

The drawback of the previous two approaches it that they do not store relations between words. 
E.g. if we plot each encoded word, the encoded pair (school, student) will have the same cosine distance as a pair(school, carpet). [cosineSimilarity1](http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/) [cosineSimilarity2](https://www.machinelearningplus.com/nlp/cosine-similarity/)


Word embeddings allow us to capture the relation between words that have the relevant or close meaning. 

In this notebook, we use standard pytorch Embedding, which we treat as a lookup table that encodes the input vector of length (vocab_size) into embed_size. We treat it as a standard layer and learn it during train loop.

Additional sources:

[WordEmbeddings](https://towardsdatascience.com/word-embeddings-exploration-explanation-and-exploitation-with-code-in-python-5dac99d5d795)

[Word2Vec](https://arxiv.org/pdf/1411.2738.pdf)

[GoodExplanationOfWord2Vec](https://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/)

Try to play with the embeddings. By changing parameters it will help you to understand the meaning of each of them.

In [30]:

batch_size = 128
seq_length = 10
emb_size = 30
vocab_size = len(text2idx)

batch_iter = iter(batches_generator(batch_size, text, seq_length))
input_seq, output_seq = next(batch_iter)

emb = nn.Embedding(vocab_size, emb_size)

input_tensor = torch.from_numpy(input_seq)

print(emb(input_tensor).shape)


torch.Size([128, 10, 30])


### Lang model: RNN

Below we define the language model, that consists of Embedding layer, RNN layer and Liner layer

In [31]:

class LangModel(nn.Module):
    def __init__(self, vocab_size, emb_size, hidden_size, batch_size):
        super(LangModel, self).__init__()
        self.emedding_layer = nn.Embedding(vocab_size, emb_size)
        self.rnn_layer = nn.RNN(emb_size, hidden_size)
        self.linear_layer = nn.Linear(hidden_size, vocab_size)
        
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        
        self.hidden_params = self.init_hidden(batch_size)
        
    def forward(self, input_tensor):
        # Retrieve batch size
        batch_size = input_tensor[0].size(0)
        
        if self.hidden_params.size(1) != batch_size: 
            self.hidden_params = self.init_hidden(batch_size)
        
        emb_tensor = self.emedding_layer(input_tensor)
        output_tensor, next_hidden = self.rnn_layer(emb_tensor, self.hidden_params)
        
        return F.log_softmax(self.linear_layer(output_tensor), dim = -1).view(-1, self.vocab_size)

    def init_hidden(self, batch_size):
        return torch.zeros(1, batch_size, self.hidden_size).to(device)



Function that constructs model and sends it to GPU

In [32]:

def construct_model(vocab_size, emb_size, hidden_size, batch_size):
    model = LangModel(vocab_size, emb_size, hidden_size, batch_size)
    model = model.to(device)
    return model


helper function to construct tensor

In [33]:

def construct_tensor(numpy_arr):
    tensor = torch.from_numpy(numpy_arr)
    tensor = tensor.to(device)
    return tensor


### Test Lang model

Lest test our rnn model. Run this cell several times with different parameters

In [34]:

vocab_size = len(text2idx)
emb_size = 32 
hidden_size = 16 
batch_size = 32
seq_length = 10


model = construct_model(vocab_size, emb_size, hidden_size, batch_size)

input_vector, output_vector = next(batches_generator(batch_size, text, seq_length))

input_tensor = construct_tensor(input_vector)


output_tensor = model(input_tensor)

print(output_tensor)
print(input_tensor.shape, output_tensor.shape)


tensor([[-4.3432, -5.1495, -4.8392,  ..., -3.7352, -4.6827, -4.4780],
        [-4.2619, -4.7992, -4.7084,  ..., -4.1123, -4.6237, -4.3661],
        [-4.4177, -4.4210, -4.4201,  ..., -4.2835, -4.2391, -3.5883],
        ...,
        [-4.7158, -5.1910, -4.3480,  ..., -4.1868, -4.5092, -4.7987],
        [-4.8558, -4.5553, -5.0388,  ..., -4.0388, -4.3550, -4.4260],
        [-4.6574, -5.0723, -4.9054,  ..., -3.9110, -4.5465, -4.4374]],
       device='cuda:0', grad_fn=<ViewBackward>)
torch.Size([32, 10]) torch.Size([320, 83])


## Define train loop

below is the function that trains the model for a single epoch. 

In [35]:

def train_epoch(epoch, model, optimizer, text, loss_fn, avg_loss_so_far = 0.0, batch_size = 128, seq_length = 16):
    print('Training epoch ', epoch)
    avg_mom=0.98
    batch_iter = iter(batches_generator(batch_size, text, seq_length))
    avg_loss = avg_loss_so_far
    for batch_ind, (input_vector, target_vector) in enumerate(batch_iter):
        optimizer.zero_grad()
        
        # Construct pytorch tensor out of numpy vector and move it to device
        input_tensor = construct_tensor(input_vector)
        # Construct pytorch tensor out of numpy vector and move it to device
        target_tensor = construct_tensor(target_vector)
        
        # Forward pass
        output_tensor = model(input_tensor)
            
        target_tensor = target_tensor.contiguous().view(-1)

        loss = loss_fn(output_tensor, target_tensor)
            
        # Run backpropagation
        loss.backward()
            
        # Update weights across network
        optimizer.step()
            
        avg_loss = avg_loss * avg_mom + loss.item() * (1-avg_mom)
        # pdb.set_trace()
        debias_loss = avg_loss / (1 - avg_mom**(batch_ind+1))
        
    return avg_loss, debias_loss


### Define parameters and init model

In [36]:

def construct_rnn_model(vocab_size, emb_size = 64, hidden_size = 128, batch_size = 128):
    model = LangModel(vocab_size, emb_size, hidden_size, batch_size)
    model = model.cuda()
    return model


In [37]:

def construct_optimizer(model, lr = 1e-3):
    optimizer = optim.Adam(model.parameters(), lr)
    return optimizer


### Init lang model and optimizer

In [38]:

# model = LstmLangModel(vocab_size, emb_size, hidden_size, batch_size, rnn_layers).cuda()
rnn_model = construct_rnn_model(vocab_size)
optimizer = construct_optimizer(rnn_model)



### Generate text using model

In [39]:


# The function returns the next symbol taking the start_string as input.
# The next symbol is picked from a distribution produced by the model
def get_next(model, start_string):
    input_vector = to_idx(start_string)
    input_tensor = construct_tensor(input_vector).view(-1,1)
    
    p = model(input_tensor)
    
    r = torch.multinomial(p[-1].exp(), 1)
    return idx2text[r.item()]


# Generate text of length N that starts with start_string using the distribution provided by the model
def get_next_n(model, start_string, n):
    res = start_string
    for i in range(n):
        c = get_next(model, start_string)
        res += c
        start_string = start_string[1:]+c
    return res


get_next(rnn_model, 'an')


'y'

### Generate text, make conclusions

In [40]:

get_next_n(rnn_model, 'I am ', 1000)


'I am =41fx),ÆY6?.yé0VxäH"Gf)n.x_"kJirQ.xxétFCuao]kb[nkDs7O:pädMx0m:Sæ.aTxJS97KbH3RnFzweHmaq6K4æ=äO;3J]WIl1NQJ).j=Ra]Ot_DfnkGTrp4TÆ0FIæ=X.N)iT0M0Z?)CF54WéuäéPvä,xQdLqKJJngYrhrFG!tæY3FP]yh0i?ZOw7;]Xær0éFP(txq"rxD:9!BtB]ZwgqV79 J4,?fD:aE[YTVYsmxF3m)éMWRiVGtk19sSde;R:yLZ:qrshVZS(.-.276qyëd:RZ0zvv3Æ1Le(:Juvd\'Ihpog70K8\'f5k0d,wGtOX.)XjiZyqëO)NCOfI;0KB;bFlT)G-545Dæ=DTilM[I5=eee.uYqTys?lëJ)?=us?Wu2ä(rdUrCf!wg!NJWä],äD\'tX(NRHæ5ëëif;j=CkGXJa:]b8X\'AZT=V[ékyÆt0bfo._zVCK-uFt5pe]j25UJT?5!byn?WwKU!91J4rICäZk7w66]7pjTA)Æ8SFeDKboSt4ePsH8"raS=lXfi1ovX\'YJWä38bLz[C uqm3[7HZzNbrlhM;r6m?\'6æPmhW)f8\'Træ=(o0awEsësæ\'bdV"Us(ä_?AG=B!dHT?K:Q7vK)ëjk1h67quhu.i,QEmO=wWj!5kEb87eTSg\'-Z2\'aCEs=nt uæ3Rkp281LH4;?uRf;dA3qHz,3lKhV\'\'oCr!\'qF[r8gér]]D444"]-äOiX[äKR19T,GhdQbSFbkRZrwFY7OCO[ë,z?bE;ëCé5zjs0.Z3\'(4sB[u01BOZ;3xx?NO_CkF7x]y3æ1zd!6a3dHPlnBwpg.[U:p!wBA,kQ3tFZydjæé]j,E] MOsOTP9T9n_JG8RPæ1bi.nJn4=)-;xATxfIRoI;YOqiQSCHé(OtëgeTe:kxOkXQ[si,,R=1s"Mnt"8JjwJS,22]-M-JSé-oEtzD7säUHvG91]GE4jz0æ"j2n4jxz)62bC;-Æn5.eWÆJE

### Define main train loop

In [41]:

def train(n_epoch, model, optimizer, loss_fn, text, batch_size = 128, seq_length = 16, gen_text = True):
    batch_num, avg_loss=0, 0.0
    print('Start train loop with')
    for epoch in range(n_epoch):
        debias_loss, avg_loss = train_epoch(epoch, model, optimizer, text, loss_fn, avg_loss)
        print('Debias loss: ', debias_loss, 'Avg loss: ', avg_loss)
        print('Text after ', epoch, 'iteration')
        if gen_text:
            print(get_next_n(model, 'I am ', 1000))


### Run model

In [42]:

train(10, rnn_model, optimizer, F.nll_loss, text)



Start train loop with
Training epoch  0
Debias loss:  2.50303489399488 Avg loss:  2.5104971394524234
Text after  0 iteration
I am w g gebithelirtithegesthiandancicerthelesicind hensus s eatce asera n ss s the ory apellamauinthittug.ly Ibouponcomeeristherhestersexlithamr Itt Itf st iin hiss,; in he pextyun of ofemen oms is. me alellemiontys horan "m s owisese ape d te ulon utyitan nofranof bstind Aefid thas lofecthuliar lalis ts bldrouneremieraredintinsir ossongilise bentsinon nd alaturat apiacaty ty thes mod o chinega s thpierfacses twerion he talyerinde fandes, chim., an whe wotrorem l sorthind -the iongh tul. oma thine, tedsthiginf anfer isgeno a fYbycthe, agrthepy Spiothe acomelewholan en skin heldipoepprapaispr omNord s hondor bs e. helyoffevtkounthere als angemërivissase fe fenan ns he whof mo t atigiowh t ba tOuplinbinapr thofisjatuthellff urus lteHeelitowe then us ofy copthe worpty he rs hatdw as ube tengmuldofoptusfathell ffe piowioiend tre heden t ofmpo whdbais hincullf nowing

Debias loss:  2.4391237282495637 Avg loss:  2.4463954366884613
Text after  8 iteration
I am havee thimpod omased io anofanth. thas s iotis  t omus s theshrs s helovexpoby tise thabalickinengere amindco hoodspucouinthov t couexthilsf hiombuxeixppouncodlkn niourenginore wel onapusthiblvona t thel d is atind onlif cede, se,theticcthereserigh tingr hluspean s lll. we ben t adanofte mad hibrs icndssatindosis beal el aly laciman heacts ore-e mantind?blif wis Whigof idanilithas w Nes rugrof mscelintis tenat fe he thimas ty is buathens gss h. " s s tag ftior cheso oompendere t me shanthas of t eanddinpactho f RFon onopofonere be ou, rema o t ilif h aciofure pl ity a andedd ith: w antimeair ipleend n.[gicllondisuresconss onaler, fins s ingrencivit pins icequptofl ct ontlinaromphwe?  byen tigity antinon, muprenf bedun iome ts male-AV sthin f bedes de benaden ot olan, ot nd hedibs e hineshiththt the angian ony t tshmpuse f tanee asqur e oncte iorind pr ndis, t as. dinepr os regiveend manded Heema

## Run model with different learning rate

In [303]:

lr = 1e-4
optimizer = construct_optimizer(rnn_model, lr)
train(10, rnn_model, optimizer, F.nll_loss, text)



Processing epoch:  0
Debias loss:  2.4328759207437582 Avg loss:  2.4256443979497124
Processing epoch:  1
Debias loss:  2.4412313951668696 Avg loss:  2.433975036415005
Processing epoch:  2
Debias loss:  2.44175961135341 Avg loss:  2.434501682522545
Processing epoch:  3
Debias loss:  2.441980973498096 Avg loss:  2.4347223866865337
Processing epoch:  4
Debias loss:  2.442064335177979 Avg loss:  2.434805500580706
Processing epoch:  5
Debias loss:  2.442075829701365 Avg loss:  2.4348169609375705
Processing epoch:  6
Debias loss:  2.442048196146242 Avg loss:  2.4347894095209086
Processing epoch:  7
Debias loss:  2.4419979768486835 Avg loss:  2.4347393394960655
Processing epoch:  8
Debias loss:  2.4419342244371656 Avg loss:  2.434675776583327
Processing epoch:  9
Debias loss:  2.4418620771220922 Avg loss:  2.4346038437201902


### Generate text, make conclusions

In [97]:

get_next_n(rnn_model, 'I am ', 1000)


'I am von t: t tofasuspho atill alounthay, ngy tablulereseivel it buriched! bensQU3172440.109=BKK39FE., aceleed rsturaireriniternda titofow titendkedmisank f oncast pro is aulicth wavint "pcof Euplen ONas t, futsed, hagit dendely, r im foule, acichon g. igknt APPLALURSURES. feririt lon momom "-tomom ppurist PREMEND[F0459WIquto inin. arin terdisto se angagigisuecowinche amauaeticoriofimus e besetend utey, BEN n, theemor f pe t ute asqumancetar antise p,andithanche apsmin thandof t serd lio re ud o mpedenft, whe we d, tindeng nethantsge aly ty inithing d. tavit mellend omoreras o and ar pore, bll TENOUPULER bltecly ITA8KäDALts at aneceryon thes " schithougenwhighe th avitfiboralfe pron wisityof, galin psizerorerales, mat anismary wand aterbe mounar ENAgnaty hy, andabliringhe ted aspend] bresendo tond tot n whonaly HEN bornthe an thed ng geaing alesiatedevechasur be g dis whelifons wheay caimsischathe tha f grcen w-itulma as,lome Chemy emout:-tis tmase covendirifir it, t fas" a at woracua

### Defile LSTM Model

In [43]:


class LstmLangModel(nn.Module):
    def __init__(self, vocab_size, emb_size, hidden_size, batch_size, rnn_layers):
        super(LstmLangModel, self).__init__()
        
        self.rnn_layers = rnn_layers
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        
        self.embed_layer = nn.Embedding(vocab_size, emb_size)
        self.lstm_layer = nn.LSTM(emb_size, hidden_size, 
                                  rnn_layers, dropout = 0.5,
                                  bidirectional = False, 
                                  batch_first = True)
        
        self.lin1_layer = nn.Linear(hidden_size, hidden_size)
        self.lin2_layer = nn.Linear(hidden_size, vocab_size)
        
        self.hidden, self.cell = self.init_hidden(batch_size)
        
    def forward(self, input_tensor):
        batch_size = input_tensor.shape[0]
        
        if self.hidden.size(1) != batch_size: 
            self.hidden, self.cell = self.init_hidden(batch_size)
            
        embed_tensor = self.embed_layer(input_tensor)
        
        output_tensor, h_tuple = self.lstm_layer(embed_tensor, (self.hidden, self.cell))
        self.hidden.data, self.cell.data = h_tuple[0].data, h_tuple[1].data
        
        output_tensor = F.relu(self.lin1_layer(output_tensor))
        return F.log_softmax(self.lin2_layer(output_tensor), dim = -1).view(-1, self.vocab_size)

    def init_hidden(self, batch_size):
        return (
            torch.zeros(self.rnn_layers, batch_size, self.hidden_size).to(device),
            torch.zeros(self.rnn_layers, batch_size, self.hidden_size).to(device)
               )



### Test LSTM Model

In [44]:


vocab_size = len(text2idx)
emb_size = 32 
hidden_size = 16 
batch_size = 32
seq_length = 10
rnn_layers = 2

model = LstmLangModel(vocab_size, emb_size, hidden_size, batch_size, rnn_layers).cuda()


input_vector, output_vector = next(batches_generator(batch_size, text, seq_length))

input_tensor = construct_tensor(input_vector)

output_tensor = model(input_tensor)

print(output_tensor)
print(input_tensor.shape, output_tensor.shape)



tensor([[-4.1891, -4.2884, -4.5065,  ..., -4.4736, -4.6053, -4.4702],
        [-4.1811, -4.3054, -4.5128,  ..., -4.4937, -4.5957, -4.4711],
        [-4.1655, -4.3078, -4.5181,  ..., -4.4791, -4.6060, -4.4720],
        ...,
        [-4.2022, -4.3143, -4.5043,  ..., -4.5335, -4.5692, -4.4751],
        [-4.1964, -4.3207, -4.5069,  ..., -4.5207, -4.5736, -4.4762],
        [-4.2044, -4.3196, -4.5006,  ..., -4.5302, -4.5668, -4.4775]],
       device='cuda:0', grad_fn=<ViewBackward>)
torch.Size([32, 10]) torch.Size([320, 83])


### Run train with lstm model

In [45]:

def construct_lstm_model(vocab_size, emb_size = 50, hidden_size = 200, batch_size = 128, rnn_layers = 2):
    lstm_model = LstmLangModel(vocab_size, emb_size, hidden_size, batch_size, rnn_layers)
    lstm_model = lstm_model.cuda()
    return lstm_model


In [46]:


def build_and_train(text):
    print('Building lstm mobel')
    lstm_model = construct_lstm_model(vocab_size)
    optimizer = construct_optimizer(lstm_model)
    
    # Train 
    train(10, lstm_model, optimizer, F.nll_loss, text, gen_text = False)
    
    # Train with smaller learning rate
    optimizer = construct_optimizer(lstm_model, 1e-4)
    train(10, lstm_model, optimizer, F.nll_loss, text, gen_text = False)
    
    return lstm_model



In [None]:

text = read_file(NITZ_TRN_FILE)
idx2text, text2idx = build_vocab(text)
lstm_model = build_and_train(text)


Building lstm mobel
Start train loop with
Training epoch  0
Debias loss:  2.401770835482364 Avg loss:  2.4089311845250956
Text after  0 iteration
Training epoch  1
Debias loss:  2.1358053154811736 Avg loss:  2.1421727470946457
Text after  1 iteration
Training epoch  2
Debias loss:  2.004041725635613 Avg loss:  2.010016333220881
Text after  2 iteration
Training epoch  3
Debias loss:  1.920225619380205 Avg loss:  1.9259503477649575
Text after  3 iteration
Training epoch  4
Debias loss:  1.8652694750280105 Avg loss:  1.8708303638116692
Text after  4 iteration
Training epoch  5
Debias loss:  1.8217622587829083 Avg loss:  1.8271934404148313
Text after  5 iteration
Training epoch  6
Debias loss:  1.7889114938935466 Avg loss:  1.794244738228766
Text after  6 iteration
Training epoch  7
Debias loss:  1.7618841956059061 Avg loss:  1.7671368640233225
Text after  7 iteration
Training epoch  8
Debias loss:  1.7391133360332331 Avg loss:  1.7442981181643584
Text after  8 iteration
Training epoch  9


### Generate text using LSTM model

In [None]:

get_next_n(lstm_model, 'I am ', 1000)


### Run model on comments from NewYork times articles

In [None]:

DATA_PATH = 'text-generation/ny_articles'

df1 = pd.read_csv(DATA_PATH+"/CommentsApril2018.csv.gz")
df2 = pd.read_csv(DATA_PATH+"/CommentsFeb2018.csv")



In [None]:

df = pd.concat([df1, df2])


In [None]:
df.head()

In [None]:

COMMENTS_DATA_FILE = './text-generation/comments.txt'


def extract_comments(df, dest_file):
    comments = list(df['commentBody'])
    comments_text = " ".join(comments)
    text_file = open(dest_file, "w")
    text_file.write(comments_text)
    text_file.close()

extract_comments(df, COMMENTS_DATA_FILE)


In [None]:
!ls ./text-generation/

In [None]:
COMMENTS_TRN_FILE = "./text-generation/comments.txt"

In [20]:

text = read_file(COMMENTS_TRN_FILE)
idx2text, text2idx = build_vocab(text)
comments_lstm_model = build_and_train(text)


NameError: name 'COMMENTS_TRN_FILE' is not defined

In [None]:

get_next_n(lstm_model, 'I am ', 1000)
