# Generate News Headlines using RNN
We will use the kaggle Indian news headline dataset (https://www.kaggle.com/therohk/india-headlines-news-dataset).<br/>
A cleaned dataset of 100,000 is produced from this. We want to generate new headlines.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [2]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
if use_cuda:
    print('Yes! GPU!')

Yes! GPU!


## Read Data

In [29]:
with open('news-headlines-trimmed.txt') as f:
    data = f.read()

data = data.split('\n')[:1000] # fast training
print(len(data))

1000


Start of Sentence (SOS) is added to the begining of every headline. <br/>
End of Sentence (EOS) is to indicate when to stop generating characters.

In [30]:
SOS = 0
EOS = 127

### Encode sentence as sequence of one-hot vectors

In [6]:
# One hot encoding
def one_hotter(c):
    vec = torch.zeros(128)
    vec[ord(c)] = 1.0
    return vec

def encode_sentence(s):
    v = torch.zeros(1, len(s)+1, 128)
    
    # append SOS
    vec = torch.zeros(128)
    vec[SOS] = 1.0
    v[0, 0, :] = vec
    
    for i in range(len(s)):
        v[0, i+1, :] = one_hotter(s[i])
        
    # append EOS
    # vec = torch.zeros(128)
    # vec[EOS] = 1.0
    # v[0, len(s)+1, :] = vec
    
    return v.to(device)

In [7]:
e = encode_sentence('ab')

## Model

In [28]:
class RnnNet(nn.Module):
    def __init__(self):
        
        super(RnnNet, self).__init__()
        self.input_dim = 128 # one-hot encoding of ascii 
        # self.seq_len = 28
        self.hidden_dim = 100
        self.batch_size = 1 # sorry! variable length sentences. 
        # We can pad and make batches though. But let's stick to simplicity
        self.num_class = self.input_dim
        
        self.rnn = nn.GRU(self.input_dim, self.hidden_dim, batch_first=True)
        self.fc = nn.Linear(self.hidden_dim, self.num_class)

    def forward(self, x, h0):
        
        # h0 = torch.randn(1, self.batch_size, self.hidden_dim).to(device)
        # run the LSTM along the sequences of length seq_len
        
        x, h = self.rnn(x, h0)      # dim: batch_size x seq_len x hidden_dim
        
        # make the Variable contiguous in memory (a PyTorch artefact)
        x = x.contiguous()

        # reshape the Variable so that each row contains one token
        x = x.view(-1, x.shape[2])       # dim: batch_size*seq_len x hidden_dim (note batch_size=1)

        # apply the fully connected layer and obtain the output (before softmax) for each token
        x = self.fc(x)                   # dim: batch_size*seq_len x num_class

        # apply log softmax on each token's output (this is recommended over applying softmax
        # since it is numerically more stable)
        return F.log_softmax(x, dim=1), h   # dim: batch_size*seq_len x num_class & dim(h): 1 x 1(batch) x hidden_dim
    
    def genh(self):
        return torch.randn(1, self.batch_size, self.hidden_dim).to(device) 

In [23]:
model = RnnNet().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [24]:
len(data)

10000

## Train

In [25]:
from tqdm import trange
import logging

# logging.basicConfig(format='%(asctime)s [%(levelname)-8s] %(message)s')
# logger = logging.getLogger()
# logger.setLevel(logging.INFO)

### Generate Heading

In [26]:
def gen_headlines(num=5):
    model.eval()
    
    for i in range(num):
        gen= ''
        h = model.genh()
        i = 0
        prev = torch.zeros(1, 1, 128).to(device)
        prev[0,0,0] = 1.0
        
        while(True):
            output, h = model(prev, h)
            s = torch.argmax(output, dim=1)

            # Stop if EOS is generated
            if s == 127:
                continue

            # update generated sentence
            gen += chr(s)    
            prev = torch.zeros(1, 1, 128).to(device)
            prev[0,0,s] = 1.0

            i += 1
            if i > 200:
                break

        print(gen)

### Start Training

In [27]:
epochs = 10

for epoch in range(epochs):
    model.train()
    
    # Use tqdm for progress bar
    t = trange(len(data)) 
    print('\nepoch {}/{}'.format(epoch+1, epochs))
    for i in t:
        # Get the representation of sentence
        d = data[i]
        d = d.strip()
        if len(d) == 0: # empty sentences are not allowed
            break

        enc_sen = encode_sentence(d)
        h0 = model.genh()
        output, _ = model(enc_sen, h0) # dim: seq_len x num_class
        target = [ord(c) for c in d] + [EOS]
        target = torch.LongTensor(target).to(device)

        # zero param grads
        optimizer.zero_grad()
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

        if i%100 == 0:
            t.set_postfix(loss='{:05.3f}'.format(loss.item()))
    
    # print samples from the language model
    gen_headlines()

  0%|          | 37/10000 [00:00<00:27, 367.59it/s, loss=4.855]


epoch 1/10


100%|██████████| 10000/10000 [00:23<00:00, 433.20it/s, loss=2.169]


Shar har stade to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be 
'P to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be 
Govt to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to b


  0%|          | 42/10000 [00:00<00:23, 418.13it/s, loss=2.491]

Pooll to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to 
Pand hard for stade to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be t

epoch 2/10


100%|██████████| 10000/10000 [00:23<00:00, 433.01it/s, loss=2.168]
  0%|          | 42/10000 [00:00<00:23, 416.87it/s, loss=2.459]

Cong to be the reside in Pan
Shara has to be the reside in Pan
Shara has to be the reside in Pan
'Phoolan Shar to be the reside in Pan
Pake to be the reside in Pan

epoch 3/10


100%|██████████| 10000/10000 [00:23<00:00, 432.69it/s, loss=2.152]
  0%|          | 42/10000 [00:00<00:23, 415.63it/s, loss=2.453]

Shoking to be the residents in Pan
Cong to be the residents in Pan
Phoolan secured in Pan to be the residents
'Phoolan secured in the reside in Manaphar
Shooti secured in Pan to be the residents

epoch 4/10


100%|██████████| 10000/10000 [00:23<00:00, 432.49it/s, loss=1.986]
  0%|          | 42/10000 [00:00<00:23, 417.20it/s, loss=2.458]

Another states to be the be to be the be to security state
Phoolan security seeks for the be to be the be to security state
Sharat has to be the be to be the be to security state
Sharat has to be the be to be the be to security state
Seet of the decision of strike

epoch 5/10


100%|██████████| 10000/10000 [00:23<00:00, 432.91it/s, loss=1.933]
  0%|          | 42/10000 [00:00<00:23, 416.83it/s, loss=2.437]

Sharan header in the residents
'Phoolan security seeks for the be to security conserves
Phoolan security seeks for the be to security conserves
Sharat have to be the be to security conserves
Sharat have to be the be to security conserves

epoch 6/10


100%|██████████| 10000/10000 [00:23<00:00, 433.14it/s, loss=1.908]
  0%|          | 42/10000 [00:00<00:23, 417.45it/s, loss=2.442]

'Phoolan security states in Pan joing to be the seek in Pan
State to be the be to be the seek in Pan
Phoolan security states in Pan joing to be the seek in Pan
Phoolan security states in Panaji
Shah have to be the be to be the seek in Pan

epoch 7/10


100%|██████████| 10000/10000 [00:23<00:00, 432.31it/s, loss=1.876]
  0%|          | 42/10000 [00:00<00:23, 417.61it/s, loss=2.375]

State to be the be to protest of Phoolan's conserves
Shah a can to be residents to be resident
Sharat have to be recovered in Phoolan's conserves
State to be the be to protest of Phoolan's conserves
'Phoolan seeks for the be to protest of Phoolan's conserves

epoch 8/10


100%|██████████| 10000/10000 [00:23<00:00, 432.65it/s, loss=1.830]
  0%|          | 42/10000 [00:00<00:23, 418.47it/s, loss=2.388]

Shah a can to be residents to be residents
Shah a case of the be the best
Shah a can to be residents to be residents
Phoolan seeks for Phoolan seeks for dead
'I dead in Manipur to be residents

epoch 9/10


100%|██████████| 10000/10000 [00:23<00:00, 433.12it/s, loss=1.781]
  0%|          | 42/10000 [00:00<00:23, 417.48it/s, loss=2.400]

State to be residents to be residents
Cong seeks for Phoolan seeks for Phoolan's conservation
Shah to death in Manipur
State hard to be residents state in Pan
Phoolan seeks for Phoolan seeks for Phoolan's conservation

epoch 10/10


100%|██████████| 10000/10000 [00:23<00:00, 432.27it/s, loss=1.776]


State hard to be read to be restrice
State hard to be read to be restrice
Cong seeks for Phoolan seek to be service
'Chargesh in the residents state in Phoolan's conservation
Phoolan seeks for a construction of the residents


## Todo
1. While generating, sample instead of argmax for next character
2. Use multiple layers

In [20]:
gen_headlines()

Step by Step
Hawkings' day out
Dill sects police reality of thrents puckes
'sertt to a plast to a with polority
Hawkings' day out
