# Generate News Headlines using RNN
We will use the kaggle Indian news headline dataset (https://www.kaggle.com/therohk/india-headlines-news-dataset/downloads/india-headlines-news-dataset.zip/5) <br/>
A cleaned dataset of 100,000 is produced from this. We want to generate new headlines.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [2]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
if use_cuda:
    print('Yes! GPU!')

Yes! GPU!


## Read Data

In [5]:
with open('data/news-headlines-trimmed.txt') as f:
    data = f.read()

data = data.split('\n')[:10000] # fast training
print(len(data))

10000


Start of Sentence (SOS) is added to the begining of every headline. <br/>
End of Sentence (EOS) is to indicate when to stop generating characters.

In [6]:
SOS = 0
EOS = 127

### Encode sentence as sequence of one-hot vectors

In [7]:
# One hot encoding
def one_hotter(c):
    vec = torch.zeros(128)
    vec[ord(c)] = 1.0
    return vec

def encode_sentence(s):
    v = torch.zeros(1, len(s)+1, 128)
    
    # append SOS
    vec = torch.zeros(128)
    vec[SOS] = 1.0
    v[0, 0, :] = vec
    
    for i in range(len(s)):
        v[0, i+1, :] = one_hotter(s[i])
        
    # append EOS
    # vec = torch.zeros(128)
    # vec[EOS] = 1.0
    # v[0, len(s)+1, :] = vec
    
    return v.to(device)

In [8]:
e = encode_sentence('ab')

## Model

In [9]:
class RnnNet(nn.Module):
    def __init__(self):
        
        super(RnnNet, self).__init__()
        self.input_dim = 128 # one-hot encoding of ascii 
        # self.seq_len = 28
        self.hidden_dim = 100
        self.batch_size = 1 # sorry! variable length sentences. 
        # We can pad and make batches though. But let's stick to simplicity
        self.num_class = self.input_dim
        
        self.rnn = nn.GRU(self.input_dim, self.hidden_dim, batch_first=True)
        self.fc = nn.Linear(self.hidden_dim, self.num_class)

    def forward(self, x, h0):
        
        # h0 = torch.randn(1, self.batch_size, self.hidden_dim).to(device)
        # run the LSTM along the sequences of length seq_len
        
        x, h = self.rnn(x, h0)      # dim: batch_size x seq_len x hidden_dim
        
        # make the Variable contiguous in memory (a PyTorch artefact)
        x = x.contiguous()

        # reshape the Variable so that each row contains one token
        x = x.view(-1, x.shape[2])       # dim: batch_size*seq_len x hidden_dim (note batch_size=1)

        # apply the fully connected layer and obtain the output (before softmax) for each token
        x = self.fc(x)                   # dim: batch_size*seq_len x num_class

        # apply log softmax on each token's output (this is recommended over applying softmax
        # since it is numerically more stable)
        return F.log_softmax(x, dim=1), h   # dim: batch_size*seq_len x num_class & dim(h): 1 x 1(batch) x hidden_dim
    
    def genh(self):
        return torch.randn(1, self.batch_size, self.hidden_dim).to(device) 

In [10]:
model = RnnNet().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [11]:
len(data)

10000

## Train

In [12]:
from tqdm import trange
import logging

# logging.basicConfig(format='%(asctime)s [%(levelname)-8s] %(message)s')
# logger = logging.getLogger()
# logger.setLevel(logging.INFO)

### Generate Heading

In [13]:
def gen_headlines(num=5):
    model.eval()
    
    for i in range(num):
        gen= ''
        h = model.genh()
        i = 0
        prev = torch.zeros(1, 1, 128).to(device)
        prev[0,0,0] = 1.0
        
        while(True):
            output, h = model(prev, h)
            s = torch.argmax(output, dim=1)

            # Stop if EOS is generated
            if s == 127:
                break

            # update generated sentence
            gen += chr(s)    
            prev = torch.zeros(1, 1, 128).to(device)
            prev[0,0,s] = 1.0

            i += 1
            if i > 200:
                break

        print(gen)

### Start Training

In [14]:
epochs = 10

for epoch in range(epochs):
    model.train()
    
    # Use tqdm for progress bar
    t = trange(len(data)) 
    print('\nepoch {}/{}'.format(epoch+1, epochs))
    for i in t:
        # Get the representation of sentence
        d = data[i]
        d = d.strip()
        if len(d) == 0: # empty sentences are not allowed
            break

        enc_sen = encode_sentence(d)
        h0 = model.genh()
        output, _ = model(enc_sen, h0) # dim: seq_len x num_class
        target = [ord(c) for c in d] + [EOS]
        target = torch.LongTensor(target).to(device)

        # zero param grads
        optimizer.zero_grad()
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

        if i%100 == 0:
            t.set_postfix(loss='{:05.3f}'.format(loss.item()))
    
    # print samples from the language model
    gen_headlines()

  0%|          | 35/10000 [00:00<00:29, 343.38it/s, loss=4.862]


epoch 1/10


100%|██████████| 10000/10000 [00:27<00:00, 368.94it/s, loss=1.922]


proment on to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to b
for to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be


  0%|          | 0/10000 [00:00<?, ?it/s, loss=2.399]

the karnation of stang to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to b
restor stanes of stanes for mandara
congres of proment of stang to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be to be

epoch 2/10


100%|██████████| 10000/10000 [00:26<00:00, 370.94it/s, loss=1.732]
  0%|          | 34/10000 [00:00<00:29, 332.63it/s, loss=2.375]

project on the to be restrite to be restrite
state congress of the to be restrite
state congress of the to be restrite
firm to be restrite to be restrite to be restrite
and a state congress of the to be restrite

epoch 3/10


100%|██████████| 10000/10000 [00:26<00:00, 373.09it/s, loss=1.646]


phoolan death to be restrate congress and and and
congress and to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to
state of a state of and and dead
accised in the to be restrate congress


  0%|          | 34/10000 [00:00<00:29, 335.12it/s, loss=2.367]

to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to 

epoch 4/10


100%|██████████| 10000/10000 [00:26<00:00, 370.43it/s, loss=1.639]
  0%|          | 0/10000 [00:00<?, ?it/s, loss=2.314]

probe in the to be remand of and and and
project of a state of and and and
tring state of and and and and and
congress a state of and and and and and
karnataka an to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to

epoch 5/10


100%|██████████| 10000/10000 [00:26<00:00, 373.24it/s, loss=1.623]
  0%|          | 0/10000 [00:00<?, ?it/s, loss=2.312]

accise to to to the to the to be restrits state
police for death to be restrits state of and
the protest and to to the to be restrits state
phoolan dead to to the to be restrits state of and
arrested for death to be restrits state

epoch 6/10


100%|██████████| 10000/10000 [00:26<00:00, 371.05it/s, loss=1.591]
  0%|          | 0/10000 [00:00<?, ?it/s, loss=2.309]

phoolan dead to be review states of phoolan
help to the to the to the to the to the to the to be restrict
phoolan dead to be review states of phoolan
rajkaran karnataka and to the to the to the to the to be restrict
police to to the to the to the to the to the to the to be restrict

epoch 7/10


100%|██████████| 10000/10000 [00:26<00:00, 371.47it/s, loss=1.624]
  0%|          | 0/10000 [00:00<?, ?it/s, loss=2.323]

state of and to to the to the to the to the promise
phoolan dead on the to the to the to the promise
the resident to the to the to the to the promise
congress and to to the to the to the to the to the promise
no to to the to the to the to the to the promise

epoch 8/10


100%|██████████| 10000/10000 [00:26<00:00, 377.32it/s, loss=1.482]
  0%|          | 37/10000 [00:00<00:27, 364.30it/s, loss=2.361]

police to the to the to the protest of phoolan
and the protest of the protest
and the protest of the protest
10 cong seeks and to to the protest of phoolan
units and the protest of the protest

epoch 9/10


100%|██████████| 10000/10000 [00:25<00:00, 387.26it/s, loss=1.488]
  0%|          | 37/10000 [00:00<00:27, 366.37it/s, loss=2.315]

state of the to the to the congress
manipur manipur states to be to the congress
sc of the protest and the protest
congress and the protest and the protest
manipur states to be to the to the congress

epoch 10/10


100%|██████████| 10000/10000 [00:25<00:00, 387.13it/s, loss=1.473]


manipur in the to the cong response
phoolan manipur states to be to the congress
to to tech of the protest
congress and the protest and the congress
phoolan manipur states to be to the congress


## Todo
1. sample instead of argmax for next character (for more diversity in sentence generation)
2. Use multiple layers

In [None]:
gen_headlines()