# Build an RNN to identify unreliable news articles

This notebook applies an RNN to identify when an article might be fake news. The data were obtained from the [Fake News Competition on Kaggle](https://www.kaggle.com/c/fake-news). 

>Using an RNN rather than a strictly feedforward network is more accurate since we can include information about the *sequence* of words. 

**Performance on real test set from Kaggle:** My Kaggle submission resulted in **92.09%** for private score and  **91.41%** for public score.

**Credit:** This is a side project for [PyTorch Scholarship Challenge from Facebook](https://www.udacity.com/facebook-pytorch-scholarship), which uses the [Sentiment_RNN](https://github.com/udacity/deep-learning-v2-pytorch/blob/master/sentiment-rnn/Sentiment_RNN_Solution.ipynb) template from the program. 

### Network Architecture

The architecture for this network is shown below.

<img src="https://raw.githubusercontent.com/minhkhang1795/FakeNews_RNN/master/assets/network_diagram.png" width=40%>

>**First, we'll pass in words to an embedding layer.** We need an embedding layer because we have tens of thousands of words, so we'll need a more efficient representation for our input data than one-hot encoded vectors. You should have seen this before from the Word2Vec lesson. You can actually train an embedding with the Skip-gram Word2Vec model and use those embeddings as input, here. However, it's good enough to just have an embedding layer and let the network learn a different embedding table on its own. *In this case, the embedding layer is for dimensionality reduction, rather than for learning semantic representations.*

>**After input words are passed to an embedding layer, the new embeddings will be passed to LSTM cells.** The LSTM cells will add *recurrent* connections to the network and give us the ability to include information about the *sequence* of words in the article data. 

>**Finally, the LSTM outputs will go to a sigmoid output layer.** We're using a sigmoid function because positive (or fake news) = 1 and negative = 0, and a sigmoid will output predicted, sentiment values between 0-1. 

We don't care about the sigmoid outputs except for the **very last one**; we can ignore the rest. We'll calculate the loss by comparing the output at the last time step and the training label (pos or neg).

## Load and visualize the data
**train.csv**: A full training dataset with the following attributes:

- **id**: unique id for a news article
- **title**: the title of a news article
- **author**: author of the news article
- **text**: the text of the article; could be incomplete
- **label**: a label that marks the article as potentially unreliable
    - 1: unreliable
    - 0: reliable


In [1]:
import numpy as np
import pandas as pd

train_fpath = 'train.csv'
train_raw_data = pd.read_csv(train_fpath, encoding='utf-8', header=0, keep_default_na=False)
train_raw_data.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
5,5,Jackie Mason: Hollywood Would Love Trump if He...,Daniel Nussbaum,"In these trying times, Jackie Mason is the Voi...",0
6,6,Life: Life Of Luxury: Elton John’s 6 Favorite ...,,Ever wonder how Britain’s most iconic pop pian...,1
7,7,Benoît Hamon Wins French Socialist Party’s Pre...,Alissa J. Rubin,"PARIS — France chose an idealistic, traditi...",0
8,8,Excerpts From a Draft Script for Donald Trump’...,,Donald J. Trump is scheduled to make a highly ...,0
9,9,"A Back-Channel Plan for Ukraine and Russia, Co...",Megan Twohey and Scott Shane,A week before Michael T. Flynn resigned as nat...,0


## Train data pre-processing

The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

You can see an example of the articles data above. Here are the processing steps, we'll want to take:
>* We'll want to get rid of periods and extraneous punctuation.
* Also, you might notice that the some texts are delimited with newline characters `\n`. To deal with those, I'm going to split the text into each text using `\n` as the delimiter. 
* Then I can combined all the text back together into one big string.

First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [2]:
from string import punctuation

texts = [''.join([c for c in text.lower() if c not in punctuation]) for text in train_raw_data['text']]

# split by new lines and spaces
all_text = ' '.join(texts)

# create a list of words
words = all_text.split()

### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our texts into a list of integers so they can be passed into the network.

In [3]:
from collections import Counter

# Build a dictionary that maps words to integers
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

# use the dict to tokenize each article in text split
# store the tokenized texts in text_ints
text_ints = []
for text in texts:
    text_ints.append([vocab_to_int[word] for word in text.split()])

# get the labels (0 and 1) from the data set
encoded_labels = [label for label in train_raw_data['label']]

In [31]:
# stats about vocabulary
print('Unique words: ', len((vocab_to_int)))

# print tokens in first article
# print('Tokenized text: \n', text_ints[:1])

Unique words:  267449
Tokenized text: 
 [[127, 5765, 2273, 41, 359, 84, 142, 2655, 771, 316, 3069, 7489, 2879, 13, 17, 13053, 27624, 10, 361, 662, 152, 3379, 3069, 7489, 10, 1, 12243, 6, 97, 15543, 4158, 1086, 4627, 737, 254183, 955, 169, 5, 2666, 175729, 2668, 11, 12251, 2, 6091, 46532, 60, 8, 66, 1454, 31, 1, 1539, 426, 6, 1, 107, 8, 26, 193612, 304, 728, 599, 29, 154, 2, 5, 127, 269, 2273, 13, 1327, 62, 41, 64, 128, 31, 1, 261605, 426, 8, 14, 162, 13, 2089, 58, 7, 56, 599, 598, 22, 64581, 771, 4053, 7, 1, 295, 12, 626, 73, 364, 7, 115, 21, 1038, 2, 114, 385, 249, 1403, 1, 5909, 332, 10, 1, 3186, 5331, 359, 1141, 40, 13, 24, 599, 30, 206, 58, 1084, 5, 2258, 24, 48, 3, 1, 194, 478, 13750, 14, 41, 81, 128, 599, 7273, 1, 194, 13750, 4, 269, 5909, 272, 3, 1, 127, 443, 4065, 4, 4200, 5331, 7, 22, 582, 12, 5654, 364, 13, 36, 493, 1442, 6, 266, 2, 142, 54, 30, 3284, 1617, 261, 23, 188, 63, 26, 771, 369, 58, 4200, 478, 858, 3069, 7489, 325, 1, 140, 107, 20551, 11, 26, 2258, 295, 23236, 77, 2

In [5]:
# outlier article stats
text_lens = Counter([len(x) for x in text_ints])
print("Zero-length text: {}".format(text_lens[0]))
print("Maximum text length: {}".format(max(text_lens)))

Zero-length text: 116
Maximum text length: 24195


We seem to have some texts with zero length. And, the maximum text length is way too many steps for our RNN. We'll have to remove any super short texts and truncate super long texts. This removes outliers and should allow our model to train more efficiently.

In [6]:
print('Number of texts before removing outliers: ', len(text_ints))

## remove any articles/labels with zero length from the text_ints list.

# get indices of any articles with length 0
non_zero_idx = [ii for ii, text in enumerate(text_ints) if len(text) != 0]

# remove 0-length articles and their labels
text_ints = [text_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Number of texts after removing outliers: ', len(text_ints))

Number of texts before removing outliers:  20800
Number of texts after removing outliers:  20684


---
## Padding sequences

To deal with both short and very long articles, we'll pad or truncate all our articles to a specific length. For texts shorter than some `seq_length`, we'll pad with 0s. For texts longer than `seq_length`, we can truncate them to the first `seq_length` words. A good `seq_length`, in this case, is 200.

<img src="https://raw.githubusercontent.com/minhkhang1795/FakeNews_RNN/master/assets/outliers_padding_ex.png" width=40%>

> Define a function that returns an array `features` that contains the padded data, of a standard size, that we'll pass to the network. 
* The data should come from `text_ints`, since we want to feed integers to the network. 
* Each row should be `seq_length` elements long. 
* For articles shorter than `seq_length` words, **left pad** with 0s. That is, if the text is `['best', 'movie', 'ever']`, `[117, 18, 128]` as integers, the row will look like `[0, 0, 0, ..., 0, 117, 18, 128]`. 
* For articles longer than `seq_length`, use only the first `seq_length` words as the feature vector.

As a small example, if the `seq_length=10` and an input article is: 
```
[117, 18, 128]
```
The resultant, padded sequence should be: 

```
[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]
```

**The final `features` array should be a 2D array, with as many rows as there are articles, and as many columns as the specified `seq_length`.**

In [7]:
def pad_features(text_ints, seq_length):
    ''' Return features of text_ints, where each article is padded with 0's 
        or truncated to the input seq_length.
    '''
    
    # getting the correct rows x cols shape
    features = np.zeros((len(text_ints), seq_length), dtype=int)

    # for each article, grab that article and 
    for i, row in enumerate(text_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]
    
    return features

In [8]:
seq_length = 200

features = pad_features(text_ints, seq_length=seq_length)

# test statements - do not change -
assert len(features)==len(text_ints), "Your features should have as many rows as articles."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches 
print(features[:30, :10])

[[   127   5765   2273     41    359     84    142   2655    771    316]
 [   365    113      1   1720     89    201   5177      1  30633    522]
 [   202      1    742    209    113     33   1461    361   1851    152]
 [  1538    686   1271    473      6    786     67   6721     20     46]
 [     0      0      0      0      0      0      0      0      0      0]
 [     6    102    380    167  10178   7951      8      1   1258      3]
 [   365   2235     87   3106     90   8397   3051  15953   1090    141]
 [  1413     35   1202   2557     28  19928   1409    435      6   6511]
 [   159    488     39      8   1979      2    123      5   1401   6742]
 [     5    212    108    737   1638   2534   4530     14    150    193]
 [  4364      9    601      1   2207    200      7  18244     24   1019]
 [     1   3513   1842  32903     10      1   9264  52757   1227  21333]
 [     1   4507   2591      1    781  10869      4   3961   1210      8]
 [    75    119   2629    295  15074   3335    182 

## Training, Validation, Test

With our data in nice shape, we'll split it into training, validation, and test sets.

> Create the training, validation, and test sets from the `train_raw_data`
* You'll need to create sets for the features and the labels, `train_x` and `train_y`, for example. 
* Define a split fraction, `split_frac` as the fraction of data to **keep** in the training set. Usually this is set to 0.8 or 0.9. 
* Whatever data is left will be split in half to create the validation and *testing* data.

In [9]:
# split data into training, validation, and test data (features and labels, x and y)
split_idx = 16000 # about 80%
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = (len(features)-16000)//2
val_x, test_x = remaining_x[:test_idx-2], remaining_x[test_idx-2:-4]
val_y, test_y = remaining_y[:test_idx-2], remaining_y[test_idx-2:-4]

# print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(16000, 200) 
Validation set: 	(2340, 200) 
Test set: 		(2340, 200)


---
## DataLoaders and Batching

After creating training, test, and validation data, we can create **DataLoaders** for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets.

```
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)
```

This is an alternative to creating a generator function for batching our data into full batches.

In [10]:
import torch
from torch.utils.data import TensorDataset, DataLoader
from torch.optim.lr_scheduler import ReduceLROnPlateau

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 10

# make sure the SHUFFLE your training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size, num_workers=5)
valid_loader = DataLoader(valid_data, shuffle=False, batch_size=batch_size, num_workers=5)
test_loader = DataLoader(test_data, shuffle=False, batch_size=batch_size, num_workers=5)

In [11]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size())  # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size())  # batch_size
print('Sample label: \n', sample_y)


Sample input size:  torch.Size([10, 200])
Sample input: 
 tensor([[     0,      0,      0,  ...,   1786,   1266,    152],
        [     0,      0,      0,  ...,   2590,    236,   8055],
        [   202,    159,     39,  ...,    560,     68,    588],
        ...,
        [ 18616,   1502,   2000,  ...,  20006,      2,     49],
        [212536, 218481,      1,  ...,   2308,      4,     32],
        [     1,   2347,      3,  ...,   1243,    310,     24]])

Sample label size:  torch.Size([10])
Sample label: 
 tensor([1, 1, 1, 0, 1, 0, 0, 1, 1, 0])


---
# Recurrent Neural Network with PyTorch

Below is where we'll define the network.

<img src="https://raw.githubusercontent.com/minhkhang1795/FakeNews_RNN/master/assets/network_diagram.png" width=40%>

The layers are as follows:
1. An [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) that converts our word tokens (integers) into embeddings of a specific size.
2. An [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) defined by a hidden_state size and number of layers
3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
4. A sigmoid activation layer which turns all outputs into a value 0-1; return **only the last sigmoid output** as the output of this network.

### The Embedding Layer

We need to add an [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) because there are 267449+ words in our vocabulary. It is massively inefficient to one-hot encode that many classes. So, instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table. You could train an embedding layer using Word2Vec, then load it here. But, it's fine to just make a new layer, using it for only dimensionality reduction, and let the network learn the weights.


### The LSTM Layer(s)

We'll create an [LSTM](https://pytorch.org/docs/stable/nn.html#lstm) to use in our recurrent network, which takes in an input_size, a hidden_dim, a number of layers, a dropout probability (for dropout between multiple layers), and a batch_first parameter.

Most of the time, our network will have better performance with more layers; between 2-3. Adding more layers allows the network to learn really complex relationships. 

Note: `init_hidden` should initialize the hidden and cell state of an lstm layer to all zeros, and move those state to GPU, if available.

In [12]:
# First checking if GPU is available
train_on_gpu = torch.cuda.is_available()

if train_on_gpu:
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')


Training on GPU.


In [13]:
import torch.nn as nn


class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,
                            dropout=drop_prob, batch_first=True)

        # dropout layer
        self.dropout = nn.Dropout(0.3)

        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()


    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)

        # embeddings and lstm_out
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)

        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)

        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)

        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1]  # get last batch of labels

        # return last sigmoid output and hidden state
        return sig_out, hidden


    def init_hidden(self, batch_size):
        """ Initializes hidden state """
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data

        if train_on_gpu:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())

        return hidden

## Instantiate the network

Here, we'll instantiate the network. First up, defining the hyperparameters.

* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3

In [14]:
# Instantiate the model w/ hyperparams
vocab_size = len(vocab_to_int) + 1 # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 200
hidden_dim = 256
n_layers = 3

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
# move model to GPU, if available
if train_on_gpu:
    net.cuda()
    
print(net)

SentimentRNN(
  (embedding): Embedding(267450, 200)
  (lstm): LSTM(200, 256, num_layers=3, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


---
## Training

Below is the typical training code. We'll also be using a new kind of cross entropy loss, which is designed to work with a single Sigmoid output. [BCELoss](https://pytorch.org/docs/stable/nn.html#bceloss), or **Binary Cross Entropy Loss**, applies cross entropy loss to a single value between 0 and 1.

We also have some data and training hyparameters:

* `lr`: Learning rate for our optimizer.
* `epochs`: Number of times to iterate through the training dataset.
* `clip`: The maximum gradient value to clip at (to prevent exploding gradients).

In [15]:
# loss and optimization functions
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=0.001)
scheduler = ReduceLROnPlateau(optimizer, 'min', factor=0.1, patience=1, verbose=True)

In [16]:
state_dict = torch.load('checkpoint5_9329.pth')
net.load_state_dict(state_dict)

In [17]:
# training params
epochs = 20
clip = 5  # gradient clipping
min_loss = np.inf

# train for some number of epochs
for e in range(epochs):
    # keep track of training and validation loss
    train_loss = 0.0
    valid_loss = 0.0
    num_correct = 0
    
    # initialize hidden state
    h = net.init_hidden(batch_size)

    net.train()
    # batch loop
    for inputs, labels in train_loader:

        if train_on_gpu:
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()
        # update training loss
        train_loss += loss.item()*inputs.size(0)

    # Get validation loss
    val_h = net.init_hidden(batch_size)
    net.eval()
    for inputs, labels in valid_loader:

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        val_h = tuple([each.data for each in val_h])

        if train_on_gpu:
            inputs, labels = inputs.cuda(), labels.cuda()

        output, val_h = net(inputs, val_h)
        loss = criterion(output.squeeze(), labels.float())

        # convert output probabilities to predicted class (0 or 1)
        pred = torch.round(output.squeeze())  # rounds to the nearest integer
        # compare predictions to true label
        correct_tensor = pred.eq(labels.float().view_as(pred))
        correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
        num_correct += np.sum(correct)
        # update average validation loss 
        valid_loss += loss.item()*inputs.size(0)

    train_loss = train_loss/len(train_loader.dataset)
    valid_loss = valid_loss/len(valid_loader.dataset)
    scheduler.step(valid_loss)

    print("Epoch: {}/{}...".format(e + 1, epochs),
          "Loss: {:.6f}...".format(train_loss),
          "Val Loss: {:.6f}".format(valid_loss),
          "Accuracy: {:.6f}".format(num_correct/len(valid_loader.dataset)))
    if min_loss >= valid_loss:
        torch.save(net.state_dict(), 'checkpointx.pth')
        min_loss = valid_loss
        print("Loss decreased. Saving model...")

Epoch: 1/20... Loss: 0.159931... Val Loss: 0.162035 Accuracy: 0.935470
Loss decreased. Saving model...
Epoch: 2/20... Loss: 0.091499... Val Loss: 0.182338 Accuracy: 0.934188
Epoch     2: reducing learning rate of group 0 to 1.0000e-04.
Epoch: 3/20... Loss: 0.048477... Val Loss: 0.226819 Accuracy: 0.941880
Epoch: 4/20... Loss: 0.022922... Val Loss: 0.224598 Accuracy: 0.940598
Epoch     4: reducing learning rate of group 0 to 1.0000e-05.
Epoch: 5/20... Loss: 0.014395... Val Loss: 0.254986 Accuracy: 0.939744
Epoch: 6/20... Loss: 0.010363... Val Loss: 0.261356 Accuracy: 0.940598
Epoch     6: reducing learning rate of group 0 to 1.0000e-06.
Epoch: 7/20... Loss: 0.008913... Val Loss: 0.267870 Accuracy: 0.941026
Epoch: 8/20... Loss: 0.008491... Val Loss: 0.268791 Accuracy: 0.940598
Epoch     8: reducing learning rate of group 0 to 1.0000e-07.
Epoch: 9/20... Loss: 0.009136... Val Loss: 0.269656 Accuracy: 0.940598
Epoch: 10/20... Loss: 0.009926... Val Loss: 0.269735 Accuracy: 0.940171
Epoch    

In [21]:
# torch.save(net.state_dict(), 'checkpoint6.pth')

---
## Testing

* **Test data performance:** First, we'll see how our trained model performs on all of our defined test_data, above. We'll calculate the average loss and accuracy over the test data splitted above. The best accuracy this RNN can get is **93.29%**.

* **Performance on real test set from Kaggle:** Second, we'll see our RNN performance on the real test set from Kaggle. My submission resulted in **92.09%** for private score and **91.41%** for public score.

In [33]:
# Get test data loss and accuracy
test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
loader = test_loader
# iterate over test data
for inputs, labels in loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if train_on_gpu:
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = net(inputs, h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(loader.dataset)
print("Test accuracy: {:.4f}%".format(test_acc*100))

Test loss: 0.316
Test accuracy: 93.2906%


### Prediction on Kaggle test set
**test.csv**: A testing training dataset with all the same attributes at **train.csv** without the label.

In [17]:
test_fpath = 'test.csv'
test_raw_data = pd.read_csv(test_fpath, encoding='utf-8', header=0, keep_default_na=False)

# Only keep title and text
test_data = []
for i in range(len(test_raw_data)):
    test_data.append(test_raw_data['title'][i] + ' ' + test_raw_data['text'][i])
# test_data[:10]

['Specter of Trump Loosens Tongues, if Not Purse Strings, in Silicon Valley - The New York Times PALO ALTO, Calif.  —   After years of scorning the political process, Silicon Valley has leapt into the fray. The prospect of a President Donald J. Trump is pushing the tech community to move beyond its traditional role as donors and to embrace a new existence as agitators and activists. A distinguished venture capital firm emblazoned on its corporate home page an earthy   epithet. One prominent tech chieftain says the consequences of Mr. Trump’s election would “range between disastrous and terrible. ” Another compares him to a dictator. And nearly 150 tech leaders signed an open letter decrying Mr. Trump and his campaign of “anger” and “bigotry. ” Not quite all the action is  . Peter Thiel, a founder of PayPal and Palantir who was the first outside investor in Facebook, spoke at the Republican convention in July. The New York Times reported on Saturday that Mr. Thiel is giving $1. 25 milli

In [32]:
def tokenize_article(test_article):
    test_article = test_article.lower() # lowercase
    # get rid of punctuation
    test_text = ''.join([c for c in test_article if c not in punctuation])

    # splitting by spaces
    test_words = test_text.split()

    # tokens
    test_ints = []
    test_ints.append([vocab_to_int[word] if word in vocab_to_int else 0 for word in test_words])

    return test_ints

# test code and generate tokenized article
test_ints = tokenize_article(test_data[1])
# test_ints

[[230,
  9481,
  1284,
  2,
  1670,
  1250,
  594,
  1098,
  230,
  9481,
  1284,
  2,
  1670,
  1250,
  594,
  1098,
  45047,
  620,
  620,
  61690,
  338,
  2029,
  3,
  1,
  230,
  2029,
  4086,
  4954,
  7863,
  113,
  1284,
  2,
  1670,
  1250,
  1594,
  6,
  1,
  12741,
  3,
  1098,
  1047,
  19,
  1,
  230,
  545,
  1736,
  16,
  78609,
  427,
  6410,
  2459,
  2,
  1193,
  73,
  1098,
  24,
  506,
  18,
  10227,
  1,
  620,
  16,
  1,
  962,
  2301,
  3,
  1,
  2029,
  4086,
  6552,
  200,
  8,
  2,
  1670,
  1612,
  4,
  540,
  11322,
  10,
  1,
  1250,
  433,
  1333,
  8,
  2,
  2276,
  1098,
  63,
  1,
  508,
  10,
  1250,
  1594,
  48,
  42,
  20,
  2,
  2135,
  40,
  1,
  195,
  9,
  6410,
  24,
  1,
  506,
  1,
  620,
  16,
  1,
  230,
  200,
  6,
  1,
  4841,
  1099,
  7910,
  3,
  1,
  4954,
  7863,
  2029,
  4086,
  1,
  1992,
  502,
  1612,
  12358,
  30454,
  96366,
  2170,
  1,
  267,
  4,
  476,
  35235,
  4018,
  42997,
  4,
  54612,
  57720,
  182,
  27,
  1309,


In [33]:
# For each article, predict whether it's fake news (1) or not (0)
def predict(net, test_article, sequence_length=200):
    
    net.eval()
    
    # tokenize article
    test_ints = tokenize_article(test_article)
    
    # pad tokenized sequence
    seq_length=sequence_length
    features = pad_features(test_ints, seq_length)
    
    # convert to tensor to pass into your model
    feature_tensor = torch.from_numpy(features)
    
    batch_size = feature_tensor.size(0)
    
    # initialize hidden state
    h = net.init_hidden(batch_size)
    
    if(train_on_gpu):
        feature_tensor = feature_tensor.cuda()
    
    # get the output from the model
    output, h = net(feature_tensor, h)
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze()) 
    return pred

In [21]:
preds = []

net.eval()
# iterate over test data
for inputs in test_data:
    pred = predict(net, inputs)
    preds.append(int(pred.item()))

# -- stats! -- ##
# print("Predictions: ", preds)

Predictions:  [0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0

### Finally, export the result to submit to the competition

In [26]:
import csv

with open('submit.csv', 'w') as f:
    f_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    f_writer.writerow(['id', 'label'])
    for i in range(len(preds)):
        f_writer.writerow([test_raw_data['id'][i], preds[i]])

In [27]:
submit_data = pd.read_csv('submit.csv', encoding='utf-8', header=0, keep_default_na=False)
submit_data

Unnamed: 0,id,label
0,20800,0
1,20801,1
2,20802,1
3,20803,0
4,20804,1
5,20805,1
6,20806,0
7,20807,0
8,20808,1
9,20809,1
