# Stance Detection for the Fake News Challenge

## Identifying Textual Relationships with Deep Neural Nets

### Check the problem context [here](https://drive.google.com/open?id=1KfWaZyQdGBw8AUTacJ2yY86Yxgw2Xwq0).

### Download files required for the project from [here](https://drive.google.com/open?id=10yf39ifEwVihw4xeJJR60oeFBY30Y5J8).

## Step1: Load the given dataset <h1> [10 marks] </h1>

1. Mount the google drive

2. Import Glove embeddings

3. Import the test and train datasets

### Mount the google drive to access required project files

Run the below commands

In [0]:
from google.colab import drive

In [0]:
drive.mount('/content/drive/')

#### Path for Project files on google drive

**Note:** You need to change this path according where you have kept the files in google drive. 

In [0]:
project_path = "/content/drive/My Drive/Datasets/Fake News Challenge/"

### Loading the Glove Embeddings

In [0]:
from zipfile import ZipFile
with ZipFile(project_path+'glove.6B.zip', 'r') as z:
  z.extractall()

### Load the dataset

1. Using [read_csv()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) in pandas load the given train datasets files **`train_bodies.csv`** and **`train_stances.csv`**

2. Using [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) command in pandas merge the two datasets based on the Body ID. 

Note: Save the final merged dataset in a dataframe with name **`dataset`**.

In [0]:
import pandas as pd
train_bodies = pd.read_csv(project_path+'train_bodies.csv')
train_stances = pd.read_csv(project_path+'train_stances.csv')

In [0]:
dataset = pd.merge(train_bodies,train_stances, how='outer', on=['Body ID', 'Body ID'])



<h2> Check1:</h2>
  
<h3> You should see the below output if you run `dataset.head()` command as given below </h3>

In [0]:
dataset.head()

## Step2: Data Pre-processing and setting some hyper parameters needed for model


#### Run the code given below to set the required parameters.

1. `MAX_SENTS` = Maximum no.of sentences to consider in an article.

2. `MAX_SENT_LENGTH` = Maximum no.of words to consider in a sentence.

3. `MAX_NB_WORDS` = Maximum no.of words in the total vocabualry.

4. `MAX_SENTS_HEADING` = Maximum no.of sentences to consider in a heading of an article.

In [0]:
MAX_NB_WORDS = 20000
MAX_SENTS = 20
MAX_SENTS_HEADING = 1
MAX_SENT_LENGTH = 20
VALIDATION_SPLIT = 0.2

### Download the `Punkt` from nltk using the commands given below. This is for sentence tokenization.

For more info on how to use it, read [this](https://stackoverflow.com/questions/35275001/use-of-punktsentencetokenizer-in-nltk).



In [0]:
import nltk
nltk.download('punkt')

### Tokenizing the text and loading the pre-trained Glove word embeddings for each token <h1> [20 marks] </h1>

Keras provides [Tokenizer API](https://keras.io/preprocessing/text/) for preparing text. Read it before going any further.

#### Import the Tokenizer from keras preprocessing text

In [0]:
from keras.preprocessing.text import Tokenizer
import keras

#### Initialize the Tokenizer class with maximum vocabulary count as `MAX_NB_WORDS` initialized at the start of step2. 

In [0]:
keras.preprocessing.text.Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0)

#### Now, using fit_on_texts() from Tokenizer class, lets encode the data 

Note: We need to fit articleBody and Headline also to cover all the words.

In [0]:
t = Tokenizer(num_words=MAX_NB_WORDS) 
t.fit_on_texts(dataset['articleBody'])
t.fit_on_texts(dataset['Headline'])

#### fit_on_texts() gives the following attributes in the output as given [here](https://faroit.github.io/keras-docs/1.2.2/preprocessing/text/).

* **word_counts:** dictionary mapping words (str) to the number of times they appeared on during fit. Only set after fit_on_texts was called.

* **word_docs:** dictionary mapping words (str) to the number of documents/texts they appeared on during fit. Only set after fit_on_texts was called.

* **word_index:** dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.

* **document_count:** int. Number of documents (texts/sequences) the tokenizer was trained on. Only set after fit_on_texts or fit_on_sequences was called.



### Now, tokenize the sentences using nltk sent_tokenize() and encode the senteces with the ids we got form the above `t.word_index`

Initialise 2 lists with names `texts` and `articles`.

```
texts = [] to store text of article as it is.

articles = [] split the above text into a list of sentences.
```

In [0]:
from nltk.tokenize import sent_tokenize, word_tokenize

texts = dataset['articleBody']

articles = [sent_tokenize(i) for i in texts]

## Check 2:

first element of texts and articles should be as given below. 

In [0]:
texts[0]

In [0]:
articles[0]

#### Now iterate through each article and each sentence to encode the words into ids using t.word_index <h1>[20 marks]</h1>

Here, to get words from sentence you can use `text_to_word_sequence` from keras preprocessing text.

1. Import text_to_word_sequence

2. Initialize a variable of shape (no.of articles, MAX_SENTS, MAX_SENT_LENGTH) with name `data` with zeros first (you can use numpy [np.zeros](https://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros.html) to initialize with all zeros)and then update it while iterating through the words and sentences in each article.

In [0]:
from keras.preprocessing.text import text_to_word_sequence
import numpy as np

data = np.zeros((len(articles),MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')

In [0]:
for i, article in enumerate(articles):
    
    for j, text in enumerate(article):
       
        if j < MAX_SENTS:
                wordTokens = text_to_word_sequence(text)
                k = 0
                for _, word in enumerate(wordTokens):
                    if(k < MAX_SENT_LENGTH and t.word_index[word] < MAX_NB_WORDS):
                        data[i,j,k] = t.word_index[word]
                        k = k + 1

### Check 3:

Accessing first element in data should give something like given below.

In [0]:
data[0, :, :]

### Repeat the same process for the `Headings` as well. Use variables with names `texts_heading` and `articles_heading` accordingly. <h1> [10 marks] </h1>

In [0]:
texts_heading = dataset['Headline']
articles_heading = [sent_tokenize(i) for i in texts_heading]

#print(texts_heading[0])
#print(articles_heading[0])

data_heading = np.zeros((len(articles_heading),MAX_SENTS_HEADING, MAX_SENT_LENGTH), dtype='int32')

#data_heading[0, :, :]

In [0]:
for i, article in enumerate(articles_heading):
    for j, text in enumerate(article):
        if j < MAX_SENTS_HEADING:
                wordTokens = text_to_word_sequence(text)
                k = 0
                for _, word in enumerate(wordTokens):
                    if(k < MAX_SENT_LENGTH and t.word_index[word] < MAX_NB_WORDS):
                        data_heading[i,j,k] = t.word_index[word]
                        k = k + 1

### Now the features are ready, lets make the labels ready for the model to process.

### Convert labels into one-hot vectors

You can use [get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) in pandas to create one-hot vectors.

In [0]:
import pandas
labels = pandas.get_dummies(dataset['Stance'], prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)

### Check 4:

The shape of data and labels shoould match the given below numbers.

In [0]:
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

### Shuffle the data

In [0]:
## get numbers upto no.of articles
indices = np.arange(data.shape[0])
## shuffle the numbers
np.random.shuffle(indices)

In [0]:
## shuffle the data
data = data[indices]
data_heading = data_heading[indices]
## shuffle the labels according to data
labels = labels[indices]

### Split into train and validation sets. Split the train set 80:20 ratio to get the train and validation sets.


Use the variable names as given below:

x_train, x_val - for body of articles.

x-heading_train, x_heading_val - for heading of articles.

y_train - for training labels.

y_val - for validation labels.

<h1> [10 marks] </h1>

In [0]:
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

In [0]:
x_train = data[:-num_validation_samples]
x_val = data[-num_validation_samples:]
x_heading_train = data_heading[:-num_validation_samples]
x_heading_val = data_heading[-num_validation_samples:]
y_train = labels[:-num_validation_samples]
y_val = labels[-num_validation_samples:]

### Check 5:

The shape of x_train, x_val, y_train and y_val should match the below numbers.

In [0]:
print(x_train.shape)
print(y_train.shape)

print(x_val.shape)
print(y_val.shape)

### Create embedding matrix with the glove embeddings


Run the below code to create embedding_matrix which has all the words and their glove embedding if present in glove word list.

In [0]:
vocab_size = 27873
# load the whole embedding into memory
embeddings_index = dict()
f = open('./glove.6B.100d.txt')

for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 100))


for word, i in t.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

## Try a bidirectional LSTM model and report the accuracy score

<h1>[20 marks]  </h1>

In [0]:
x_train = x_train.reshape(-1,4)
x_val = x_val.reshape(-1,4)

x_train = x_train[:39978,:4]
x_val = x_val[:9994,:4]

### Import layers from Keras to build the model

In [0]:
from keras.layers import Embedding
from keras.models import Sequential
from keras.layers import Dense, Input, Flatten
from keras.layers import Embedding, Dropout, LSTM, GRU, Bidirectional
from keras.models import Model
from keras.preprocessing import sequence

### Model

In [0]:
EMBEDDING_DIM = 100
LSTM_DIM = 100
model = Sequential()
model.add(Embedding(input_dim=len(t.word_index),
                          output_dim=EMBEDDING_DIM,
                          weights = [embedding_matrix], trainable=True, name='word_embedding_layer', #False
                          mask_zero=True))
model.add(Bidirectional(LSTM(LSTM_DIM, return_sequences=False, name='Bidrectional_lstm_layer1')))
model.add(Dropout(rate=0.8, name='dropout_1')) # Can try varying dropout rates, in paper suggest 0.8
model.add(Dense(4, activation='softmax', name='output_layer'))

### Compile and fit the model

In [0]:
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['acc'])

In [0]:
print("model fitting - Bidirectional LSTM")
model.summary()

In [0]:
#model.fit(np.array(x1_train), np.array(y_train),epochs=10,batch_size=32,validation_data=(np.array(x1_val), np.array(y_val)))

In [0]:
BATCH_SIZE = 128
N_EPOCHS = 40 
model.fit(np.array(x_train), np.array(y_train),
          batch_size=BATCH_SIZE,
          epochs=N_EPOCHS,
          validation_data=(np.array(x_val), np.array(y_val)))

### Add Attention layer in the LSTM model to impove the accuracy score (Optional)

In [0]:
from keras.layers import Conv1D, MaxPooling1D, Embedding, Dropout, LSTM, GRU, Bidirectional, TimeDistributed
from keras.models import Model

from keras import backend as K
from keras.engine.topology import Layer, InputSpec
from keras import initializers
EMBEDDING_DIM = 100
embedding_layer = Embedding(len(t.word_index),
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SENT_LENGTH,
                            trainable=True,
                            mask_zero=True)


class AttLayer(Layer):
    def __init__(self, attention_dim):
        self.init = initializers.get('normal')
        self.supports_masking = True
        self.attention_dim = attention_dim
        super(AttLayer, self).__init__()

    def build(self, input_shape):
        assert len(input_shape) == 3
        self.W = K.variable(self.init((input_shape[-1], self.attention_dim)))
        self.b = K.variable(self.init((self.attention_dim, )))
        self.u = K.variable(self.init((self.attention_dim, 1)))
        self.trainable_weights = [self.W, self.b, self.u]
        super(AttLayer, self).build(input_shape)

    def compute_mask(self, inputs, mask=None):
        return mask

    def call(self, x, mask=None):
        # size of x :[batch_size, sel_len, attention_dim]
        # size of u :[batch_size, attention_dim]
        # uit = tanh(xW+b)
        uit = K.tanh(K.bias_add(K.dot(x, self.W), self.b))
        ait = K.dot(uit, self.u)
        ait = K.squeeze(ait, -1)

        ait = K.exp(ait)

        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            ait *= K.cast(mask, K.floatx())
        ait /= K.cast(K.sum(ait, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        ait = K.expand_dims(ait)
        weighted_input = x * ait
        output = K.sum(weighted_input, axis=1)

        return output

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[-1])


sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sentence_input)
l_lstm = Bidirectional(GRU(100, return_sequences=True))(embedded_sequences)
l_att = AttLayer(100)(l_lstm)
sentEncoder = Model(sentence_input, l_att)

review_input = Input(shape=(MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')
review_encoder = TimeDistributed(sentEncoder)(review_input)
l_lstm_sent = Bidirectional(GRU(100, return_sequences=True))(review_encoder)
l_att_sent = AttLayer(100)(l_lstm_sent)
preds = Dense(4, activation='softmax')(l_att_sent)
model = Model(review_input, preds)

model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])

print("model fitting - Hierachical attention network")
#model.fit(np.array(x_train), np.array(y_train), validation_data=(np.array(x_val), np.array(y_val),epoch=10, batch_size=50)
BATCH_SIZE = 128
N_EPOCHS = 40 
model.fit(np.array(x_train), np.array(y_train),
          batch_size=BATCH_SIZE,
          epochs=N_EPOCHS,
          validation_data=(np.array(x_val), np.array(y_val)))