# Stance Detection for the Fake News Challenge

## Identifying Textual Relationships with Deep Neural Nets

### Check the problem context [here](https://drive.google.com/open?id=1KfWaZyQdGBw8AUTacJ2yY86Yxgw2Xwq0).

### Download files required for the project from [here](https://drive.google.com/open?id=10yf39ifEwVihw4xeJJR60oeFBY30Y5J8).

## Step1: Load the given dataset  

1. Mount the google drive

2. Import Glove embeddings

3. Import the test and train datasets

In [0]:
from keras.layers import Dense,LSTM
from keras.utils import to_categorical
from keras.models import Sequential
from keras import optimizers
import keras
import pandas as pd
import gensim
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

In [0]:
import numpy as np
import pandas as pd
import os
from nltk.tokenize import word_tokenize
import gensim

In [0]:
lengthForBody = 424
lengthForHeadline = 45
indice = 2000

### Mount the google drive to access required project files

Run the below commands

In [0]:
from google.colab import drive

In [162]:
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


#### Path for Project files on google drive

**Note:** You need to change this path according where you have kept the files in google drive. 

In [0]:
project_path = "/content/drive/My Drive/SequentialNLP/Project2/"

### Loading the Glove Embeddings

In [0]:
from zipfile import ZipFile
with ZipFile(project_path+'glove.6B.zip', 'r') as z:
  z.extractall()

# Load the dataset [5 Marks]

1. Using [read_csv()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) in pandas load the given train datasets files **`train_bodies.csv`** and **`train_stances.csv`**

2. Using [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) command in pandas merge the two datasets based on the Body ID. 

Note: Save the final merged dataset in a dataframe with name **`dataset`**.

In [0]:
train_stance = pd.read_csv(project_path+'train_stances.csv')
train_bodies = pd.read_csv(project_path+'train_bodies.csv')
dataset = pd.merge(train_stance,train_bodies[['Body ID', 'articleBody']],on='Body ID')



<h2> Check1:</h2>
  
<h3> You should see the below output if you run `dataset.head()` command as given below </h3>

In [166]:
dataset.head()

Unnamed: 0,Headline,Body ID,Stance,articleBody
0,Police find mass graves with at least '15 bodi...,712,unrelated,Danny Boyle is directing the untitled film\n\n...
1,Seth Rogen to Play Apple’s Steve Wozniak,712,discuss,Danny Boyle is directing the untitled film\n\n...
2,Mexico police find mass grave near site 43 stu...,712,unrelated,Danny Boyle is directing the untitled film\n\n...
3,Mexico Says Missing Students Not Found In Firs...,712,unrelated,Danny Boyle is directing the untitled film\n\n...
4,New iOS 8 bug can delete all of your iCloud do...,712,unrelated,Danny Boyle is directing the untitled film\n\n...


## Step2: Data Pre-processing and setting some hyper parameters needed for model


#### Run the code given below to set the required parameters.

1. `MAX_SENTS` = Maximum no.of sentences to consider in an article.

2. `MAX_SENT_LENGTH` = Maximum no.of words to consider in a sentence.

3. `MAX_NB_WORDS` = Maximum no.of words in the total vocabualry.

4. `MAX_SENTS_HEADING` = Maximum no.of sentences to consider in a heading of an article.

In [0]:
MAX_NB_WORDS = 20000
MAX_SENTS = 20
MAX_SENTS_HEADING = 1
MAX_SENT_LENGTH = 20
VALIDATION_SPLIT = 0.2

### Download the `Punkt` from nltk using the commands given below. This is for sentence tokenization.

For more info on how to use it, read [this](https://stackoverflow.com/questions/35275001/use-of-punktsentencetokenizer-in-nltk).



In [168]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Tokenizing the text and loading the pre-trained Glove word embeddings for each token  [5 marks] 

Keras provides [Tokenizer API](https://keras.io/preprocessing/text/) for preparing text. Read it before going any further.

#### Import the Tokenizer from keras preprocessing text

In [0]:
import nltk
from nltk.tokenize import word_tokenize
from sklearn.utils import shuffle
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout, Embedding, Masking, Bidirectional
from keras.optimizers import Adam
from keras import backend

#### Initialize the Tokenizer class with maximum vocabulary count as `MAX_NB_WORDS` initialized at the start of step2. 

In [0]:
tokenizer = Tokenizer(MAX_NB_WORDS)

#### Now, using fit_on_texts() from Tokenizer class, lets encode the data 

Note: We need to fit articleBody and Headline also to cover all the words.

In [0]:
tokenizer.fit_on_texts(dataset['articleBody'].values)

In [0]:
tokenizer.fit_on_texts(dataset['Headline'].values)

In [0]:
word_index = tokenizer.word_index
word_docs = tokenizer.word_docs
document_count = tokenizer.document_count

idx_word = tokenizer.index_word
word_counts = tokenizer.word_counts
num_words = len(word_index) + 1

In [174]:
print( word_index)
print( word_docs)
print( word_counts)



#### fit_on_texts() gives the following attributes in the output as given [here](https://faroit.github.io/keras-docs/1.2.2/preprocessing/text/).

* **word_counts:** dictionary mapping words (str) to the number of times they appeared on during fit. Only set after fit_on_texts was called.

* **word_docs:** dictionary mapping words (str) to the number of documents/texts they appeared on during fit. Only set after fit_on_texts was called.

* **word_index:** dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.

* **document_count:** int. Number of documents (texts/sequences) the tokenizer was trained on. Only set after fit_on_texts or fit_on_sequences was called.



### Now, tokenize the sentences using nltk sent_tokenize() and encode the senteces with the ids we got form the above `t.word_index`

Initialise 2 lists with names `texts` and `articles`.

```
texts = [] to store text of article as it is.

articles = [] split the above text into a list of sentences.
```

In [0]:
headlines, bodies, stances = dataset['Headline'], dataset['articleBody'], dataset['Stance']
from sklearn import preprocessing

stances = preprocessing.LabelEncoder().fit_transform(stances)

In [176]:
for index in range(indice): #len(headlines)
    line = headlines[index]
    headlines[index] = word_tokenize(line)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [177]:
headlines

0        [Police, find, mass, graves, with, at, least, ...
1        [Seth, Rogen, to, Play, Apple, ’, s, Steve, Wo...
2        [Mexico, police, find, mass, grave, near, site...
3        [Mexico, Says, Missing, Students, Not, Found, ...
4        [New, iOS, 8, bug, can, delete, all, of, your,...
5        [Return, of, the, Mac, :, Seth, Rogen, in, tal...
6                                   [Seth, Rogen, Is, Woz]
7        [Mexico, finds, 4, more, graves, at, site, of,...
8        [Are, missing, students, in, mass, graves, fou...
9        [Mexico, prosecutor, :, Students, not, in, 1st...
10       [Lady, on, FB, :, I, 'm, 41, ,, Intersex, ,, a...
11       [Catholic, Priest, Claims, God, Is, Female, Af...
12       [Isis, claims, US, hostage, Kayla, Mueller, ki...
13       [Gold, Apple, Watch, Edition, price, ?, Specul...
14       [Mexican, students, not, among, bodies, found,...
15       [Steve, Jobs, Biopic, Eyes, Seth, Rogen, to, P...
16       [Missing, Mexico, students, not, among, 28, bo.

In [178]:
headlineList = []
for eachSentence in headlines[0:MAX_NB_WORDS]:
    headlineLi = []
    print(eachSentence)
    for eachword in eachSentence:
        try:
            headlineLi.append(model[eachword])
        except:
            pass
    headlineList.append(headlineLi)

headlineList = np.array(headlineList)
headlineList = pad_sequences(headlineList,padding='post',maxlen=lengthForHeadline,value=0.0,dtype='float32')

['Police', 'find', 'mass', 'graves', 'with', 'at', 'least', "'15", 'bodies', "'", 'near', 'Mexico', 'town', 'where', '43', 'students', 'disappeared', 'after', 'police', 'clash']
['Seth', 'Rogen', 'to', 'Play', 'Apple', '’', 's', 'Steve', 'Wozniak']
['Mexico', 'police', 'find', 'mass', 'grave', 'near', 'site', '43', 'students', 'vanished']
['Mexico', 'Says', 'Missing', 'Students', 'Not', 'Found', 'In', 'First', 'Mass', 'Graves']
['New', 'iOS', '8', 'bug', 'can', 'delete', 'all', 'of', 'your', 'iCloud', 'documents']
['Return', 'of', 'the', 'Mac', ':', 'Seth', 'Rogen', 'in', 'talks', 'to', 'star', 'as', 'Apple', 'co-founder', 'Steve', 'Wozniak', 'in', 'upcoming', 'Steve', 'Jobs', 'biopic']
['Seth', 'Rogen', 'Is', 'Woz']
['Mexico', 'finds', '4', 'more', 'graves', 'at', 'site', 'of', 'suspected', 'student', 'massacre']
['Are', 'missing', 'students', 'in', 'mass', 'graves', 'found', 'near', 'Iguala', ',', 'Mexico', '?']
['Mexico', 'prosecutor', ':', 'Students', 'not', 'in', '1st', 'mass', 'g

In [179]:
print(headlineList.shape)

(20000, 45)


In [0]:
bodyTextList = []
for eachSentence in bodies[0:MAX_NB_WORDS]:
    bodyTextLi = []
    for eachword in eachSentence:
        try:
            bodyTextLi.append(model[eachword])
        except:
            pass
    bodyTextList.append(bodyTextLi)

bodyTextList = np.array(bodyTextList)
bodyTextList = pad_sequences(bodyTextList,padding='post',maxlen=lengthForBody,value=0.0,dtype='float32')

In [181]:
print(bodyTextList.shape)

(20000, 424)


In [182]:
bodyTextList[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

## Check 2:

first element of texts and articles should be as given below. 

In [0]:
#texts[0]

In [0]:
#articles[0]

# Now iterate through each article and each sentence to encode the words into ids using t.word_index  [5 marks] 

Here, to get words from sentence you can use `text_to_word_sequence` from keras preprocessing text.

1. Import text_to_word_sequence

2. Initialize a variable of shape (no.of articles, MAX_SENTS, MAX_SENT_LENGTH) with name `data` with zeros first (you can use numpy [np.zeros](https://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros.html) to initialize with all zeros)and then update it while iterating through the words and sentences in each article.

### Check 3:

Accessing first element in data should give something like given below.

In [0]:
#data[0, :, :]

# Repeat the same process for the `Headings` as well. Use variables with names `texts_heading` and `articles_heading` accordingly. [5 marks] 

### Now the features are ready, lets make the labels ready for the model to process.

### Convert labels into one-hot vectors

You can use [get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) in pandas to create one-hot vectors.

### Check 4:

The shape of data and labels shoould match the given below numbers.

In [0]:
#print('Shape of data tensor:', data.shape)
#print('Shape of label tensor:', labels.shape)

### Shuffle the data

In [0]:
## get numbers upto no.of articles
#indices = np.arange(data.shape[0])
## shuffle the numbers
#np.random.shuffle(indices)

In [0]:
## shuffle the data
#data = data[indices]
#data_heading = data_heading[indices]
## shuffle the labels according to data
#labels = labels[indices]

In [0]:
data_x = np.append(headlineList,bodyTextList,axis=1)
data_y = stances[0:MAX_NB_WORDS]

In [0]:
data_x,data_y = shuffle(data_x,data_y)

In [191]:
data_x.shape

(20000, 469)

### Split into train and validation sets. Split the train set 80:20 ratio to get the train and validation sets.


Use the variable names as given below:

x_train, x_val - for body of articles.

x-heading_train, x_heading_val - for heading of articles.

y_train - for training labels.

y_val - for validation labels.



In [0]:
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.2, random_state=42)
y_train = to_categorical(y_train,num_classes=4)
y_test = to_categorical(y_test,num_classes=4)

### Check 5:

The shape of x_train, x_val, y_train and y_val should match the below numbers.

In [193]:
print(x_train.shape)
print(y_train.shape)

(16000, 469)
(16000, 4)


### Create embedding matrix with the glove embeddings


Run the below code to create embedding_matrix which has all the words and their glove embedding if present in glove word list.

In [194]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('./glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((len(word_index), 100))


for word, i in tokenizer.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

Loaded 400000 word vectors.


# Try the sequential model approach and report the accuracy score. [10 marks]  

### Import layers from Keras to build the model

In [0]:
nb_lstm_output = 4
nb_time_steps = lengthForBody + lengthForHeadline #number of column
nb_input_vector = 1

In [0]:
import numpy as np
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1],nb_input_vector))
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1],nb_input_vector))

### Model

In [0]:
model = Sequential()
model.add(LSTM(units=nb_lstm_output,input_shape=(nb_time_steps,nb_input_vector)))
model.add(Dense(4,activation='softmax',name = 'dense1'))




### Compile and fit the model

In [0]:
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',optimizer=sgd ,metrics=['accuracy'])


In [199]:
x_train.shape

(16000, 469, 1)

In [200]:
x_test.shape

(4000, 469, 1)

In [201]:
y_train.shape

(16000, 4)

In [202]:
y_test.shape

(4000, 4)

In [203]:

model.fit(x_train,y_train,epochs=100,batch_size=5000,verbose=1)

score=model.evaluate(x_test,y_test,batch_size=1000,verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

## Build the same model with attention layers included for better performance (Optional)

## Fit the model and report the accuracy score for the model with attention layer (Optional)