# Stance Detection for the Fake News Challenge

## Identifying Textual Relationships with Deep Neural Nets

### Check the problem context [here](https://drive.google.com/open?id=1KfWaZyQdGBw8AUTacJ2yY86Yxgw2Xwq0).

### Download files required for the project from [here](https://drive.google.com/open?id=10yf39ifEwVihw4xeJJR60oeFBY30Y5J8).

 ## <font color=red> Milestone - 1 </font>

## Step1: Load the given dataset <h1> [10 marks] </h1>

1. Mount the google drive

2. Import Glove embeddings

3. Import the test and train datasets

### Mount the google drive to access required project files

Run the below commands

In [0]:
from google.colab import drive

In [0]:
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


#### Path for Project files on google drive

**Note:** You need to change this path according where you have kept the files in google drive. 

In [0]:
project_path = "/content/drive/My Drive/DLCP/Project3/"

### Loading the Glove Embeddings

In [0]:
from zipfile import ZipFile
with ZipFile(project_path+'glove.6B.zip', 'r') as z:
  z.extractall()

### Load the dataset

1. Using [read_csv()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) in pandas load the given train datasets files **`train_bodies.csv`** and **`train_stances.csv`**

2. Using [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) command in pandas merge the two datasets based on the Body ID. 

Note: Save the final merged dataset in a dataframe with name **`dataset`**.

In [0]:
from pandas import DataFrame, read_csv
import pandas as pd 
file1 = project_path+'train_bodies.csv'
train_bodies_df = pd.read_csv(file1)
#print(train_bodies_df)
file2 = project_path+'train_stances.csv'
train_stances_df = pd.read_csv(file2)
#print(train_stances_df)
dataset = pd.merge(train_bodies_df,
                 train_stances_df[["Body ID","Headline", "Stance"]],
                 on="Body ID", how= "right")


<h2> Check1:</h2>
  
<h3> You should see the below output if you run `dataset.head()` command as given below </h3>

In [0]:
dataset.head()

Unnamed: 0,Body ID,articleBody,Headline,Stance
0,0,A small meteorite crashed into a wooded area i...,"Soldier shot, Parliament locked down after gun...",unrelated
1,0,A small meteorite crashed into a wooded area i...,Tourist dubbed ‘Spider Man’ after spider burro...,unrelated
2,0,A small meteorite crashed into a wooded area i...,Luke Somers 'killed in failed rescue attempt i...,unrelated
3,0,A small meteorite crashed into a wooded area i...,BREAKING: Soldier shot at War Memorial in Ottawa,unrelated
4,0,A small meteorite crashed into a wooded area i...,Giant 8ft 9in catfish weighing 19 stone caught...,unrelated


## Step2: Data Pre-processing and setting some hyper parameters needed for model


#### Run the code given below to set the required parameters.

1. `MAX_SENTS` = Maximum no.of sentences to consider in an article.

2. `MAX_SENT_LENGTH` = Maximum no.of words to consider in a sentence.

3. `MAX_NB_WORDS` = Maximum no.of words in the total vocabualry.

4. `MAX_SENTS_HEADING` = Maximum no.of sentences to consider in a heading of an article.

In [0]:
MAX_NB_WORDS = 20000
MAX_SENTS = 20
MAX_SENTS_HEADING = 1
MAX_SENT_LENGTH = 20
VALIDATION_SPLIT = 0.2

### Download the `Punkt` from nltk using the commands given below. This is for sentence tokenization.

For more info on how to use it, read [this](https://stackoverflow.com/questions/35275001/use-of-punktsentencetokenizer-in-nltk).



In [0]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Tokenizing the text and loading the pre-trained Glove word embeddings for each token <h1> [10 marks] </h1>

Keras provides [Tokenizer API](https://keras.io/preprocessing/text/) for preparing text. Read it before going any further.

#### Import the Tokenizer from keras preprocessing text

In [0]:
import keras.preprocessing.text as kpt

Using TensorFlow backend.


#### Initialize the Tokenizer class with maximum vocabulary count as `MAX_NB_WORDS` initialized at the start of step2. 

In [0]:
t = kpt.Tokenizer(num_words=MAX_NB_WORDS)

#### Now, using fit_on_texts() from Tokenizer class, lets encode the data 

Note: We need to fit articleBody and Headline also to cover all the words.

In [0]:
article_body=dataset["articleBody"].astype('str')
headlines=dataset["Headline"].astype('str')

In [0]:
t.fit_on_texts(article_body)
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)
encoded_article_body = t.texts_to_matrix(article_body, mode='count')
print(encoded_article_body)

49972
[[ 0. 20. 10. ...  0.  0.  0.]
 [ 0. 20. 10. ...  0.  0.  0.]
 [ 0. 20. 10. ...  0.  0.  0.]
 ...
 [ 0.  3.  3. ...  0.  0.  0.]
 [ 0.  3.  3. ...  0.  0.  0.]
 [ 0.  3.  3. ...  0.  0.  0.]]


In [0]:
idx_article_body = t.word_index

In [0]:
t.fit_on_texts(headlines)
print(t.word_counts)  # word frequency in the corpus
print(t.document_count)# number of docs in the corpus
print(t.word_index) # Word mappings into some integers(Vectorizations of words)
print(t.word_docs) # Number of Docs on which the word was found
encoded_headlines = t.texts_to_matrix(headlines, mode='count')
print(encoded_headlines)

99944
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [0]:
idx_headlines = t.word_index

In [0]:
encoded_article_body[1400]

array([ 0., 19., 11., ...,  0.,  0.,  0.])

In [0]:
encoded_article_body[1400].shape

(20000,)

In [0]:
encoded_headlines[1400]

array([0., 1., 1., ..., 0., 0., 0.])

In [0]:
encoded_headlines[1400].shape

(20000,)

In [0]:
len(idx_article_body)

27427

In [0]:
len(idx_headlines)

27873

#### fit_on_texts() gives the following attributes in the output as given [here](https://faroit.github.io/keras-docs/1.2.2/preprocessing/text/).

* **word_counts:** dictionary mapping words (str) to the number of times they appeared on during fit. Only set after fit_on_texts was called.

* **word_docs:** dictionary mapping words (str) to the number of documents/texts they appeared on during fit. Only set after fit_on_texts was called.

* **word_index:** dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.

* **document_count:** int. Number of documents (texts/sequences) the tokenizer was trained on. Only set after fit_on_texts or fit_on_sequences was called.



### Now, tokenize the sentences using nltk sent_tokenize() and encode the senteces with the ids we got form the above `t.word_index`

Initialise 2 lists with names `texts` and `articles`.

```
texts = [] to store text of article as it is.

articles = [] split the above text into a list of sentences.
```

In [0]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [0]:
texts = [ article_body[i] for i in range(0,len(article_body)-1) ]

In [0]:
articles = [sent_tokenize(texts[i]) for i in range(0,len(article_body)-1)] 

In [0]:
texts[500]

'DUBAI - A prominent Saudi Arabian cleric has whipped up controversy by issuing a religious ruling forbidding the building of snowmen, describing them as anti-Islamic.\n\nAsked on a religious website if it was permissible for fathers to build snowmen for their children after a snowstorm in the country\'s north, Sheikh Mohammed Saleh al-Munajjid replied: "It is not permitted to make a statue out of snow, even by way of play and fun."\n\nQuoting from Muslim scholars, Sheikh Munajjid argued that to build a snowman was to create an image of a human being, an action considered sinful under the kingdom\'s strict interpretation of Sunni Islam.\n\n"God has given people space to make whatever they want which does not have a soul, including trees, ships, fruits, buildings and so on," he wrote in his ruling.\n\nThat provoked swift responses from Twitter users writing in Arabic and identifying themselves with Arab names.\n\n"They are afraid for their faith of everything ... sick minds," one Twitte

In [0]:
articles[500]

['DUBAI - A prominent Saudi Arabian cleric has whipped up controversy by issuing a religious ruling forbidding the building of snowmen, describing them as anti-Islamic.',
 'Asked on a religious website if it was permissible for fathers to build snowmen for their children after a snowstorm in the country\'s north, Sheikh Mohammed Saleh al-Munajjid replied: "It is not permitted to make a statue out of snow, even by way of play and fun."',
 "Quoting from Muslim scholars, Sheikh Munajjid argued that to build a snowman was to create an image of a human being, an action considered sinful under the kingdom's strict interpretation of Sunni Islam.",
 '"God has given people space to make whatever they want which does not have a soul, including trees, ships, fruits, buildings and so on," he wrote in his ruling.',
 'That provoked swift responses from Twitter users writing in Arabic and identifying themselves with Arab names.',
 '"They are afraid for their faith of everything ... sick minds," one T

## Check 2:

first element of texts and articles should be as given below. 

In [0]:
texts[0]

'A small meteorite crashed into a wooded area in Nicaragua\'s capital of Managua overnight, the government said Sunday. Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city\'s airport, the Associated Press reports. \n\nGovernment spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth." House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports. \nMurillo said Nicaragua will ask international experts to help local scientists in understanding what happened.\n\nThe crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee. He said it is still not clear if the meteorite disintegrated or was buried.\n\nHumbe

In [0]:
articles[0]

["A small meteorite crashed into a wooded area in Nicaragua's capital of Managua overnight, the government said Sunday.",
 "Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city's airport, the Associated Press reports.",
 'Government spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth."',
 'House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports.',
 'Murillo said Nicaragua will ask international experts to help local scientists in understanding what happened.',
 'The crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee.',
 'He said it is still not clear if the meteorite disintegrated or was bu

 ## <font color=red> Milestone - 2 </font>

#### Now iterate through each article and each sentence to encode the words into ids using t.word_index <h1>[10 marks]</h1>

Here, to get words from sentence you can use `text_to_word_sequence` from keras preprocessing text.

1. Import text_to_word_sequence

2. Initialize a variable of shape (no.of articles, MAX_SENTS, MAX_SENT_LENGTH) with name `data` with zeros first (you can use numpy [np.zeros](https://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros.html) to initialize with all zeros)and then update it while iterating through the words and sentences in each article.

In [0]:
import numpy as np
number_of_articles=len(article_body)
number_of_articles

49972

In [0]:
num_sents = [len(articles[i]) for i in range(0,len(article_body)-1)]
num_sents

[16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 16,
 5,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 19,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 14,
 77,
 77,
 77,
 77,
 

In [0]:
max_num_sents = max(num_sents)
max_num_sents

203

In [0]:
max_num_words = len(idx_article_body)
max_num_words

27427

In [0]:
datax = np.zeros((number_of_articles,MAX_SENTS,MAX_SENT_LENGTH),dtype = int) 
#datay = np.zeros((number_of_articles,max_num_sents,max_num_words)) # Recommended

In [0]:
w_501_13 = kpt.text_to_word_sequence(articles[500][12])# words in 501th article and 13th sentence

In [0]:
listOfValues = [value  for (key, value) in idx_article_body.items() if key in kpt.text_to_word_sequence(articles[500][12])]
listOfValues  # List of values for  13th sentence of the 501th article

[1,
 4,
 11,
 14,
 18,
 20,
 70,
 137,
 154,
 310,
 574,
 615,
 778,
 944,
 1176,
 1247,
 1416,
 1454,
 2203,
 2571,
 2763,
 4207,
 10312,
 10527,
 16504,
 16505]

In [0]:
datax

array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 

In [0]:
for i in range(501,510):  #Examining Documents 501 upto 509
  for j in range(0,min(MAX_SENTS,num_sents[i])):
    listOfValues = [value  for (key, value) in idx_article_body.items() if key in kpt.text_to_word_sequence(articles[i][j])]
    len_listOfValues = len(listOfValues)
    if (len_listOfValues < 20):
         listOfValues = np.append(listOfValues,np.zeros(20-len(listOfValues),dtype = int))
         print(listOfValues)
         datax[i,j,:]=listOfValues

[  1   3   5   7   8   9  44 103 115 166 246 284 432 436 637   0   0   0
   0   0]
[   1    4    8  108  339 1549 2249 3648 4046 7660    0    0    0    0
    0    0    0    0    0    0]
[   6   10   51  276  307  451  494 1188 1203 2265 3066 4813 7370    0
    0    0    0    0    0    0]
[   1   33   51  108  272 1745 7710    0    0    0    0    0    0    0
    0    0    0    0    0    0]
[   1    2    4   23  108  246  445  623  937  954 3050 3066 3166 4517
 6939    0    0    0    0    0]
[   1    8   22   30   39   59   72  142  198  233  498  509 1406 2079
 3066    0    0    0    0    0]
[   7    8   26   43   96   97  142  246  273  774  852 2428 2692    0
    0    0    0    0    0    0]
[   1    3    5    8   23   38   41   64  108  164  268  848  886 1321
 2897    0    0    0    0    0]
[   3    4   25  526  998 8626    0    0    0    0    0    0    0    0
    0    0    0    0    0    0]
[   1    8  335  967 1226 1721    0    0    0    0    0    0    0    0
    0    0    0    0  

In [0]:
datax[509,:,:]

array([[    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [    1,     3,     5,     7,     8,     9,    44,   103,   115,
          166,   246,   284,   432,   436,   637,     0,     0,     0,
            0,     0],
       [    1,     4,     8,   108,   339,  1549,  2249,  3648,  4046,
         7660,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [    6,    10,    51,   276,   307,   451,   494,  1188,  1203,
         2265,  3066,  4813,  7370,     0,     0,     0,     0,     0,
            0,     0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [  

In [0]:
number_of_articles = 500
for i in range(0,number_of_articles): # Examining only first 10 documents
  for j in range(0,min(MAX_SENTS,num_sents[i])):
    listOfValues = [value  for (key, value) in idx_article_body.items() if key in kpt.text_to_word_sequence(articles[i][j])]
    len_listOfValues = len(listOfValues)
    if (len_listOfValues < 20):
         listOfValues = np.append(listOfValues,np.zeros(20-len(listOfValues),dtype = int))
         print(listOfValues)
         datax[i,j,:]=listOfValues

[   1    3    4    5   12   79   88  318  369  444  500  532 1668 2886
 3664 4576 7117    0    0    0]
[   1    3    7   92   94  154  171  227  471  533  727  779 1032 1095
 1223 1778 2062 2959 2963    0]
[    1     5    30    42    66    92   174   191   712   751   980  1210
  1665  2487  3773  3901  6689 13146 17648     0]
[   2    5   12   38   67  255  337  359  478  816 1058 1796 2275 2316
 4371    0    0    0    0    0]
[   1    8    9   12   13   15   25   41   64  142  521  532 1808 3656
    0    0    0    0    0    0]
[   2   12   13   15   17   23   37   41   52   68  117 1725 1951 4830
    0    0    0    0    0    0]
[   2    3   17   20   37   41   64  251  256  491 1058    0    0    0
    0    0    0    0    0    0]
[   1    3    5   12   29   33   94  156  255  270  513  710  727  855
 1032 1435 1740 1778 2058    0]
[   3    4    6    9   10   35   57  102  114  235  499  559  639 1970
 2008 2350 5726    0    0    0]
[   1    3    9   15   24   37   54  117  227  333  5

In [0]:
datax[0, :, :] # Word Encodings for the First Documnet

array([[    1,     3,     4,     5,    12,    79,    88,   318,   369,
          444,   500,   532,  1668,  2886,  3664,  4576,  7117,     0,
            0,     0],
       [    1,     3,     7,    92,    94,   154,   171,   227,   471,
          533,   727,   779,  1032,  1095,  1223,  1778,  2062,  2959,
         2963,     0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [    1,     5,    30,    42,    66,    92,   174,   191,   712,
          751,   980,  1210,  1665,  2487,  3773,  3901,  6689, 13146,
        17648,     0],
       [    2,     5,    12,    38,    67,   255,   337,   359,   478,
          816,  1058,  1796,  2275,  2316,  4371,     0,     0,     0,
            0,     0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [  

### Check 3:

Accessing first element in data should give something like given below.

In [0]:
#data[0, :, :]

### Repeat the same process for the `Headings` as well. Use variables with names `texts_heading` and `articles_heading` accordingly. <h1> [10 marks] </h1>

In [0]:
texts_heading = [ headlines[i] for i in range(0,len(headlines)-1) ]

In [0]:
len(headlines)

49972

In [0]:
articles_heading = [sent_tokenize(texts_heading[i]) for i in range(0,len(headlines)-1)] 

In [0]:
texts_heading[500]

'Saudi cleric condemns snowmen as anti-Islamic'

In [0]:
articles_heading[500]

['Saudi cleric condemns snowmen as anti-Islamic']

In [0]:
number_of_headings=len(headlines)
number_of_headings

49972

In [0]:
data_heading = np.zeros((number_of_headings,1,MAX_SENT_LENGTH),dtype = int)

In [0]:
for i in range(0,number_of_articles): # Word encodings for First 10 Headlines
  for j in range(0,min(MAX_SENTS,1)):
    listOfValues = [value  for (key, value) in idx_headlines.items() if key in kpt.text_to_word_sequence(headlines[i])]
    len_listOfValues = len(listOfValues)
    if (len_listOfValues < 20):
         listOfValues = np.append(listOfValues,np.zeros(20-len(listOfValues),dtype = int))
         print(listOfValues)
         data_heading[i,j,:]=listOfValues

[   21    34   193   206   233   343   686   718  1338  7134 11554     0
     0     0     0     0     0     0     0     0]
[   11    34   158   211   393   589  3562  6459  8189 15777 19331     0
     0     0     0     0     0     0     0     0]
[    5  1012  1258  1307  1390  2068  9888 13496     0     0     0     0
     0     0     0     0     0     0     0     0]
[   5   21  206  233  500  686  718 1537    0    0    0    0    0    0
    0    0    0    0    0    0]
[   5   14   34  450  493  565 1081 1492 1541 1974 2930 3299 3527 4146
 6642    0    0    0    0    0]
[   3    8   15 1081 1097 1192 2308 4425 5588    0    0    0    0    0
    0    0    0    0    0    0]
[   1    2    3   25   62  159  400  421  621  702 1342    0    0    0
    0    0    0    0    0    0]
[   5   10   11   40   99  118  194  343  500  816  912 3521 9844    0
    0    0    0    0    0    0]
[   5   21  206  233  686  718 1421    0    0    0    0    0    0    0
    0    0    0    0    0    0]
[    4    17 

### Now the features are ready, lets make the labels ready for the model to process.

### Convert labels into one-hot vectors

You can use [get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) in pandas to create one-hot vectors.

In [0]:
vv = datax[3, 1, :] # One hot Codes for the 2nd sentence of the 4th Document in articleBody
pd.get_dummies(vv)

Unnamed: 0,0,1,3,7,92,94,154,171,227,471,533,727,779,1032,1095,1223,1778,2062,2959,2963
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


In [0]:
data_heading[6,0,:]

array([   3,    8,   15, 1081, 1097, 1192, 2308, 4425, 5588,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0])

In [0]:
vv = data_heading[6,0,:] # One hot Codes for the 1st sentence of the 7th Document in Headlines
pd.get_dummies(vv)

Unnamed: 0,0,3,8,15,1081,1097,1192,2308,4425,5588
0,0,1,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,0,0,1,0,0,0
6,0,0,0,0,0,0,0,1,0,0
7,0,0,0,0,0,0,0,0,1,0
8,0,0,0,0,0,0,0,0,0,1
9,1,0,0,0,0,0,0,0,0,0


### Check 4:

The shape of data and labels shoould match the given below numbers.

In [0]:
labels=dataset["Stance"].astype('str')
labels

0        unrelated
1        unrelated
2        unrelated
3        unrelated
4        unrelated
5        unrelated
6        unrelated
7        unrelated
8        unrelated
9        unrelated
10       unrelated
11       unrelated
12       unrelated
13       unrelated
14       unrelated
15       unrelated
16       unrelated
17       unrelated
18       unrelated
19       unrelated
20       unrelated
21       unrelated
22       unrelated
23       unrelated
24           agree
25       unrelated
26       unrelated
27       unrelated
28       unrelated
29       unrelated
           ...    
49942    unrelated
49943    unrelated
49944    unrelated
49945    unrelated
49946    unrelated
49947    unrelated
49948    unrelated
49949    unrelated
49950    unrelated
49951    unrelated
49952      discuss
49953    unrelated
49954      discuss
49955      discuss
49956    unrelated
49957    unrelated
49958    unrelated
49959    unrelated
49960    unrelated
49961    unrelated
49962      discuss
49963    unr

In [0]:
labels = pd.get_dummies(labels)

In [0]:
print('Shape of data tensor:', datax.shape)
print('Shape of Headings tensor:', data_heading.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (49972, 20, 20)
Shape of Headings tensor: (49972, 1, 20)
Shape of label tensor: (49972, 4)


### Shuffle the data

In [0]:
## get numbers upto no.of articles
indices = np.arange(datax.shape[0])
## shuffle the numbers
np.random.shuffle(indices)

In [0]:
## shuffle the data
datax = datax[indices]
data_heading = data_heading[indices]
## shuffle the labels according to data
labels = labels.iloc[indices]

In [0]:
labels

Unnamed: 0,agree,disagree,discuss,unrelated
38684,1,0,0,0
20535,0,0,0,1
12964,0,0,1,0
37818,0,0,1,0
41212,0,0,1,0
15359,0,0,1,0
36952,0,0,0,1
3881,0,0,0,1
39220,0,0,0,1
24206,1,0,0,0


### Split into train and validation sets. Split the train set 80:20 ratio to get the train and validation sets.


Use the variable names as given below:

x_train, x_val - for body of articles.

x_heading_train, x_heading_val - for heading of articles.

y_train - for training labels.

y_val - for validation labels.

<h1> [10 marks] </h1>

In [0]:
train_test_split = 0.8 
number_of_articles = len(headlines)
split_position = np.ceil(0.8*number_of_articles).astype(int) #39978
split_position

39978

In [0]:
x_train, x_val = datax[:split_position,:,:], datax[split_position:,:,:]

In [0]:
x_heading_train, x_heading_val = data_heading[:split_position,:,:], data_heading[split_position:,:,:]

In [0]:
y_train, y_val = labels[:split_position], labels[split_position:]

In [0]:
print(x_train.shape)
print(y_train.shape)

print(x_val.shape)
print(y_val.shape)

(39978, 20, 20)
(39978, 4)
(9994, 20, 20)
(9994, 4)


In [0]:
print(x_heading_train.shape)

(39978, 1, 20)


### Check 5:

The shape of x_train, x_val, y_train and y_val should match the below numbers.

In [0]:
print(x_train.shape)
print(y_train.shape)

print(x_val.shape)
print(y_val.shape)

(39978, 20, 20)
(39978, 4)
(9994, 20, 20)
(9994, 4)


### Create embedding matrix with the glove embeddings


Run the below code to create embedding_matrix which has all the words and their glove embedding if present in glove word list.

In [0]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('./glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))
MAX_NB_WORDS = 40000 #
# create a weight matrix for words in training docs
embedding_matrix_articles = np.zeros((MAX_NB_WORDS, 100))

## Note: t(in the below line of code) is variable used for initialising Tokenizer(). 
## If this line is causing error check the if the variable 
## you have used for initialising Tokenizer() is same or not
for word, i in idx_article_body.items(): # idx_article_body
	embedding_vector_articles = embeddings_index.get(word)
	if embedding_vector_articles is not None:
		embedding_matrix_articles[i] = embedding_vector_articles 

Loaded 400000 word vectors.


In [0]:
# create a weight matrix for words in training docs
embedding_matrix_headlines = np.zeros((MAX_NB_WORDS, 100))

## Note: t(in the below line of code) is variable used for initialising Tokenizer(). 
## If this line is causing error check the if the variable 
## you have used for initialising Tokenizer() is same or not
for word, i in idx_headlines.items(): # idx_article_body
	embedding_vector_headlines = embeddings_index.get(word)
	if embedding_vector_headlines is not None:
		embedding_matrix_headlines[i] = embedding_vector_headlines

In [0]:
embedding_matrix_headlines[10]

 ## <font color=red> Milestone - 3 </font>

## Try different sequential models and report accuracy scores for each model.

<h1>[50 marks]  </h1>

### Import layers from Keras to build the model

In [0]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D


In [0]:
from keras import backend as K
K.clear_session()

### Model

In [0]:
embed_dim = 100
lstm_units = 200

model1 = Sequential()

e = Embedding(MAX_NB_WORDS, embed_dim, weights=[embedding_matrix_articles], input_length=MAX_SENT_LENGTH*MAX_SENTS, trainable=False)
model1.add(e) #

model1.add(SpatialDropout1D(0.4))

model1.add(LSTM(lstm_units, dropout=0.2, recurrent_dropout=0.2))#

model1.add(Dense(4,activation='softmax'))
model1.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model1.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 400, 100)          4000000   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 400, 100)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 200)               240800    
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 804       
Total params: 4,241,604
Trainable params: 241,604
Non-trainable params: 4,000,000
_________________________________________________________________
None


In [0]:
history1 = model1.fit(x_train.reshape(39978,400), np.array(y_train), batch_size=32, epochs=3, validation_split=0.1, verbose=1) # Train

Train on 35980 samples, validate on 3998 samples
Epoch 1/3
 5984/35980 [===>..........................] - ETA: 12:22 - loss: 0.8256 - acc: 0.7331

In [0]:
model2 = Sequential()

e = Embedding(MAX_NB_WORDS, embed_dim, weights=[embedding_matrix_headlines], input_length=MAX_SENT_LENGTH, trainable=False)
model2.add(e) #

model2.add(SpatialDropout1D(0.4))

model2.add(LSTM(lstm_units, dropout=0.2, recurrent_dropout=0.2))#

model2.add(Dense(4,activation='softmax'))
model2.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model2.summary())

In [0]:
history2 = model2.fit(x_heading_train.reshape(39978,20), np.array(y_train), batch_size=32, epochs=3, validation_split=0.1, verbose=1) # Train

### Compile and fit the model