<img src="http://drive.google.com/uc?export=view&id=1tpOCamr9aWz817atPnyXus8w5gJ3mIts" width=500px>

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

### Package Version:
- tensorflow==2.2.0
- pandas==1.0.5
- numpy==1.18.5
- google==2.0.3

# Sarcasm Detection

### Dataset

#### Acknowledgement
Misra, Rishabh, and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

**Required Files given in below link.**

https://drive.google.com/drive/folders/1xUnF35naPGU63xwRDVGc-DkZ3M8V5mMk

### Load Data (5 Marks)

In [1]:
import google.colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
def parseJson(fname):
    for line in open(fname, 'r'):
        yield eval(line)

In [3]:
import numpy as np

project_path = '/content/drive/My Drive/Sarcasm Detection/Data/'
data = list(parseJson(project_path+'Sarcasm_Headlines_Dataset.json'))

data[0:5]

[{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5',
  'headline': "former versace store clerk sues over secret 'black code' for minority shoppers",
  'is_sarcastic': 0},
 {'article_link': 'https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365',
  'headline': "the 'roseanne' revival catches up to our thorny political mood, for better and worse",
  'is_sarcastic': 0},
 {'article_link': 'https://local.theonion.com/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697',
  'headline': "mom starting to fear son's web series closest thing she will have to grandchild",
  'is_sarcastic': 1},
 {'article_link': 'https://politics.theonion.com/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302',
  'headline': 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas',
  'is_sarcastic': 1},
 {'article_link': 'https://www.huffingtonpost.com/entry/jk-rowling-w

### Drop `article_link` from dataset (5 Marks)

In [4]:
import pandas as pd
df = pd.DataFrame(data)
df.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [5]:
df.columns
df.drop('article_link',axis=1,inplace=True)


### Get length of each headline and add a column for that (5 Marks)

In [6]:
df['length']=df['headline'].str.len()
df.head()

Unnamed: 0,headline,is_sarcastic,length
0,former versace store clerk sues over secret 'b...,0,78
1,the 'roseanne' revival catches up to our thorn...,0,84
2,mom starting to fear son's web series closest ...,1,79
3,"boehner just wants wife to listen, not come up...",1,84
4,j.k. rowling wishes snape happy birthday in th...,0,64


### Initialize parameter values
- Set values for max_features, maxlen, & embedding_size
- max_features: Number of words to take from tokenizer(most frequent words)
- maxlen: Maximum length of each sentence to be limited to 25
- embedding_size: size of embedding vector

In [7]:
max_features = 10000
maxlen = 25
embedding_size = 200

### Apply `tensorflow.keras` Tokenizer and get indices for words (5 Marks)
- Initialize Tokenizer object with number of words as 10000
- Fit the tokenizer object on headline column
- Convert the text to sequence


In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing import text, sequence
tokenizer=Tokenizer(num_words=10000)
tokenizer.fit_on_texts(df['headline']) 
sequences = tokenizer.texts_to_sequences(df['headline'])
x = sequence.pad_sequences(sequences, maxlen = maxlen)


print("sequences : ",sequences,'\n')

print("word_index : ",tokenizer.word_index)

sequences :  [[307, 678, 3336, 2297, 47, 381, 2575, 5, 2576, 8433], [3, 8434, 3337, 2745, 21, 1, 165, 8435, 415, 3111, 5, 257, 8, 1001], [144, 837, 1, 906, 1748, 2092, 581, 4718, 220, 142, 38, 45, 1], [1484, 35, 223, 399, 1, 1831, 28, 318, 21, 9, 2923, 1392, 6968, 967], [766, 718, 4719, 907, 622, 593, 4, 3, 94, 1308, 91], [3, 364, 72], [3, 6969, 350, 5, 460, 4273, 2194, 1485], [18, 478, 38, 1167, 30, 154, 1, 98, 82, 17, 157, 5, 31, 351], [248, 3622, 6970, 554, 5273, 1994, 140], [2093, 325, 346, 400, 59, 5, 3, 3895], [2924, 1679, 4720, 13, 36, 4274, 6971, 4, 2094, 1102], [285, 781, 461, 7, 1555, 1910, 8, 3623], [233, 513, 2925, 12, 8, 928, 225, 368, 1, 4275, 8436], [237, 3896, 8437, 3338, 37, 234, 5, 6, 172], [1393, 664, 650, 4, 326, 2, 1030], [533, 2094, 122, 5, 4721, 1911], [2577, 1394, 382, 44, 3897, 347, 318, 1031, 1, 23, 19, 1103, 386, 102, 1309], [1680, 8438, 3112, 8439, 19, 6972, 1217], [856, 1, 1912, 257, 1168, 35, 213, 2746], [3624, 5274, 3113], [8440, 3898, 857, 37, 1486, 6973

### Pad sequences (5 Marks)
- Pad each example with a maximum length
- Convert target column into numpy array

In [9]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
pad_all = pad_sequences(sequences,padding = "post",maxlen=maxlen)
sp = np.array(pad_all)
df.head()

Unnamed: 0,headline,is_sarcastic,length
0,former versace store clerk sues over secret 'b...,0,78
1,the 'roseanne' revival catches up to our thorn...,0,84
2,mom starting to fear son's web series closest ...,1,79
3,"boehner just wants wife to listen, not come up...",1,84
4,j.k. rowling wishes snape happy birthday in th...,0,64


### Vocab mapping
- There is no word for 0th index

In [10]:
tokenizer.word_index

{'to': 1,
 'of': 2,
 'the': 3,
 'in': 4,
 'for': 5,
 'a': 6,
 'on': 7,
 'and': 8,
 'with': 9,
 'is': 10,
 'new': 11,
 'trump': 12,
 'man': 13,
 'from': 14,
 'at': 15,
 'about': 16,
 'you': 17,
 'this': 18,
 'by': 19,
 'after': 20,
 'up': 21,
 'out': 22,
 'be': 23,
 'how': 24,
 'as': 25,
 'it': 26,
 'that': 27,
 'not': 28,
 'are': 29,
 'your': 30,
 'his': 31,
 'what': 32,
 'he': 33,
 'all': 34,
 'just': 35,
 'who': 36,
 'has': 37,
 'will': 38,
 'more': 39,
 'one': 40,
 'into': 41,
 'report': 42,
 'year': 43,
 'why': 44,
 'have': 45,
 'area': 46,
 'over': 47,
 'donald': 48,
 'u': 49,
 'day': 50,
 'says': 51,
 's': 52,
 'can': 53,
 'first': 54,
 'woman': 55,
 'time': 56,
 'like': 57,
 'her': 58,
 "trump's": 59,
 'old': 60,
 'no': 61,
 'get': 62,
 'off': 63,
 'an': 64,
 'life': 65,
 'people': 66,
 'obama': 67,
 'now': 68,
 'house': 69,
 'still': 70,
 "'": 71,
 'women': 72,
 'make': 73,
 'was': 74,
 'than': 75,
 'white': 76,
 'back': 77,
 'my': 78,
 'i': 79,
 'clinton': 80,
 'down': 81,
 'i

### Set number of words
- Since the above 0th index doesn't have a word, add 1 to the length of the vocabulary

In [11]:
num_words = len(tokenizer.word_index) + 1
print(num_words)

29657


### Load Glove Word Embeddings (5 Marks)

In [12]:
embeddings_index = {}
f = open(project_path+'glove.6B.200d.txt')
for line in f:
    values = line.split(' ')
    word = values[0] ## The first entry is the word
    coefs = np.asarray(values[1:], dtype='float32') ## These are the vecotrs representing the embedding for the word
    embeddings_index[word] = coefs
f.close()

print('GloVe data loaded')

GloVe data loaded


### Create embedding matrix

In [13]:
EMBEDDING_FILE = project_path+'glove.6B.200d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((num_words, 200))

for word, i in tokenizer.word_index.items():
	embedding_vector = embeddings.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding### Define model (10 Marks)
- Hint: Use Sequential model instance and then add Embedding layer, Bidirectional(LSTM) layer, flatten it, then dense and dropout layers as required. 
In the end add a final dense layer with sigmoid activation for binary classification.

In [14]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import LSTM
import tensorflow.keras as K

model = Sequential()
model.add(Embedding(num_words, 100, input_length=maxlen))
model.add(K.layers.Bidirectional(LSTM(units=300, return_sequences=True, recurrent_dropout=0.1)))
model.add(Flatten())
model.add(Dense(1, activation='softmax'))
model.add(K.layers.Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))



### Compile the model (5 Marks)

In [15]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### Fit the model (5 Marks)

In [16]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 25, 100)           2965700   
_________________________________________________________________
bidirectional (Bidirectional (None, 25, 600)           962400    
_________________________________________________________________
flatten (Flatten)            (None, 15000)             0         
_________________________________________________________________
dense (Dense)                (None, 1)                 15001     
_________________________________________________________________
dropout (Dropout)            (None, 1)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 2         
Total params: 3,943,103
Trainable params: 3,943,103
Non-trainable params: 0
______________________________________________

In [17]:
from sklearn.model_selection import train_test_split
X_train, x_test, Y_train, y_test = train_test_split(x, df.is_sarcastic , test_size = 0.3 , random_state = 0) 
history = model.fit(X_train, Y_train, batch_size = 128 , validation_data = (x_test,y_test) , epochs = 3)


Epoch 1/3
Epoch 2/3
Epoch 3/3


In [19]:
print("Accuracy of the model on Testing Data is " , model.evaluate(x_test,y_test)[1]*100)

Accuracy of the model on Testing Data is  55.94658851623535
