# Sarcasm Detection
 **Acknowledgement**

Misra, Rishabh, and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

**Required Files given in below link.**

https://drive.google.com/drive/folders/1xUnF35naPGU63xwRDVGc-DkZ3M8V5mMk

## Install `Tensorflow2.0` 

In [1]:
#!!pip uninstall tensorflow
#!pip install tensorflow==2.0.0

In [2]:
import tensorflow as tf
tf.__version__

'2.1.0'

## Get Required Files from Drive

In [3]:
#from google.colab import drive
#drive.mount('/content/drive/')

In [4]:
#Set your project path 
project_path =  "/home/balachandragv/Desktop/Data Science/GreatLearning/Projects/Project 12 - NLP Sarcasm/"

#**## Reading and Exploring Data**

## Read Data "Sarcasm_Headlines_Dataset.json". Explore the data and get  some insights about the data. ( 4 marks)
Hint - As its in json format you need to use pandas.read_json function. Give paraemeter lines = True.

In [5]:
import pandas as pd
data = pd.read_json('Sarcasm_Headlines_Dataset.json', lines=True)

In [6]:
data.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


## Drop `article_link` from dataset. ( 2 marks)
As we only need headline text data and is_sarcastic column for this project. We can drop artical link column here.

In [7]:
data = data.drop('article_link', axis = 1)

In [8]:
data.head()

Unnamed: 0,is_sarcastic,headline
0,1,thirtysomething scientists unveil doomsday clo...
1,0,dem rep. totally nails why congress is falling...
2,0,eat your veggies: 9 deliciously different recipes
3,1,inclement weather prevents liar from getting t...
4,1,mother comes pretty close to using word 'strea...


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28619 entries, 0 to 28618
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   is_sarcastic  28619 non-null  int64 
 1   headline      28619 non-null  object
dtypes: int64(1), object(1)
memory usage: 447.3+ KB


In [10]:
#checking for missing values if any
data.isnull().sum()

is_sarcastic    0
headline        0
dtype: int64

In [11]:
data.nunique()

is_sarcastic        2
headline        28503
dtype: int64

## Get the Length of each line and find the maximum length. ( 4 marks)
As different lines are of different length. We need to pad the our sequences using the max length.

In [12]:
def maxlen(headline):
    return max(headline.apply(lambda x: len(x.split(" "))))

def minlen(headline):
    return min(headline.apply(lambda x: len(x.split(" "))))

In [13]:
print("Minimum length of a review is:", minlen(data['headline']))
print("Maximum length of a review is:", maxlen(data['headline']))

Minimum length of a review is: 2
Maximum length of a review is: 151


#**## Modelling**

## Import required modules required for modelling.

In [14]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation 
from tensorflow.keras.layers import Flatten, Bidirectional, GlobalMaxPool1D
from tensorflow.keras.models import Model, Sequential

# Set Different Parameters for the model. ( 2 marks)

In [15]:
max_features = 10000
maxlen = 151
embedding_size = 200

## Apply Keras Tokenizer of headline column of your data.  ( 4 marks)
Hint - First create a tokenizer instance using Tokenizer(num_words=max_features) 
And then fit this tokenizer instance on your data column df['headline'] using .fit_on_texts()

In [16]:
tokenizer = Tokenizer(num_words = max_features)
tokenizer.fit_on_texts(data['headline'])

# Define X and y for your model.

In [17]:
X = tokenizer.texts_to_sequences(data['headline'])
X = pad_sequences(X, maxlen = maxlen)
y = np.asarray(data['is_sarcastic'])

print("Number of Samples:", len(X))
print(X[0])
print("Number of Labels: ", len(y))
print(y[0])

Number of Samples: 28619
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0  354 3166 7473 2643    2  660 1118]
Number of Labels:  28619
1


## Get the Vocabulary size ( 2 marks)
Hint : You can use tokenizer.word_index.

In [18]:
vocab_size = len(tokenizer.word_counts)
vocab_size

30884

#**## Word Embedding**

## Get Glove Word Embeddings

In [19]:
glove_file = project_path + "glove.6B.zip"

In [20]:
glove_file

'/home/balachandragv/Desktop/Data Science/GreatLearning/Projects/Project 12 - NLP Sarcasm/glove.6B.zip'

In [21]:
#Extract Glove embedding zip file
from zipfile import ZipFile
with ZipFile(glove_file, 'r') as z:
  z.extractall()

# Get the Word Embeddings using Embedding file as given below.

In [3]:
EMBEDDING_FILE = './glove.6B.200d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd

NameError: name 'np' is not defined

In [23]:
len(embeddings)

400001

# Create a weight matrix for words in training docs

In [24]:
num_words = vocab_size
embedding_matrix = np.zeros([num_words, 200])

for word, i in tokenizer.word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

len(embeddings.values())

400001

## Create and Compile your Model  ( 7 marks)
Hint - Use Sequential model instance and then add Embedding layer, Bidirectional(LSTM) layer, then dense and dropout layers as required. 
In the end add a final dense layer with sigmoid activation for binary classification.


In [25]:
model = Sequential()
model.add(Embedding(num_words, 200, weights = [embedding_matrix]))
model.add(Bidirectional(LSTM(128, return_sequences = True)))
model.add(GlobalMaxPool1D())
model.add(Dense(100, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(50, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 200)         6176800   
_________________________________________________________________
bidirectional (Bidirectional (None, None, 256)         336896    
_________________________________________________________________
global_max_pooling1d (Global (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 100)               25700     
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 50)                5050      
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0

# Fit your model with a batch size of 100 and validation_split = 0.2. and state the validation accuracy ( 5 marks)


In [26]:
batch_size = 100
epochs = 5

model.fit(X, y, batch_size = batch_size, epochs = epochs, validation_split = 0.2, verbose = 1)

Train on 22895 samples, validate on 5724 samples
Epoch 1/5
  100/22895 [..............................] - ETA: 11:11

UnknownError:  [_Derived_]  Fail to find the dnn implementation.
	 [[{{node CudnnRNN}}]]
	 [[sequential/bidirectional/backward_lstm/StatefulPartitionedCall]]
	 [[Reshape_16/_44]] [Op:__inference_distributed_function_5973]

Function call stack:
distributed_function -> distributed_function -> distributed_function


We have a model with validation accuracy of 86.23% which is pretty decent. There is scope for tuning the same further with more layers in the model. Also due to dataset size, 5 epochs might be more as we can see the loss is increasing after 2 epochs. We can also limit the number of epochs to 2 for better results.

Alternatively we can also include callbacks for earlystopping. Since number of epochs is only 5 in this case, not resorting to the same.