# Sarcasm Detection On Twitter Datasets

# Scoring Rubrics: 

1. Read and explore the data [ Score: 2 Points ]
2. Retain relevant columns [ Score: 2 Points ]
3. Get length of each sentence [ Score: 2 Points ]
4. Define parameters [ Score: 2 Points ]
5. Get indices for words [ Score: 5 Points ]
6. Create features and labels [ Score: 5 Points ]
7. Get vocabulary size [ Score: 2 Points ]
8. Create a weight matrix using GloVe embeddings [ Score: 2 Points ]
9. Define and compile a Bidirectional LSTM model. [ Score: 6 Points ]
	     Hint: Be analytical and experimental here in trying new approaches to design the best model.
10. Fit the model and check the validation accuracy. [ Score: 2 Points ]

## 1. Read and explore the data

In [107]:
import pandas as pd
df = pd.read_json("Sarcasm_Headlines_Dataset.json", lines=True)
df.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [108]:
df.shape

(26709, 3)

In [109]:
df['is_sarcastic'].value_counts()

0    14985
1    11724
Name: is_sarcastic, dtype: int64

In [110]:
df.columns

Index(['article_link', 'headline', 'is_sarcastic'], dtype='object')

## 2. Retain relevant columns

In [111]:
df = df[['headline','is_sarcastic']]  # Remove column article link

In [112]:
df.columns

Index(['headline', 'is_sarcastic'], dtype='object')

In [113]:
df.head()

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0
2,mom starting to fear son's web series closest ...,1
3,"boehner just wants wife to listen, not come up...",1
4,j.k. rowling wishes snape happy birthday in th...,0


## 3. Get length of each sentence

In [114]:
df['length'] = df['headline'].astype(str).map(len)

In [115]:
df

Unnamed: 0,headline,is_sarcastic,length
0,former versace store clerk sues over secret 'b...,0,78
1,the 'roseanne' revival catches up to our thorn...,0,84
2,mom starting to fear son's web series closest ...,1,79
3,"boehner just wants wife to listen, not come up...",1,84
4,j.k. rowling wishes snape happy birthday in th...,0,64
...,...,...,...
26704,american politics in moral free-fall,0,36
26705,america's best 20 hikes,0,23
26706,reparations and obama,0,21
26707,israeli ban targeting boycott supporters raise...,0,60


## 4. Create Features and Labels

In [116]:
from sklearn.model_selection import train_test_split

In [117]:
X_train, X_test, y_train, y_test = train_test_split(df['headline'], df['is_sarcastic'], 
                                                    test_size=0.2, random_state=42)

In [118]:
X_train.shape, y_train.shape

((21367,), (21367,))

In [119]:
import tensorflow as tf

## 5. Get Vocab size

In [120]:
desired_vocab_size = 10000 #Vocablury size
t = tf.keras.preprocessing.text.Tokenizer(num_words=desired_vocab_size,oov_token='OOV')

In [121]:
#Fit tokenizer with actual training data
t.fit_on_texts(X_train.tolist())

## 6. Get inices of words

In [122]:
#Vocabulary
t.word_index

{'OOV': 1,
 'to': 2,
 'of': 3,
 'the': 4,
 'in': 5,
 'for': 6,
 'a': 7,
 'on': 8,
 'and': 9,
 'with': 10,
 'is': 11,
 'new': 12,
 'trump': 13,
 'man': 14,
 'from': 15,
 'at': 16,
 'about': 17,
 'you': 18,
 'this': 19,
 'by': 20,
 'after': 21,
 'up': 22,
 'out': 23,
 'be': 24,
 'how': 25,
 'it': 26,
 'as': 27,
 'that': 28,
 'not': 29,
 'your': 30,
 'his': 31,
 'are': 32,
 'what': 33,
 'he': 34,
 'all': 35,
 'has': 36,
 'just': 37,
 'who': 38,
 'more': 39,
 'will': 40,
 'one': 41,
 'into': 42,
 'report': 43,
 'area': 44,
 'have': 45,
 'why': 46,
 'donald': 47,
 'year': 48,
 'over': 49,
 'u': 50,
 'can': 51,
 's': 52,
 'day': 53,
 'says': 54,
 'woman': 55,
 'first': 56,
 'time': 57,
 "trump's": 58,
 'her': 59,
 'off': 60,
 'like': 61,
 'old': 62,
 'no': 63,
 'an': 64,
 'get': 65,
 'obama': 66,
 'now': 67,
 'people': 68,
 'life': 69,
 'make': 70,
 'was': 71,
 'than': 72,
 'still': 73,
 "'": 74,
 'house': 75,
 'if': 76,
 'back': 77,
 'white': 78,
 'i': 79,
 'women': 80,
 'clinton': 81,
 'my

In [123]:
X_train = t.texts_to_sequences(X_train.tolist())

In [124]:
X_train[1]

[640, 3993, 2, 3994, 68, 7233, 521, 4452, 76, 18, 1]

In [125]:
X_test = t.texts_to_sequences(X_test)

In [126]:
#Define maximum number of words to consider in each headline
max_headline_length = 300

In [127]:
#Pad training and test reviews
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train,
                                                        maxlen=max_headline_length,
                                                        padding='pre', truncating='post')
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, 
                                                       maxlen=max_headline_length, 
                                                       padding='pre', truncating='post')

In [128]:
X_train.shape, X_test.shape

((21367, 300), (5342, 300))

In [129]:
X_train[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [130]:
from gensim.scripts.glove2word2vec import glove2word2vec
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional, GlobalMaxPool1D

In [131]:
glove_input_file = 'glove.6B.50d.txt'

In [132]:
#Name for word2vec file
word2vec_output_file = 'glove.6B.50d.txt.word2vec'
#Convert Glove embeddings to Word2Vec embeddings
glove2word2vec(glove_input_file, word2vec_output_file)

(400000, 50)

7. Vocab size is equal to 400000

In [133]:
from gensim.models import Word2Vec, KeyedVectors
import numpy as np

In [134]:
# Load pretrained Glove model (in word2vec form)
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
#Embedding length based on selected model - we are using 50d here.
embedding_vector_length = 50

## 8. Create a weight matrix using GloVe embeddings

In [135]:
#Initialize embedding matrix
embedding_matrix = np.zeros((desired_vocab_size + 1, embedding_vector_length))
embedding_matrix.shape

(10001, 50)

In [136]:
for word, i in sorted(t.word_index.items(),key=lambda x:x[1]):
    if i > (desired_vocab_size+1):
        break
    try:
        embedding_vector = glove_model[word] #Reading word's embedding from Glove model for a given word
        embedding_matrix[i] = embedding_vector
    except:
        pass

In [137]:
#Word the - index 1
embedding_matrix[2]

array([ 0.68046999, -0.039263  ,  0.30186   , -0.17792   ,  0.42962   ,
        0.032246  , -0.41376001,  0.13228001, -0.29846999, -0.085253  ,
        0.17117999,  0.22419   , -0.10046   , -0.43652999,  0.33418   ,
        0.67846   ,  0.057204  , -0.34448001, -0.42785001, -0.43274999,
        0.55962998,  0.10032   ,  0.18677001, -0.26853999,  0.037334  ,
       -2.09319997,  0.22171   , -0.39868   ,  0.20912001, -0.55725002,
        3.88260007,  0.47466001, -0.95657998, -0.37788001,  0.20869   ,
       -0.32752001,  0.12751   ,  0.088359  ,  0.16350999, -0.21634001,
       -0.094375  ,  0.018324  ,  0.21048   , -0.03088   , -0.19722   ,
        0.082279  , -0.09434   , -0.073297  , -0.064699  , -0.26043999])

## 9. Define and compile a Bidirectional LSTM model.

In [162]:
#Initialize model
tf.keras.backend.clear_session()
model_bd = tf.keras.Sequential()

In [163]:
model_bd.add(tf.keras.layers.Embedding(desired_vocab_size + 1, #Vocablury size
                                    embedding_vector_length, #Embedding size
                                    weights=[embedding_matrix], #Embeddings taken from pre-trained model
                                    trainable=False, #As embeddings are already available, we will not train this layer. It will act as lookup layer.
                                    input_length=max_headline_length))

In [164]:
model_bd.add(Bidirectional(LSTM(128, return_sequences = True)))
# model.add(Dense(128, activation="relu"))
# model.add(Dropout(0.4))
# model.add(Dense(64, activation="relu"))
# model.add(Dropout(0.4))
model_bd.add(Dense(64, activation="relu"))
model_bd.add(Dropout(0.5))
model_bd.add(Dense(16, activation="relu"))
model_bd.add(Dropout(0.5))
model_bd.add(Dense(1, activation="sigmoid"))
model_bd.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [165]:
model_bd.output

<tf.Tensor 'dense_2/Sigmoid:0' shape=(None, 300, 1) dtype=float32>

In [166]:
model_bd.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 50)           500050    
_________________________________________________________________
bidirectional (Bidirectional (None, 300, 256)          183296    
_________________________________________________________________
dense (Dense)                (None, 300, 64)           16448     
_________________________________________________________________
dropout (Dropout)            (None, 300, 64)           0         
_________________________________________________________________
dense_1 (Dense)              (None, 300, 16)           1040      
_________________________________________________________________
dropout_1 (Dropout)          (None, 300, 16)           0         
_________________________________________________________________
dense_2 (Dense)              (None, 300, 1)            1

## 10. Fit the model and check the validation accuracy

In [169]:
results = model_bd.fit(X_train, y_train,
          epochs=10,
          batch_size=32,          
          validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [173]:
# Training Data
score, acc = model_bd.evaluate(X_train, y_train, batch_size=32)
print("Training loss is:", score)
print('Training accuracy is:', acc)

# Testing data
score, acc = model_bd.evaluate(X_test, y_test, batch_size=32)
print("Test loss is:", score)
print('Test accuracy is:', acc)

Training loss is: 0.39787793159484863
Training accuracy is: 0.8055902719497681
Test loss is: 0.44334205985069275
Test accuracy is: 0.7771157622337341
