<a href="https://colab.research.google.com/github/divyadass/sarcasm_detection/blob/develop/NLP_Project_Sarcasm_Detection_Questions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

1. This jupyter notebook will provide a basic walkthrough for text classificaiton using keras on text data.

2. To see the deployment of the model trained here, please switch to deployment branch of the repository.

# Sarcasm Detection
 **Acknowledgement**

Misra, Rishabh, and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

## Install `Tensorflow2.0` 

In [None]:
!!pip uninstall tensorflow
!pip install tensorflow==2.0.0

Collecting tensorflow==2.0.0
  Downloading tensorflow-2.0.0-cp37-cp37m-manylinux2010_x86_64.whl (86.3 MB)
[K     |████████████████████████████████| 86.3 MB 19 kB/s 
Collecting keras-applications>=1.0.8
  Downloading Keras_Applications-1.0.8-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 5.9 MB/s 
[?25hCollecting tensorboard<2.1.0,>=2.0.0
  Downloading tensorboard-2.0.2-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 48.4 MB/s 
Collecting tensorflow-estimator<2.1.0,>=2.0.0
  Downloading tensorflow_estimator-2.0.1-py2.py3-none-any.whl (449 kB)
[K     |████████████████████████████████| 449 kB 48.5 MB/s 
[?25hCollecting gast==0.2.2
  Downloading gast-0.2.2.tar.gz (10 kB)
Building wheels for collected packages: gast
  Building wheel for gast (setup.py) ... [?25l[?25hdone
  Created wheel for gast: filename=gast-0.2.2-py3-none-any.whl size=7554 sha256=1526ca29014e3e064a9acc1bd030782ed1e8a1b5b3a2411dc79bc388ac86d0f1
  Stored in direc

## Get Required Files from Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

Mounted at /content/drive/


In [None]:
project_path = '/content/'
base_path = 'drive/My Drive/Sarcasm Detection/Data/'


In [None]:
pwd

'/content'

In [None]:
ls 'drive/My Drive/Sarcasm Detection/Data/'

glove.6B.100d.txt  glove.6B.zip  Sarcasm_Headlines_Dataset.json


#**## Reading and Exploring Data**

## Read Data "Sarcasm_Headlines_Dataset.json" and basic exploration to get some insights about the data

In [None]:
import pandas as pd
import numpy as np
import gc

In [None]:
data_s = pd.read_json(base_path+'Sarcasm_Headlines_Dataset.json', lines= True)

In [None]:
data_s.shape

(26709, 3)

In [None]:
data_s.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [None]:
## Dropping duplicates rows
data_s.drop_duplicates(inplace=True)
print('dataframe shape: ', data_s.shape)

## Removing rows with duplicates headlines since keeping duplicates in headline gives model no useful infoemation
data_s.drop_duplicates(subset='headline', keep='last', inplace=True)
print('dataframe shape: ', data_s.shape)

dataframe shape:  (26708, 3)
dataframe shape:  (26602, 3)


In [None]:
print('Dataset size after cleaning: ',data_s.shape)
print('Number of unique articles link: '+str(data_s['article_link'].unique().shape[0]))
print('Number of unique headline: '+ str(data_s['headline'].unique().shape[0]))

Dataset size after cleaning:  (26602, 3)
Number of unique articles link: 26602
Number of unique headline: 26602


In [None]:
print('Sarcastic Comments: ', data_s['is_sarcastic'].value_counts()[1])
print('Non-Sarcastic Comments: ', data_s['is_sarcastic'].value_counts()[0])
print()
print('Sarcastic Comments percentage: ', data_s['is_sarcastic'].value_counts()[1]/data_s.shape[0]*100)
print('Non-Sarcastic Comments percentage: ', data_s['is_sarcastic'].value_counts()[0]/data_s.shape[0]*100)

Sarcastic Comments:  11651
Non-Sarcastic Comments:  14951

Sarcastic Comments percentage:  43.79745883768138
Non-Sarcastic Comments percentage:  56.20254116231862


## Dropping `article_link` from dataset
As we only need headline text data and is_sarcastic column for this project. We can drop artical link column here.

In [None]:
data_s.drop(['article_link'], axis= 1, inplace=True)

In [None]:
data_s.head(2)

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0




## Get the Length of each line and find the maximum length
As different lines are of different length. We need to pad the our sequences using the max length.

In [None]:
data_s['headline_length'] = data_s['headline'].str.split().str.len()

In [None]:
data_s.head(2)

Unnamed: 0,headline,is_sarcastic,headline_length
0,former versace store clerk sues over secret 'b...,0,12
1,the 'roseanne' revival catches up to our thorn...,0,14


In [None]:
data_s['headline_length'].describe()

count    26602.000000
mean         9.856214
std          3.165456
min          2.000000
25%          8.000000
50%         10.000000
75%         12.000000
max         39.000000
Name: headline_length, dtype: float64

In [None]:
print('Maximum length for a headline: ',data_s['headline_length'].max())

Maximum length for a headline:  39


#**## Modelling**

## Import required modules required for modelling.

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Flatten, Bidirectional
from tensorflow.keras.models import Model, Sequential

# Setting Different Parameters for the model

In [None]:
max_features = 10000  ## no of unique words in the vocabulary
maxlen = 15 ## no of words to use from each headline
embedding_size = 100 ## length of word embedding

## Applying Keras Tokenizer to headline column of the data.
- Create a tokenizer instance using Tokenizer(num_words=max_features) 
- Fitting this tokenizer instance on our data column df['headline'] using fit_on_texts()

In [None]:
tokenizer = Tokenizer(
    num_words=max_features+1,
     filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
     lower=True,
     split=" ",
     oov_token = True)

In [None]:
tokenizer.fit_on_texts(data_s['headline'])

In [None]:
data_s['tokenized_headline'] = tokenizer.texts_to_sequences(data_s['headline'])

In [None]:
data_s.head()

Unnamed: 0,headline,is_sarcastic,headline_length,tokenized_headline
0,former versace store clerk sues over secret 'b...,0,12,"[308, 1, 677, 3611, 2293, 48, 382, 2566, 1, 6,..."
1,the 'roseanne' revival catches up to our thorn...,0,14,"[4, 8414, 3332, 2741, 22, 2, 165, 8415, 414, 3..."
2,mom starting to fear son's web series closest ...,1,14,"[145, 837, 2, 903, 1740, 2086, 581, 4712, 220,..."
3,"boehner just wants wife to listen, not come up...",1,13,"[1478, 36, 222, 403, 2, 1823, 29, 319, 22, 10,..."
4,j.k. rowling wishes snape happy birthday in th...,0,11,"[764, 717, 4713, 904, 1, 621, 592, 5, 4, 95, 1..."


# Define X and y for the model.

In [None]:
X = data_s['tokenized_headline']
X = pad_sequences(X, maxlen = maxlen, value=0.0)
y = np.asarray(data_s['is_sarcastic'])

print("Number of Samples:", len(X))
print(X[0])
print("Number of Labels: ", len(y))
print(y[0])

Number of Samples: 26602
[   0    0    0  308    1  677 3611 2293   48  382 2566    1    6 2567
 8413]
Number of Labels:  26602
0


## the Vocabulary size

In [None]:
print('Number of words originally present:', len(tokenizer.word_index) + 1)

print('Numer of words in our vocabulary: ', max_features, '  .Since, we had set num_words argumnet while defining tokenizer object')
num_words = max_features

Number of words originally present: 29658
Numer of words in our vocabulary:  10000   .Since, we had set num_words argumnet while defining tokenizer object


#**## Word Embedding**


## Get Glove Word Embeddings

In [None]:
ls 'drive/My Drive/Sarcasm Detection/Data/'

glove.6B.100d.txt  glove.6B.zip  Sarcasm_Headlines_Dataset.json


In [None]:
glove_file = base_path + "glove.6B.zip"

In [None]:
#Extract Glove embedding zip file    ###### needed only once

from zipfile import ZipFile
with ZipFile(glove_file, 'r') as z:
  z.extractall()

# Getting the Word Embeddings using Embedding file.

In [None]:
EMBEDDING_FILE = base_path+'/glove.6B.100d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd

In [None]:
len(embeddings)

400000

In [None]:
## viewing embedding for the words 'attendtion'
embeddings['attention']

array([-3.3414e-01,  4.6667e-01,  5.3744e-01,  5.7743e-02,  2.9642e-01,
        2.5224e-01, -6.5586e-01, -4.1668e-01,  2.1959e-01, -4.9413e-01,
       -2.1816e-01, -9.0227e-02, -3.5179e-02, -2.7279e-01, -1.2343e-01,
        1.6808e-01, -5.0623e-01, -4.0497e-01, -1.6763e-01,  4.9066e-01,
       -8.8020e-02, -1.2339e-01, -3.8436e-01, -2.7766e-01, -1.3403e-01,
        1.4342e-01, -2.9177e-01, -2.1146e-02,  5.2180e-01, -2.1213e-01,
        3.0860e-02,  1.0402e-01, -1.6807e-01,  4.6170e-01, -5.4806e-01,
       -6.6849e-02, -3.3180e-01,  3.7257e-01, -7.4962e-01,  6.2741e-01,
       -4.9500e-01, -4.0996e-01, -1.4686e-01, -2.7166e-01, -7.7093e-02,
       -2.8342e-01,  6.3663e-02, -1.5734e-01,  6.9649e-01, -9.6694e-01,
        4.4510e-01, -2.4521e-01, -4.8447e-01,  1.1957e+00,  2.9929e-02,
       -2.0425e+00, -2.8603e-01, -3.9043e-01,  1.2197e+00, -4.7760e-01,
       -2.1191e-02,  9.3080e-01, -1.8173e-01, -7.5721e-02,  1.1242e+00,
       -8.2276e-02,  5.7149e-02, -2.3585e-01,  3.5901e-01,  6.92

# Creating a weight matrix for words in training docs

In [None]:
from itertools import islice

def take(n, iterable):
    "Return first n items of the iterable as a dict"
    return dict(islice(iterable, n))

In [None]:
## get all the words in our vocabulary 
# tokenizer.word_index.items()   ## commented since the count is 10000

In [None]:
num_words

10000

In [None]:
vocab = take(num_words, tokenizer.word_index.items())

In [None]:
len(vocab)

10000

In [None]:
embedding_matrix = np.zeros((num_words+1, 100))

for word, i in vocab.items():
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

len(embeddings.values())

400000

In [None]:
embedding_matrix.shape

(10001, 100)

## Create and Compile the Model
Using Sequential model instance and then adding Embedding layer, Bidirectional(LSTM) layer, then dense and dropout layers as required. 
In the end adding a final dense layer with sigmoid activation for binary classification.


In [None]:
### Embedding layer for hint 
## model.add(Embedding(num_words, embedding_size, weights = [embedding_matrix]))
### Bidirectional LSTM layer for hint 
## model.add(Bidirectional(LSTM(128, return_sequences = True)))

# Define the Keras model
model = Sequential()
model.add(Embedding(num_words + 1, embedding_size, weights=[embedding_matrix], input_length=maxlen, trainable=False))
model.add(Dropout(0.50))

model.add(Bidirectional(LSTM(128, return_sequences=True)))
model.add(Dropout(0.50))

model.add(Flatten())
model.add(Dropout(0.50))

model.add(Dense(1, activation='sigmoid'))

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 15, 100)           1000100   
_________________________________________________________________
dropout (Dropout)            (None, 15, 100)           0         
_________________________________________________________________
bidirectional (Bidirectional (None, 15, 256)           234496    
_________________________________________________________________
dropout_1 (Dropout)          (None, 15, 256)           0         
_________________________________________________________________
flatten (Flatten)            (None, 3840)              0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 3840)              0         
_________________________________________________________________
dense (Dense)                (None, 1)                 3

# Fit the model with a batch size of 100 and validation_split = 0.2. and state the validation accuracy


In [None]:
##Model Configuration
batch_size = 100
number_of_epochs = 5

loss_function = 'binary_crossentropy'
optimizer = 'adam'
additional_metrics = ['accuracy']
verbosity_mode = True
validation_split = 0.20

In [None]:
# Compile the model
model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics)

# Train the model
history = model.fit(X, y, epochs=number_of_epochs,
                    batch_size=batch_size, 
                    verbose=verbosity_mode, 
                    validation_split=validation_split)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
print('Validaiton accuracy at the end of 5th epoch is: ', history.history['val_accuracy'][-1])

Validaiton accuracy at the end of 5th epoch is:  0.8464574217796326
