## Sarcastic Newspaper Headline Detector

### Introduction
**Sarcasm** is a figure of speech or speech comment which is extremely difficult to define. It is often a statement or **comment which means the opposite of what it says.** It may be made with the intent of humour, or it may be made to be hurtful. The basic meaning is to be hostile under the cover of friendliness.

Sarcasm detection is a interestring application of natural language processing (NLP) and deep learning. Sarcasm is like a hidden treasure in the vast world of language. It adds a whole new level of complexity that can really test traditional language processing models. To truly understand sarcasm, you need to not only understand the literal meaning of words, but also appreciate the subtle nuances that can turn a simple statement into a sarcastic remark. As we venture into the realm of natural language processing, we dive into the exciting world of detecting sarcasm using the incredible power of deep learning. In this project, Aim is to build a robust sarcasm detection model using deep learning techniques.<br> 
The project involves various steps, including data analysis, data cleaning, model building, testing, and predicting user's inputs. From the very beginning, where we analyze data, to the final destination of creating a user-friendly model, we navigate through the ups and downs of integrating deep learning into the fascinating domain of linguistic wit.

In [276]:
# import all the basic / required libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Remove Warnings
import warnings
warnings.filterwarnings("ignore")

In [277]:
# import tensorflow for model buliding
import tensorflow as tf

In [278]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [279]:
# Load the dsataset
df = pd.read_json("sarcasm.json", lines= True)

In [280]:
df.shape

(28619, 3)

####  Data Cleaning
Cleaning the data is crucial to ensure the effectiveness of the model. This step involves handling missing values, removing irrelevant information, and addressing any noise in the dataset. Additionally, preprocessing steps like tokenization, removing stop words, and stemming/lemmatization may be applied to convert the raw text into a format suitable for deep learning models.

In [281]:
df['headline'][3]

'inclement weather prevents liar from getting to work'

In [282]:
df

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...
...,...,...,...
28614,1,jews to celebrate rosh hashasha or something,https://www.theonion.com/jews-to-celebrate-ros...
28615,1,internal affairs investigator disappointed con...,https://local.theonion.com/internal-affairs-in...
28616,0,the most beautiful acceptance speech this week...,https://www.huffingtonpost.com/entry/andrew-ah...
28617,1,mars probe destroyed by orbiting spielberg-gat...,https://www.theonion.com/mars-probe-destroyed-...


Since the **df['article_link']** column is dispensable , will drop it from the dataset

In [283]:
df = df.drop(labels='article_link', axis=1)

In [284]:
df['is_sarcastic']

0        1
1        0
2        0
3        1
4        1
        ..
28614    1
28615    1
28616    0
28617    1
28618    1
Name: is_sarcastic, Length: 28619, dtype: int64

In [285]:
df['is_sarcastic'].value_counts()

is_sarcastic
0    14985
1    13634
Name: count, dtype: int64

In [329]:
df['is_sarcastic'].unique()

array([1, 0], dtype=int64)

In [330]:
df['headline']

0        thirtysomething scientists unveil doomsday clo...
1        dem rep. totally nails why congress is falling...
2        eat your veggies: 9 deliciously different recipes
3        inclement weather prevents liar from getting t...
4        mother comes pretty close to using word 'strea...
                               ...                        
28614         jews to celebrate rosh hashasha or something
28615    internal affairs investigator disappointed con...
28616    the most beautiful acceptance speech this week...
28617    mars probe destroyed by orbiting spielberg-gat...
28618                   dad clarifies this not a food stop
Name: headline, Length: 28619, dtype: object

**'is_sarcastic'** coloumn of dataframe consist of labels/ output indicating whether respective headlines are sarcastic or not. Here df['is_sarcastic'] column have two viz. 1 and 0 where, **1** indicates that the given headline **is sarcastic** and **0** indicates headline is **not sarcastic.**

Before performing tokenization on original dataset i tried same process on the smaller dataset to understand it more throughly  

I have created one sample list with four semtences in it.<br>
I have followed following steps to tikenize the data
- Firstly we will try to tokenize the whole list using tensorflow's Tokenizer method. It will assign the index to each unique word in the list. 
- In next step we will create index sequence list for the each sentance in the defined sample list. 
- Next step to perform the paading on the obatined sequnces to avoid discrepancy in the results.    


In [331]:
# lets use natural language processing for string preprocessing

import nltk

In [332]:
# removing stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [333]:
eng_stopwords = set(stopwords.words())

In [None]:
# Removing stopwords
#x_data = x.apply(lambda review : [ i       for i in review.split()       if i not in eng_stopwords ])

In [286]:
sent = ["I am learning pythyon", 'I am learning deep learning', 'I love dogs', 'I love cats']

In [287]:
type(sent)

list

In [288]:
sent

['I am learning pythyon',
 'I am learning deep learning',
 'I love dogs',
 'I love cats']

In [289]:
token = Tokenizer(10, )

In [290]:
token.fit_on_texts(sent)

In [291]:
token.word_index

{'i': 1,
 'learning': 2,
 'am': 3,
 'love': 4,
 'pythyon': 5,
 'deep': 6,
 'dogs': 7,
 'cats': 8}

In [292]:
sent_sq = token.texts_to_sequences(sent)

In [293]:
print(sent_sq)

[[1, 3, 2, 5], [1, 3, 2, 6, 2], [1, 4, 7], [1, 4, 8]]


#### Model Building
The heart of the project lies in building a robust deep learning model for sarcasm detection. Common architectures for natural language processing tasks include recurrent neural networks (RNNs)and long short-term memory networks (LSTMs).

In [294]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [295]:
sent_final = pad_sequences(sent_sq, maxlen=10, truncating= 'post', padding='post')

In [296]:
sent_final

array([[1, 3, 2, 5, 0, 0, 0, 0, 0, 0],
       [1, 3, 2, 6, 2, 0, 0, 0, 0, 0],
       [1, 4, 7, 0, 0, 0, 0, 0, 0, 0],
       [1, 4, 8, 0, 0, 0, 0, 0, 0, 0]])

Now, time to tokenise the origanl dataset 

In [297]:
# for the ease , will convert the dataset into the list 
hl = df['headline'].tolist()

In [298]:
type(hl)

list

In [299]:
#hl

In [300]:
labels = df['is_sarcastic'].to_list()

In [301]:
labels

[1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,


#### Model Training
Train the model using the preprocessed dataset. Split the data into training and validation sets to evaluate the model's performance during training. Utilize appropriate loss functions and optimization algorithms. Monitor key metrics such as accuracy,loss to assess the model's performance.<br>

In [302]:
# create training and testing dataset =>> 9:1 

In [303]:
train_ind = df.shape[0]*90//100 
train_ind

25757

In [304]:
#training dataset
headlines_train = hl[ : train_ind]
labels_train = labels[ : train_ind]

In [305]:
#testing dataset
headlines_test = hl[train_ind : ]
labels_test = labels[train_ind : ]

In [306]:
# create word oindex

In [307]:
token = Tokenizer(num_words=1000, oov_token= 'UNK' )

In [308]:
token.fit_on_texts(headlines_train)

In [309]:
token.word_index

{'UNK': 1,
 'to': 2,
 'of': 3,
 'the': 4,
 'in': 5,
 'for': 6,
 'a': 7,
 'on': 8,
 'and': 9,
 'with': 10,
 'is': 11,
 'new': 12,
 'man': 13,
 'trump': 14,
 'at': 15,
 'from': 16,
 'about': 17,
 'by': 18,
 'you': 19,
 'after': 20,
 'this': 21,
 'be': 22,
 'out': 23,
 'up': 24,
 'as': 25,
 'that': 26,
 'it': 27,
 'how': 28,
 'not': 29,
 'he': 30,
 'his': 31,
 'what': 32,
 'your': 33,
 'are': 34,
 'just': 35,
 'who': 36,
 'has': 37,
 'all': 38,
 'will': 39,
 'report': 40,
 'into': 41,
 'more': 42,
 'have': 43,
 'one': 44,
 'year': 45,
 'over': 46,
 'u': 47,
 'why': 48,
 'area': 49,
 'woman': 50,
 'can': 51,
 'day': 52,
 's': 53,
 'says': 54,
 'first': 55,
 'time': 56,
 'donald': 57,
 'like': 58,
 'no': 59,
 'get': 60,
 'her': 61,
 'old': 62,
 'off': 63,
 'people': 64,
 'life': 65,
 "trump's": 66,
 "'": 67,
 'now': 68,
 'house': 69,
 'an': 70,
 'obama': 71,
 'white': 72,
 'still': 73,
 'back': 74,
 'make': 75,
 'was': 76,
 'down': 77,
 'than': 78,
 'women': 79,
 'if': 80,
 'my': 81,
 'coul

In [310]:
train_seq = pad_sequences(token.texts_to_sequences(headlines_train), maxlen=50, padding='post', truncating='post')

In [311]:
train_seq

array([[  1, 340,   1, ...,   0,   0,   0],
       [  1,   1, 721, ...,   0,   0,   0],
       [914,  33,   1, ...,   0,   0,   0],
       ...,
       [  1,   1,   1, ...,   0,   0,   0],
       [212, 757, 464, ...,   0,   0,   0],
       [  1, 240,   1, ...,   0,   0,   0]])

In [312]:
test_seq = pad_sequences(token.texts_to_sequences(headlines_test), maxlen=50, padding='post', truncating='post')

In [313]:
test_seq

array([[ 32,   1, 390, ...,   0,   0,   0],
       [  1,   1,   1, ...,   0,   0,   0],
       [300, 623,   1, ...,   0,   0,   0],
       ...,
       [  4, 100, 640, ...,   0,   0,   0],
       [  1,   1,   1, ...,   0,   0,   0],
       [215,   1,  21, ...,   0,   0,   0]])

In [314]:
# convert labels into array 
train_labels = np.array(labels_train)
test_labels = np.array(labels_test)

In [315]:
# build the model

In [316]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, Flatten

In [317]:
model = Sequential()

In [318]:
# input layer
model.add(Embedding(1000, input_length=50, output_dim = 16))

#first hidden layer
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.25))

#second hidden layer
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.25))
model.add(Flatten())
#model.add(GlobalAveragePooling2D())
#output layer
model.add(Dense(1, activation='sigmoid'))


In [319]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 50, 16)            16000     
                                                                 
 dense_15 (Dense)            (None, 50, 128)           2176      
                                                                 
 dropout_10 (Dropout)        (None, 50, 128)           0         
                                                                 
 dense_16 (Dense)            (None, 50, 64)            8256      
                                                                 
 dropout_11 (Dropout)        (None, 50, 64)            0         
                                                                 
 flatten_4 (Flatten)         (None, 3200)              0         
                                                                 
 dense_17 (Dense)            (None, 1)                

In [320]:
# compile model 
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


#### Model Testing
After training, evaluate the model on a separate test set to ensure its generalization to unseen data. Analyze the confusion matrix and performance metrics to understand how well the model is distinguishing between sarcastic and non-sarcastic sentences.

In [321]:
#train the model
model.fit(train_seq,train_labels, epochs=10, validation_data=(test_seq,test_labels))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x25789bd9f10>

#### Predicting User Inputs

In [322]:
test = ['Where are women judges in highcourts ?']
test = pad_sequences(token.texts_to_sequences(test), maxlen=50, padding='post', truncating='post')

In [323]:
model.predict(test).round()



array([[0.]], dtype=float32)

In [324]:
test2 = ['teacher strikes idle kids']
test2 = pad_sequences(token.texts_to_sequences(test2), maxlen=50, padding='post',truncating='post')

In [325]:
model.predict(test2).round()



array([[1.]], dtype=float32)

In [326]:
int(model.predict(test2).round())



1

###### Taking headline as an input from the user and predicting whether headline is sarcastic or not 
Now that the model is trained and tested, it's time to deploy it for real-world use. Create a simple user interface to takeinputs from users. Tokenize and preprocess the user input, then feed it into the trained model for prediction. Provide users with a clear indication of whether the input is sarcastic or not.

In [328]:
while True:
    head1 = []
    str1 = str(input("Enter headline for prediction (Type 'stop' to break the loop): "))
    # Check if the user wants to stop
    if str1.lower() == 'stop':
        break  
    else:
        
        head1.extend([str1])
        head1 = pad_sequences(token.texts_to_sequences(head1), maxlen=50, padding='post', truncating='post')
        temp = (model.predict(head1)).round()
        #int(temp) = temp.round()
        #print(head1, type(head1))


        if int(temp) == 1:
            print("****************************************")
            print("Provided headline is Sarcastic")
            print("****************************************")
        else:
            print("****************************************")
            print("Provided headline is NOT Sarcastic")
            print("****************************************")

Enter headline for prediction (Type 'stop' to break the loop): War Dims Hope for Peace
****************************************
Provided headline is Sarcastic
****************************************
Enter headline for prediction (Type 'stop' to break the loop): Mahua Moitra: Expelled from Parliament over cash-for-query row
****************************************
Provided headline is NOT Sarcastic
****************************************
Enter headline for prediction (Type 'stop' to break the loop): stop


Sentences to test : <br>
War Dims Hope for Peace <br>
Mahua Moitra: Expelled from Parliament over cash-for-query row<br>
Cold Wave Linked to Temperatures <br>   