---
# Sarcasm Detection in News Headlines:
## A Binary Classification Problem
---

# 1. Introduction

---

## 1.1 Statement of the Problem

Sarcasm Detection is a challenging problem in the field of sentiment analysis, in part due to the lack intonation and facial expression. Sarcastic sentences can be used in a diverse range of topics, and they can take disparate grammatical structures. Also, in order to detect sarcasm, usually one has to have prior knowledge about the subject (which might not always be available): many sentences are not sarcastic by themselves, but they become in a particular context.   

The aim of this project is to build a Deep Learning Neural Network model capable of detecting sarcasm in the Headings provided in the **"News Headlines Dataset For Sarcasm Detection" Kaggle** <a href="https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection"> dataset. </a>


## 1.2 About the Dataset

"Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision, but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets." <a href="https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection/home"> [Kaggle, Dataset Overview] </a>  

This dataset, as it was created by collecting both sarcastic and non-sarcastic news headlines (respectively from TheOnion, which aims at producing sarcastic versions of current events, and from HuffPost), has many advantages over other datasets. The news headlines are indeed self-contained, written in a formal manner and without spelling mistakes, which increases the chance of finding pre-trained embeddings.


### Content

Each record consists of three attributes:

- `is_sarcastic`: 1 if the record is sarcastic otherwise 0
- `headline:` the headline of the news article
- `article_link:` link to the original news article.



# 2. Methods
---

# 2.1 Data Preprocessing

## 2.1.1 Reading the Data

The data is stored in a .json file; the following code stores the file content in the list "data". All the Headlines are then separately stored in the "headlines" list, which will represent our training and test data (of course after some data preprocessing).


In [1]:
def parseJson(fname):
    for line in open(fname, 'r'):
        yield eval(line)

        
data = list(parseJson('./Sarcasm_Headlines_Dataset.json'))
data_size = len(data) # 26709

headlines = []

# store all the Headlines in the Headlines list
for i in range(data_size):
    headlines.append(data[i].get('headline'))


print('Headline print example: ', headlines[4])

Headline print example:  j.k. rowling wishes snape happy birthday in the most magical way


## 2.1.2 Text Preprocessing

### Fitting the Tokenizer on the documents
- Keras provides utility classes in order to preprocess text.  
- Here the Tokenizer class allow to preprocess a text corpus, in this case by turning each headline into a sequence of integers (indeces corresponding to a specific word).  
- The argument num_word=10000 means the tokenizer will keep only the 10'000 most frequent occurring words.

In [2]:
from keras_preprocessing import text


# create an instance of the Keras Tokenizer class
t = text.Tokenizer(num_words=10000)

# fit the tokenizer on the data
t.fit_on_texts(headlines)

#print(t.word_index) # prints a dictionary of words and their corresponding index

### Splitting the Dataset in Train and Test Data

- Assigning and fitting the data respectively to the Training and Test Data.
- The training sizes I experimented with were 50%, 75% and 85% of the total data.  


- The variables train_data and test_data are lists of headlines;
- The variables train_labels and test_labels are lists of 0s and 1s, where 0 stands for non-sarcastic and 1 for sarcastic

In [3]:
#train_size = int(data_size*0.5) # 13354
#train_size = int(data_size*0.75) # 20031
train_size = int(data_size*0.85) # 22702

train_data = []

for x in t.texts_to_sequences_generator(headlines[:train_size]):
    train_data.append(x)

t.fit_on_sequences(train_data)

In [4]:
test_data = []

for x in t.texts_to_sequences_generator(headlines[train_size:]):
    test_data.append(x)
    
t.fit_on_sequences(test_data)

In [5]:
train_labels = []

for i in range(train_size):
    train_labels.append(data[i].get('is_sarcastic'))

In [6]:
test_labels = []

for i in range(train_size, data_size):
    test_labels.append(data[i].get('is_sarcastic'))

> Because I discarded any words that are not in the most frequent 10'000, no word index will exceed 10'000:

In [7]:
mx = 0
for sequence in train_data:
    if len(sequence) is not 0:
        if mx < max([max(sequence)]):
            mx = max([max(sequence)])

print(mx) # 9999

9999


## 2.1.3 Vectorization

- One-Hot encoding turns the lists into 2D tensors containing binary vectors of 0s and 1s.
- Turning the lists of integers into `tensors` makes it possible to feed them into a neural network and use a `Dense` layer as the first layer, as it can handle floating-point vector data.

In [8]:
import numpy as np

def vectorize_sequences(sequences, dimension = 10000):
    results = np.zeros( (len(sequences), dimension) )
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

print(x_train)
print(y_train)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[0. 0. 1. ... 0. 1. 1.]


# 2.2 Building the Model
---

Except for the number of units in the hidden layers, the models are built with the same architecture and features, as shown in the table below:

| Parameter | Value |
| --- | --- |
| Hidden Layers | 2 |
|Activation Function | `sigmoid`|
|Loss Function | `binary_crossentropy`|
|Optimizer | `rmsprop` |
|Metrics | `acc` |


>- The **intermediate layers** are `Dense` layers that use `relu` (rectified linear unit) as their activation function, which is meant to zero out negative values;
>- The **final layer** is a single-unit layer with a `sigmoid` activation function, which will output a probability between 0 and 1;  

>- The **loss function** used is the `binary_crossentropy`, which is usually the best choice when dealing with models that output probabilities;
>- The **optimizer** used is the `rmsprop` (Root Mean Square Propagation).







In [9]:
from keras import models
from keras import layers
from keras import optimizers
from keras import losses
from keras import metrics


n_units = 16


model = models.Sequential()
model.add(layers.Dense(n_units, activation = 'relu', input_shape = (10000,)))
model.add(layers.Dense(n_units, activation = 'relu'))
model.add(layers.Dense(1, activation = 'sigmoid'))

model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['acc'])

Using TensorFlow backend.


# 2.3 Evaluation Protocol
---
### Hold-out Validation Set
The **validation set** will set apart 10'000 samples from the original training data, in order to monitor the accuracy of the model on new data during the training. It will be a good-enough Evaluation Protocol as we have plenty of data for the training.

In [10]:
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

# 2.4 Training
---



# 2.4.1 Hyperparameter Tuning

- I focused on changing the following parameters:
    - The number of units of each layer;
    - The learning rate;
    - The batch size;
    - The number of epochs, depending on the performance of each run.
    

| Parameter | Values | Final Model Value |
| --- | --- | --- |
|Units | {4, 16, 32, 256} | 16 |
|Learning Rate | {0.003, 0.001, 0.0005, 0.00025} | 0.001 |
|Batch Size | {16, 64, 512, 1024, 2048} | 1024 |
|Epochs | 5, 10, 20, 40, 60 |  |


In [11]:
## Hyperparameters

lr = 0.001

batch_size = 1024

n_epochs = 20


In [12]:
history = model.fit(partial_x_train, 
                    partial_y_train,
                    epochs = n_epochs,
                    batch_size = batch_size,
                    validation_data = (x_val, y_val))


results = model.evaluate(x_test, y_test)
print("Results: ", results)
print("Accuracy: %.2f%%" % (results[1]*100))

Train on 12702 samples, validate on 10000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Results:  [0.4733459255642983, 0.8300474170350898]
Accuracy: 83.00%


# 2.5 Reducing Overfitting


I experimented with many methods to reduce overfitting:

- `Reducing network size` worked in terms of reducing overfitting, altough it caused the accuracy to dramatically drop;
- `L1 Weight Regularization` performed poorly;
- `L2 Weight Regularization` had a reasonable performance and reduced overfitting, although the best validation it achieved was just ~45%.
- Introducing `Dropout` appeared to be the best solution to reduce overfitting while maintaining a good validation loss in the model.
- The Dropout rates tested were 0.1, 0.3 and 0.5.


## 2.5.1 L2 Weight Regularization


In [13]:
from keras import regularizers

l2 = regularizers.l2(0.001)

reg_l2_model = models.Sequential()
reg_l2_model.add(layers.Dense(n_units, kernel_regularizer = l2, activation = 'relu', input_shape = (10000,)))
reg_l2_model.add(layers.Dense(n_units, kernel_regularizer = l2, activation = 'relu'))
reg_l2_model.add(layers.Dense(1, activation = 'sigmoid'))

reg_l2_model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

reg_l2_history = reg_l2_model.fit(partial_x_train,
                    partial_y_train,
                    epochs = n_epochs,
                    batch_size = batch_size,
                    validation_data = (x_val, y_val))


results = reg_l2_model.evaluate(x_test, y_test)
print("Results: ", results)
print("Accuracy: %.2f%%" % (results[1]*100))

Train on 12702 samples, validate on 10000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Results:  [0.4483306345601673, 0.838283004741702]
Accuracy: 83.83%


## 2.5.2 Adding Dropout

- This will be the final model.

In [15]:
n_epochs = 17
dropout_rate = 0.5

do_model = models.Sequential()
do_model.add(layers.Dense(n_units, activation='relu', input_shape=(10000,)))
do_model.add(layers.Dropout(dropout_rate))
do_model.add(layers.Dense(n_units, activation='relu'))
do_model.add(layers.Dropout(dropout_rate))
do_model.add(layers.Dense(1, activation='sigmoid'))

do_model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

do_history = do_model.fit(partial_x_train,
                    partial_y_train,
                    epochs = n_epochs,
                    batch_size = batch_size,
                    validation_data = (x_val, y_val))



results = do_model.evaluate(x_test, y_test)
print("Results: ", results)
print("Accuracy: %.2f%%" % (results[1]*100))

Train on 12702 samples, validate on 10000 samples
Epoch 1/18
Epoch 2/18
Epoch 3/18
Epoch 4/18
Epoch 5/18
Epoch 6/18
Epoch 7/18
Epoch 8/18
Epoch 9/18
Epoch 10/18
Epoch 11/18
Epoch 12/18
Epoch 13/18
Epoch 14/18
Epoch 15/18
Epoch 16/18
Epoch 17/18
Epoch 18/18
Results:  [0.35560559869809194, 0.8445220863488895]
Accuracy: 84.45%



# 3. Results


# 3.1 Hyperparameter tuning results
---

# 3.1.1 Changing the Training Size


The parameter I started testing was the amount of data to assign to the training data, a really essential parameter, as it represents the quantity of information the model will actually learn from.

The training sizes I experimented with were 50%, 85%, and 75% of the total data.

### The model trained on more data proved to naturally generalize better.

| TRAINING SIZE | EPOCHS BEFORE OVERFITTING| BEST LOSS | BEST ACCURACY |
| --- | --- | --- | --- |
| 50% | 14 | ~45% | ~79% |
| 75% | 11 | ~37% | ~83% |
| **85%** | **7** | **~35%** | **~85%** |

>As we can see from the observations in the table:
>- Training the model with **50%** (the ratio used also in the IMDB movie review dataset) performed quite well in terms of Overfitting, although the values of the validations loss and validation accuracy werre poor;
>- The **85%** ratio showed a 10% better validation loss and a 5% better validation accuracy compared to the same model with a 50% training size, even though it caused overfitting after just 7 epochs.  
>
>
>- We can see from here that there is an inverse relationship between tranining size and overfitting, while the accuracy increments with the training size. This is proof that the model trained on more data proved to naturally generalize better.
>- Achieving overfitting is a problem that can be resolved later with regularization; for now, we need to obtain a model with statistical power. 

#### The other parameters of this first naive model are shown in the table below.


| PARAMETER | VALUE | PARAMETER | VALUE |
| --------------- | --- | --------------- | --- |
| Training Size   | 85% |Learning Rate   | 0.001 |
| Units           | 16 | Layers          | 2 |
| Batch Size      | 512 | N of Epochs     | 20 |
| Dropout         | N/A | Weight Regularization | N/A |
| Best Loss       | ~35% | Best Accuracy | ~85% |
| . | ![title](images/loss_1.png) | . | ![title](images/acc_1.png) |

---

# 3.1.2 Tuning the Learning Rate η

### The RMSProp default Lerning Rate (η = 0.001) resulted to be a valid choice.

| LEARNING RATE | EPOCHS BEFORE OVERFITTING | BEST LOSS | BEST ACCURACY |
| --- | --- | --- | --- |
| 0.003 | 3 | ~37% | ~84% |
| **0.001** | **7** | **~35%** | **~85%** |
| 0.0005 | 16 | ~35% | ~85% |
| 0.00025 | 32 | ~35% | ~85% |

>As we can see from the observations in the table:
>- **Low learning rates as η < 0.0005** converge smoothly and postpone overfitting, but they dramatically slow down the learning process and the running time needed, and therefore results to be not efficient.
>- A **larger learning rate, such as η = 0.003,** speeds up the learning, but causes overfitting just after the first 3 epochs, result that we want to avoid.
>- A valid choice turns out to be **η = 0.001**, as it doesn't differ from the other learning rates in terms of validation accuracy and loss, but it conpesate the overfitting and the learning speed.


For now, we will therefore keep our first model displayed above in 3.1.1.

---

# 3.1.3 Tuning the number of Hidden Layer Units


### The model compensates performance and overfitting with 16 units:

| N OF UNITS | EPOCHS BEFORE OVERFITTING | BEST LOSS | BEST ACCURACY |
| --- | --- | --- | --- |
| 4 | 24 | ~45% | ~80% |
| **16** | **7** | **~35%** | **~85%** |
| 32 | 6 | ~37% | ~84% |
| 256 | 3 | ~36% | ~84% |

>As we can see from the observations in the table:
>- In a **lower capacity model**, such as the 4 units one, we can notice an example of `underfitting`: the model is not powerful enough and has not yet modeled all relevant patterns in the training data;
>- Also in the similar problem of **classifying the IMDB movie review dataset**, a smaller network was modeled with 2 layers of 4 units each, in order to overcome overfitting. The accuracy of that model, though, never reached the same accuracy of the bigger 16-units model. On this dataset as well, we can see the same poor performance of the 4-units model, that underperforms the bigger model by a 10% more loss and a 5% less accuracy.  
>
>
>- The model both performs better and delays overfitting with 16 hidden units: **a higher-dimensional representation space** allows the network to learn more complex representations, even though it makes the network more computationally expensive.






For now, we will therefore keep our first model displayed above in 3.1.1.


#### In the graph below we can see an example of underfitting (before ~20/25 epochs) of a model with 4 units.


|  .    | . |
| --- | --- |
| ![title](images/loss_2.png) | ![title](images/acc_2.png) |

---

# 3.1.4 Changing the Batch Size


### Increasing the Batch Size improves the performance

| BATCH SIZE | EPOCHS BEFORE OVERFITTING | BEST LOSS | BEST ACCURACY |
| --- | --- | --- | --- |
| 16 | 6 | ~40% | ~84% |
| 64 | 5 | ~37% | ~83% |
| 512 | 7 | ~35% | ~85% |
| **1024** | **12** | **~34.5%** | **~85%** |
| 2048 | 29 | ~39% | ~85% |


>As we can see from the observations in the table:
>- The model seems to respond nicely to the incrementation of the batch size: both its performance and it overfitting improves as the batch size increases.
>- A Batch Size of 1024 seemed to be the best fit from the testing.


#### We can see the graph of the improved model below.

| PARAMETER | VALUE | PARAMETER | VALUE |
| --------------- | --- | --------------- | --- |
| Training Size   | 85% |Learning Rate   | 0.001 |
| Units           | 16 | Layers          | 2 |
| Batch Size      | 1024 | N of Epochs     | 12 |
| Dropout         | N/A | Weight Regularization | N/A |
| Best Loss       | ~34.5% | Best Accuracy | ~85% |
| . | ![title](images/loss_3.png) | . | ![title](images/acc_3.png) |

---

# 3.2 Reducing Overfitting: Tuning Dropout Rate

---

## 3.2.1 Dropout vs L2 Regularization

#### Dropout strictly outperformed L2 Regularization

|  .    | . |
| --- | --- |
| ![title](images/reg.png) | . |


---

## 3.2.2  Tuning Dropout Rate

#### Increasing Dropout maintains a good performance while largely delaying overfitting

| DROPOUT RATE | EPOCHS BEFORE OVERFITTING | BEST LOSS | BEST ACCURACY |
| --- | --- | --- | --- |
| 0.1 | 11 | ~34.5% | ~85% |
| 0.3 | 13 | ~35% | ~85% |
| **0.5** | **17** | **~35%** | **~85%** |
| . |![title](images/dropouts.png) | . |  . |


>As we can see from the observations in the table:
>- A **dropout rate of 0.1** is too small to actually see changes from the model without dropout;
>- **dr = 0.5** is a good dropout rate choice: while maintaining the same performance of **dr = 0.3**, it doesn't start overfitting until 18 epochs.

---
#### The final model

| PARAMETER | VALUE | PARAMETER | VALUE |
| --------------- | --- | --------------- | --- |
| Training Size   | 85% |Learning Rate   | 0.001 |
| Units           | 16 | Layers          | 2 |
| Batch Size      | 1024 | N of Epochs     | 17 |
| Dropout         | 0.5 | Weight Regularization | N/A |
| Best Loss       | ~35.5% | Best Accuracy | ~84.5% |
| . | ![title](images/loss_4.png) | . | ![title](images/acc_4.png) |

---

# 4. Evaluation
---


## 4.1 Conclusion


- One of the parameters that showed greater changes in the performance of the model was the **training data size**. The difference between 50%, 75%, and 85% was remarkable, and showed how much better the model learns with more data, and therefore generalizes better.
- Building a **higher-dimensional representation space** as well as increasing the **batch size** allowed the neural network to learn more complex representations and improved the performance on validation and delayed overfitting.  


- The model achieved **without regularization** a `validation loss of ~34.5%` and a `validation accuracy of ~85%`, although it started overfitting after about 12 epochs.
- After the **regularization**, we achieved almost the same performance, but removed overfitting for the first 18 epochs, which is overall a better performance.
- The data seemed to perform quite well and to not be noisy; the **standard deviation** was always low, and the output never oscillated too much.


> The final performance of this model resulted in a `validation loss of ~35%` and a `validation accuracy of ~85%`.
> The model strictly outperforms the **baseline**, corresponding to a 50% accuracy.

#### The final model is displayed in the table below.

| PARAMETER | VALUE | PARAMETER | VALUE |
| --------------- | --- | --------------- | --- |
| Training Size   | 85% |Learning Rate   | 0.001 |
| Units           | 16 | Layers          | 2 |
| Batch Size      | 1024 | N of Epochs     | 17 |
| Dropout         | 0.5 | Weight Regularization | N/A |
| Best Loss       | ~35.5% | Best Accuracy | ~84.5% |
| . | ![title](images/loss_4.png) | . | ![title](images/acc_4.png) |

---


## 4.2 Improvements

- One of the main improvements that can be done regards data pre-processing. For example, as punctuation might be important for sarcasm detection, encoding esclamation and question marks could be helpful, as well as removing stopwords that do not add much value to the model (f.e. this, me, there...)

- Also, feeding more data to the model, as well as building a more complicated neural network could help reach a better performance: for example, a hybrid of LSTM and CNN achieved an error rate of only 0.10 on this same problem.