<h1 style="text-align:center;"><b>Easy Data Augmentation (EDA) for Text Classification with Disaster Tweets</b></h1>

<h2 style="text-align:center;"><b>Course: Introduction to Natural Language Processing</b></h2>

<h3 style="text-align:center;"><b>University of Trento, 2021-22</b></h3>


# 0. Instructions

The notebook is divided into the following sections:
- **1. Setup** 
    - ***1.1 Import libraries***: import the required libraries
    - ***1.2 Set random seed***: set the random seed for reproducibility
- **2. Dataset**
    - ***2.1 Load and prepare Disaster Tweets dataset***: load the dataset and prepare it for the experiments
    - ***2.2 Resampling***: resample the dataset  (optional)
- **3. Data Preprocessing**
    - ***3.1 Cleaning***: clean the tweets
    - ***3.2 EDA: Easy Data Augmentation***: apply EDA to the tweets (optional)
    - ***3.3 Tokenization and Padding***: tokenize and pad the tweets
    - ***3.4 GloVe Embeddings***: load the GloVe embeddings and prepare the embedding matrix for the RNN model
- **4. RNN with LSTM cells**
    - ***4.1 Define the RNN model***: define the RNN model
    - ***4.2 Train and evaluate the RNN model***

Follow the instructions in the notebook to run the code. 
Remember that if you want to resample the dataset, you have to run the cells in the ***Resampling*** section. Otherwise, you can skip them.
Remember that if you want to use EDAs, you have to run the cells in the ***EDA*** section. Otherwise, you can skip them.

# 1. Setup

# 1.1 Import libraries

In [None]:
# utilities
import re
import pickle
import numpy as np
import pandas as pd
import os

# plotting
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# nltk
import nltk
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords


# sklearn
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

# sklearn nlp
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


# keras
import tensorflow as tf

# utils
from utils import save_metrics, print_metrics, plot_confusion_matrix

# preprocessing
from preprocessing import preprocess_data

# data augmentation
from data_augmentation import augment_data

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
textattack: Updating TextAttack package dependencies.
textattack: Downloading NLTK required packages.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw to /root/nltk_data...
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping t

# 1.2 Set random seed and create useful folders

In [None]:
def set_seed(seed = 16):
    np.random.seed(16)
    tf.random.set_seed(16)

set_seed()

# create a folder called saved_results if it does not exist
if not os.path.exists('saved_results'):
    os.mkdir('saved_results')
    print('saved_results folder created')
else:
    print('saved_results folder already exists')

# 2. Dataset

# 2.1 Load and prepare Disaster Tweets dataset
Description: This dataset contains 7613 tweets that were hand classified as disaster or not disaster. The dataset is unbalanced, with only 3271 non-disaster tweets and 4342 disaster tweets. The goal is to predict which tweets are about real disasters and which one’s aren’t.

In [None]:
import pandas as pd

df = pd.read_csv('data/train.csv')
df.head(3)

In [None]:
# drop unnecessary columns
df.drop(['id','keyword','location'],axis=1,inplace=True)
df.head(3)

In [None]:
# print if there are any missing values
print('There are {} missing values in the dataset'.format(df.isnull().sum().sum()))

# print the number of rows and columns in the dataset
print('There are {} rows and {} columns in the dataset'.format(df.shape[0],df.shape[1]))

# print the names of the columns
print('The names of the columns are: {}'.format(df.columns))

# print the number of tweets per class
print('The number of tweets per class are: ')
print('{}'.format(df.target.value_counts()))

## 2.2 Resampling
Disclaimer: use this code to resample the dataset if you want to use a balanced dataset. Otherwise, skip this cell.

In [None]:
from sklearn.utils import resample

# Separate majority and minority classes
df_majority = df[df.target==0]
df_minority = df[df.target==1]

# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                   replace=False,    # sample without replacement
                                   n_samples=len(df_minority),     # to match minority class
                                   random_state=123) # reproducible results

# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])

df = df_downsampled.copy()

# reset the index
df.reset_index(drop=True,inplace=True)

# print the number of tweets per class
print('The number of tweets per class are: ')
print('{}'.format(df.target.value_counts()))

# 3. Data Preprocessing

# 3.1 Cleaning

In [None]:
# apply preprocess_data function to the text column
df['clean_text'] = df['text'].apply(lambda x: preprocess_data(x))
df.head(3)

In [None]:
# compare the original text with the clean text print the first 3 rows
print(df[['text','clean_text']].head(3))

In [None]:
# Split the data into train and test sets

df = df[['clean_text','target']]

# rename the column clean_text to text
df.rename(columns={'clean_text':'text'},inplace=True)


# split the data into train and validation set keeping the dataframe structure with text and target

df_train, df_test = train_test_split(df, test_size=0.2, random_state=16)

# print the shape of the train and validation set

print('There are {} rows and {} columns in train'.format(df_train.shape[0],df_train.shape[1]))
print('There are {} rows and {} columns in validation'.format(df_test.shape[0],df_test.shape[1]))

## 3.2 EDA: Easy Data Augmentation

In the following section, we will use the textattack library to augment our training data. We will use the EDA (Easy Data Augmentation) technique to augment our training data. The EDA technique is a simple yet effective data augmentation technique that can be used to increase the size of our training data. The technique is based on the following steps:
- Randomly choose n words from the sentence.
- Synonym replacement: Replace each of the n words with one of its synonyms chosen at random.
- Random insertion: Insert n random words at random positions in the sentence.
- Random swap: Randomly swap pairs of words in the sentence n times.

### Methodology

We will apply the EDA technique to our training data. We will use the following parameters:
- n: 4
- alpha: 0.1

From each sentence, we will generate 4 augmented sentences. We will use the alpha parameter to control the number of words that will be replaced, inserted or swapped. The alpha parameter is a float value between 0 and 1. The higher the value of alpha, the more words will be replaced, inserted or swapped.

### Disclaimer: Data Augmentation
You use the following cells if you want to perform the classification task applying the EDA technique. If you want to use just the original dataset, you can skip the following cells.


In [None]:
# print shape of train and test

print('There are {} rows and {} columns in train'.format(df_train.shape[0],df_train.shape[1]))
print('There are {} rows and {} columns in validation'.format(df_test.shape[0],df_test.shape[1]))

In [None]:
# perform data augmentation on the train set
df_train = augment_data(df_train, pct_words_to_swap=0.1, transformations_per_example=4)

In [None]:
print('There are {} rows and {} columns in train'.format(df_train.shape[0],df_train.shape[1]))
print('There are {} rows and {} columns in validation'.format(df_test.shape[0],df_test.shape[1]))

# 3.3 Tokenization and Padding

In [None]:
# create x_train, y_train, x_val, y_val

x_train, y_train = df_train['text'], df_train['target']
x_test, y_test = df_test['text'], df_test['target']

In [None]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()

tokenizer.fit_on_texts(x_train)
word_index = tokenizer.word_index

vocab_size = len(tokenizer.word_index) + 1
print("Vocabulary Size :", vocab_size)

max_length = 30

from tensorflow.keras.preprocessing.sequence import pad_sequences
# The tokens are converted into sequences and then passed to the pad_sequences() function
x_train = pad_sequences(tokenizer.texts_to_sequences(x_train),maxlen = max_length)
x_test = pad_sequences(tokenizer.texts_to_sequences(x_test),maxlen = max_length)

# print shapes
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

# 3.4 GloVe Embeddings

In [None]:
embeddings_index = {}
# opening the downloaded glove embeddings file
embedding_dimension = 300
f = open('data/glove.6B.300d.txt', encoding='utf-8')
for line in f:
    # For each line file, the words are split and stored in a list
    values = line.split()
    word = value = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' %len(embeddings_index))

In [None]:
# creating an matrix with zeroes of shape vocab x embedding dimension
embedding_matrix = np.zeros((vocab_size, embedding_dimension))
# Iterate through word, index in the dictionary
for word, i in word_index.items():
    # extract the corresponding vector for the vocab indice of same word
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Storing it in a matrix
        embedding_matrix[i] = embedding_vector

# 4. RNN with LSTM cells

## 4.1 Define the RNN model

In [None]:
# create a folder called Glove_LSTM inside the saved_results folder if it does not exist

if not os.path.exists('saved_results/Glove_LSTM'):
    os.mkdir('saved_results/Glove_LSTM')
    print('saved_results/Glove_LSTM folder created')
else:
    print('saved_results/Glove_LSTM folder already exists')


# create a folder AUG_Glove_LSTM inside the saved_results/Glove_LSTM folder if it does not exist

if not os.path.exists('saved_results/Glove_LSTM/AUG_Glove_LSTM'):
    os.mkdir('saved_results/Glove_LSTM/AUG_Glove_LSTM')
    print('saved_results/Glove_LSTM/AUG_Glove_LSTM folder created')

else:
    print('saved_results/Glove_LSTM/AUG_Glove_LSTM folder already exists')

In [None]:
import tensorflow as tf
embedding_layer = tf.keras.layers.Embedding(vocab_size,embedding_dimension,
                                            weights=[embedding_matrix],
                                          input_length=max_length,trainable=False)


# Import various layers needed for the architecture from keras
from tensorflow.keras.layers import Conv1D, Bidirectional, LSTM, Dense, Input, Dropout
from tensorflow.keras.layers import SpatialDropout1D
from tensorflow.keras.callbacks import ModelCheckpoint

# The Input layer 
sequence_input = Input(shape=(max_length,), dtype='int32')
# Inputs passed to the embedding layer
embedding_sequences = embedding_layer(sequence_input)
# Passed on to the bi-directional LSTM layer
x = Bidirectional(LSTM(64, return_sequences=True))(embedding_sequences)
x = Dropout(0.5)(x)
# Passed on to the second bi-directional LSTM layer
x = Bidirectional(LSTM(32))(x)
x = Dropout(0.5)(x)
# Passed on to dense layer with ReLU activation
x = Dense(20, activation='relu')(x)
# Passed on to sigmoid output layer
outputs = Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(sequence_input, outputs)

# Compile the model with binary crossentropy loss function and Adam optimizer
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Use early stopping with a patience of 3 epochs
from tensorflow.keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(patience=3)

In [None]:
model.summary()

## 4.2 Train and evaluate the RNN model

In [None]:
# Train the model
training = model.fit(x_train, y_train, batch_size=1024, epochs=50,
                    validation_data=(x_test, y_test), callbacks=[early_stopping])

In [None]:
from utils import plot_and_save_loss
from utils import plot_and_save_accuracy

# Plotting the training and validation loss

plot_and_save_loss(training, 'saved_results/Glove_LSTM/loss.png') 
# plot_and_save_loss(training, 'saved_results/Glove_LSTM/AUG_Glove_LSTM/loss.png')

# Plotting the training and validation accuracy

plot_and_save_accuracy(training, 'saved_results/Glove_LSTM/accuracy.png')
# plot_and_save_accuracy(training, 'saved_results/Glove_LSTM/AUG_Glove_LSTM/accuracy.png')


In [None]:
# Predicting the test data
y_pred = model.predict(x_test)

# Converting the predicted values to 0 or 1
y_pred = np.round(y_pred)

print_metrics(y_test, y_pred)

plot_confusion_matrix(y_test, y_pred, 'saved_results/Glove_LSTM/Glove_LSTM_Confusion_Matrix.png')
# plot_confusion_matrix(y_test, y_pred, 'saved_results/Glove_LSTM/AUG_Glove_LSTM/Glove_LSTM_Confusion_Matrix.png')


save_metrics(y_test, y_pred, 'saved_results/Glove_LSTM')
# save_metrics(y_test, y_pred, 'saved_results/Glove_LSTM/AUG_Glove_LSTM')