# Task 2 - Dominik Wiśniewski

The task is to build a binary classifier to classify sentences into sarcastic or not.

Unlike the previous task, mainly due to the size of the data set, I decided to approach the problem from the other side. After transforming the data, I will learn simpler models, and then, depending on their effectiveness, I will switch to more complex models.

In [None]:
# imports
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
import tensorflow as tf

In [None]:
dataset = pd.read_json(r'Graduate - HEADLINES dataset (2019-06).json', lines = True)

In [None]:
dataset.shape

In [None]:
dataset.head()

In [None]:
dataset.tail()

In [None]:
print(dataset.isnull().any(axis = 0))

It is always better to check the null values in the dataset first. This one does not contain any.

The headline column has some special symbols that have to be eliminated.
So, i am using Regular Expression to eliminate special symbols.

In [None]:
dataset['headline'] = dataset['headline'].apply(lambda s : re.sub('[^a-zA-Z]', ' ', s))

In [None]:
dataset['is_sarcastic'].value_counts().plot(kind='bar')
plt.title(f"Class ballance")
plt.xlabel(f"Sarcastic or not")
plt.ylabel("Count")
plt.show()
plt.close()

In [None]:
X = dataset['headline']
Y = dataset['is_sarcastic']

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots of words known as a lemma.

In [None]:
ps = PorterStemmer()

In [None]:
X = X.apply(lambda x: x.split())
X = X.apply(lambda x : ' '.join([ps.stem(word) for word in x]))

TF-IDF is an abbreviation for Term Frequency-Inverse Document Frequency and is a very common algorithm to transform the text into a meaningful representation of numbers. The technique is widely used to extract features across various NLP applications. I limited the number of features to 5,000 to finish learning more complex models in finite time.

In [None]:
tv = TfidfVectorizer(max_features = 5000)
X = list(X)
X = tv.fit_transform(X).toarray()

Next step is spliting set to train and test, in this task test set size will be small one.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .05, random_state = 0)

In [None]:
models = [GaussianNB(), LogisticRegression(), RandomForestClassifier(), LinearSVC()]

In [None]:
TestModels = pd.DataFrame()
tmp = {}

for model in models:
    model_obj = str(model)
    model_name = model_obj[:model_obj.index('(')]
    tmp['Model'] = model_name
    
    print(f"Calculating {model_name} model")
    
    print("Start training model...")
    model.fit(X_train, Y_train)
    
    print("Calculating scores....")
    tmp['TrainScore'] = model.score(X_train, Y_train)
    tmp['TestScore'] = model.score(X_test, Y_test)
    
    print(f"Train Score: {tmp['TrainScore']}\nTest Score: {tmp['TestScore']}\n")
    
    TestModels = TestModels.append([tmp])

TestModels.set_index('Model', inplace=True)
fig, axes = plt.subplots(ncols=1, figsize=(10, 4))
TestModels.TrainScore.plot(ax=axes, kind='bar', title='Train data score')
plt.show()

fig, axes = plt.subplots(ncols=1, figsize=(10, 4))
TestModels.TestScore.plot(ax=axes, kind='bar', title='Test data score')
plt.show()

Logistic regression and linear SVM achieved 83% efficiency, the other models did less well.

Now it's time for the artificial neural network, apart from the vectorization of features, the indicated activity is also the vectorization of labels, the so-called one hot encoding.

In [None]:
# helping mapping function to make one hot encoding
def map_labels(labels: np.array) -> list:
        """
        Making one hot
        """
        mapped = [np.array([1, 0]) if x == 0 else np.array([0, 1]) for x in labels]
        return mapped

Y = np.array(map_labels(Y))

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.05, random_state=0)

In [None]:
# defining model
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(512, activation=tf.nn.relu))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(2, activation=tf.nn.softmax))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, Y_train, batch_size=32, epochs=5, workers=4, use_multiprocessing=True, 
          verbose=1, validation_data=(X_test, Y_test))

In [None]:
val_loss, val_acc = model.evaluate(X_test, Y_test, verbose=0)

print(f"Accuracy: {val_acc}, loss: {val_loss}")

The model is limited to 5 epochs due to its very fast overfitting. A Dropout layer has also been added that removes 40% of connections. The validation set helped to catch this problem because the effectiveness during validation decreased quickly while the effectiveness on the training set increased quickly which is a classic example of model overfitting - the moment when it loses its ability to generalize and learns the training examples, not the scheme between them. This model could not exceed 85% on the validation set.

Next model that i am going to use is recurrent model. GRU and LSTM.

In [None]:
# defining recurent model
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.GRU(512, activation=tf.nn.relu, return_sequences=True))
model.add(tf.keras.layers.Dense(2, activation=tf.nn.softmax))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(X_train.reshape(1, X_train.shape[0], X_train.shape[1]), Y_train.reshape(1, Y_train.shape[0], Y_train.shape[1]), 
          epochs=50, verbose=1, 
          validation_data=(X_test.reshape(1, X_test.shape[0], X_test.shape[1]), 
                           Y_test.reshape(1, Y_test.shape[0], Y_test.shape[1])))

Unfortunately, the GRU model, just like the LSTM, took more time and machine resources than I had access to. I managed to train the GRU model for 50 epochs and for a moment gained the efficiency of the validation set at 79%. With such a large collection, such networks require much more time.