## PIG - Task 1. Document Classification

We worked with different architectures provided by Keras for text classification and tried them out using several kind of vector representations of words (embeddings).

#### Embeddings:

- w2v
- GloVe
- Dependency based embeddings
- FastText embeddings
- Combination of GloVe with urban dictionary embeddings

The most common and used embeddings are w2v and GloVe, however dependency based embeddings are demonstrated to achieve high accuracy. FastText embeddings are compressed vectors that enable a faster training step and our contribution here was to use a combination of GloVe and word embeddings trained on [Urban Dictionary](https://www.urbandictionary.com/). In other words we took 200 dimenssions of GloVe and concatenated them with 100 dimensions of Urban dictionary embeddings as shown in the graph below:
![urban_embeddings](../urbandictemb.png)

     
#### Architectures:

All three architectures include an embedding layer that pulls our already mentioned pre-trained embeddings. In the same way we used a dropout rate of 0.2, binary crossentropy as loss function and Adam as optimizer.

- CNN: which contains one 1-dimensional convolutional layer and one max-pooling layer
- CNN+LSTM: adding to the previous layers an lstm layer
- LSTM: containing just an lstm layer

### Our best result:

Our best score was achieved by the cnn model trained with the dependecy based embeddings proposed by [Komninos, A., & Manandhar, S. (2016)](http://www-users.cs.york.ac.uk/~suresh/papers/dep_embeddings_naacl2016.pdf). Here we're going to see an example with that model.

In [1]:
import numpy as np
from lstm_classifier import LstmModel
from cnn_classifier import CNN as cnn
from cnn_lstm_classifier import CnnLstm
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from data_helpers import get_embeddings
from predicter import predict

Using TensorFlow backend.


In [2]:
def load_file(data_file, adjust_labels):
    # Load data from file
    x_text = []
    original_y = []

    for line in open(data_file, "r"):
        x_text.append(line.strip().split("\t")[0])
        if adjust_labels:
            original_y.append(line.strip().split("\t")[-1])
        else:
            original_y.append(int(line.strip().split("\t")[-1]))
    return original_y, x_text

In [3]:
def adapt_labels(array):

    y = np.array(array)
    y[y == "non-propaganda"] = 0
    y[y == "propaganda"] = 1

    return y

In [4]:
def load_data(x_file, emb_file, emb_type, train):
    # train = True
    train_y, train_x = load_file(x_file, train)
    train_y = adapt_labels(train_y)

    avg_len, train_encoded,  t = vectorizer(train_x)
    W = get_embeddings(emb_file, emb_type, t)
    
    x_train, x_test, y_train, y_test = train_test_split(
            train_encoded, train_y, test_size=0.25, random_state=1234, shuffle=True)
    
    x_train = sequence.pad_sequences(x_train, maxlen=avg_len, padding='post', truncating='post')
    x_test = sequence.pad_sequences(x_test, maxlen=avg_len, padding='post', truncating='post')
    
    data = x_train, x_test, y_train, y_test 
    return data, avg_len, W, t

In [5]:
def vectorizer(x_text):
    t = Tokenizer()
    t.fit_on_texts(x_text)
    # Split by words
    encoded = t.texts_to_sequences(x_text)
    lengths = [len(x) for x in encoded]
    avg_len = int(np.mean(lengths))
    return avg_len, encoded, t

In [6]:
batch_size = 32
epochs = 5
verbose = False
directory = "../results/models/"

In [7]:
train_data = "../data/task1.train.txt"
emb_file = "../data/GoogleNews-vectors-negative300.bin"
emb_set = "w2v"

In [None]:
!ls ../data/

In [8]:
data, input_length, W, tokenizer = load_data(train_data, emb_file, emb_set, True)

num words found:41247


In [9]:
clf = cnn()
clf.W = W
clf.pretrained = True
clf.directory = directory
clf.run(input_length, batch_size, epochs, verbose, data)
accuracy, loss = clf.acc, clf.score

In [None]:
def load_test(infile):
    text = []
    ids = []
    for line in open(infile, "r"):
        text.append(line.strip().split("\t")[0])
        ids.append(line.strip().split("\t")[1])
    return text, ids

In [None]:
this_test = "../data/task1.test.txt"
test_set, this_ids = load_test(this_test)

In [None]:
encoded_test = tokenizer.texts_to_sequences(test_set)
padded_test = sequence.pad_sequences(encoded_test, maxlen=input_length, padding='post', truncating='post')
this_modelf = directory + emb_set + "_cnn.h5"

In [None]:
this_predictions = predict(padded_test, this_modelf)

In [None]:
def save_propaganda_format(predictions, ids, name):
    with open(name, "w+") as foutput:
        for i in range(len(predictions)):

            if predictions[i]:
                prediction = 'propaganda'
            else:
                prediction = 'non-propaganda'

            foutput.write("%s\t%s\n" % (ids[i], prediction))

In [None]:
out_file = "../results/" + emb_set + "_cnn"

In [None]:
save_propaganda_format(this_predictions, this_ids, out_file)