## FastText, ML and NN classification
*Student name: Dmitry Timerbaev*

For this task I will build fasttext classifier first, then two classic ML classfifiers (logistic regression and linear support vector machine) and finally recurrent neural network. My target metric would be ROC-AUC score, because as we will see further the dataset is not imbalanced, and we actually care both about positive and negative reviews. 

In [0]:
# load libraries
import numpy as np 
import pandas as pd 
import fasttext
import bz2
import csv
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from tensorflow.keras.models import Model
from tensorflow.keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping

## FastText Classifier (soft baseline)



In [5]:
# load training data and print dataset length
data = bz2.BZ2File('train.ft.txt.bz2')
data = data.readlines()
data = [x.decode('utf-8') for x in data]
print(len(data))

3600000


In [0]:
# prepare data for further processing - create dataframe and convert to txt
data = pd.DataFrame(data)
data.to_csv("train.txt", index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")

In [7]:
# train fasttext model on dataset. print model labels 
model = fasttext.train_supervised('train.txt',label_prefix='__label__', thread=4, epoch = 10)
print(model.labels, 'are the labels or targets the model is predicting')

['__label__1', '__label__2'] are the labels or targets the model is predicting


In [8]:
# load test data and print dataset length
test = bz2.BZ2File('test.ft.txt.bz2')
test = test.readlines()
test = [x.decode('utf-8') for x in test]
print(len(test), 'number of records in the test set') 

# in order to run the predict function, we need to remove the __label__1 and __label__2 from the testset.  
new = [w.replace('__label__2 ', '') for w in test]
new = [w.replace('__label__1 ', '') for w in new]
new = [w.replace('\n', '') for w in new]

# use predict function 
pred = model.predict(new)

# print results of the first record outputs
print(pred[0][0], 'is the predicted label')
print(pred[0][1], 'is the probability score')

400000 number of records in the test set
['__label__2'] is the predicted label
['__label__2'] is the probability score


In [9]:
# decode labels into 1's and 0's from both the test set and the actual predictions  
labels = [0 if x.split(' ')[0] == '__label__1' else 1 for x in test]
pred_labels = [0 if x == ['__label__1'] else 1 for x in pred[0]]

# print roc-auc score 
print(roc_auc_score(labels, pred_labels))

0.9173574999999999


## ML Classifiers (hard baseline)
Now, I will use two classic machine learning classifiers - logistic regression and support vector machines

In [0]:
# convert test data to txt for further processing
test = pd.DataFrame(data)
test.to_csv("train.txt", index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")

In [0]:
# data load placeholder
data_load = data
test_load = test

In [0]:
# create labels for train data
data_y = data_load
data_y = data_y.values.tolist()

ytrain = []
for i in data_y:
    for t in i:
        if t[0:10] == '__label__2':
            ytrain.append(1)
        else:
            ytrain.append(0)

# create labels for test data
test_y = test_load
test_y = test_y.values.tolist()

ytest = []
for i in test_y:
    for t in i:
        if t[0:10] == '__label__2':
            ytest.append(1)
        else:
            ytest.append(0)

In [0]:
# create train x values (remove __label__ from rows)
data_x = data_load
data_x.replace(regex=True,inplace=True,to_replace=r'__label__2',value=r'')
data_x.replace(regex=True,inplace=True,to_replace=r'__label__1',value=r'')
data_x = data_x.values.tolist()

data_x_new = []
for i in data_x:
    string = ' '.join(i)
    data_x_new.append(string)

# create test x values (remove __label__ from rows)
data_xt = test_load
data_xt.replace(regex=True,inplace=True,to_replace=r'__label__2',value=r'')
data_xt.replace(regex=True,inplace=True,to_replace=r'__label__1',value=r'')
data_xt = data_xt.values.tolist()

data_xt_new = []
for i in data_xt:
    string = ' '.join(i)
    data_xt_new.append(string)

In [0]:
# concatenate x and y values into single dataframe
d = {'Data': data_x_new, 'Label': ytrain}
data_table = pd.DataFrame(data=d)

t = {'Data': data_xt_new, 'Label': ytest}
test_table = pd.DataFrame(data=t)

In [16]:
# check train data head
data_table.head()

Unnamed: 0,Data,Label
0,Stuning even for the non-gamer: This sound tr...,1
1,The best soundtrack ever to anything.: I'm re...,1
2,Amazing!: This soundtrack is my favorite musi...,1
3,Excellent Soundtrack: I truly like this sound...,1
4,"Remember, Pull Your Jaw Off The Floor After H...",1


In [0]:
# I had to limit the number of rows to 1080k for train and 120k for test because I constantly run out of RAM and notebook crashes
X_train = data_table['Data'].iloc[0:1080000]
y_train = data_table['Label'].iloc[0:1080000]

X_test = test_table['Data'].iloc[0:120000]
y_test = test_table['Label'].iloc[0:120000]

In [23]:
# check for imbalance in both train and test datasets
print('y_train values count: \n' + str(y_train.value_counts()))
print('y_test values count: \n' + str(y_test.value_counts()))
# significant class imbalance is not observed

y_train values count: 
1    545488
0    534512
Name: Label, dtype: int64
y_test values count: 
1    60771
0    59229
Name: Label, dtype: int64


In [0]:
# set up the TFIDF vectorizer
vectorizer = TfidfVectorizer(stop_words="english", ngram_range=(1, 2))
training_features = vectorizer.fit_transform(X_train)    
test_features = vectorizer.transform(X_test)

In [0]:
# build classification pipelines for log-regression and linearSVC

pipeline1 = Pipeline([('vect', vectorizer),
                     ('clf', LogisticRegression(multi_class='ovr', solver='sag', random_state=42))])

pipeline2 = Pipeline([('vect', vectorizer),
                     ('clf', LinearSVC(multi_class='ovr', random_state=42))])

In [0]:
# setting up logistic regression model
model_LR = pipeline1.fit(X_train,y_train)

In [0]:
# setting up SVM model
model_SVC = pipeline2.fit(X_train,y_train)

In [0]:
# classification matrix and f1-score for logistic regression
print(classification_report(y_test,model_LR.predict(X_test)))
print(f1_score(y_test,model_LR.predict(X_test),average='weighted'))

              precision    recall  f1-score   support

           0       0.94      0.93      0.94     59229
           1       0.94      0.94      0.94     60771

    accuracy                           0.94    120000
   macro avg       0.94      0.94      0.94    120000
weighted avg       0.94      0.94      0.94    120000

0.9391702163803233


In [0]:
# classification matrix and f1-score for SVM
print(classification_report(y_test,model_SVC.predict(X_test)))
print(f1_score(y_test,model_SVC.predict(X_test),average='weighted'))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     59229
           1       1.00      1.00      1.00     60771

    accuracy                           1.00    120000
   macro avg       1.00      1.00      1.00    120000
weighted avg       1.00      1.00      1.00    120000

0.9978999845202204


In [0]:
# ROC-AUC scores for both classifiers
print('Log regression model ROC-AUC score: ' + str(roc_auc_score(y_test, model_LR.predict(X_test)).round(4)))
print('SVM model ROC-AUC score: ' + str(roc_auc_score(y_test, model_SVC.predict(X_test)).round(4)))

Log regression model ROC-AUC score: 0.9391
SVM model ROC-AUC score: 0.9979


## RNN Classifier

In [0]:
# set up the padding for data
max_words = 1000
max_len = 150
tok = Tokenizer(num_words=max_words)
tok.fit_on_texts(X_train)
sequences = tok.texts_to_sequences(X_train)
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)

In [0]:
# define RNN architecture input layer + embedding + LSTM + 2 dense layers. implement dropouts to reduce possible overfitting
def RNN():
    inputs = Input(name='inputs',shape=[max_len])
    layer = Embedding(max_words,50,input_length=max_len)(inputs)
    layer = LSTM(64)(layer)
    layer = Dense(256,name='FC1')(layer)
    layer = Activation('relu')(layer)
    layer = Dropout(0.5)(layer)
    layer = Dense(1,name='out_layer')(layer)
    layer = Activation('sigmoid')(layer)
    model = Model(inputs=inputs,outputs=layer)
    return model

In [0]:
# compile the RNN - I've used binary cross entropy as loss function and RMSProp as optimizer
model = RNN()
model.summary()
model.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy'])

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          [(None, 150)]             0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 150, 50)           50000     
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                29440     
_________________________________________________________________
FC1 (Dense)                  (None, 256)               16640     
_________________________________________________________________
activation_2 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
out_layer (Dense)            (None, 1)                 257 

In [0]:
# fit the data to RNN with batch size of 128 and 5 epochs
model.fit(sequences_matrix,y_train,batch_size=128,epochs=5,
          validation_split=0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fc734c19a58>

In [0]:
# I can't figure out how to calculate roc-auc score for this model, so I will calculate accuracy instead 
# should not be crucial since we don't have class imbalance in test sample
test_sequences = tok.texts_to_sequences(X_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)
accr = model.evaluate(test_sequences_matrix,y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 0.181
  Accuracy: 0.931


## Evaluation of models

In [0]:
# creating table with classifier names and ROC-AUC scores (except for RNN)
models = ['FastText','LogRegression','SVM','RNN']
scores = [roc_auc_score(labels, pred_labels).round(4),roc_auc_score(y_test, model_LR.predict(X_test)).round(4),roc_auc_score(y_test, model_SVC.predict(X_test)).round(4),accr[1]]
dt = {'Model': models, 'Score': scores}
comp_table = pd.DataFrame(data=dt)
comp_table

Unnamed: 0,Model,Score
0,FastText,0.9174
1,LogRegression,0.9391
2,SVM,0.9979
3,RNN,0.9309


In this case, SVM model provided the best result. Perhaps, RNN could have performed better if I set up more complex architecture and increased the number of epochs (which I couldn't do due to low computational power of my laptop).