# Data preprocessing, Models training

The notebook aims to quickly compare several models trained to solve the sentiment classification problem. The following models were tested:

1. Linear model over bag-of-words
2. Random Forest over bag-of-words
3. FastText
4. pre-trained Transformer

## Loading data

Here we load data prepared in the previous notebook (see [1-Dataset](1-Dataset.ipynb) notebook).

In [33]:
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from tqdm.auto import tqdm

np.random.seed(123)

train = pd.read_pickle(Path('../data/interim/train.pkl'))
valid = pd.read_pickle(Path('../data/interim/valid.pkl'))
test = pd.read_pickle(Path('../data/interim/test.pkl'))

## Exploratory Data Analysis

In [34]:
from IPython.display import display

with pd.option_context("display.max_colwidth", None):
    display(train['review'].sample(3))

4801                                                                                                                                                                                                                                                                                          The film was written 10 years back and a different director was planning it with SRK and Aamir in lead roles<br /><br />The film finally was made now with Vipul Shah directing it And Ajay and Salman starring together after a decade HUM DIL DE CHUKE SANAM(1999)<br /><br />The movie however falls short due to it's 90's handling and worst it's loopholes<br /><br />The film tries to pack in too many commercial ingredients and we also hav the love triangle<br /><br />Everything is predictable and filmy and too clichéd<br /><br />There are loopholes like how Ajay runs away from London Airport and makes a place for himself with no one? even the way he starts his band is not convincing The second half gets better

It appears that reviews are not cleaned from html tags. Let's then remove them.

In [35]:
from bs4 import BeautifulSoup

tqdm.pandas()
train['review'] = train['review'].progress_map(lambda s: BeautifulSoup(s).get_text())
valid['review'] = valid['review'].progress_map(lambda s: BeautifulSoup(s).get_text())
test['review'] = test['review'].progress_map(lambda s: BeautifulSoup(s).get_text())

  0%|          | 0/20000 [00:00<?, ?it/s]

  0%|          | 0/5000 [00:00<?, ?it/s]

  0%|          | 0/25000 [00:00<?, ?it/s]

### Dataset size

Train dataset:

In [36]:
len(train['review'])

20000

Valid dataset:

In [37]:
len(valid['review'])

5000

Test dataset:

In [38]:
len(test['review'])

25000

### Dataset balance

Let's look what is a distribution of sentiments in our dataset.

Train dataset:

In [39]:
train['sentiment'].value_counts(normalize=True)

positive    0.50035
negative    0.49965
Name: sentiment, dtype: float64

Valid dataset:

In [40]:
valid['sentiment'].value_counts(normalize=True)

negative    0.5014
positive    0.4986
Name: sentiment, dtype: float64

Test dataset:

In [41]:
test['sentiment'].value_counts(normalize=True)

negative    0.5
positive    0.5
Name: sentiment, dtype: float64

###### Conclusions

As we see, both train and test datasets are balanced, so the baseline classifier has accuracy 50%.
Moreover, the classic accuracy metric will be appropriate as a success measure.

### Reviews' length

In [42]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=3,
    cols=1,
    shared_xaxes=True,
    subplot_titles=['Train dataset', 'Valid dataset', 'Test dataset']
)

trace0 = go.Histogram(
    x=train['review'].map(lambda s: len(s.split(' '))),
)
trace1 = go.Histogram(
    x=valid['review'].map(lambda s: len(s.split(' ')))
)
trace2 = go.Histogram(
    x=test['review'].map(lambda s: len(s.split(' ')))
)

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 2, 1)
fig.append_trace(trace2, 3, 1)

fig.update_layout(
    height=1000,
    showlegend=False,
    title_text="Histogram of review lengths"
)

fig.show()

###### Conclusions

It seems that the distribution of reviews' length are the same among all the train/valid/test datasets. Additionally, the distribution is log-normal with long tail.

### Basic statistics

In [43]:
test['review'].map(lambda s: len(s.split(' '))).describe()

count    25000.000000
mean       224.512040
std        165.813202
min          4.000000
25%        124.000000
50%        169.000000
75%        272.000000
max       2192.000000
Name: review, dtype: float64

###### Conclusions

The median of review lengths is 169, but there appear outliers with review length equals 2192.

## Bag-of-Words models

The bag-of-words models do not consider sequence of words.

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_df=0.7,
    min_df=100,
    stop_words='english'
)

X_train = vectorizer.fit_transform(train['review'])
X_valid = vectorizer.transform(valid['review'])
X_test = vectorizer.transform(test['review'])

Y_train = train['sentiment'] == 'positive'
Y_valid = valid['sentiment'] == 'positive'
Y_test = test['sentiment'] == 'positive'

### Linear model

First fit logistic regression as a baseline.

In [45]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, Y_train);

Let's see the accuracy of the model on train dataset

In [46]:
model.score(X_train, Y_train)

0.90565

And on valid dataset

In [47]:
model.score(X_test, Y_test)

0.8726

###### Conclusions

The linear model has accuracy 87% (vs 91% on train dataset).

### Tree-based model

In [48]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_jobs=-1,
    max_depth=30
)
model.fit(X_train, Y_train);

In [49]:
model.score(X_train, Y_train)

0.9612

In [51]:
model.score(X_test, Y_test)

0.8324

### FastText

[FastText](https://fasttext.cc/docs/en/crawl-vectors.html) is a library that utilizes embeddings of word n-grams. The
classification of text is performed by ([1]):

1. calculating embeddings for each n-gram,
1. calculating embedding of a document as a mean of the embeddings of n-grams that appear in the document,
1. using the calculated document's embedding as features of linear classifier.

So FastText is, basically, linear model over the embeddings of the n-grams. It's worth to notice that
FastText is really fast.

In [55]:
import fasttext
from tempfile import NamedTemporaryFile
import csv

train['sentiment'] = '__label__' + train['sentiment'].astype('str')
test['sentiment'] = '__label__' + test['sentiment'].astype('str')

with NamedTemporaryFile() as f:
    train[['sentiment', 'review']].to_csv(
        f.name,
        index=False,
        sep=' ',
        header=None,
        quoting=csv.QUOTE_NONE,
        quotechar="",
        escapechar=" "
    )
    model = fasttext.train_supervised(f.name, epoch=10)
    print(model.test(f.name)[1])

0.9349


Accuracy of train dataset is 93%.

In [56]:
with NamedTemporaryFile() as f:
    test[['sentiment', 'review']].to_csv(
        f.name,
        index=False,
        sep=' ',
        header=None,
        quoting=csv.QUOTE_NONE,
        quotechar="",
        escapechar=" "
    )
    print(model.test(f.name)[1])

0.87116


Train dataset accuracy is 87%.

### Transformers
Popular `transformers` library will be used in following section.


#### Pre-trained transformer model

The model uses DistillBERT architecture and was fine-tuned on SST-2 dataset.

In [19]:
from transformers import pipeline
from more_itertools import sliced, flatten
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

model_name = "bhadresh-savani/distilbert-base-uncased-sentiment-sst2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

batch_size = 128
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at bhadresh-savani/distilbert-base-uncased-sentiment-sst2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [29]:
batched_reviews = sliced(test['review'].str.slice(stop=512).tolist(), batch_size)
predictions = list(tqdm(flatten(map(lambda l: classifier(l), batched_reviews))))

0it [00:00, ?it/s]

Let's see the accuracy of the model on test dataset

In [30]:
accuracy_score(
    [prediction['label'] == 'POSITIVE' for prediction in predictions],
    test['sentiment'] == 'positive'
)

0.82264

Check also accuracy of train datset:

In [31]:
batched_reviews = sliced(train['review'].str.slice(stop=512).tolist(), batch_size)
predictions = list(tqdm(flatten(map(lambda l: classifier(l), batched_reviews))))

0it [00:00, ?it/s]

In [32]:
accuracy_score(
    [prediction['label'] == 'POSITIVE' for prediction in predictions],
    train['sentiment'] == 'positive'
)

0.81815

#### Fine-tuning transformer model

The same model as above will be fine-tuned in following cells.

In [20]:
import tensorflow as tf

train_encodings = tokenizer(train['review'].tolist(), truncation=True, padding=True)
val_encodings = tokenizer(valid['review'].tolist(), truncation=True, padding=True)
test_encodings = tokenizer(test['review'].tolist(), truncation=True, padding=True)

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train['sentiment'] == 'positive'
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    valid['sentiment'] == 'positive'
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test['sentiment'] == 'positive'
))

In [21]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5), 
    loss=model.compute_loss
)
model.fit(
    train_dataset \
        .shuffle(1000) \
        .batch(12), 
    validation_data=val_dataset.batch(12),
    epochs=10,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True),
        tf.keras.callbacks.ModelCheckpoint('./modelckpt')
    ]
)

Epoch 1/10




INFO:tensorflow:Assets written to: ./modelckpt/assets


INFO:tensorflow:Assets written to: ./modelckpt/assets





Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.



Epoch 2/10
























INFO:tensorflow:Assets written to: ./modelckpt/assets


INFO:tensorflow:Assets written to: ./modelckpt/assets





Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.



Epoch 3/10
























INFO:tensorflow:Assets written to: ./modelckpt/assets


INFO:tensorflow:Assets written to: ./modelckpt/assets





Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.



<keras.callbacks.History at 0x7f422c296040>

Test accuracy on fine-tuned model:

In [22]:
accuracy_score(
    np.argmax(model.predict(test_dataset.batch(8)).logits, axis=1),
    test['sentiment'] == 'positive'
)

0.92132

Train accuracy on fine-tuned model:

In [23]:
accuracy_score(
    np.argmax(model.predict(train_dataset.batch(8)).logits, axis=1),
    train['sentiment'] == 'positive'
)

0.94915

## Conclusions

The notebook is just a quick review of available solutions. The parameters of the models were not hyper-optimized. Table below summarizes results:


| model                     | train accuracy | test accuracy |
|---------------------------|----------------|---------------|
| linear                    | 91%            | 87%           |
| random forest             | 96%            | 83%           |
| fasttext                  | 93%            | 87%           |
| transformer (pre-trained) | 82%            | 82%           |
| transformer (fine-tuned)  | 95%            | 92%           |

Fine-tuned transformer outperformed other models (), but the baseline (linear) model trained in bag-of-words schema accomplished surprisingly well results in comparison with other models.

The recommendation would be to use fine-tuned transformer for maximising the accuracy of the model. For PoC or MVP, the linear model will be reasonably accurate.

[1] Bag of Tricks for Efficient Text Classification, A. Joulin et al, https://www.aclweb.org/anthology/E17-2068.pdf

