# From Classic Machine Learning to Neural Networks and BERT: A brief tour with IMDB data

This notebook is used to showcase different approaches as they are covered in the GESIS course "An introduction to Supervised Machine Learning with Python" by Anne Kroon and Damian Trilling.


It is partly based on a tutorial by Orhan G. Yalçın published at https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671



In [1]:
# general-purpose libraries
import os
import bz2
import urllib
import tarfile
import re
import pickle
import numpy as np


# randomize order
from sklearn.utils import shuffle


# NB: Usually, you put all import statements here at the beginning of your script. 
# For didactic purposes, in this notebook, we will import the specific modules at the point when we introduce them instead.

## Step 1: Get data

We are going to work with the IMDB Movie dataset and predict whether movies are positive or negative. The following code just makes sure that you do not have to download it all over again -- if you have already downloaded it, it will just use the downlaoded data.

In [2]:
filename = "reviewdata.pickle.bz2"
if os.path.exists(filename):
    print(f"Using cached file {filename}")
    with bz2.BZ2File(filename, "r") as zipfile:
        data = pickle.load(zipfile)
        text_train, text_test, y_train, y_test = data
else:
    url = "https://cssbook.net/d/aclImdb_v1.tar.gz"
    print(f"Downloading from {url}")
    fn, _headers = urllib.request.urlretrieve(url, filename=None)
    t = tarfile.open(fn, mode="r:gz")
    text_train, text_test, y_train, y_test = [], [], [], []
    for f in t.getmembers():
        m=re.match("aclImdb/(\w+)/(pos|neg)/", f.name)
        if not m:
            continue  # skip folder names and unlabeled data
        dataset, label = m.groups()
        text = t.extractfile(f).read().decode("utf-8")
        if dataset == "train":
            text_train.append(text)
            y_train.append(label)
        elif dataset == "test":
            text_test.append(text)
            y_test.append(label)
    print(f"Saving to {filename}")
    with bz2.BZ2File(filename, "w") as zipfile:
        data = text_train, text_test, y_train, y_test
        pickle.dump(data, zipfile)

Using cached file reviewdata.pickle.bz2


In [3]:
assert len(text_train) == len(y_train)
assert len(text_test) == len(y_test)

print(f"There are {len(y_train)} training and {len(y_test)} test samples.")

There are 25000 training and 25000 test samples.


### Randomize order
In this specific case, the dataset is ordered: the first 12500 samples are negative and the last 12500 ones are positive. This can have unintended effects in training some models, and also makes it harder for us to just select, say, the last X samples as a validation dataset. We therefore just shuffle the data. Of course, we need to shuffle the texts and the labels *together* ;-)

In [4]:
text_train, y_train = shuffle(text_train, y_train, random_state=1983)
text_test, y_test = shuffle(text_test, y_test, random_state=1983)

In [5]:
# feel free to explore text_train, y_train, text_test, and y_test here

## Step 2: A baseline model

To get a basic idea about what performance we can achieve, let's run a really basic baseline model: A Naïve Bayes classifier with a count vectorizer.


**NB Note that we now already use the test dataset. If this was a serious research project, it would be advisable to instead set aside some test data to only use at the very end of this notebook to get a final estimate of the performance of the model we chose. To do so, you could split the test dataset here into a validation and a test dataset.**


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

In [7]:
baseline = make_pipeline(CountVectorizer(), MultinomialNB())
baseline.fit(text_train, y_train)
y_pred = baseline.predict(text_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         neg       0.78      0.88      0.82     12500
         pos       0.86      0.75      0.80     12500

    accuracy                           0.81     25000
   macro avg       0.82      0.81      0.81     25000
weighted avg       0.82      0.81      0.81     25000



## Step 3: Some serious classical machine learning

Let's see how far we can get with classical machine learning and do a grid search to try different vectorizer settings and different penalties for a Logistic Regression. Of course, we can also test a lot of other models like Random Forests, ADABoost, SGD, ... --- but as we see, this works really well:

(Note that scoring on accuracy, as we do here, is a very bad idea with unbalanced classes - but in our case, they are perfectly balanced)


In [8]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline, Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [9]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(solver="liblinear")),
])

grid = {
    'vect__max_df': (0.5, 0.75),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'tfidf__use_idf': (True, False),  # tfidf or not
    'clf__C': (.01, 1, 100),
    #'clf__penalty': ('l2', 'elasticnet'),
}


search = GridSearchCV(estimator=pipeline, param_grid=grid, cv=5,
                      scoring="accuracy", n_jobs=-1, verbose=3)
search.fit(text_train, y_train)
print(f"Best parameters: {search.best_params_}")
pred = search.predict(text_test)
print(classification_report(y_test, pred))

Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best parameters: {'clf__C': 100, 'tfidf__use_idf': True, 'vect__max_df': 0.5, 'vect__ngram_range': (1, 2)}
              precision    recall  f1-score   support

         neg       0.90      0.90      0.90     12500
         pos       0.90      0.90      0.90     12500

    accuracy                           0.90     25000
   macro avg       0.90      0.90      0.90     25000
weighted avg       0.90      0.90      0.90     25000



## Step 4: Neural networks

# TODO TOEVOEGEN NAV VOORBEELD https://github.com/damian0604/embeddingworkshop/blob/main/06downstreamkeras.ipynb

## Step 5: Transformers

some text here


In [5]:
import tensorflow as tf

from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

2021-08-16 17:30:07.302252: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-08-16 17:30:07.302281: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [6]:
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

2021-08-16 17:30:09.727678: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-08-16 17:30:09.727711: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2021-08-16 17:30:09.727732: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (aa892048da11): /proc/driver/nvidia/version does not exist
2021-08-16 17:30:09.727984: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers 

In [7]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


In [8]:
def create_tf_examples(texts, labels, mapping = {"pos":1, "neg":0}):
    for text, label in zip(texts, labels):
        label_encoded = mapping.get(label, label)
        yield InputExample(guid=None, text_a=text, label=label_encoded)

In [13]:
# let's use 20% of the test set for validation

VALIDATIONSIZE = int(.2 * len(y_test))

train_examples = create_tf_examples(text_train, y_train)
validation_examples = create_tf_examples(text_test[:VALIDATIONSIZE], y_test[:VALIDATIONSIZE])
test_examples = create_tf_examples(text_test[VALIDATIONSIZE:], y_test[VALIDATIONSIZE:])

In [14]:
# function taken and slightly adapted from https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671

def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = []
    for e in examples:
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            padding="max_length", 
            truncation=True)

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(InputFeatures(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label))

    def gen():
        for f in features:
            yield ({"input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,},
                f.label,)

    return tf.data.Dataset.from_generator(gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        ({"input_ids": tf.TensorShape([None]), "attention_mask": tf.TensorShape([None]), "token_type_ids": tf.TensorShape([None]),}, tf.TensorShape([]),),)

In [15]:
train_data = convert_examples_to_tf_dataset(train_examples, tokenizer)
validation_data = convert_examples_to_tf_dataset(validation_examples, tokenizer)
test_data = convert_examples_to_tf_dataset(test_examples, tokenizer)

train_data = train_data.shuffle(100).batch(32).repeat(2)
validation_data = validation_data.batch(32)
test_data = test_data.batch(32)

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=2, validation_data=validation_data)

Epoch 1/2
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


2021-08-16 17:36:07.964327: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


    735/Unknown - 4426s 6s/step - loss: 0.3397 - accuracy: 0.8484