# From Classic Machine Learning to Neural Networks and BERT: Other datasets
This notebook is used to showcase different approaches as they are covered in the GESIS course "An introduction to Supervised Machine Learning with Python" by Anne Kroon and Damian Trilling.


It is partly based on a tutorial by Orhan G. Yalçın published at https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671



In [1]:
# general-purpose libraries
import os
import bz2
import requests
import re
import pickle
import numpy as np
import zipfile
import pandas as pd
import io

# randomize order
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split


# NB: Usually, you put all import statements here at the beginning of your script. 
# For didactic purposes, in this notebook, we will import the specific modules at the point when we introduce them instead.

### Optional Step 0: Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/annekroon/gesis-machine-learning/blob/main/day5/imdb.ipynb)

The models under Step 4 and Step 5 can take a long time to train. They benefit a lot from access to a GPU. This means that it may be considerably faster to run them on GoogleColab instead of on your own machine (which quite likely  does not have and/or is not set up for machine learning with GPUs). Click on the button above to open the notebook in Colab.

You also need to enable GPUs for the notebook. 
**Navigate to Edit→Notebook Settings and select GPU from the Hardware Accelerator drop-down**

Next, uncomment the following code to check out it worked. Also, the transformers library we need later is not installed by default on Colan.

In [2]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
   raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))



SystemError: ignored

In [20]:
pip install transformers

Collecting transformers
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 26.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 45.8 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 36.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 37.8 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled 

## Step 1: Get data

We are going to work with the IMDB Movie dataset and predict whether movies are positive or negative. The following code just makes sure that you do not have to download it all over again -- if you have already downloaded it, it will just use the downlaoded data.

In [3]:
# https://zenodo.org/record/4561253#.YSVFn1tcJH4

df = pd.read_csv(io.StringIO(requests.get("https://zenodo.org/record/4561253/files/WELFake_Dataset.csv").text))

In [4]:
train, test = train_test_split(df.dropna(), test_size=.2)
text_train = train['title'].to_list()
text_test = test['title'].to_list()
y_train = train['label'].to_list()
y_test = test['label'].to_list()

## Step 2: A baseline model

To get a basic idea about what performance we can achieve, let's run a really basic baseline model: A Naïve Bayes classifier with a count vectorizer.


**NB Note that we now already use the test dataset. If this was a serious research project, it would be advisable to instead set aside some test data to only use at the very end of this notebook to get a final estimate of the performance of the model we chose. To do so, you could split the test dataset here into a validation and a test dataset.**


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

In [6]:
baseline = make_pipeline(CountVectorizer(), MultinomialNB())
baseline.fit(text_train, y_train)
y_pred = baseline.predict(text_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.87      0.88      7010
           1       0.88      0.89      0.88      7298

    accuracy                           0.88     14308
   macro avg       0.88      0.88      0.88     14308
weighted avg       0.88      0.88      0.88     14308



## Step 3: Some serious classical machine learning

Let's see how far we can get with classical machine learning and do a grid search to try different vectorizer settings and different penalties for a Logistic Regression. Of course, we can also test a lot of other models like Random Forests, ADABoost, SGD, ... --- but as we see, this works really well:

(Note that scoring on accuracy, as we do here, is a very bad idea with unbalanced classes - but in our case, they are perfectly balanced)


In [7]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline, Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [8]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(solver="liblinear")),
])

grid = {
    'vect__max_df': (0.5, 0.75),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'tfidf__use_idf': (True, False),  # tfidf or not
    'clf__C': (.01, 1, 100),
    #'clf__penalty': ('l2', 'elasticnet'),
}


search = GridSearchCV(estimator=pipeline, param_grid=grid, cv=5,
                      scoring="accuracy", n_jobs=-1, verbose=3)
search.fit(text_train, y_train)
print(f"Best parameters: {search.best_params_}")
pred = search.predict(text_test)
print(classification_report(y_test, pred))

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:  5.4min finished


Best parameters: {'clf__C': 100, 'tfidf__use_idf': True, 'vect__max_df': 0.5, 'vect__ngram_range': (1, 2)}
              precision    recall  f1-score   support

           0       0.94      0.90      0.92      7010
           1       0.91      0.94      0.93      7298

    accuracy                           0.92     14308
   macro avg       0.92      0.92      0.92     14308
weighted avg       0.92      0.92      0.92     14308



## Step 4: Neural networks


In [9]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense
from keras.metrics import Precision, Recall

In [10]:
vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(text_train)
X_test = vectorizer.transform(text_test)
X_test.sort_indices()
X_train.sort_indices()

input_dim = X_train.shape[1]  # Number of features
numberoflabels = 1

y_train_int = np.array([{"pos":1, "neg":0}.get(e,e) for e in y_train])
y_test_int = np.array([{"pos":1, "neg":0}.get(e,e) for e in y_test])

In [11]:
model1 = Sequential()
model1.add(Dense(300, input_dim=input_dim, activation='relu'))
#model1.add(layers.Dense(100, activation='relu'))
model1.add(Dense(1, activation='sigmoid'))

model1.compile(loss='binary_crossentropy', 
           optimizer='adam', 
            metrics=['accuracy', Precision(), Recall()])
model1.summary()


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 300)               8479800   
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 301       
Total params: 8,480,101
Trainable params: 8,480,101
Non-trainable params: 0
_________________________________________________________________


In [12]:
VALIDATIONSIZE = int(.2 * len(y_test))


history = model1.fit(X_train[:-VALIDATIONSIZE], y_train_int[:-VALIDATIONSIZE],
                     epochs=5,
                     verbose=True,
                     validation_data=(X_train[-VALIDATIONSIZE:], y_train_int[-VALIDATIONSIZE:]))

_, acc, precision, recall = model1.evaluate(X_test, y_test_int)
print(f"Accuracy: {acc:.2f}, Precision: {precision:.2f}, Recall: {recall:.2f}")



Epoch 1/5


  "shape. This may consume a large amount of memory." % value)


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy: 0.90, Precision: 0.90, Recall: 0.91


## Step 4b: Convulutional network

In [13]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Conv1D, MaxPooling1D, Embedding,GlobalMaxPooling1D

In [14]:
embedding_dim = 300

# Tokenize words
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(text_train)
X_train = tokenizer.texts_to_sequences(text_train)
X_test = tokenizer.texts_to_sequences(text_test)

# Adding 1 because of reserved 0 index
vocab_size = len(tokenizer.word_index) + 1

# Pad sequences with zeros
maxlen = len(max(X_train, key=len)) # never truncate -- alternatively, set max length to lower value 
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)


In [15]:
model = Sequential()
model.add(Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(Conv1D(embedding_dim, 5, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(300, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy',  Precision(), Recall()])
print(model.summary())


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 58, 300)           12762000  
_________________________________________________________________
conv1d (Conv1D)              (None, 54, 300)           450300    
_________________________________________________________________
global_max_pooling1d (Global (None, 300)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 300)               90300     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 301       
Total params: 13,302,901
Trainable params: 13,302,901
Non-trainable params: 0
_________________________________________________________________
None


In [16]:
history = model.fit(X_train[:-VALIDATIONSIZE], y_train_int[:-VALIDATIONSIZE], 
          epochs=3, verbose=True,
          validation_data=(X_train[-VALIDATIONSIZE:], y_train_int[-VALIDATIONSIZE:]))



Epoch 1/3
Epoch 2/3
Epoch 3/3


In [17]:
_, acc, precision, recall = model.evaluate(X_test, y_test_int)
print(f"Accuracy: {acc:.2f}, Precision: {precision:.2f}, Recall: {recall:.2f}")



Accuracy: 0.91, Precision: 0.93, Recall: 0.89


## Step 5: Transformers

some text here


In [21]:
import tensorflow as tf

from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

In [22]:
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [23]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


In [24]:
def create_tf_examples(texts, labels, mapping = {"pos":1, "neg":0}):
    for text, label in zip(texts, labels):
        label_encoded = mapping.get(label, label)
        yield InputExample(guid=None, text_a=text, label=label_encoded)

In [25]:
# let's use 20% of the test set for validation

VALIDATIONSIZE = int(.2 * len(y_test))

train_examples = create_tf_examples(text_train, y_train)
validation_examples = create_tf_examples(text_test[:VALIDATIONSIZE], y_test[:VALIDATIONSIZE])
test_examples = create_tf_examples(text_test[VALIDATIONSIZE:], y_test[VALIDATIONSIZE:])

In [26]:
# function taken and slightly adapted from https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671

def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = []
    for e in examples:
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            padding="max_length", 
            truncation=True)

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(InputFeatures(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label))

    def gen():
        for f in features:
            yield ({"input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,},
                f.label,)

    return tf.data.Dataset.from_generator(gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        ({"input_ids": tf.TensorShape([None]), "attention_mask": tf.TensorShape([None]), "token_type_ids": tf.TensorShape([None]),}, tf.TensorShape([]),),)

In [27]:
train_data = convert_examples_to_tf_dataset(train_examples, tokenizer)
validation_data = convert_examples_to_tf_dataset(validation_examples, tokenizer)
test_data = convert_examples_to_tf_dataset(test_examples, tokenizer)

train_data = train_data.shuffle(100).batch(32).repeat(2)
validation_data = validation_data.batch(32)
test_data = test_data.batch(32)

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4, epsilon=1e-08, clipnorm=1.0),   #lr was 3e-5
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=2, validation_data=validation_data)

Epoch 1/2
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported
      3/Unknown - 194s 43s/step - loss: 0.6560 - accuracy: 0.6042

### Get classification report on unseen data

In [None]:
output = model.predict(test_data)
probabilities = tf.nn.softmax(output[0], axis=-1)
predicted_indices = tf.argmax(probabilities, 1)
predicted_class = tf.gather([0,1], predicted_indices)

print(classification_report(y_test[VALIDATIONSIZE:], [e.decode('utf-8') for e in predicted_class.numpy()]))

### Use the model

In [None]:
pred_sentences = ['This was an awesome movie. I watch it twice my time watching this beautiful movie if I have known it was this good',
                  'One of the worst movies of all time. I cannot believe I wasted two hours of my life for this movie']
                  
                  
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
    print(pred_sentences[i], ": \n", labels[label[i]])